我建议您逐行读取文件,然后在字边界上调用split
以获取单词数。
public static void main(String[] args) throws IOException {
final File file = new File("myFile");
try (final BufferedReader bufferedReader =
new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF-8"))) {
String line;
while ((line = bufferedReader.readLine()) != null) {
final String[] words = line.split("\\b");
System.out.println(words.length + " words in line \"" + line + "\".");
}
}
}
这样可以避免从你的程序调用grep。
你得到的奇怪字符很可能是使用错误的编码。你确定你的文件是在“UTF-8”吗?
编辑
OP要读取一个文件中的行由行,然后搜索在另一个文件中读取行的出现。
这仍然可以使用java更容易地完成。根据有多大你的其他文件,你可以先读入内存,并搜索,或搜索一下行由行也
一个简单的例子把文件读入内存:
public static void main(String[] args) throws UnsupportedEncodingException, IOException {
final File corpusFile = new File("corpus");
final String corpusFileContent = readFileToString(corpusFile);
final File file = new File("myEngramFile");
try (final BufferedReader bufferedReader =
new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF-8"))) {
String line;
while ((line = bufferedReader.readLine()) != null) {
final int matches = countOccurencesOf(line, corpusFileContent);
};
}
}
private static String readFileToString(final File file) throws IOException {
final StringBuilder stringBuilder = new StringBuilder();
try (final FileChannel fc = new RandomAccessFile(file, "r").getChannel()) {
final ByteBuffer byteBuffer = ByteBuffer.allocate(4096);
final CharsetDecoder charsetDecoder = Charset.forName("UTF-8").newDecoder();
while (fc.read(byteBuffer) > 0) {
byteBuffer.flip();
stringBuilder.append(charsetDecoder.decode(byteBuffer));
byteBuffer.reset();
}
}
return stringBuilder.toString();
}
private static int countOccurencesOf(final String countMatchesOf, final String inString) {
final Matcher matcher = Pattern.compile("\\b" + countMatchesOf + "\\b").matcher(inString);
int count = 0;
while (matcher.find()) {
++count;
}
return count;
}
这应该如果您的“语料库”文件少于百兆字节左右,则工作正常。任何大,你会想改变“countOccurencesOf”的方法是这样的
private static int countOccurencesOf(final String countMatchesOf, final File inFile) throws IOException {
final Pattern pattern = Pattern.compile("\\b" + countMatchesOf + "\\b");
int count = 0;
try (final BufferedReader bufferedReader =
new BufferedReader(new InputStreamReader(new FileInputStream(inFile), "UTF-8"))) {
String line;
while ((line = bufferedReader.readLine()) != null) {
final Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
++count;
}
};
}
return count;
}
现在你只需通过你的“文件”对象进入方法,而不是字符串化的文件。
请注意,流式方法逐行读取文件并因此丢弃换行符,如果您的Pattern
依赖于它们,则需要在解析String
之前将它们添加回去。
为什么不使用Java正则表达式引擎? – 2013-04-07 11:38:32
你确定你的文件是用UTF-8编码的吗?更可能是ISO-8859-1或ISO-8859-15或类似的东西。 – 2013-04-07 11:38:41