在java中调用grep时，它不适用于法语字符

我在java中调用grep来单独计算语料库中单词列表的数量。在java中调用grep时，它不适用于法语字符

BufferedReader fb = new BufferedReader(
new InputStreamReader( 
    new FileInputStream("french.txt"), "UTF8")); 

while ((l = fb.readLine()) != null){ 
String lpt = "\\b"+l+"\\b"; 
String[] args = new String[]{"grep","-ic",lpt,corpus}; 
Process grepCommand = Runtime.getRuntime().exec(args); 
grep.waitFor() 
} 
BufferedReader grepInput = new BufferedReader(new InputStreamReader(grep.getInputStream())); 
int tmp = Integer.parseInt(grepInput.readLine()); 
System.out.println(l+"\t"+tmp);

这适用于我的英文单词列表和语料库。但我也有一个法语单词列表和语料库。它不会对Java控制台上法国和采样输出工作看起来是这样的：

� bord  0 
� c�t�  0

正确的形式：“àBORD”和“的Côté”。

现在我的问题是：问题在哪里？我应该修复我的java代码，还是grep问题？如果是这样，我该如何解决它。（即使我将编码更改为UTF-8，我也无法正确在终端上看到法语字符）。

来源

2013-04-07 MAZDAK

为什么不使用Java正则表达式引擎？ – 2013-04-07 11:38:32

你确定你的文件是用UTF-8编码的吗？更可能是ISO-8859-1或ISO-8859-15或类似的东西。 – 2013-04-07 11:38:41

我建议您逐行读取文件，然后在字边界上调用split以获取单词数。

public static void main(String[] args) throws IOException { 
    final File file = new File("myFile"); 
    try (final BufferedReader bufferedReader = 
      new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF-8"))) { 
     String line; 
     while ((line = bufferedReader.readLine()) != null) { 
      final String[] words = line.split("\\b"); 
      System.out.println(words.length + " words in line \"" + line + "\"."); 
     } 
    } 
}

这样可以避免从你的程序调用grep。

你得到的奇怪字符很可能是使用错误的编码。你确定你的文件是在“UTF-8”吗？

编辑

OP要读取一个文件中的行由行，然后搜索在另一个文件中读取行的出现。

这仍然可以使用java更容易地完成。根据有多大你的其他文件，你可以先读入内存，并搜索，或搜索一下行由行也

一个简单的例子把文件读入内存：

public static void main(String[] args) throws UnsupportedEncodingException, IOException { 
    final File corpusFile = new File("corpus"); 
    final String corpusFileContent = readFileToString(corpusFile); 
    final File file = new File("myEngramFile"); 
    try (final BufferedReader bufferedReader = 
      new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF-8"))) { 
     String line; 
     while ((line = bufferedReader.readLine()) != null) { 
      final int matches = countOccurencesOf(line, corpusFileContent); 
     }; 
    } 
} 

private static String readFileToString(final File file) throws IOException { 
    final StringBuilder stringBuilder = new StringBuilder(); 
    try (final FileChannel fc = new RandomAccessFile(file, "r").getChannel()) { 
     final ByteBuffer byteBuffer = ByteBuffer.allocate(4096); 
     final CharsetDecoder charsetDecoder = Charset.forName("UTF-8").newDecoder(); 
     while (fc.read(byteBuffer) > 0) { 
      byteBuffer.flip(); 
      stringBuilder.append(charsetDecoder.decode(byteBuffer)); 
      byteBuffer.reset(); 
     } 
    } 
    return stringBuilder.toString(); 
} 

private static int countOccurencesOf(final String countMatchesOf, final String inString) { 
    final Matcher matcher = Pattern.compile("\\b" + countMatchesOf + "\\b").matcher(inString); 
    int count = 0; 
    while (matcher.find()) { 
     ++count; 
    } 
    return count; 
}

这应该如果您的“语料库”文件少于百兆字节左右，则工作正常。任何大，你会想改变“countOccurencesOf”的方法是这样的

private static int countOccurencesOf(final String countMatchesOf, final File inFile) throws IOException { 
    final Pattern pattern = Pattern.compile("\\b" + countMatchesOf + "\\b"); 
    int count = 0; 
    try (final BufferedReader bufferedReader = 
      new BufferedReader(new InputStreamReader(new FileInputStream(inFile), "UTF-8"))) { 
     String line; 
     while ((line = bufferedReader.readLine()) != null) { 
      final Matcher matcher = pattern.matcher(line); 
      while (matcher.find()) { 
       ++count; 
      } 
     }; 
    } 
    return count; 
}

现在你只需通过你的“文件”对象进入方法，而不是字符串化的文件。

请注意，流式方法逐行读取文件并因此丢弃换行符，如果您的Pattern依赖于它们，则需要在解析String之前将它们添加回去。

来源

2013-04-07 16:41:39

我所需要的是一个语料库中的n-gram数量，对于任何给定n-gram从另一个文件（fb）读取。你是对的，奇怪的字符是由于文件编码。 – MAZDAK 2013-04-08 11:18:32

问题在于你的设计。不要从java调用grep。改为使用纯java实现：逐行读取文件并使用纯Java API实现您自己的“grep”。

但严重的是我认为问题出在你的shell中。你是否尝试手动运行grep并过滤法文字符？我相信它不适合你。这取决于你的外壳配置，因此取决于平台。 Java可以提供平台无关的解决方案。为了达到这个目标，你应该尽可能避免使用包括执行命令行工具在内的非纯Java技术。

顺便读一遍您的文件并使用String.contains()或模式匹配进行行筛选的BTW代码，它甚至比运行grep的代码短。

来源

2013-04-07 11:43:04 AlexR

我同意，也许不是String.contains（），但我认为模式匹配是一个好主意。调用ggrep需要很多时间，它甚至可能会更快。然而，我仍然有同样的问题，而在Java控制台上显示结果 – MAZDAK 2013-04-07 16:43:33

原来它实际上是慢得多在java中实现整个事情，在我的巨大语料库 – MAZDAK 2013-04-08 11:15:02

在java中调用grep时，它不适用于法语字符

回答

相关问题