StreamDecoder vs InputStreamReader阅读格式错误的文件

我遇到了一些奇怪的行为，阅读Java 8中的文件，我想知道是否有人可以理解它。StreamDecoder vs InputStreamReader阅读格式错误的文件

场景：

读取格式错误的文本文件。通过格式不正确，我的意思是它包含的字节不映射到任何unicode代码点。

我使用创建这样的文件中的代码如下：

byte[] text = new byte[1]; 
char k = (char) -60; 
text[0] = (byte) k; 
FileUtils.writeByteArrayToFile(new File("/tmp/malformed.log"), text);

此代码生成包含正好一个字节，这是不ASCII表的一部分（也没有扩展一个）的文件。

试图cat这个文件输出如下：

�

哪个是UNICODE Replacement Character。这很有意义，因为UTF-8需要2个字节才能解码非ASCII字符，但我们只有一个。这是我期望从我的Java代码中获得的行为。

粘贴一些常用代码：

private void read(Reader reader) throws IOException { 

    CharBuffer buffer = CharBuffer.allocate(8910); 

    buffer.flip(); 

    // move existing data to the front of the buffer 
    buffer.compact(); 

    // pull in as much data as we can from the socket 
    int charsRead = reader.read(buffer); 

    // flip so the data can be consumed 
    buffer.flip(); 

    ByteBuffer encode = Charset.forName("UTF-8").encode(buffer); 
    byte[] body = new byte[encode.remaining()]; 
    encode.get(body); 

    System.out.println(new String(body)); 
}

这是我的第一种方法使用nio：

FileInputStream inputStream = new FileInputStream(new File("/tmp/malformed.log")); 
read(Channels.newReader(inputStream.getChannel(), "UTF-8");

这将产生以下异常：

java.nio.charset.MalformedInputException: Input length = 1 

    at java.nio.charset.CoderResult.throwException(CoderResult.java:281) 
    at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339) 
    at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) 
    at java.io.Reader.read(Reader.java:100)

这是不是我所期待但也有道理，因为这实际上是一个腐败和非法的f ile，而异常基本上告诉我们它期望更多的字节被读取。

我的第二个（使用常规java.io）：

FileInputStream inputStream = new FileInputStream(new File("/tmp/malformed.log")); 
read(new InputStreamReader(inputStream, "UTF-8"));

这并没有失败，产生完全相同的输出cat也：

�

这也是情理之中。

所以我的问题是：

什么是从Java应用程序在此方案中预期的行为？
为什么使用Channels.newReader（返回StreamDecoder）和简单地使用常规InputStreamReader有什么区别？我是如何读错的？

任何澄清将不胜感激。

谢谢:)

来源

2017-08-01 Eli Polonsky

你注意到你没有为'InputStreamReader'指定'UTF-8'吗？你的平台默认编码为“UTF-8”还是别的？ 'InputStreamReader'也在内部使用'StreamDecoder'。 – Kayaman

“扩展一个”：哪个扩展了一个？ IBM437以任何顺序使用全部256个字节的值。无论如何，你认为一个文本文件会不正确吗？您的应用程序中是否有某些部分需要处理错误的输入？如果应用程序拒绝它，那么错误的输入是否可以在源处修复？换句话说，MalformedInputException在许多情况下是预期的行为。 –

@Kayaman谢谢，我没有注意到。但是我的平台默认是UTF-8。我更改了代码以指定Charset，并且行为保持不变。（在这里编辑代码） –

行为之间的差别其实去一直到StreamDecoder and Charset classes。该InputStreamReader会从StreamDecoder.forInputStreamReader(..)一个CharsetDecoder这确实对错误

StreamDecoder(InputStream in, Object lock, Charset cs) { 
    this(in, lock, 
    cs.newDecoder() 
    .onMalformedInput(CodingErrorAction.REPLACE) 
    .onUnmappableCharacter(CodingErrorAction.REPLACE)); 
}

更换而Channels.newReader(..)创建使用默认设置的解码器（即报表，而不是取代，这导致一个异常时）

public static Reader newReader(ReadableByteChannel ch, 
           String csName) { 
    checkNotNull(csName, "csName"); 
    return newReader(ch, Charset.forName(csName).newDecoder(), -1); 
}

所以它们的工作方式不同，但没有任何文档说明差异。这是记录错误的，但我假设他们改变了功能，因为你宁愿得到一个异常，而不是你的数据被破坏。

处理字符编码时要小心！

来源

2017-08-11 07:38:07 Kayaman

StreamDecoder vs InputStreamReader阅读格式错误的文件

回答

相关问题