抓取网页编码问题 - 字节中的负值

我使用以下代码来抓取网页。抓取网页编码问题 - 字节中的负值

CloseableHttpClient httpclient = HttpClients.createDefault(); 
HttpGet httpget = new HttpGet(url); 
CloseableHttpResponse response = httpclient.execute(httpget); 
HttpEntity entity = response.getEntity(); 
System.out.println(entity.getContentType()); 
//output: Content-Type: text/html; charset=ISO-8859-1

我发现，字符“””具有字节值-110，这是不能被映射到在任一ISO-8859-1或UTF-8有效字符。

我尝试手动打开网页并复制文字和保存为文本文件，然后我看到了字节值实际上是39. 我觉得OS做转换时的字符通过剪贴板了

我想要的只是将网页保存为原始的本地磁盘。

我做了一个简单的代码来保存内容到磁盘。我直接读取字节和写入字节。当我用十六进制编辑器打开保存的文件时，我可以看到该字节的值是146（-110）。

InputStream in = entity.getContent(); 
FileOutputStream fos = new FileOutputStream(new File("D:/test.html")); 

byte[] buffer = new byte[1024]; 
int len = 0; 
while((len = in.read(buffer)) > 0) { 
    fos.write(buffer, 0, len); 
    buffer = new byte[1024]; 
} 
in.close(); 
fos.close();

所以现在问题变成如何从字节146（-110）重建字符。如果我有任何问题，我会继续尝试和更新。

来源

2014-09-06 David

你能提供有问题用“””的文字代码？如果不一致，您使用的代码将网页保存到磁盘。 [mvce]（http://stackoverflow.com/help/mcve） – NiematojakTomasz 2014-09-06 19:10:48

也许你可以给你一些代码如何将页面保存到磁盘？你有没有检查’的值？它看起来像字符’是3个字节长，除非我粘贴或复制失败。检查了这一点：

public static void main(String[] args) { 
    char c = '’'; 
    System.out.println("character: " + c); 
    System.out.println("int: " + (int)c); 
    String s = new String("’"); 
    // Java uses UTF-16 encoding, other encodings will give different values 
    byte[] bytes = s.getBytes(); 
    System.out.println("bytes: " + Arrays.toString(bytes)); 
}

编辑：我发现了以下建议的方法来处理字符集，不妨一试：

ContentType contentType = ContentType.getOrDefault(entity); 
    Charset charset = contentType.getCharset(); 
    Reader reader = new InputStreamReader(entity.getContent(), charset);

来源：https://hc.apache.org/httpcomponents-client-ga/tutorial/html/fundamentals.html

来源

2014-09-06 17:56:26 MirMasej

Java中的字节是带符号的类型，值为-128至127.最高有效位用于指示符号。例如，0111 1111 == 127和1000 0000 == -128。

我在ANSI表中查找了您的字符（'），发现它的值为146（当然这大于127）。二进制表示是1001 0010，因此将其解释为有符号值将产生-110。

重现您所看到的：

String s = new String("’");   // ’ is ansi character 146 
byte[] bytes = s.getBytes();   
System.out.println((int)bytes[0]); // prints -110

的字节值转换为无符号的表示：

char c = (char)(bytes[0] & 0xFF); 
System.out.println((int)c);   // prints 146

来源

2014-09-06 18:52:04 trooper

抓取网页编码问题 - 字节中的负值

回答

相关问题