java.util.Scanner和Wikipedia

我试图使用java.util.Scanner采取维基百科内容并将其用于基于词的搜索。事实是，这一切都很好，但是当阅读一些文字时，它会给我错误。看着代码，并做了一些检查，结果证明，有些词似乎不识别编码，等等，而内容是不可读的。这是用来取页面代码：java.util.Scanner和Wikipedia

// -Start-

try { 
     connection = new URL("http://it.wikipedia.org 
wiki/"+word).openConnection(); 
        Scanner scanner = new Scanner(connection.getInputStream()); 
     scanner.useDelimiter("\\Z"); 
     content = scanner.next(); 
//   if(word.equals("pubblico")) 
//    System.out.println(content); 
     System.out.println("Doing: "+ word); 
//End

的问题的话为“共和”的意大利语维基百科出现。上字公众大楼中的println的结果是这样的（板缺）： ï¿ï¿½] KSR>ï¿½〜戊 ï¿½1Aï¿½ï¿½ï¿½Eï¿½ER3tHZï¿½4vï¿½ï¿½&PZjtcï ¿½¿½ï¿½Dï¿½7_|ï¿½ï¿½ï¿½ï¿½=8ï¿½ï¿½Ø}

你知道为什么吗？然而看着页面源代码和标题是相同的，使用相同的编码...

原来，内容是gzipped，所以我可以告诉维基百科不要给我teir页拉链或它的唯一途径？谢谢

来源

2009-02-11 luiss

我更新了我的答案以解决您的gzip问题。 – erickson 2009-02-11 22:37:10

尝试使用的Reader而不是InputStream - 我认为它的工作原理是这样的：

connection = new URL("http://it.wikipedia.org/wiki/"+word).openConnection(); 
String ctype = connection.getContentType(); 
int csi = ctype.indexOf("charset="); 
Scanner scanner; 
if (csi > 0) 
    scanner = new Scanner(new InputStreamReader(connection.getInputStream(), ctype.substring(csi + 8))); 
else 
    scanner = new Scanner(new InputStreamReader(connection.getInputStream())); 
scanner.useDelimiter("\\Z"); 
content = scanner.next(); 
if(word.equals("pubblico")) 
    System.out.println(content); 
System.out.println("Doing: "+ word);

你也可以只通过字符集到扫描仪的构造函数直接作为中指出另一个答案。

来源

2009-02-11 22:02:35

请勿使用内容编码。它指定使用的压缩，并且与字符编码无关。 – erickson 2009-02-11 22:07:33

尝试使用扫描仪用指定的字符集：

public Scanner(InputStream source, String charsetName)

对于默认的构造函数：

从流

字节转换成使用底层平台的默认字符集字符。

Scanner on java.sun.com

来源

2009-02-11 21:58:08 parkerfath

您需要使用URLConnection，以便您可以确定content-type header的响应。这应该告诉你当你使用create your Scanner时要使用的字符编码。

具体来说，看一下内容类型头文件的“charset”参数。

为了抑制gzip压缩，set the accept-encoding header以 “身份”。有关更多信息，请参阅the HTTP specification。

来源

2009-02-11 22:03:41 erickson

connection = new URL("http://it.wikipedia.org/wiki/"+word).openConnection(); 
      connection.addRequestProperty("Accept-Encoding",""); 
      System.out.println(connection.getContentEncoding()); 
      Scanner scanner = new Scanner(new InputStreamReader(connection.getInputStream())); 
      scanner.useDelimiter("\\Z"); 
      content = new String(scanner.next());

编码不会改变。为什么？

来源

2009-02-12 16:14:44

connection = new URL("http://it.wikipedia.org/wiki/"+word).openConnection(); 
//connection.addRequestProperty("Accept-Encoding",""); 
//System.out.println(connection.getContentEncoding()); 

InputStream resultingInputStream = null;  // Stream su cui fluisce la pagina scaricata 
String encoding = connection.getContentEncoding(); // Codifica di invio (identity, gzip, inflate) 
// Scelta dell'opportuno decompressore per leggere la sorgente 
if (connection.getContentEncoding() != null && encoding.equals("gzip")) { 
    resultingInputStream = new GZIPInputStream(connection.getInputStream()); 
} 
else if (encoding != null && encoding.equals("deflate")) { 
    resultingInputStream = new InflaterInputStream(connection.getInputStream(), new Inflater(true)); 
} 
else { 
    resultingInputStream = connection.getInputStream(); 
} 

// Scanner per estrarre dallo stream la pagina per inserirla in una stringa 
Scanner scanner = new Scanner(resultingInputStream); 
scanner.useDelimiter("\\Z"); 
content = new String(scanner.next());

So works !!!

来源

2009-02-12 22:37:04

java.util.Scanner和Wikipedia

回答

相关问题