我使用HttpURLConnection来抓取https://translate.google.com/。HttpURLConnection使用https InputStream Garbled
InetSocketAddress addr = new InetSocketAddress("127.0.0.1", 1082);
Proxy proxy = new Proxy(Proxy.Type.HTTP, addr);
url = new URL("https://translate.google.com/");
HttpURLConnection conn = (HttpURLConnection) url.openConnection(proxy);
conn.setRequestProperty("Accept-Encoding", "gzip, deflate, sdch");
conn.setRequestProperty("Connection", "keep-alive");
conn.setRequestProperty("User-Agent",
"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36");
conn.setRequestProperty("Accept", "*/*");
Map<String, List<String>> reqHeaders = conn.getHeaderFields();
List<String> reqTypes = reqHeaders.get("Content-Type");
for (String ss : reqTypes) {
System.out.println(ss);
}
InputStream in = conn.getInputStream();
String s = IOUtils.toString(in, "UTF-8");
System.out.println(s.substring(0, 100));
Map<String, List<String>> resHeader = conn.getHeaderFields();
List<String> resTypes = resHeader.get("Content-Type");
for (String ss : resTypes) {
System.out.println(ss);
}
控制台是
但是当我改变的URL http://translate.google.com/。 它运作良好。
我知道其实HttpURLConnection是HttpsURLConnection,当我爬行器https://translate.google.com/。 我尝试使用HttpsURLConnection,它仍然是乱码。
有什么建议吗?
我会尝试upadte接受编码。 –
@TomGrylls尝试*删除*它。目前你对服务器说谎,你可以处理gzip编码,当你不能。或者不是。 – EJP
我尝试我的代码没有Accept-Encoding.It是返回正常值,虽然不正确value.And我会处理这个。谢谢! –