我试图编写一个快速的HTML刮板,此时我只关注最大化吞吐量而不解析。我已缓存的URL的IP地址:在Java中获取多个网页的最快方法
public class Data {
private static final ArrayList<String> sites = new ArrayList<String>();
public static final ArrayList<URL> URL_LIST = new ArrayList<URL>();
public static final ArrayList<InetAddress> ADDRESSES = new ArrayList<InetAddress>();
static{
/*
add all the URLs to the sites array list
*/
// Resolve the DNS prior to testing the throughput
for(int i = 0; i < sites.size(); i++){
try {
URL tmp = new URL(sites.get(i));
InetAddress address = InetAddress.getByName(tmp.getHost());
ADDRESSES.add(address);
URL_LIST.add(new URL("http", address.getHostAddress(), tmp.getPort(), tmp.getFile()));
System.out.println(tmp.getHost() + ": " + address.getHostAddress());
} catch (MalformedURLException e) {
} catch (UnknownHostException e) {
}
}
}
}
我的下一个步骤是通过从互联网上获取他们,看完第一64KB和移动到下一个URL来测试与100个网址的速度。我创建的FetchTaskConsumer
的一个线程池,我已经试过(一i7四核机上16〜64)运行多个线程,这里是每个消费者的外观:
public class FetchTaskConsumer implements Runnable{
private final CountDownLatch latch;
private final int[] urlIndexes;
public FetchTaskConsumer (int[] urlIndexes, CountDownLatch latch){
this.urlIndexes = urlIndexes;
this.latch = latch;
}
@Override
public void run() {
URLConnection resource;
InputStream is = null;
for(int i = 0; i < urlIndexes.length; i++)
{
int numBytes = 0;
try {
resource = Data.URL_LIST.get(urlIndexes[i]).openConnection();
resource.setRequestProperty("User-Agent", "Mozilla/5.0");
is = resource.getInputStream();
while(is.read()!=-1 && numBytes < 65536)
{
numBytes++;
}
} catch (IOException e) {
System.out.println("Fetch Exception: " + e.getMessage());
} finally {
System.out.println(numBytes + " bytes for url index " + urlIndexes[i] + "; remaining: " + remaining.decrementAndGet());
if(is!=null){
try {
is.close();
} catch (IOException e1) {/*eat it*/}
}
}
}
latch.countDown();
}
}
充其量我能在大约30秒内浏览100个URL,但文献表明我应该能够通过每秒150个URL。请注意,我可以访问千兆位以太网,尽管目前我正在家中使用我的20 Mbit连接运行测试......无论如何,连接永远不会被充分利用。
我试过直接使用Socket
连接,但是我一定是做错了,因为那样会更慢!有关如何提高吞吐量的任何建议?
P.S.
我有一个约100万热门网址的列表,因此如果100不足以进行基准测试,我可以添加更多网址。
更新:
的literature I'm referring是与Najork网络爬虫的论文,Najork规定:
在17天
被处理的8.91亿的网址〜606个下载每秒[上] 4 Compaq DS20E Alpha服务器[含] 4 GB主内存[,] 650 GB磁盘空间[和] 100 MBit/sec。
以太网ISP速率限制带宽 160Mbits /秒
所以它实际上是每秒150页,而不是300.我的电脑是酷睿i7与4 GB的RAM和我隔靴搔痒接近。我没有看到任何说明他们特别使用的东西。
更新:
好吧,总结起来......最后的结果在!事实证明,100个URL对于基准来说有点太低。我碰到了1024个URL,64个线程,我为每次获取设置了2秒的超时时间,并且我能够达到每秒21页(实际上,我的连接大约为10.5 Mbps,因此每秒21页* 64KB每页大约10.5 Mbps)。这里是fetcher的样子:
public class FetchTask implements Runnable{
private final int timeoutMS = 2000;
private final CountDownLatch latch;
private final int[] urlIndexes;
public FetchTask(int[] urlIndexes, CountDownLatch latch){
this.urlIndexes = urlIndexes;
this.latch = latch;
}
@Override
public void run() {
URLConnection resource;
InputStream is = null;
for(int i = 0; i < urlIndexes.length; i++)
{
int numBytes = 0;
try {
resource = Data.URL_LIST.get(urlIndexes[i]).openConnection();
resource.setConnectTimeout(timeoutMS);
resource.setRequestProperty("User-Agent", "Mozilla/5.0");
is = resource.getInputStream();
while(is.read()!=-1 && numBytes < 65536)
{
numBytes++;
}
} catch (IOException e) {
System.out.println("Fetch Exception: " + e.getMessage());
} finally {
System.out.println(numBytes + "," + urlIndexes[i] + "," + remaining.decrementAndGet());
if(is!=null){
try {
is.close();
} catch (IOException e1) {/*eat it*/}
}
}
}
latch.countDown();
}
}
为刮板设置浏览器useragent是不是**良好的做法。 – Mat 2011-04-16 17:36:34
文学?你的意思是说javadocs?我找不到与URLConnection相关的每秒300个URL的信息。 – Babar 2011-04-16 17:44:56
URLConnection大多每500ms获得一个页面,java在这个目的上很慢 – 2011-04-16 17:59:36