抓取网页和存储链接

-1

我想创建一个线程以抓取网站的所有链接并将其存储在LinkedHashSet中，但是当我打印此LinkedHashSet的大小时，它不打印任何内容。我已经开始学习爬行了！我引用了Java的艺术。这里是我的代码：抓取网页和存储链接

import java.io.BufferedReader; 
import java.io.IOException; 
import java.io.InputStreamReader; 
import java.net.MalformedURLException; 
import java.net.URL; 
import java.util.LinkedHashSet; 
import java.util.logging.Level; 
import java.util.logging.Logger; 

public class TestThread { 

    public void crawl(URL url) { 
     try { 

      BufferedReader reader = new BufferedReader(
        new InputStreamReader(url.openConnection().getInputStream())); 
      String line = reader.readLine(); 
      LinkedHashSet toCrawlList = new LinkedHashSet(); 

      while (line != null) { 
       toCrawlList.add(line); 
       System.out.println(toCrawlList.size()); 
      } 
     } catch (IOException ex) { 
      Logger.getLogger(TestThread.class.getName()).log(Level.SEVERE, null, ex); 
     } 

    } 

    public static void main(String[] args) { 
     final TestThread test1 = new TestThread(); 
     Thread thread = new Thread(new Runnable() { 
      public void run(){ 
       try { 
        test1.crawl(new URL("http://stackoverflow.com/")); 
       } catch (MalformedURLException ex) { 
        Logger.getLogger(TestThread.class.getName()).log(Level.SEVERE, null, ex); 
       } 
      } 
     }); 
    } 
}

来源

2014-10-20 TrangVu

问题是什么？ – Marcin 2014-10-20 07:15:24

我不知道如何获得我已经被抓取和存储的所有链接，我只是使用LinkHashSet来存储，但是当我抓取并打印出来时，它什么也没有显示 – TrangVu 2014-10-21 10:46:01

应填写您的列表如下：

while ((line = reader.readLine()) != null) { 
    toCrawlList.add(line); 
} 
System.out.println(toCrawlList.size());

如果还是不行，请尝试在代码中设置一个断点，如果你的读者甚至可以找出包含任何东西

来源

2014-10-20 08:09:17

抓取网页和存储链接

回答

相关问题