2015-11-13 43 views
0

抓取维基pedia页调试基本履带的例子后,我可以抓取并从URL中的示例不能与Crawler4j

controller.addSeed("http://www.ics.uci.edu/"); 
controller.addSeed("http://www.ics.uci.edu/~lopes/"); 
controller.addSeed("http://www.ics.uci.edu/~welling/"); 

成功的数据写入一个文本文件,但是,当我改变了URL进入维基百科页面,NetBean只注意到“构建成功”,并没有运行和写入任何内容,我试图抓取其他页面,但其中一些工作,有些则没有。 我控制器的代码:

public class BasicCrawlController { 

public static CrawlController controller; 

public static void main(String[] args) throws Exception { 

    // if (args.length != 2) { 
    // System.out.println("Needed parameters: "); 
    // System.out.println("\t rootFolder (it will contain intermediate crawl data)"); 
    // System.out.println("\t numberOfCralwers (number of concurrent threads)"); 
    // return; 
    // } 
      /* 
      * crawlStorageFolder is a folder where intermediate crawl data is 
      * stored. 
      */ 
    // String crawlStorageFolder = args[0]; 
    String crawlStorageFolder = "C:\\Users\\AD-PC\\Desktop"; 

    /* 
    * numberOfCrawlers shows the number of concurrent threads that should 
    * be initiated for crawling. 
    */ 
    int numberOfCrawlers = Integer.parseInt("1"); 

    CrawlConfig config = new CrawlConfig(); 

    config.setCrawlStorageFolder(crawlStorageFolder); 

    /* 
    * Be polite: Make sure that we don't send more than 1 request per 
    * second (1000 milliseconds between requests). 
    */ 
    config.setPolitenessDelay(1000); 

    /* 
    * You can set the maximum crawl depth here. The default value is -1 for 
    * unlimited depth 
    */ 
    config.setMaxDepthOfCrawling(4); 

    /* 
    * You can set the maximum number of pages to crawl. The default value 
    * is -1 for unlimited number of pages 
    */ 
    config.setMaxPagesToFetch(1000); 

    /* 
    * Do you need to set a proxy? If so, you can use: 
    * config.setProxyHost("proxyserver.example.com"); 
    * config.setProxyPort(8080); 
    * 
    * If your proxy also needs authentication: 
    * config.setProxyUsername(username); config.getProxyPassword(password); 
    */ 
    /* 
    * This config parameter can be used to set your crawl to be resumable 
    * (meaning that you can resume the crawl from a previously 
    * interrupted/crashed crawl). Note: if you enable resuming feature and 
    * want to start a fresh crawl, you need to delete the contents of 
    * rootFolder manually. 
    */ 
    config.setResumableCrawling(false); 

    /* 
    * Instantiate the controller for this crawl. 
    */ 
    PageFetcher pageFetcher = new PageFetcher(config); 
    RobotstxtConfig robotstxtConfig = new RobotstxtConfig(); 
    RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher); 
    controller = new CrawlController(config, pageFetcher, robotstxtServer); 

    /* 
    * For each crawl, you need to add some seed urls. These are the first 
    * URLs that are fetched and then the crawler starts following links 
    * which are found in these pages 
    */ 
    controller.addSeed("http://www.ics.uci.edu/"); 
    controller.addSeed("http://www.ics.uci.edu/~lopes/"); 
    controller.addSeed("http://www.ics.uci.edu/~welling/"); 

    /* 
    * Start the crawl. This is a blocking operation, meaning that your code 
    * will reach the line after this only when crawling is finished. 
    */ 
    controller.start(BasicCrawler.class, numberOfCrawlers); 

} 

}

而且BasicCrawler

public class BasicCrawler extends WebCrawler { 

private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g" + "|png|tiff?|mid|mp2|mp3|mp4" 
     + "|wav|avi|mov|mpeg|ram|m4v|pdf" + "|rm|smil|wmv|swf|wma|zip|rar|gz))$"); 

/** 
* You should implement this function to specify whether the given url 
* should be crawled or not (based on your crawling logic). 
*/ 
@Override 
public boolean shouldVisit(WebURL url) { 
    String href = url.getURL().toLowerCase(); 
    return !FILTERS.matcher(href).matches() && href.startsWith("http://www.ics.uci.edu"); 
} 

/** 
* This function is called when a page is fetched and ready to be processed 
* by your program. 
*/ 
@Override 
public void visit(Page page) { 
    int docid = page.getWebURL().getDocid(); 
    String url = page.getWebURL().getURL(); 
    String domain = page.getWebURL().getDomain(); 
    String path = page.getWebURL().getPath(); 
    String subDomain = page.getWebURL().getSubDomain(); 
    String parentUrl = page.getWebURL().getParentUrl(); 
    String anchor = page.getWebURL().getAnchor(); 

    System.out.println("Docid: " + docid); 
    System.out.println("URL: " + url); 
    System.out.println("Domain: '" + domain + "'"); 
    System.out.println("Sub-domain: '" + subDomain + "'"); 
    System.out.println("Path: '" + path + "'"); 
    System.out.println("Parent page: " + parentUrl); 

    if (page.getParseData() instanceof HtmlParseData) { 

     try { 
      HtmlParseData htmlParseData = (HtmlParseData) page.getParseData(); 
      String text = htmlParseData.getText(); 
      String title = htmlParseData.getTitle(); 
      String html = htmlParseData.getHtml(); 
      List<WebURL> links = htmlParseData.getOutgoingUrls(); 
      System.out.println("Title: " + title); 
      System.out.println("Text length: " + text.length()); 
      System.out.println("Html length: " + html.length()); 
      System.out.println("Number of outgoing links: " + links.size()); 
      System.out.println("============="); 
      //create an print writer for writing to a file 
      PrintWriter out = new PrintWriter(new FileWriter("D:\\test.txt", true)); 

      //output to the file a lineD:\ 
      out.println(docid + "."); 
      out.println("- Title: " + title); 
      out.println("- Content: " + text); 
      out.println("- Anchor: "+ anchor); 

      //close the file (VERY IMPORTANT!) 
      out.close(); 
     } catch (IOException e1) { 
      System.out.println("Error during reading/writing"); 
     } 

     if (docid == 300) { 
      controller.shutdown(); 
     } 
    } 
} 

有人可以告诉我怎么解决? Wiki是否阻止了crawler4j?

+0

未进行具体开始。 https://en.wikipedia.org/robots.txt另外,如果你看看crawler4j的页面,他们会提到一个延迟,并谈论它在维基百科中的使用。另外,你应该将'out.close()'移动到finally块中。 – Bill

回答

1

你的问题所在位置:

@Override public boolean shouldVisit(WebURL url) { String href = url.getURL().toLowerCase(); return !FILTERS.matcher(href).matches() && href.startsWith("http://www.ics.uci.edu"); }

每一个URL由履带式检索到的时候,这个方法被调用。在你的情况与维基百科的页面,它会永远返回由于示例的默认代码假设,每一个应该被抓取页面http://www.ics.uci.edu