2017-01-03 82 views
0

我想使用Apache Nutch 1.12抓取站点并将数据索引到Apache Solr中。我遵循此tutorialNutch抓取不起作用

我seed.txt文件有这个网址http://nutch.apache.org/

在我正则表达式URL过滤器,我有这样的+^* http://([a-z0-9])* nutch.apache.org/

当我尝试获取数据我只能得到我的seed.txt文件中的网址。

Fetcher: starting at 2017-01-03 09:56:23 
Fetcher: segment: crawl/segments/20170103095613 
Fetcher: threads: 10 
Fetcher: time-out divisor: 2 
QueueFeeder finished: total 2 records + hit by time limit :0 
Using queue mode : byHost 
Using queue mode : byHost 
Using queue mode : byHost 
fetching http://nutch.apache.org/ (queue crawl delay=5000ms) 
Thread FetcherThread has no more work available 
-finishing thread FetcherThread, activeThreads=2 
Using queue mode : byHost 
Using queue mode : byHost 
Thread FetcherThread has no more work available 
-finishing thread FetcherThread, activeThreads=2 
Using queue mode : byHost 
Thread FetcherThread has no more work available 
-finishing thread FetcherThread, activeThreads=2 
Using queue mode : byHost 
Thread FetcherThread has no more work available 
-finishing thread FetcherThread, activeThreads=2 
Using queue mode : byHost 
Thread FetcherThread has no more work available 
-finishing thread FetcherThread, activeThreads=2 
Using queue mode : byHost 
Thread FetcherThread has no more work available 
-finishing thread FetcherThread, activeThreads=2 
Using queue mode : byHost 
Thread FetcherThread has no more work available 
-finishing thread FetcherThread, activeThreads=2 
Fetcher: throughput threshold: -1 
Fetcher: throughput threshold retries: 5 
Thread FetcherThread has no more work available 
-finishing thread FetcherThread, activeThreads=2 
robots.txt whitelist not configured. 
robots.txt whitelist not configured. 
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=2 
Thread FetcherThread has no more work available 
Thread FetcherThread has no more work available 
-finishing thread FetcherThread, activeThreads=1 
-finishing thread FetcherThread, activeThreads=0 
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0 
-activeThreads=0 

我在这里失踪。

+0

递归尝试,生成> Fetch> Parse> Updatedb。看到你的日志条目了解更多详情 –

回答

0

我试图再次执行读取操作,我得到了预期的结果。