0
我想使用Apache Nutch 1.12抓取站点并将数据索引到Apache Solr中。我遵循此tutorial。Nutch抓取不起作用
我seed.txt文件有这个网址http://nutch.apache.org/
在我正则表达式URL过滤器,我有这样的+^* http://([a-z0-9])* nutch.apache.org/
当我尝试获取数据我只能得到我的seed.txt文件中的网址。
Fetcher: starting at 2017-01-03 09:56:23
Fetcher: segment: crawl/segments/20170103095613
Fetcher: threads: 10
Fetcher: time-out divisor: 2
QueueFeeder finished: total 2 records + hit by time limit :0
Using queue mode : byHost
Using queue mode : byHost
Using queue mode : byHost
fetching http://nutch.apache.org/ (queue crawl delay=5000ms)
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=2
Using queue mode : byHost
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=2
Fetcher: throughput threshold: -1
Fetcher: throughput threshold retries: 5
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=2
robots.txt whitelist not configured.
robots.txt whitelist not configured.
-activeThreads=2, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=2
Thread FetcherThread has no more work available
Thread FetcherThread has no more work available
-finishing thread FetcherThread, activeThreads=1
-finishing thread FetcherThread, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0
-activeThreads=0
我在这里失踪。
递归尝试,生成> Fetch> Parse> Updatedb。看到你的日志条目了解更多详情 –