2011-03-24 53 views
6

我是否真的没有办法通过一些Java代码以编程方式调用Apache Nutch?文档(或指南或教程)在哪里可以做到这一点? Google让我失望了。所以我实际上尝试了Bing。 (是的,我知道,可悲。)想法?提前致谢。Nutch:在Java中调用,而不是命令行?

(另外,如果Nutch的是废话拍摄用Java编写的任何其他抓取工具被证明是与实际的文件互联网规模可靠吗?)

+0

请告诉我,这不是答案。 http://stackoverflow.com/questions/4340222/nutch-api-advice – ChrisJF 2011-03-24 15:07:39

回答

6

如果你进去看看吧bin/nutch脚本,你”会看到,它调用对应的命令的Java类:

# figure out which class to run 
if [ "$COMMAND" = "crawl" ] ; then 
    CLASS=org.apache.nutch.crawl.Crawl 
elif [ "$COMMAND" = "inject" ] ; then 
    CLASS=org.apache.nutch.crawl.Injector 
elif [ "$COMMAND" = "generate" ] ; then 
    CLASS=org.apache.nutch.crawl.Generator 
elif [ "$COMMAND" = "freegen" ] ; then 
    CLASS=org.apache.nutch.tools.FreeGenerator 
elif [ "$COMMAND" = "fetch" ] ; then 
    CLASS=org.apache.nutch.fetcher.Fetcher 
elif [ "$COMMAND" = "fetch2" ] ; then 
    CLASS=org.apache.nutch.fetcher.Fetcher2 
elif [ "$COMMAND" = "parse" ] ; then 
    CLASS=org.apache.nutch.parse.ParseSegment 
elif [ "$COMMAND" = "readdb" ] ; then 
    CLASS=org.apache.nutch.crawl.CrawlDbReader 
elif [ "$COMMAND" = "convdb" ] ; then 
    CLASS=org.apache.nutch.tools.compat.CrawlDbConverter 
elif [ "$COMMAND" = "mergedb" ] ; then 
    CLASS=org.apache.nutch.crawl.CrawlDbMerger 
elif [ "$COMMAND" = "readlinkdb" ] ; then 
    CLASS=org.apache.nutch.crawl.LinkDbReader 
elif [ "$COMMAND" = "readseg" ] ; then 
    CLASS=org.apache.nutch.segment.SegmentReader 
elif [ "$COMMAND" = "segread" ] ; then 
    echo "[DEPRECATED] Command 'segread' is deprecated, use 'readseg' instead." 
    CLASS=org.apache.nutch.segment.SegmentReader 
elif [ "$COMMAND" = "mergesegs" ] ; then 
    CLASS=org.apache.nutch.segment.SegmentMerger 
elif [ "$COMMAND" = "updatedb" ] ; then 
    CLASS=org.apache.nutch.crawl.CrawlDb 
elif [ "$COMMAND" = "invertlinks" ] ; then 
    CLASS=org.apache.nutch.crawl.LinkDb 
elif [ "$COMMAND" = "mergelinkdb" ] ; then 
    CLASS=org.apache.nutch.crawl.LinkDbMerger 
elif [ "$COMMAND" = "index" ] ; then 
    CLASS=org.apache.nutch.indexer.Indexer 
elif [ "$COMMAND" = "solrindex" ] ; then 
    CLASS=org.apache.nutch.indexer.solr.SolrIndexer 
elif [ "$COMMAND" = "dedup" ] ; then 
    CLASS=org.apache.nutch.indexer.DeleteDuplicates 
elif [ "$COMMAND" = "solrdedup" ] ; then 
    CLASS=org.apache.nutch.indexer.solr.SolrDeleteDuplicates 
elif [ "$COMMAND" = "merge" ] ; then 
    CLASS=org.apache.nutch.indexer.IndexMerger 
elif [ "$COMMAND" = "plugin" ] ; then 
    CLASS=org.apache.nutch.plugin.PluginRepository 
elif [ "$COMMAND" = "server" ] ; then 
    CLASS='org.apache.nutch.searcher.DistributedSearch$Server' 
else 
    CLASS=$COMMAND 
fi 

# run it 
exec "$JAVA" $JAVA_HEAP_MAX $NUTCH_OPTS -classpath "$CLASSPATH" $CLASS "[email protected]" 

从那里,这只是一个看API docs的问题,如有必要,源代码的类。

+0

Touché好的先生!谢谢! – ChrisJF 2011-03-24 15:48:33