2010-11-11 33 views
0

我在conf/nutch-site.xmlnutch为什么解析应用程序/ x-javascript文件?

<property> 
    <name>plugin.includes</name> 
    <value>urlfilter-regex|protocol-(http|file)|parse-(text|html|pdf|msword)|in 
dex-(basic|anchor|more)|query-(basic|site|url)|response-(json|xml)|summary-basic 
|scoring-opic|urlnormalizer-(pass|regex|basic)</value> 
    <description>Regular expression naming plugin directory names to 
    include. Any plugin not matching this expression is excluded. 
    In any case you need at least include the nutch-extensionpoints plugin. By 
    default Nutch includes crawling just HTML and plain text via HTTP, 
    and basic indexing and search plugins. In order to use HTTPS please enable 
    protocol-httpclient, but be aware of possible intermittent problems with the 
    underlying commons-httpclient library. 
    </description> 
</property> 

注意配置的Nutch具有以下解析器的名单 - 纯文本,HTML,PDF和MSWORD。但出于一些奇怪的原因,我刚刚在索引中发现了一些application/x-javascript文件。为什么会这样?它是否使用插件目录中的内容并忽略我的plugin.includes?

回答

0

我使用Nutch 1.1(非中继)来解析rss提要。我使用parse-rss插件。如果我激活插件,则只会解析Feed。如果不是,他们会被忽略。所以要回答你的问题,是的,Nutch应该只使用plugin.includes中定义的插件。

相关问题