2012-01-09 73 views
0

我尝试使用1.4 Nutch的抓取,但我面对的解析错误,这是日志文件:Nutch的无法成功解析内容

2012-01-09 09:12:02,696 INFO parse.ParseSegment - ParseSegment: starting at   2012-01-09 09:12:02 
2012-01-09 09:12:02,697 INFO parse.ParseSegment - ParseSegment: segment: crawl/segments/20120109091153 
2012-01-09 09:12:03,416 WARN parse.ParseUtil - Unable to successfully parse content http://sujitpal.blogspot.com/ of type application/xhtml+xml 
2012-01-09 09:12:03,417 INFO parse.ParseSegment - Parsing: http:// sujitpal.blogspot.com/ 
2012-01-09 09:12:03,418 WARN parse.ParseSegment - Error parsing: http://sujitpal.blogspot.com/: failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content 
2012-01-09 09:12:03,419 INFO crawl.SignatureFactory - Using Signature impl: org.apache.nutch.crawl.MD5Signature 

通过检查配置/ Nutch的-site.xml中,我发现HTML |文| XHTML | XML包括在plugin.includes preperty

<property> 
<name>plugin.includes</name> 
<value>myplugins|protocol-httpclient|query-(basic|site|url)|summary- 
basic|urlfilter-  
regex|parse-(xml|xhtml|html|tika|text|js)|index-(basic|anchor)|scoring- 
opic|urlnormalizer-(pass|regex|basic)|query-(basic|site|url)|response-(json|xml) 
</value> 
<description>Regular expression naming plugin directory names to 
include. Any plugin not matching this expression is excluded. 
In any case you need at least include the nutch-extensionpoints plugin. By 
default Nutch includes crawling just HTML and plain text via HTTP, 
and basic indexing and search plugins. In order to use HTTPS please enable 
protocol-httpclient, but be aware of possible intermittent problems with the 
underlying commons-httpclient library. 
</description> 
</property> 

为什么不能解析的XHTML/XML或者甚至文本/ XML?

回答

1

你配置了哪些插件?如果您使用的是tika,那么tika会将mime类型(如xhtml/xml)映射到解析器。如果在配置文件中没有条目,则不会发生任何事情。

您可以禁用tika并只使用parse-html插件。

我使用我们的默认插件配置测试了您的网站。

protocol-http|urlfilter-regex|parse-(html)|index-(basic|anchor) 
|query- (basic|site|url)|response-(json|xml) 
|summary-basic|scoring-opic|urlnormalizer-  
(pass|regex|basic) 

并得到您的网页分析。

Parsed (32ms):http://sujitpal.blogspot.com/ 

素不相识 JPEE