Nutch的抓取工具无法检索新闻文章内容

但我没有收到文离开页面到索引中的内容字段（elasticsearch）。

成果爬行的是： -

{ 
    "took": 2, 
    "timed_out": false, 
    "_shards": { 
    "total": 5, 
    "successful": 5, 
    "failed": 0 
    }, 
    "hits": { 
    "total": 2, 
    "max_score": 0.09492774, 
    "hits": [ 
     { 
     "_index": "news", 
     "_type": "doc", 
     "_id": "http://www.bloomberg.com/press-releases/2016-07-08/network-1-announces-settlement-of-patent-litigation-with-apple-inc", 
     "_score": 0.09492774, 
     "_source": { 
      "tstamp": "2016-08-04T07:21:59.614Z", 
      "segment": "20160804125156", 
      "digest": "d583a81c0c4c7510f5c842ea3b557992", 
      "host": "www.bloomberg.com", 
      "boost": "1.0", 
      "id": "http://www.bloomberg.com/press-releases/2016-07-08/network-1-announces-settlement-of-patent-litigation-with-apple-inc", 
      "url": "http://www.bloomberg.com/press-releases/2016-07-08/network-1-announces-settlement-of-patent-litigation-with-apple-inc", 
      "content": "" 
     } 
     }, 
     { 
     "_index": "news", 
     "_type": "doc", 
     "_id": "http://www.bloomberg.com/press-releases/2016-07-05/apple-donate-life-america-bring-national-organ-donor-registration-to-iphone", 
     "_score": 0.009845509, 
     "_source": { 
      "tstamp": "2016-08-04T07:22:05.708Z", 
      "segment": "20160804125156", 
      "digest": "2a94a32ffffd0e03647928755e055e30", 
      "host": "www.bloomberg.com", 
      "boost": "1.0", 
      "id": "http://www.bloomberg.com/press-releases/2016-07-05/apple-donate-life-america-bring-national-organ-donor-registration-to-iphone", 
      "url": "http://www.bloomberg.com/press-releases/2016-07-05/apple-donate-life-america-bring-national-organ-donor-registration-to-iphone", 
      "content": "" 
     } 
     } 
    ] 
    } 
}

中，我们可以看到，内容字段为空。我尝试了nutch-site.txt中的不同选项。但结果仍然一样。请帮助我。

来源

2016-08-04 Sachin

不知道为什么nutch无法提取文章内容。但是我发现了一个使用Jsoup的解决方法。我开发了一个自定义分析过滤器插件，用于分析整个文档，并在解析器过滤器返回的ParseResult中设置分析文本。并用我的自定义解析过滤器在parse-plugins.xml

更换解析HTML的插件，这将是这样的： -

document = Jsoup.parse(new String(content.getContent(),"UTF-8"),content.getUrl()); 
    parse = parseResult.get(content.getUrl()); 
    status = parse.getData().getStatus(); 
    title = document.title(); 
    parseData = new ParseData(status, title,parse.getData().getOutlinks(), parse.getData().getContentMeta(), parse.getData().getParseMeta()); 
    parseResult.put(content.getUrl(), new ParseText(document.body().text()), parseData);

来源

2016-09-23 06:21:13 Sachin

只是出于上下文的回答，但尝试使用Apache ManifoldCF。它提供了内置的弹性搜索连接器，以及更好的日志历史来找出为什么数据没有编入索引。ManifoldCF中的连接器部分允许您指定应在哪个字段中索引内容。这是一个很好的开源替代方案。

来源

2016-08-08 22:19:34

谢谢:)。我会看看它。 – Sachin

我想选择特定的div或任何其他标签内的链接，并获取该链接的内容并为它们编制索引。我们是否可以用多方面做这样的事情 – Sachin

Nutch的抓取工具无法检索新闻文章内容

回答

相关问题