Elasticsearch正确的策略来索引html文件的内容

你好Elasticsearch的专家！Elasticsearch正确的策略来索引html文件的内容

我有一个用例，我不知道什么是最好的方式去做。

我有一个html文件，我需要索引。这部分很简单，因为我可以配置自定义分析器并创建索引。

虽然我有一个特殊的需要，我需要提取一些数据在n索引到特殊领域。

这是一个摘录，它有成千上万的这样的行。

<td>....</td> 
<td>... 
<p>Great item to truck</p></td>... 
<a href="javascript:selectItem('1.a.b.c.1.d.f.11')">1.a.b.c.1.d.f.11</a> ...

大量的垃圾，甚至内联的CSS。

我的局限性：

我也没办法更改HTML

我的挑战：

索引的HTML文件的文本，同时消除html tags css and noise
我需要在作为链接的一部分的文本上创建自动填充，例如 1.a.b.c.1.d.f.11

所以当用户开始输入1.a.b.c.1.d.f.11时，我必须能够自动完成它。

我应该创建一个除标签内容以外的所有东西的分析器。如果是的话，我该怎么做？

我将不胜感激任何意见或提示你觉得什么是正确的做法在这里使用elasticsearch

来源

2014-09-30 Istvano

解决方案1：

我建议你开发一个小型应用程序解析HTML文件的内容和只索引你感兴趣的数据。换句话说条所有HTML标签和不必要的数据

溶液2

您可以使用的炭过滤器的[html_strip]剥去所有html标签

GET /_analyze?tokenizer=keyword&token_filters=lowercase&char_filters=html_strip&text=<td>....</td><td>...<p>Great item to truck</p></td>...<a href="javascript:selectItem('1.a.b.c.1.d.f.11')">1.a.b.c.1.d.f.11</a> ...

来源

2014-09-30 19:22:26 Abdel

溶液1

现在，如果你想在索引和存储内容之前完全去除html，你可以使用mapper attachment插件 - 当y您可以定义映射，您可以将content_type分类为“html”。“

映射器附件对许多事情很有用，特别是如果您处理多种文档类型，但最值得注意的是 - 我相信只是使用它来剥离html标签就足够了（您不能用html_strip char filter）

虽然只是一个预警 - 没有任何html标签会被存储，所以如果你确实需要这些标签，我会建议定义另一个字段来存储原始内容另一个注意：你不能指定multifields for mapper attachment documents，so you would need to store that that outside of the mapper attachment document。看到我的工作示例如下：

你“会需要导致该映射：

{ 
    "html5-es" : { 
    "aliases" : { }, 
    "mappings" : { 
     "document" : { 
     "properties" : { 
      "delete" : { 
      "type" : "boolean" 
      }, 
      "file" : { 
      "type" : "attachment", 
      "fields" : { 
       "content" : { 
       "type" : "string", 
       "store" : true, 
       "term_vector" : "with_positions_offsets", 
       "analyzer" : "autocomplete" 
       }, 
       "author" : { 
       "type" : "string", 
       "store" : true, 
       "term_vector" : "with_positions_offsets" 
       }, 
       "title" : { 
       "type" : "string", 
       "store" : true, 
       "term_vector" : "with_positions_offsets", 
       "analyzer" : "autocomplete" 
       }, 
       "name" : { 
       "type" : "string" 
       }, 
       "date" : { 
       "type" : "date", 
       "format" : "strict_date_optional_time||epoch_millis" 
       }, 
       "keywords" : { 
       "type" : "string" 
       }, 
       "content_type" : { 
       "type" : "string" 
       }, 
      "content_length" : { 
       "type" : "integer" 
       }, 
       "language" : { 
       "type" : "string" 
       } 
      } 
      }, 
      "hash_id" : { 
      "type" : "string" 
      }, 
      "path" : { 
      "type" : "string" 
      }, 
      "raw_content" : { 
      "type" : "string", 
      "store" : true, 
      "term_vector" : "with_positions_offsets", 
      "analyzer" : "raw" 
      }, 
      "title" : { 
      "type" : "string" 
      } 
     } 
     } 
    }, 
    "settings" : { //insert your own settings here }, 
    "warmers" : { } 
    } 
}

使得在NEST，我将装配的内容以这样：

Attachment attachment = new Attachment(); 
attachment.Content = Convert.ToBase64String(File.ReadAllBytes("path/to/document")); 
attachment.ContentType = "html"; 

Document document = new Document(); 
document.File = attachment; 
document.RawContent = InsertRawContentFromString(originalText);

我在感测试此 - 结果如下：

"file": { 
    "_content": "PGh0bWwgeG1sbnM6TWFkQ2FwPSJodHRwOi8vd3d3Lm1hZGNhcHNvZnR3YXJlLmNvbS9TY2hlbWFzL01hZENhcC54c2QiPg0KICA8aGVhZCAvPg0KICA8Ym9keT4NCiAgICA8aDE+VG9waWMxMDwvaDE+DQogICAgPHA+RGVsZXRlIHRoaXMgdGV4dCBhbmQgcmVwbGFjZSBpdCB3aXRoIHlvdXIgb3duIGNvbnRlbnQuIENoZWNrIHlvdXIgbWFpbGJveC48L3A+DQogICAgPHA+wqA8L3A+DQogICAgPHA+YXNkZjwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD4xMDwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD5MYXZlbmRlci48L3A+DQogICAgPHA+wqA8L3A+DQogICAgPHA+MTAvNiAxMjowMzwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD41IDA5PC9wPg0KICAgIDxwPsKgPC9wPg0KICAgIDxwPjExIDQ3PC9wPg0KICAgIDxwPsKgPC9wPg0KICAgIDxwPkhhbGxvd2VlbiBpcyBpbiBPY3RvYmVyLjwvcD4NCiAgICA8cD7CoDwvcD4NCiAgICA8cD5qb2c8L3A+DQogIDwvYm9keT4NCjwvaHRtbD4=", 
    "_content_length": 0, 
    "_content_type": "html", 
    "_date": "0001-01-01T00:00:00", 
    "_title": "Topic10" 
}, 
"delete": false, 
"raw_content": "<h1>Topic10</h1><p>Delete this text and replace it with your own content. Check your mailbox.</p><p> </p><p>asdf</p><p> </p><p>10</p><p> </p><p>Lavender.</p><p> </p><p>10/6 12:03</p><p> </p><p>5 09</p><p> </p><p>11 47</p><p> </p><p>Halloween is in October.</p><p> </p><p>jog</p>" 
}, 
"highlight": { 
"file.content": [ 
    "\n <em>Topic10</em>\n\n Delete this text and replace it with your own content. Check your mailbox.\n\n  \n\n asdf\n\n  \n\n 10\n\n  \n\n Lavender.\n\n  \n\n 10/6 12:03\n\n  \n\n 5 09\n\n  \n\n 11 47\n\n  \n\n Halloween is in October.\n\n  \n\n jog\n\n " 
    ] 
}

解决方案2

您将需要建立使用标准分析仪对您的内容和SEARCH进行索引的NGram分析仪。

 "analyzer" : { 
     "standard" : { 
      "type" : "standard" 
     }, 
     "autocomplete" : { 
      "filter" : [ "standard", "lowercase" ], 
      "char_filter" : [ "html_strip" ], 
      "type" : "custom", 
      "tokenizer" : "ngram" 
     }

的这个实施例：

输入： “棕色”

NGRAM分析器：

并[b]，[BR]，[BRO]，[眉头]，[棕色]
[R]，[RO]，[行]，[rown]
[0]，[流]，[自己]
[W]，[WN]
[N]

所以当你做一个自动完成搜索，它将匹配任何这些索引碎片。但是，使用标准分析仪只搜索（返回结果页面）非常重要，这样它就不会仅匹配任何这些随机碎片。

来源

2016-11-18 00:36:26

Elasticsearch正确的策略来索引html文件的内容

回答

相关问题