弹性搜索交叉字段，边缘ngram分析器

我有999个文件，我正在使用弹性搜索进行实验。弹性搜索交叉字段，边缘ngram分析器

中有我喜欢的类型映射场F4被分析，有以下设置分析仪：

"myNGramAnalyzer" => [ 
     "type" => "custom", 
     "char_filter" => ["html_strip"], 
     "tokenizer" => "standard", 
     "filter" => ["lowercase","standard","asciifolding","stop","snowball","ngram_filter"] 
    ]

我的过滤器是如下：

"filter" => [ 
     "ngram_filter" => [ 
      "type" => "edgeNGram", 
      "min_gram" => "2", 
      "max_gram" => "20" 
     ] 
    ]

我有值字段F4为“Proj1”，“Proj2”，“Proj3”......等等。

现在，当我尝试使用“proj1”字符串的交叉字段进行搜索时，我期待将带有“Proj1”的文档返回到最高分的回应的顶部。但事实并非如此。其余所有数据在内容上几乎相同。

另外我不明白为什么它匹配所有的999文件？

以下是我的搜索：

{ 
    "index": "myindex", 
    "type": "mytype", 
    "body": { 
     "query": { 
      "multi_match": { 
       "query": "proj1", 
       "type": "cross_fields", 
       "operator": "and", 
       "fields": "f*" 
      } 
     }, 
     "filter": { 
      "term": { 
       "deleted": "0" 
      } 
     } 
    } 
}

我搜索的回应是：

{ 
    "took": 12, 
    "timed_out": false, 
    "_shards": { 
     "total": 5, 
     "successful": 5, 
     "failed": 0 
    }, 
    "hits": { 
     "total": 999, 
     "max_score": 1, 
     "hits": [{ 
      "_index": "myindex", 
      "_type": "mytype", 
      "_id": "42", 
      "_score": 1, 
      "_source": { 
       "f1": "396","f2": "125650","f3": "BH.1511AI.001", 
       "f4": "Proj42", 
       "f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0" 
      } 
     }, { 
      "_index": "myindex", 
      "_type": "mytype", 
      "_id": "47", 
      "_score": 1, 
      "_source": { 
       "f1": "396","f2": "137946","f3": "BH.152096.001", 
       "f4": "Proj47", 
       "f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0" 
      } 
     }, 
     //....... 
     //....... 
     //MANY RECORDS IN BETWEEN HERE 
     //....... 
     //....... 
     { 
      "_index": myindex, 
      "_type": "mytype", 
      "_id": "1", 
      "_score": 1, 
      "_source": { 
       "f1": "396","f2": "142095","f3": "BH.705215.001", 
       "f4": "Proj1", 
       "f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0" 
      } 
     //....... 
     //....... 
     //MANY RECORDS IN BETWEEN HERE 
     //....... 
     //....... 
     }] 
    } 
}

任何东西，我做错了或丢失？（道歉冗长的问题，但我想给所有可能的信息丢弃不必要的其他代码）。

EDITED：

期限矢量响应

{ 
    "_index": "myindex", 
    "_type": "mytype", 
    "_id": "10", 
    "_version": 1, 
    "found": true, 
    "took": 9, 
    "term_vectors": { 
     "f4": { 
      "field_statistics": { 
       "sum_doc_freq": 5886, 
       "doc_count": 999, 
       "sum_ttf": 5886 
      }, 
      "terms": { 
       "pr": { 
        "doc_freq": 999, 
        "ttf": 999, 
        "term_freq": 1, 
        "tokens": [{ 
         "position": 0, 
         "start_offset": 0, 
         "end_offset": 6 
        }] 
       }, 
       "pro": { 
        "doc_freq": 999, 
        "ttf": 999, 
        "term_freq": 1, 
        "tokens": [{ 
         "position": 0, 
         "start_offset": 0, 
         "end_offset": 6 
        }] 
       }, 
       "proj": { 
        "doc_freq": 999, 
        "ttf": 999, 
        "term_freq": 1, 
        "tokens": [{ 
         "position": 0, 
         "start_offset": 0, 
         "end_offset": 6 
        }] 
       }, 
       "proj1": { 
        "doc_freq": 111, 
        "ttf": 111, 
        "term_freq": 1, 
        "tokens": [{ 
         "position": 0, 
         "start_offset": 0, 
         "end_offset": 6 
        }] 
       }, 
       "proj10": { 
        "doc_freq": 11, 
        "ttf": 11, 
        "term_freq": 1, 
        "tokens": [{ 
         "position": 0, 
         "start_offset": 0, 
         "end_offset": 6 
        }] 
       } 
      } 
     } 
    } 
}

EDITED 2名

映射为字段F4

"f4" : { 
    "type" : "string", 
    "index_analyzer" : "myNGramAnalyzer", 
    "search_analyzer" : "standard" 
}

我已更新为使用第一andard分析仪的查询时间，这已经改善了结果，但仍然不是我所期望的。

而不是999（所有文档）现在它返回111个文档，如“Proj1”，“Proj11”，“Proj111”......“Proj1”，“Proj181”.........等等。

仍然“Proj1”在结果之间而不在顶部。

来源

2016-05-12 Abubakkar

你可以检查文档之一的术语向量：https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html – alpert

@alpert更新了术语向量响应的问题 – Abubakkar

你能只需将** multi_match **搜索查询的'type'从'cross_fields'更改为'best_fields'，然后再次检查结果是否是所需结果。 –

后的支出小时的时间来找到解决这个问题，我终于做到了工作。

所以我保持一切与我的问题中提到的一样，使用n gram analzyer，同时索引数据。我唯一需要改变的是，在我的搜索查询中使用all字段作为我现有的multi-match查询的布尔查询。

现在我的搜索文本结果Proj1将返回我结果的顺序，如Proj1，Proj121，Proj11等

虽然这不返回的确切顺序一样Proj1，Proj11，Proj121等，但它仍然非常类似我想要的结果。

来源

2016-06-27 11:41:32 Abubakkar

没有index_analyzer（至少不是从Elasticsearch版本1.7）。对于mapping parameters，您可以使用analyzer和search_analyzer。请尝试以下步骤以使其正常工作。

与分析仪设置创建myindex：

PUT /myindex 
{ 
    "settings": { 
    "analysis": { 
     "filter": { 
      "ngram_filter": { 
       "type": "edge_ngram", 
       "min_gram": 2, 
       "max_gram": 20 
      } 
     }, 
     "analyzer": { 
      "myNGramAnalyzer": { 
       "type": "custom", 
       "tokenizer": "standard", 
       "char_filter": "html_strip", 
       "filter": [ 
        "lowercase", 
        "standard", 
        "asciifolding", 
        "stop", 
        "snowball", 
        "ngram_filter" 
       ] 
      } 
     } 
     } 
    } 
}

添加映射到MYTYPE（使它总之我只是映射相关领域）：

PUT /myindex/_mapping/mytype 
{ 
    "properties": { 
     "f1": { 
     "type": "string" 
     }, 
     "f4": { 
     "type": "string", 
     "analyzer": "myNGramAnalyzer", 
     "search_analyzer": "standard" 
     }, 
     "deleted": { 
     "type": "string" 
     } 
    } 
}

指数的一些数据：

PUT myindex/mytype/1 
{ 
    "f1":"396", 
    "f4":"Proj12" , 
    "deleted": "0" 
} 

PUT myindex/mytype/2 
{ 
    "f1":"42", 
    "f4":"Proj22" , 
    "deleted": "1" 
}

现在试试你的查询：

GET myindex/mytype/_search 
{ 
    "query": { 
     "multi_match": { 
     "query": "proj1", 
     "type": "cross_fields", 
     "operator": "and", 
     "fields": "f*" 
     } 
    }, 
    "filter": { 
     "term": { 
     "deleted": "0" 
     } 
    } 
}

它应该返回文档#1。它为我工作Sense。我正在使用Elasticsearch 2.X版本。

希望我已成功地帮助:)

来源

2016-05-15 20:25:20

你是否试过这样做，通过添加带有字段f4的文件作为Proj1，Proj11，Proj12，Proj13，Proj121，Proj111，因为我的东西不工作为了这。它已经在为您在示例中使用的文档工作了。 – Abubakkar

另外，我知道'index_analyzer'，我使用支持它的旧版本。 – Abubakkar

当我索引： 'PUT myindex/mytype/_bulk {“index”：{“_id”：“1”}} {“f1”：“396”，“f4”：“Proj1”，“deleted” ：“0”} {“index”：{“_id”：“2”}} {“f1”：“396”，“f4”：“Proj11”，“deleted”：“0”} { index“：{”_id“：”3“}} {”f1“：”396“，”f4“：”Proj13“，”deleted“：”1“} {”index“：{”_id“ “4”}} {“f1”：“396”，“f4”：“Proj121”，“删除”：“1”} {“index”：{“_id”：“5”}} { f1“：”396“，”f4“：”Proj111“，”删除“：”1“} 我得到的文件是：'＃1'和'＃2'不是你想要的吗？ –

弹性搜索交叉字段，边缘ngram分析器

回答

相关问题