我有999个文件,我正在使用弹性搜索进行实验。弹性搜索交叉字段,边缘ngram分析器
中有我喜欢的类型映射场F4被分析,有以下设置分析仪:
"myNGramAnalyzer" => [
"type" => "custom",
"char_filter" => ["html_strip"],
"tokenizer" => "standard",
"filter" => ["lowercase","standard","asciifolding","stop","snowball","ngram_filter"]
]
我的过滤器是如下:
"filter" => [
"ngram_filter" => [
"type" => "edgeNGram",
"min_gram" => "2",
"max_gram" => "20"
]
]
我有值字段F4为“Proj1”,“Proj2”,“Proj3”......等等。
现在,当我尝试使用“proj1”字符串的交叉字段进行搜索时,我期待将带有“Proj1”的文档返回到最高分的回应的顶部。但事实并非如此。其余所有数据在内容上几乎相同。
另外我不明白为什么它匹配所有的999文件?
以下是我的搜索:
{
"index": "myindex",
"type": "mytype",
"body": {
"query": {
"multi_match": {
"query": "proj1",
"type": "cross_fields",
"operator": "and",
"fields": "f*"
}
},
"filter": {
"term": {
"deleted": "0"
}
}
}
}
我搜索的回应是:
{
"took": 12,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 999,
"max_score": 1,
"hits": [{
"_index": "myindex",
"_type": "mytype",
"_id": "42",
"_score": 1,
"_source": {
"f1": "396","f2": "125650","f3": "BH.1511AI.001",
"f4": "Proj42",
"f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0"
}
}, {
"_index": "myindex",
"_type": "mytype",
"_id": "47",
"_score": 1,
"_source": {
"f1": "396","f2": "137946","f3": "BH.152096.001",
"f4": "Proj47",
"f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0"
}
},
//.......
//.......
//MANY RECORDS IN BETWEEN HERE
//.......
//.......
{
"_index": myindex,
"_type": "mytype",
"_id": "1",
"_score": 1,
"_source": {
"f1": "396","f2": "142095","f3": "BH.705215.001",
"f4": "Proj1",
"f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0"
}
//.......
//.......
//MANY RECORDS IN BETWEEN HERE
//.......
//.......
}]
}
}
任何东西,我做错了或丢失? (道歉冗长的问题,但我想给所有可能的信息丢弃不必要的其他代码)。
EDITED:
期限矢量响应
{
"_index": "myindex",
"_type": "mytype",
"_id": "10",
"_version": 1,
"found": true,
"took": 9,
"term_vectors": {
"f4": {
"field_statistics": {
"sum_doc_freq": 5886,
"doc_count": 999,
"sum_ttf": 5886
},
"terms": {
"pr": {
"doc_freq": 999,
"ttf": 999,
"term_freq": 1,
"tokens": [{
"position": 0,
"start_offset": 0,
"end_offset": 6
}]
},
"pro": {
"doc_freq": 999,
"ttf": 999,
"term_freq": 1,
"tokens": [{
"position": 0,
"start_offset": 0,
"end_offset": 6
}]
},
"proj": {
"doc_freq": 999,
"ttf": 999,
"term_freq": 1,
"tokens": [{
"position": 0,
"start_offset": 0,
"end_offset": 6
}]
},
"proj1": {
"doc_freq": 111,
"ttf": 111,
"term_freq": 1,
"tokens": [{
"position": 0,
"start_offset": 0,
"end_offset": 6
}]
},
"proj10": {
"doc_freq": 11,
"ttf": 11,
"term_freq": 1,
"tokens": [{
"position": 0,
"start_offset": 0,
"end_offset": 6
}]
}
}
}
}
}
EDITED 2名
映射为字段F4
"f4" : {
"type" : "string",
"index_analyzer" : "myNGramAnalyzer",
"search_analyzer" : "standard"
}
我已更新为使用第一andard分析仪的查询时间,这已经改善了结果,但仍然不是我所期望的。
而不是999(所有文档)现在它返回111个文档,如“Proj1”,“Proj11”,“Proj111”......“Proj1”,“Proj181”.........等等。
仍然“Proj1”在结果之间而不在顶部。
你可以检查文档之一的术语向量:https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html – alpert
@alpert更新了术语向量响应的问题 – Abubakkar
你能只需将** multi_match **搜索查询的'type'从'cross_fields'更改为'best_fields',然后再次检查结果是否是所需结果。 –