0

我有999个文件,我正在使用弹性搜索进行实验。弹性搜索交叉字段,边缘ngram分析器

中有我喜欢的类型映射场F4被分析,有以下设置分析仪:

"myNGramAnalyzer" => [ 
     "type" => "custom", 
     "char_filter" => ["html_strip"], 
     "tokenizer" => "standard", 
     "filter" => ["lowercase","standard","asciifolding","stop","snowball","ngram_filter"] 
    ] 

我的过滤器是如下:

"filter" => [ 
     "ngram_filter" => [ 
      "type" => "edgeNGram", 
      "min_gram" => "2", 
      "max_gram" => "20" 
     ] 
    ] 

我有值字段F4为“Proj1”,“Proj2”,“Proj3”......等等。

现在,当我尝试使用“proj1”字符串的交叉字段进行搜索时,我期待将带有“Proj1”的文档返回到最高分的回应的顶部。但事实并非如此。其余所有数据在内容上几乎相同。

另外我不明白为什么它匹配所有的999文件?

以下是我的搜索:

{ 
    "index": "myindex", 
    "type": "mytype", 
    "body": { 
     "query": { 
      "multi_match": { 
       "query": "proj1", 
       "type": "cross_fields", 
       "operator": "and", 
       "fields": "f*" 
      } 
     }, 
     "filter": { 
      "term": { 
       "deleted": "0" 
      } 
     } 
    } 
} 

我搜索的回应是:

{ 
    "took": 12, 
    "timed_out": false, 
    "_shards": { 
     "total": 5, 
     "successful": 5, 
     "failed": 0 
    }, 
    "hits": { 
     "total": 999, 
     "max_score": 1, 
     "hits": [{ 
      "_index": "myindex", 
      "_type": "mytype", 
      "_id": "42", 
      "_score": 1, 
      "_source": { 
       "f1": "396","f2": "125650","f3": "BH.1511AI.001", 
       "f4": "Proj42", 
       "f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0" 
      } 
     }, { 
      "_index": "myindex", 
      "_type": "mytype", 
      "_id": "47", 
      "_score": 1, 
      "_source": { 
       "f1": "396","f2": "137946","f3": "BH.152096.001", 
       "f4": "Proj47", 
       "f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0" 
      } 
     }, 
     //....... 
     //....... 
     //MANY RECORDS IN BETWEEN HERE 
     //....... 
     //....... 
     { 
      "_index": myindex, 
      "_type": "mytype", 
      "_id": "1", 
      "_score": 1, 
      "_source": { 
       "f1": "396","f2": "142095","f3": "BH.705215.001", 
       "f4": "Proj1", 
       "f5": "BH.1511AI.001","f6": "","f7": "","f8": "","f9": "","f10": "","f11": "","f12": "","f13": "","f14": "","f15": "","f16": "09/05/16 | 01:02PM | User","deleted": "0" 
      } 
     //....... 
     //....... 
     //MANY RECORDS IN BETWEEN HERE 
     //....... 
     //....... 
     }] 
    } 
} 

任何东西,我做错了或丢失? (道歉冗长的问题,但我想给所有可能的信息丢弃不必要的其他代码)。

EDITED:

期限矢量响应

{ 
    "_index": "myindex", 
    "_type": "mytype", 
    "_id": "10", 
    "_version": 1, 
    "found": true, 
    "took": 9, 
    "term_vectors": { 
     "f4": { 
      "field_statistics": { 
       "sum_doc_freq": 5886, 
       "doc_count": 999, 
       "sum_ttf": 5886 
      }, 
      "terms": { 
       "pr": { 
        "doc_freq": 999, 
        "ttf": 999, 
        "term_freq": 1, 
        "tokens": [{ 
         "position": 0, 
         "start_offset": 0, 
         "end_offset": 6 
        }] 
       }, 
       "pro": { 
        "doc_freq": 999, 
        "ttf": 999, 
        "term_freq": 1, 
        "tokens": [{ 
         "position": 0, 
         "start_offset": 0, 
         "end_offset": 6 
        }] 
       }, 
       "proj": { 
        "doc_freq": 999, 
        "ttf": 999, 
        "term_freq": 1, 
        "tokens": [{ 
         "position": 0, 
         "start_offset": 0, 
         "end_offset": 6 
        }] 
       }, 
       "proj1": { 
        "doc_freq": 111, 
        "ttf": 111, 
        "term_freq": 1, 
        "tokens": [{ 
         "position": 0, 
         "start_offset": 0, 
         "end_offset": 6 
        }] 
       }, 
       "proj10": { 
        "doc_freq": 11, 
        "ttf": 11, 
        "term_freq": 1, 
        "tokens": [{ 
         "position": 0, 
         "start_offset": 0, 
         "end_offset": 6 
        }] 
       } 
      } 
     } 
    } 
} 

EDITED 2名

映射为字段F4

"f4" : { 
    "type" : "string", 
    "index_analyzer" : "myNGramAnalyzer", 
    "search_analyzer" : "standard" 
} 

我已更新为使用第一andard分析仪的查询时间,这已经改善了结果,但仍然不是我所期望的。

而不是999(所有文档)现在它返回111个文档,如“Proj1”,“Proj11”,“Proj111”......“Proj1”,“Proj181”.........等等。

仍然“Proj1”在结果之间而不在顶部。

+0

你可以检查文档之一的术语向量:https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-termvectors.html – alpert

+0

@alpert更新了术语向量响应的问题 – Abubakkar

+0

你能只需将** multi_match **搜索查询的'type'从'cross_fields'更改为'best_fields',然后再次检查结果是否是所需结果。 –

回答

0

后的支出小时的时间来找到解决这个问题,我终于做到了工作。

所以我保持一切与我的问题中提到的一样,使用n gram analzyer,同时索引数据。我唯一需要改变的是,在我的搜索查询中使用all字段作为我现有的multi-match查询的布尔查询。

现在我的搜索文本结果Proj1将返回我结果的顺序,如Proj1Proj121Proj11

虽然这不返回的确切顺序一样Proj1Proj11Proj121等,但它仍然非常类似我想要的结果。

1

没有index_analyzer(至少不是从Elasticsearch版本1.7)。对于mapping parameters,您可以使用analyzersearch_analyzer。 请尝试以下步骤以使其正常工作。

与分析仪设置创建myindex:

PUT /myindex 
{ 
    "settings": { 
    "analysis": { 
     "filter": { 
      "ngram_filter": { 
       "type": "edge_ngram", 
       "min_gram": 2, 
       "max_gram": 20 
      } 
     }, 
     "analyzer": { 
      "myNGramAnalyzer": { 
       "type": "custom", 
       "tokenizer": "standard", 
       "char_filter": "html_strip", 
       "filter": [ 
        "lowercase", 
        "standard", 
        "asciifolding", 
        "stop", 
        "snowball", 
        "ngram_filter" 
       ] 
      } 
     } 
     } 
    } 
} 

添加映射到MYTYPE(使它总之我只是映射相关领域):

PUT /myindex/_mapping/mytype 
{ 
    "properties": { 
     "f1": { 
     "type": "string" 
     }, 
     "f4": { 
     "type": "string", 
     "analyzer": "myNGramAnalyzer", 
     "search_analyzer": "standard" 
     }, 
     "deleted": { 
     "type": "string" 
     } 
    } 
} 

指数的一些数据:

PUT myindex/mytype/1 
{ 
    "f1":"396", 
    "f4":"Proj12" , 
    "deleted": "0" 
} 

PUT myindex/mytype/2 
{ 
    "f1":"42", 
    "f4":"Proj22" , 
    "deleted": "1" 
} 

现在试试你的查询:

GET myindex/mytype/_search 
{ 
    "query": { 
     "multi_match": { 
     "query": "proj1", 
     "type": "cross_fields", 
     "operator": "and", 
     "fields": "f*" 
     } 
    }, 
    "filter": { 
     "term": { 
     "deleted": "0" 
     } 
    } 
} 

它应该返回文档#1。它为我工作Sense。我正在使用Elasticsearch 2.X版本。

希望我已成功地帮助:)

+0

你是否试过这样做,通过添加带有字段f4的文件作为Proj1,Proj11,Proj12,Proj13,Proj121,Proj111,因为我的东西不工作为了这。它已经在为您在示例中使用的文档工作了。 – Abubakkar

+0

另外,我知道'index_analyzer',我使用支持它的旧版本。 – Abubakkar

+0

当我索引: 'PUT myindex/mytype/_bulk {“index”:{“_id”:“1”}} {“f1”:“396”,“f4”:“Proj1”,“deleted” :“0”} {“index”:{“_id”:“2”}} {“f1”:“396”,“f4”:“Proj11”,“deleted”:“0”} { index“:{”_id“:”3“}} {”f1“:”396“,”f4“:”Proj13“,”deleted“:”1“} {”index“:{”_id“ “4”}} {“f1”:“396”,“f4”:“Proj121”,“删除”:“1”} {“index”:{“_id”:“5”}} { f1“:”396“,”f4“:”Proj111“,”删除“:”1“} 我得到的文件是:'#1'和'#2'不是你想要的吗? –