2015-10-19 55 views
1

我最近增加“模糊操作”和模糊查询等设置到我们的搜索查询字符串覆盖用户误输入(如“zamestnanost”“zamestnani”不起作用)Elasticsearch模糊查询 - 最大的修改如预期

POST /my_index/_search 
{ 
    "query": { 
     "query_string": { 
     "query": "+(content:zamestnanost~)", 
     "fuzzy_prefix_length": 3, 
     "fuzzy_min_sim": 0.5, 
     "fuzzy_max_expansions": 50 
     } 
    } 
} 

按照我的理解模糊查询设置,fuzzy_min_sim = 0.5应该允许在这种情况下6编辑)原始查询(的length(query)*0.5编辑。

但是,它不匹配,甚至 “更近” 字(标记),如

  • “zamestnani”
  • “zamestnany”

我有这种奇怪的感觉,它仍然只匹配索引中最大的单词。来自原始查询字符串的2个编辑(这是模​​糊查询中的默认编辑计数)。

我也对我的查询进行了解释,结果支持这个假设,我想。该_explanation看起来是这样的:

"_explanation": { 
    "value": 0.057083897, 
    "description": "sum of:", 
    "details": [ 
     { 
     "value": 0.023866946, 
     "description": "weight(content:zamestnano^0.8 in 0) [PerFieldSimilarity], result of:", 
     "details": [ 
      { 
       "value": 0.023866946, 
       "description": "score(doc=0,freq=4.0), product of:", 
       "details": [ 
        { 
        "value": 0.66062796, 
        "description": "queryWeight, product of:", 
        "details": [ 
         { 
          "value": 0.8, 
          "description": "boost" 
         }, 
         { 
          "value": 4.624341, 
          "description": "idf(docFreq=1, maxDocs=75)" 
         }, 
         { 
          "value": 0.17857353, 
          "description": "queryNorm" 
         } 
        ] 
        }, 
        { 
        "value": 0.036127664, 
        "description": "fieldWeight in 0, product of:", 
        "details": [ 
         { 
          "value": 2, 
          "description": "tf(freq=4.0), with freq of:", 
          "details": [ 
           { 
           "value": 4, 
           "description": "termFreq=4.0" 
           } 
          ] 
         }, 
         { 
          "value": 4.624341, 
          "description": "idf(docFreq=1, maxDocs=75)" 
         }, 
         { 
          "value": 0.00390625, 
          "description": "fieldNorm(doc=0)" 
         } 
        ] 
        } 
       ] 
      } 
     ] 
     }, 
     { 
     "value": 0.03321695, 
     "description": "weight(content:zamestnanos^0.9090909 in 0) [PerFieldSimilarity], result of:", 
     "details": [ 
      { 
       "value": 0.03321695, 
       "description": "score(doc=0,freq=6.0), product of:", 
       "details": [ 
        { 
        "value": 0.7507135, 
        "description": "queryWeight, product of:", 
        "details": [ 
         { 
          "value": 0.9090909, 
          "description": "boost" 
         }, 
         { 
          "value": 4.624341, 
          "description": "idf(docFreq=1, maxDocs=75)" 
         }, 
         { 
          "value": 0.17857353, 
          "description": "queryNorm" 
         } 
        ] 
        }, 
        { 
        "value": 0.044247173, 
        "description": "fieldWeight in 0, product of:", 
        "details": [ 
         { 
          "value": 2.4494898, 
          "description": "tf(freq=6.0), with freq of:", 
          "details": [ 
           { 
           "value": 6, 
           "description": "termFreq=6.0" 
           } 
          ] 
         }, 
         { 
          "value": 4.624341, 
          "description": "idf(docFreq=1, maxDocs=75)" 
         }, 
         { 
          "value": 0.00390625, 
          "description": "fieldNorm(doc=0)" 
         } 
        ] 
        } 
       ] 
      } 
     ] 
     } 
    ] 
} 

只查询“zamestnano”“zemestnanos”使用模糊查询编辑创建。

我理解模糊查询设置吗?你能指出我的错误吗?

非常感谢每一个想法!

回答

1

the documentation

0.0..1.0

[1.7.0]已过时在1.7.0。在Elasticsearch 2.0中将删除对相似性的支持。使用以下公式将长度(术语)*(1.0 - 模糊性)转换为编辑距离,例如模糊度为0.6,术语长度为10会导致编辑距离为4. 注意:除了模糊与此查询类似,允许的最大编辑距离为2

并仔细检查这是使用validate API最简单的方法:

GET _validate/query?explain&index=my_index 
{ 
    "query": { 
    "query_string": { 
     "query": "+(content:zamestnanost~)", 
     "fuzzy_prefix_length": 3, 
     "fuzzy_min_sim": 0.5, 
     "fuzzy_max_expansions": 50 
    } 
    } 
} 

哪个给出了这样的结果:

"explanations": [ 
     { 
     "index": "test", 
     "valid": true, 
     "explanation": "+content:zamestnanost~2" 
     } 
    ] 

这表明实际的编辑距离ES将在使用查询:zamestnanost~2

+0

嗨安德烈,谢谢你的回复。这解释了我的模糊搜索的行为。是否有任何其他方式执行模糊搜索比我的搜索词更远的距离比2编辑? – shimon001

+0

根据文档,只有像这样的查询模糊允许超过2。 –