1

我crereated上elasticsearch指数相同的波纹管:elasticsearch NGRAM和PostgreSQL卦搜索结果不匹配

"settings" : { 
    "number_of_shards": 1, 
    "number_of_replicas": 0, 
    "analysis": { 
       "filter": { 
        "trigrams_filter": { 
         "type":  "ngram", 
         "min_gram": 3, 
         "max_gram": 3 
        } 
       }, 
       "analyzer": { 
        "trigrams": { 
         "type":  "custom", 
         "tokenizer": "standard", 
         "filter": [ 
          "lowercase", 
          "trigrams_filter" 
         ] 
        } 
       } 
    } 
}, 
"mappings": { 
    "issue": { 
     "properties": { 
      "description": { 
       "type":  "string", 
       "analyzer": "trigrams" 
      } 
     } 
    } 
} 

我的测试项目有波纹管:

"alici onay verdi basarili satisiniz gerceklesti diyor ama hesabima para transferi gerceklesmemis" 

"otomatik onay işlemi gecikmiş" 

"************* nolu iade islemi urun kargoya verilmedi zamaninda iade islemlerinde urun erorr hata veriyor" 

我已经用下列查询测试该指数:

GET issue/_search 
{ 
    "query": { 
     "match": { 
      "description":{ 
       "query": "otomatik onay istemi zamaninda gerceklesmemis" 
      } 
     } 
    } 
} 

and resu LT:与波纹管SQL响应上PostgreSQL的

{ 
     .... 
     "hits": { 
      .... 
       "max_score": 2.3507352, 
       "hits": [ 
          { 
           ....         
           "_score": 2.3507352, 
           "_source": { 
            "issue_id": "*******", 
            "description": "alici onay verdi basarili satisiniz gerceklesti diyor ama hesabima para transferi gerceklesmemis" 
              } 
          } 
         ] 
       } 
} 

但相同的数据的另一个结果:

SELECT 
    public.tbl_issue_descriptions_big.description, 
    similarity(description, 'otomatik onay islemi zamaninda gerceklesmemis') AS sml 
FROM 
    public.tbl_issue_descriptions_big 
WHERE 
    description %'otomatik onay islemi zamaninda gerceklesmemis' 
ORDER BY 
    sml DESC 
LIMIT 10 

结果是:

description           | sml 
======================================================|====== 
otomatik onay islemi gecikmis       |0,351852 

为什么这种差异造成的?

回答

0

我不知道足够的Postgres给有一份合格的答卷(因为这还取决于被索引的文件,如果他们得分公式是完全一样的,我怀疑),但Elasticsearch有explain APIexplain parameter在搜索中,这可以帮助您找出为什么某个文档以这种方式得分。

+0

谢谢你的回答。 但我想解释postgresql equvalent是ts_vector并用于全文搜索。但是用于机器学习的ngram和相似性。我正在搜索elasticsearch的相似算法。 –

+1

查看lucene文档,例如https://lucene.apache.org/core/6_6_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html或https://lucene.apache.org/core/ 6_6_0/core/org/apache/lucene/search/similarities/BM25Similarity.html(如果您创建新索引,这是ES 5.0以后的默认设置) – alr