2015-11-07 92 views
3

我有一个索引充满关键字,并基于这些关键字我想从输入文本中提取关键字。使用弹性搜索从文本中提取关键字(多字)

以下是示例关键字索引。请注意,关键字也可以是多个单词,或者基本上它们是唯一的标签。现在

{ 
    "hits": { 
    "total": 2000, 
    "hits": [ 
     { 
     "id": 1, 
     "keyword": "thousand eyes" 
     }, 
     { 
     "id": 2, 
     "keyword": "facebook" 
     }, 
     { 
     "id": 3, 
     "keyword": "superdoc" 
     }, 
     { 
     "id": 4, 
     "keyword": "quora" 
     }, 
     { 
     "id": 5, 
     "keyword": "your story" 
     }, 
     { 
     "id": 6, 
     "keyword": "Surgery" 
     }, 
     { 
     "id": 7, 
     "keyword": "lending club" 
     }, 
     { 
     "id": 8, 
     "keyword": "ad roll" 
     }, 
     { 
     "id": 9, 
     "keyword": "the honest company" 
     }, 
     { 
     "id": 10, 
     "keyword": "Draft kings" 
     } 
    ] 
    } 
} 

,如果我输入作为“我看到贷款俱乐部的消息在Facebook上,你的故事,Quora的”文本搜索的输出应该[“贷款俱乐部”,“脸谱” ,“你的故事”,“quora”]。此外,搜索应该是案例无动于衷

回答

6

只有一个真正的方法来做到这一点。你必须索引你的数据关键字和搜索它与带状疱疹分析:

看到这个再现:

首先,我们将创建两个自定义分析:关键字和带状疱疹:

PUT test 
{ 
    "settings": { 
    "analysis": { 
     "analyzer": { 
     "my_analyzer_keyword": { 
      "type": "custom", 
      "tokenizer": "keyword", 
      "filter": [ 
      "asciifolding", 
      "lowercase" 
      ] 
     }, 
     "my_analyzer_shingle": { 
      "type": "custom", 
      "tokenizer": "standard", 
      "filter": [ 
      "asciifolding", 
      "lowercase", 
      "shingle" 
      ] 
     } 
     } 
    } 
    }, 
    "mappings": { 
    "your_type": { 
     "properties": { 
     "keyword": { 
      "type": "string", 
      "index_analyzer": "my_analyzer_keyword", 
      "search_analyzer": "my_analyzer_shingle" 
     } 
     } 
    } 
    } 
} 

现在,让我们创建一个使用你给我们一些样本数据:

POST /test/your_type/1 
{ 
    "id": 1, 
    "keyword": "thousand eyes" 
} 
POST /test/your_type/2 
{ 
    "id": 2, 
    "keyword": "facebook" 
} 
POST /test/your_type/3 
{ 
    "id": 3, 
    "keyword": "superdoc" 
} 
POST /test/your_type/4 
{ 
    "id": 4, 
    "keyword": "quora" 
} 
POST /test/your_type/5 
{ 
    "id": 5, 
    "keyword": "your story" 
} 
POST /test/your_type/6 
{ 
    "id": 6, 
    "keyword": "Surgery" 
} 
POST /test/your_type/7 
{ 
    "id": 7, 
    "keyword": "lending club" 
} 
POST /test/your_type/8 
{ 
    "id": 8, 
    "keyword": "ad roll" 
} 
POST /test/your_type/9 
{ 
    "id": 9, 
    "keyword": "the honest company" 
} 
POST /test/your_type/10 
{ 
    "id": 10, 
    "keyword": "Draft kings" 
} 

最后查询运行搜索:

POST /test/your_type/_search 
{ 
    "query": { 
    "match": { 
     "keyword": "I saw the news of lending club on facebook, your story and quora" 
    } 
    } 
} 

这是结果:

{ 
    "took": 6, 
    "timed_out": false, 
    "_shards": { 
    "total": 5, 
    "successful": 5, 
    "failed": 0 
    }, 
    "hits": { 
    "total": 4, 
    "max_score": 0.009332742, 
    "hits": [ 
     { 
     "_index": "test", 
     "_type": "your_type", 
     "_id": "2", 
     "_score": 0.009332742, 
     "_source": { 
      "id": 2, 
      "keyword": "facebook" 
     } 
     }, 
     { 
     "_index": "test", 
     "_type": "your_type", 
     "_id": "7", 
     "_score": 0.009332742, 
     "_source": { 
      "id": 7, 
      "keyword": "lending club" 
     } 
     }, 
     { 
     "_index": "test", 
     "_type": "your_type", 
     "_id": "4", 
     "_score": 0.009207102, 
     "_source": { 
      "id": 4, 
      "keyword": "quora" 
     } 
     }, 
     { 
     "_index": "test", 
     "_type": "your_type", 
     "_id": "5", 
     "_score": 0.0014755741, 
     "_source": { 
      "id": 5, 
      "keyword": "your story" 
     } 
     } 
    ] 
    } 
} 

那么它在幕后?

  1. 它将您的文档索引为整个关键字(它将整个字符串作为单个标记发出)。我还添加了asciifolding过滤器,因此它将字母标准化,即é变为e)和小写字母过滤器(不区分大小写的搜索)。因此,例如Draft kings被索引为draft kings
  2. 现在搜索分析器使用相同的逻辑,除了它的标记器正在发出单词标记并且在其上创建了带状疱疹(标记的组合),这将与您的关键字匹配步。
+0

是任何人能够在ElasticSearch的5.x版本运行它,似乎映射类型应该从字符串改为文字,index_analyzer只是分析,但我试图执行一个搜索 – mac

+0

@mac让当too_many_clauses错误我试图让你为你工作! –

+0

@mac我能够运行查询,但他们没有带回任何数据。我已经在GitHub上记录了这个问题:https://github.com/elastic/elasticsearch/issues/26989 –