2017-03-06 71 views
0

我试图实现一个Elasticsearch pattern_capture过滤器,可以将EDR-00004转换为令牌:[EDR-00004,00004,4]。我(仍然)使用Elasticsearch 2.4,但与当前ES版本的文档没有区别。未能实现pattern_capture标记过滤器

我按照文档中的例子: https://www.elastic.co/guide/en/elasticsearch/reference/2.4/analysis-pattern-capture-tokenfilter.html

这是我的测试和结果:

curl -XPUT 'localhost:9200/test_index' -d '{ 
    "settings": { 
     "analysis": { 
      "filter": { 
       "process_number_filter": { 
        "type": "pattern_capture", 
        "preserve_original": 1, 
        "patterns": [ 
         "([A-Za-z]+-([0]+([0-9]+)))" 
        ] 
       } 
      }, 
      "analyzer": { 
       "process_number_analyzer": { 
        "type": "custom", 
        "tokenizer": "pattern", 
        "filter": ["process_number_filter"] 
       } 
      } 
     } 
    } 
}' 

curl -XGET 'localhost:9200/test_index/_analyze' -d ' 
{ 
    "analyzer": "process_number_analyzer", 
    "text": "EDR-00002" 
}' 

curl -XGET 'localhost:9200/test_index/_analyze' -d ' 
{ 
    "analyzer": "standard", 
    "tokenizer": "standard", 
    "filter": ["process_number_filter"], 
    "text": "EDR-00002" 
}' 

返回:

{"acknowledged":true} 

{ 
    "tokens": [{ 
     "token": "EDR", 
     "start_offset": 0, 
     "end_offset": 3, 
     "type": "word", 
     "position": 0 
    }, { 
     "token": "00002", 
     "start_offset": 4, 
     "end_offset": 9, 
     "type": "word", 
     "position": 1 
    }] 
} 

{ 
    "tokens": [{ 
     "token": "edr", 
     "start_offset": 0, 
     "end_offset": 3, 
     "type": "<ALPHANUM>", 
     "position": 0 
    }, { 
     "token": "00002", 
     "start_offset": 4, 
     "end_offset": 9, 
     "type": "<NUM>", 
     "position": 1 
    }] 
} 

我明白

  1. 我不需要将整个正则表达式分组,因为我有preserve_original集合
  2. 我可以用\ d和/或\ w替换东西,但这种方式我不必考虑转义。

也确保我的正则表达式是正确的。

>>> m = re.match(r"([A-Za-z]+-([0]+([0-9]+)))", "EDR-00004")                                             
>>> m.groups() 
('EDR-00004', '00004', '4') 

回答

0

我讨厌回答我自己的问题,但我找到了答案,也许它可以帮助未来的人。

我的问题是默认的标记器,它会在将文本传递到我的过滤器之前拆分文本。通过添加我自己的分词器,该分词器将默认分配器"\W+"重写为"[^\\w-]+",我的过滤器接收到了整个单词,从而创建了权限令牌。

现在这是我的自定义设置:

curl -XPUT 'localhost:9200/test_index' -d '{ 
    "settings": { 
     "analysis": { 
      "filter": { 
       "process_number_filter": { 
        "type": "pattern_capture", 
        "preserve_original": 1, 
        "patterns": [ 
         "([A-Za-z]+-([0]+([0-9]+)))" 
        ] 
       } 
      }, 
      "tokenizer": { 
       "process_number_tokenizer": { 
        "type": "pattern", 
        "pattern": "[^\\w-]+" 
       } 
      }, 
      "analyzer": { 
       "process_number_analyzer": { 
        "type": "custom", 
        "tokenizer": "process_number_tokenizer", 
        "filter": ["process_number_filter"] 
       } 
      } 
     } 
    } 
}' 

从而导致以下结果:

{ 
    "tokens": [ 
     { 
      "token": "EDR-00002", 
      "start_offset": 0, 
      "end_offset": 9, 
      "type": "word", 
      "position": 0 
     }, 
     { 
      "token": "00002", 
      "start_offset": 0, 
      "end_offset": 9, 
      "type": "word", 
      "position": 0 
     }, 
     { 
      "token": "2", 
      "start_offset": 0, 
      "end_offset": 9, 
      "type": "word", 
      "position": 0 
     } 
    ] 
}