0
我试图实现一个Elasticsearch pattern_capture过滤器,可以将EDR-00004转换为令牌:[EDR-00004,00004,4]。我(仍然)使用Elasticsearch 2.4,但与当前ES版本的文档没有区别。未能实现pattern_capture标记过滤器
这是我的测试和结果:
curl -XPUT 'localhost:9200/test_index' -d '{
"settings": {
"analysis": {
"filter": {
"process_number_filter": {
"type": "pattern_capture",
"preserve_original": 1,
"patterns": [
"([A-Za-z]+-([0]+([0-9]+)))"
]
}
},
"analyzer": {
"process_number_analyzer": {
"type": "custom",
"tokenizer": "pattern",
"filter": ["process_number_filter"]
}
}
}
}
}'
curl -XGET 'localhost:9200/test_index/_analyze' -d '
{
"analyzer": "process_number_analyzer",
"text": "EDR-00002"
}'
curl -XGET 'localhost:9200/test_index/_analyze' -d '
{
"analyzer": "standard",
"tokenizer": "standard",
"filter": ["process_number_filter"],
"text": "EDR-00002"
}'
返回:
{"acknowledged":true}
{
"tokens": [{
"token": "EDR",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
}, {
"token": "00002",
"start_offset": 4,
"end_offset": 9,
"type": "word",
"position": 1
}]
}
{
"tokens": [{
"token": "edr",
"start_offset": 0,
"end_offset": 3,
"type": "<ALPHANUM>",
"position": 0
}, {
"token": "00002",
"start_offset": 4,
"end_offset": 9,
"type": "<NUM>",
"position": 1
}]
}
我明白
- 我不需要将整个正则表达式分组,因为我有preserve_original集合
- 我可以用\ d和/或\ w替换东西,但这种方式我不必考虑转义。
也确保我的正则表达式是正确的。
>>> m = re.match(r"([A-Za-z]+-([0]+([0-9]+)))", "EDR-00004")
>>> m.groups()
('EDR-00004', '00004', '4')