2016-02-12 53 views
1

我想索引使用path_hierarchy标记器的路径,但它似乎是标记化只有一半我提供的路径。我尝试过不同的路径,结果似乎相同。Elasticsearch path_hierarchy标记化路径的一半

我的设定 -

{ 
    "settings" : { 
     "number_of_shards" : 5, 
     "number_of_replicas" : 0, 
     "analysis":{ 
      "analyzer":{ 
       "keylower":{ 
        "type": "custom", 
        "tokenizer":"keyword", 
        "filter":"lowercase" 
       }, 
       "path_analyzer": { 
        "type": "custom", 
        "tokenizer": "path_tokenizer", 
        "filter": [ "lowercase", "asciifolding", "path_ngrams" ] 
       }, 
       "code_analyzer": { 
        "type": "custom", 
        "tokenizer": "standard", 
        "filter": [ "lowercase", "asciifolding", "code_stemmer" ] 
       }, 
       "not_analyzed": { 
        "type": "custom", 
        "tokenizer": "keyword", 
        "filter": [ "lowercase", "asciifolding", "code_stemmer" ] 
       } 
      }, 
      "tokenizer": { 
       "path_tokenizer": { 
        "type": "path_hierarchy" 
       } 
      }, 
      "filter": { 
       "path_ngrams": { 
        "type": "edgeNGram", 
        "min_gram": 3, 
        "max_gram": 15 
       }, 
       "code_stemmer": { 
        "type": "stemmer", 
        "name": "minimal_english" 
       } 
      } 
     } 
    } 
} 

我的映射如下 -

{ 
    "dynamic": "strict", 
    "properties": { 
    "depot_path": { 
     "type": "string", 
     "analyzer": "path_analyzer" 
    } 
    }, 
    "_all": { 
     "store": "yes", 
     "analyzer": "english" 
    } 
} 

我在分析我已经发现如下该令牌形成提供"//cm/mirror/v1.2/Kolkata/ixin-packages/builds/"depot_path -

   "key": "//c", 
       "key": "//cm", 
       "key": "//cm/", 
       "key": "//cm/m", 
       "key": "//cm/mi", 
       "key": "//cm/mir", 
       "key": "//cm/mirr", 
       "key": "//cm/mirro", 
       "key": "//cm/mirror", 
       "key": "//cm/mirror/", 
       "key": "//cm/mirror/v", 
       "key": "//cm/mirror/v1", 
       "key": "//cm/mirror/v1.", 

为什么整个路径不是符号化?

我的预期成果是已经形成的令牌所有高达//cm/mirror/v1.2/Kolkata/ixin-packages/builds/

我曾尝试增加缓冲区大小,但没有运气的方式。有谁知道我做错了什么?

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pathhierarchy-tokenizer.html

回答

1

"max_gram": 15被限制令牌大小为15。如果你增加"max_gram",你会看到进一步的路径将被标记化。

下面是我的环境示例。

"max_gram" :15 
input path : /var/log/www/html/web/ 
path_analyser tokenized this path upto /var/log/www/ht i.e. 15 characters 


"max_gram" :100 
    input path : /var/log/www/html/web/WANTED 
    path_analyser tokenized this path upto /var/log/www/html/web/WANTED i.e. 28 characters <100 
+0

谢谢:)我决定只是摆脱'path_ngrams'过滤器。 –

1

这是因为你的"max_gram"值设置为15。因此,您会注意到生成的最大标记(“// cm/mirror/v1。”)的长度为15。将其更改为一个非常大的数字,您将获得所需的令牌。

+0

谢谢:)接受Shubhangi的回答,因为她在16秒内击败了你。 :) –