2017-07-10 82 views
0

我使用Elasticsearch 2.4,添加了icu_analysis插件以提供对日文文本的排序。它适用于我的本地环境,其中有文件数量有限,不够好,但是当我尝试它放在一个更真实的数据集,查询失败,出现以下CircuitBreakingException:导致CircuitBreakingException使用icu_collat​​ion日文文本的嵌套排序

"CircuitBreakingException[[fielddata] Data too large, data for [translations.name.jp_sort] would be larger than limit of [10239895142/9.5gb]]" 

据我所知,这个尝试时,会发生对大量文档计数的字段数据进行排序,应该使用文档值 - 但我不确定在这种情况下是否可以完成这项工作,或者为什么尚未发生。

索引中有大约4.7亿个文档,它们将翻译存储为嵌套文档 - 全集中只有约3500万包含日文翻译。下面是文件的映射:

{ 
    "settings" : { 
    "number_of_shards" : 6, 
    "number_of_replicas": 0, 
    "analysis": { 
     "filter": { 
      "trigrams_filter": { 
       "type":  "ngram", 
       "min_gram": 3, 
       "max_gram": 3 
      }, 
      "japanese_ordering": { 
      "type":  "icu_collation", 
      "language": "ja", 
      "country": "JP" 
      } 
     }, 
     "analyzer": { 
     "trigrams": { 
      "tokenizer": "my_ngram_tokenizer", 
      "filter": "lowercase" 
     }, 
     "japanese_ordering": { 
      "tokenizer": "keyword", 
      "filter": [ "japanese_ordering" ] 
     } 
     }, 
     "tokenizer": { 
     "my_ngram_tokenizer": { 
      "type": "nGram", 
      "min_gram": "3", 
      "max_gram": "3", 
      "token_chars": [ 
      "letter", 
      "digit", 
      "symbol", 
      "punctuation" 
      ] 
     } 
     } 
    } 
    }, 
    "mappings" : { 
    "product" : { 
     "_all" : { 
     "enabled" : false 
     }, 
     "properties" : { 
     "name" : { 
      "type" : "string", 
      "analyzer": "trigrams", 
      "fields": { 
      "value" : { 
       "type": "string", 
       "index": "not_analyzed" 
      } 
      } 
     }, 
     "record_status" : { 
      "type" : "integer" 
     }, 
     "categories" : { 
      "type" : "integer" 
     }, 
     "variant_status" : { 
      "type" : "integer" 
     }, 
     "visit_count" : { 
      "type" : "integer" 
     }, 
     "translations": { 
      "type": "nested", 
      "properties": { 
      "name": { 
       "type": "string", 
       "fields": { 
       "jp_sort": { 
        "type":  "string", 
        "analyzer": "japanese_ordering" 
       } 
       } 
      }, 
      "language_id": { 
       "type": "short" 
      } 
      } 
     } 
     } 
    } 
    } 
} 

,这是CircuitBreaking查询:

{ 
    "from": 0, 
    "size": 20, 
    "query": { 
     "bool": { 
      "should": [], 
      "must_not": [], 
      "must": [{ 
       "nested": { 
        "path": "translations", 
        "score_mode": "max", 
        "query": { 
         "bool": { 
          "must": [{ 
           "match": { 
            "translations.name": { 
             "query": "\u30C6\u30B9\u30C8", 
             "boost": 5 
            } 
           } 
          }] 
         } 
        } 
       } 
      }] 
     } 
    }, 
    "filter": { 
     "bool": { 
      "must": [{ 
       "terms": { 
        "variant_status": ["1"], 
        "_cache": true 
       } 
      }, { 
       "nested": { 
        "path": "translations", 
        "query": { 
         "bool": { 
          "must": [{ 
           "term": { 
            "translations.language_id": 9, 
            "_cache": true 
           } 
          }] 
         } 
        } 
       } 
      }, { 
       "term": { 
        "record_status": 1, 
        "_cache": true 
       } 
      }], 
      "must_not": [{ 
       "term": { 
        "product_collections": 0 
       } 
      }] 
     } 
    }, 
    "sort": [{ 
     "translations.name.jp_sort": { 
      "order": "asc", 
      "nested_path": "translations" 
     } 
    }] 
} 
+0

的ES 5.5版本已经推出了名为'icu_collat​​ion_keyword'新的字段类型解决了您所遇到的问题。你可以在这里阅读更多信息:https://www.elastic.co/blog/elasticsearch-5-5-0-released – Val

+0

实际上,这确实解决了它 - 我花了几个小时更新我的​​查询和索引器的版本更改,并且然后尝试了icu_collat​​ion_keyword。它运作良好,而且速度非常快!如果您想提交您的评论作为答案,我会将其标记为已接受。谢谢! –

回答