弹性搜索：在大型数据集上性能较差

我有7个节点具有2个索引的弹性搜索集群，并且都具有嵌套的对象映射。我被延迟插入到索引2（通过火花流）。我正在使用批量插入，每个批次需要〜8-12s（〜100k记录）。弹性搜索：在大型数据集上性能较差

Node Configuration: 
RAM: 64 GB 
Core: 48 
HDD : 1 TB 
JVM allocated Memory: 32 GB 

Marvel Node Status: 
CPU Usages: ~10-20% 
JVM Memory: ~60-75% 
Load Average : ~3-35 
Indexing Rate: ~10k/s 
Search Rate: ~2k/s 

Index1 (Replication 1): 
Status: green 
Documents: 84.4b 
Data: 9.3TB 
Total Shards: 400 (Could it be the reason of low performance) 

Index2 (Replication 1): 
Status: green 
Documents: 1.4b 
Data: 35.8GB 
Total Shards: 10 
Unassigned Shards: 0 

Spark streaming configuration: 
executors: 2 
Executor core per executor: 8 
Number of partition: 16 
batch size: 10s 
Event per batch: ~1k-200k 
Elastic Bulk insert count: 100k

索引2映射：

{ 
    "settings": { 
    "index": { 
     "number_of_shards": 5, 
     "number_of_replicas": 1 
    } 
    }, 
    "mappings": { 
    "parent_list": { 
     "_all": { 
     "enabled": false 
     }, 
     "properties": { 
     "parents": { 
      "type": "nested", 
      "properties": { 
      "parent_id": { 
       "type": "integer", 
       "doc_values": false 
      }, 
      "childs": { 
       "type": "nested", 
       "properties": { 
       "child_id": { 
        "type": "integer", 
        "doc_values": false 
       }, 
       "timestamp": { 
        "type": "long", 
        "doc_values": false 
       }, 
       "is_deleted": { 
        "type": "boolean", 
        "doc_values": false 
       } 
       } 
      } 
      } 
     }, 
     "other_ID": { 
      "type": "string", 
      "index": "not_analyzed", 
      "doc_values": false 
     } 
     } 
    } 
    } 
}

我的查询：

获取数由父ID与至少一个孩子IS_DELETED假。
通过is_deleted为false的子ID获取计数。通过_id

获取的文件，我期待从弹性更高的性能，但它成为我的系统瓶颈。 有人可以建议性能调整？使用此群集配置，我们可以通过Elastic实现更高的插入率吗？

来源

2016-12-30 Nishant Kumar

100K文件的批量处理呢听起来很像。你可以降低并再试一次吗？ –

我尝试了10k，但是并没有提高很多 –

@AndreiStefan Index1有400个分片。这可能是低绩效的原因吗？预期的插入率应该是多少？ –

你的问题不在配置上可能是在硬件层面。

尝试禁用throtling

PUT /_cluster/settings 
{ 
    "transient" : { 
     "indices.store.throttle.type" : "none" 
    } 
}

关掉副本 - > 0 下碎片到最大的2-3个节点的量（400 ridicusly危险）

变化的刷新速率为-1指数化

PUT /{INDICE}/_settings 
{ 
    "index" : { 
     "refresh_interval" : "-1" 
    } 
}

负载平衡服务器之间的大部分请求期间（节点）

使用持久连接如通过插座

确保你没有运行到网络的瓶颈

关于100K的文件批量请求，这取决于每个文件的大小，甜美spoot始终围绕4 -5k。为什么？由于批量API不会立即插入数据，它首先将其缓存，然后将其转储到磁盘中，如果您完成发送太大批量的缓存，则会遇到棘手问题。

如果你正在使用持久连接，你不需要担心你的批量api的大小，你可以打开一个套接字并开始发送一批文件，它的速度可以和它做的一样快。（因为它并不需要处理的直升机每次可节省您每次往返50毫秒）

任何其他问题，我知道这是有点晚了，但希望有人发现了它有用的一个somepoint

来源

2017-06-09 14:36:36

弹性搜索：在大型数据集上性能较差

回答

相关问题