我有一个运行Elasticsearch的旧集群1.4.4
。 我的群集包含约110亿份文档,所有初选的大小约为4TB
。5.x的Elasticsearch索引大小比1.x大40%
我现在正在升级到Elasticsearch 5.2.2
,这当然意味着重新索引我的数据。我目前正在发生一个单独的群集。我从我的源数据库重新索引,因为我在原始索引上禁用了_all
和_source
。
我现在重新索引了大约7.5亿个文档,并注意到我的新索引大小已经为350GB
。我做了一些数学计算,看起来索引将在完全索引时增长到大约5.5TB
。那是1.5TB以上比1.4.4
指数。我并不期待这一点。相反,我期望减小尺寸,因为我删除了几个属性。这是正常的事情还是我做错了什么? 5.2.2
中有不同的默认设置可以促进这种增长吗?
1.4.4索引设置:
{
"index": {
"refresh_interval": "30s",
"number_of_shards": "20",
"creation_date": "1426251049131",
"analysis": {
"analyzer": {
"default": {
"filter": [
"icu_folding",
"icu_normalizer"
],
"type": "custom",
"tokenizer": "icu_tokenizer"
}
}
},
"uuid": "WdgnCLyITgmpb4DROegV3Q",
"version": {
"created": "1040499"
},
"number_of_replicas": "1"
}
}
1.4.4索引映射:
{
"article": {
"_source": {
"enabled": false
},
"_all": {
"enabled": false
},
"properties": {
"date": {
"format": "dateOptionalTime",
"type": "date",
"doc_values": true
},
"has_enclosures": {
"type": "boolean"
},
"feed_subscribers": {
"type": "integer",
"doc_values": true
},
"feed_language": {
"index": "not_analyzed",
"type": "string"
},
"author": {
"norms": {
"enabled": false
},
"analyzer": "keyword",
"type": "string"
},
"has_pictures": {
"type": "boolean"
},
"title": {
"norms": {
"enabled": false
},
"type": "string"
},
"content": {
"norms": {
"enabled": false
},
"type": "string"
},
"has_video": {
"type": "boolean"
},
"url": {
"index": "not_analyzed",
"type": "string"
},
"feed_canonical": {
"type": "boolean"
},
"feed_id": {
"type": "integer",
"doc_values": true
}
}
}
}
5.2.2索引设置:
{
"articles": {
"settings": {
"index": {
"refresh_interval": "-1",
"number_of_shards": "40",
"provided_name": "articles",
"creation_date": "1489604158595",
"analysis": {
"analyzer": {
"default": {
"filter": [
"icu_folding",
"icu_normalizer"
],
"type": "custom",
"tokenizer": "icu_tokenizer"
}
}
},
"number_of_replicas": "0",
"uuid": "LOeOcZb_TMCX6E_86uMyXQ",
"version": {
"created": "5020299"
}
}
}
}
}
5.2.2指数映射:
{
"articles": {
"mappings": {
"article": {
"_all": {
"enabled": false
},
"_source": {
"enabled": false
},
"properties": {
"author": {
"type": "text",
"norms": false,
"analyzer": "keyword"
},
"content": {
"type": "text",
"norms": false
},
"date": {
"type": "date"
},
"feed_canonical": {
"type": "boolean"
},
"feed_id": {
"type": "integer"
},
"feed_subscribers": {
"type": "integer"
},
"title": {
"type": "text",
"norms": false
},
"url": {
"type": "keyword"
}
}
}
}
}
}
任何帮助将非常感激,因为这一组充满重建索引需要大约30天...谢谢!
感谢您的建议。索引速度并不是我关心的问题。它做得很好。服务器非常强大,并针对ES进行了优化。你对分段合并的看法是合理的,我确实已经注意到了一些波动,但指数仍然大得多。我怀疑它只会在部分合并的情况下最终收缩。 – Jacket
虽然30天对于您所拥有的大小来说是相当长的一段日子(尽管我不知道集群的大小)。关于磁盘空间,本文将通过以下方式分享一个有趣的体验:https:// blog.discordapp.com/how-discord-indexes-billions-of-messages-e3d5e9be866f#.6zzwqchb6 – Adonis
集群高兴地在3台服务器上运行(现在增加4台),每台服务器都配有64G RAM,4个900GB SSD。源数据在11TB价值的MySQL数据库中,它们是生产数据库运行繁忙的服务,所以显然我不能将它们推到极限。瓶颈不是ES。我唯一关心的是最终的总体指数规模。 – Jacket