弹性搜索索引清理



v Elasticsearch 5.6.*.

我正在寻找一种实现机制的方法,通过该机制,我的索引之一(每天立即增长约 100 万个文档(自动管理存储约束。

例如:我将文档的最大数量或最大索引大小定义为变量"n"。 我会写一个调度程序来检查"n"是否为真。如果为 true,那么我想删除最旧的"x"文档(基于时间(。

我在这里有几个问题:

显然,我不想删除太多或太少。我怎么知道"x"是什么?我可以简单地对 elasticsearch 说"嘿删除价值 5GB 的最旧文档"——我的目的是简单地释放固定数量的存储空间。这可能吗?

其次,我想知道这里的最佳实践是什么?显然,我不想在这里发明一个方轮,如果有什么(例如:策展人,我最近才听说它(可以完成这项工作,那么我很乐意使用它。

在您的情况下,最佳做法是使用基于时间的索引,即每日、每周或每月索引,以对你拥有的数据量和所需的保留期有意义的索引为准。您还可以使用滚动更新 API 来决定何时需要创建新索引(基于时间、文档数量或索引大小(

删除整个索引比删除与索引中某些条件匹配的文档要容易得多。如果执行后者,文档将被删除,但在合并基础段之前不会释放空间。而如果您删除整个基于时间的索引,则可以保证释放空间。

我想出了一个相当简单的bash脚本解决方案来清理Elasticsearch中基于时间的索引,我想我会分享以防有人感兴趣。策展人似乎是这样做的标准答案,但我真的不想安装和管理具有所需所有依赖项的 Python 应用程序。没有比通过 cron 执行的 bash 脚本简单得多了,而且它在核心 Linux 之外没有任何依赖项。

#!/bin/bash
# Make sure expected arguments were provided
if [ $# -lt 3 ]; then
echo "Invalid number of arguments!"
echo "This script is used to clean time based indices from Elasticsearch. The indices must have a"
echo "trailing date in a format that can be represented by the UNIX date command such as '%Y-%m-%d'."
echo ""
echo "Usage: `basename $0` host_url index_prefix num_days_to_keep [date_format]"
echo "The date_format argument is optional and defaults to '%Y-%m-%d'"
echo "Example: `basename $0` http://localhost:9200 cflogs- 7"
echo "Example: `basename $0` http://localhost:9200 elasticsearch_metrics- 31 %Y.%m.%d"
exit
fi
elasticsearchUrl=$1
indexNamePrefix=$2
numDaysDataToKeep=$3
dateFormat=%Y-%m-%d
if [ $# -ge 4 ]; then
dateFormat=$4
fi
# Get the curent date in a 'seconds since epoch' format
curDateInSecondsSinceEpoch=$(date +%s)
#echo "curDateInSecondsSinceEpoch=$curDateInSecondsSinceEpoch"
# Subtract numDaysDataToKeep from current epoch value to get the last day to keep
let "targetDateInSecondsSinceEpoch=$curDateInSecondsSinceEpoch - ($numDaysDataToKeep * 86400)"
#echo "targetDateInSecondsSinceEpoch=$targetDateInSecondsSinceEpoch"
while : ; do
# Subtract one day from the target date epoch
let "targetDateInSecondsSinceEpoch=$targetDateInSecondsSinceEpoch - 86400"
#echo "targetDateInSecondsSinceEpoch=$targetDateInSecondsSinceEpoch"
# Convert targetDateInSecondsSinceEpoch into a YYYY-MM-DD format
targetDateString=$(date --date="@$targetDateInSecondsSinceEpoch" +$dateFormat)
#echo "targetDateString=$targetDateString"
# Format the index name using the prefix and the calculated date string
indexName="$indexNamePrefix$targetDateString"
#echo "indexName=$indexName"
# First check if an index with this date pattern exists
# Curl options:
#  -s   silent mode. Don't show progress meter or error messages
#  -w "%{http_code}n" Causes curl to display the HTTP status code only after a completed transfer.
#  -I Fetch the HTTP-header only in the response. For HEAD commands there is no body so this keeps curl from waiting on it.
#  -o /dev/null Prevents the output in the response from being displayed. This does not apply to the -w output though.
httpCode=$(curl -o /dev/null -s -w "%{http_code}n" -I -X HEAD "$elasticsearchUrl/$indexName")
#echo "httpCode=$httpCode"
if [ $httpCode -ne 200 ]
then
echo "Index $indexName does not exist. Stopping processing."
break;
fi
# Send the command to Elasticsearch to delete the index. Save the HTTP return code in a variable
httpCode=$(curl -o /dev/null -s -w "%{http_code}n" -X DELETE $elasticsearchUrl/$indexName)
#echo "httpCode=$httpCode"
if [ $httpCode -eq 200 ]
then
echo "Successfully deleted index $indexName."
else
echo "FAILURE! Delete command failed with return code $httpCode. Continuing processing with next day."
continue;
fi
# Verify the index no longer exists. Should return 404 when the index isn't found.
httpCode=$(curl -o /dev/null -s -w "%{http_code}n" -I -X HEAD "$elasticsearchUrl/$indexName")
#echo "httpCode=$httpCode"
if [ $httpCode -eq 200 ]
then
echo "FAILURE! Delete command responded successfully, but index still exists. Continuing processing with next day."
continue;
fi
done

我在 https://discuss.elastic.co/t/elasticsearch-efficiently-cleaning-up-the-indices-to-save-space/137019 回答了同样的问题

如果索引一直在增长,则删除文档不是最佳做法。听起来您有时间序列数据。如果为真,那么你想要的是时间序列指数,或者更好的是展期指数。

5GB 也是一个相当小的清除量,因为单个 Elasticsearch 分片可以健康地增长到 20GB - 50GB 的大小。您是否受到存储限制?您有多少个节点?

相关内容

  • 没有找到相关文章

最新更新