ElasticSearch节点故障

My Elasticsearch集群从2B文档下降到900M记录，在AWS上显示

重新定位碎片：4

显示时

活动碎片：35

和

活动主碎片：34

(可能不相关，但以下是其余统计数据(：

节点数量：9

数据节点数量：6

未分配的碎片：17

运行时

GET /_cluster/allocation/explain

它返回：

{
"index": "datauwu",
"shard": 6,
"primary": true,
"current_state": "unassigned",
"unassigned_info": {
"reason": "NODE_LEFT",
"at": "2019-10-31T17:02:11.258Z",
"details": "node_left[removedforsecuritybecimparanoid1]",
"last_allocation_status": "no_valid_shard_copy"
},
"can_allocate": "no_valid_shard_copy",
"allocate_explanation": "cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster",
"node_allocation_decisions": [
{
"node_id": "removedforsecuritybecimparanoid2",
"node_name": "removedforsecuritybecimparanoid2",
"node_decision": "no",
"store": {
"found": false
}
},
{
"node_id": "removedforsecuritybecimparanoid3",
"node_name": "removedforsecuritybecimparanoid3",
"node_decision": "no",
"store": {
"found": false
}
},
{
"node_id": "removedforsecuritybecimparanoid4",
"node_name": "removedforsecuritybecimparanoid4",
"node_decision": "no",
"store": {
"found": false
}
},
{
"node_id": "removedforsecuritybecimparanoid5",
"node_name": "removedforsecuritybecimparanoid5",
"node_decision": "no",
"store": {
"found": false
}
},
{
"node_id": "removedforsecuritybecimparanoid6",
"node_name": "removedforsecuritybecimparanoid6",
"node_decision": "no",
"store": {
"found": false
}
},
{
"node_id": "removedforsecuritybecimparanoid7",
"node_name": "removedforsecuritybecimparanoid7",
"node_decision": "no",
"store": {
"found": false
}
}
]
}

我有点困惑这到底意味着什么，这是否意味着我的弹性搜索集群没有丢失数据，而是将其重新定位到不同的碎片中，或者它找不到碎片？

如果它找不到碎片，这是否意味着我的数据丢失了？如果是这样的话，原因是什么？我该如何防止这种情况在未来发生？

我没有设置复制副本，因为我在索引数据，复制副本在索引时会减慢速度。

同样，我的记录一度下降到400米，但随后又随机回升到900米。我不知道这意味着什么，任何见解都将不胜感激。

"原因"："NODE_LEFT"；

和：

我没有设置副本，因为我在索引数据，副本在索引时会减慢速度。

如果持有主碎片的节点已经消失，那么是的，您的数据也消失了。毕竟，如果没有副本，那么如果主要(也是唯一(碎片不再是集群的一部分，集群将从哪里检索数据？您要么需要将保存这些碎片的节点恢复并添加到集群中，要么数据就不见了。

错误消息是说"；你想让我为这个我知道存在的索引分配一个主碎片，但以前有另一个版本的主碎片找不到了，我不会再分配它，以防上一个主片段回来">

您可以通过使用allocate_stale_primary(doc(:执行重新路由来强制Elasticsearch重新分配主碎片(并明确接受上一个主碎片中的数据已经消失(

curl -H 'Content-Type: application/json' 
-XPOST '127.0.0.1:9200/_cluster/reroute?pretty' -d '{
"commands" : [ {
"allocate_stale_primary" :
{
"index" : "datauwu", "shard" : 6,
"node" : "target-data-node-id",
"accept_data_loss" : true
}
}
]
}'

除了使用一次性数据进行开发之外，关闭任何复制副本通常都是个坏主意。

也没有，我的记录一度下降到400米，但随后又随机回升到900米。我不知道这意味着什么，任何见解都将不胜感激。

发生这种情况是因为碎片在集群中不可见。如果正在分配、重新定位或恢复碎片的所有副本，就会发生这种情况。这与红色集群状态相对应。您可以通过确保至少有1个复制副本来缓解这种情况(尽管理想情况下，您设置了足够数量的复制副本，以在群集中丢失N个数据节点后仍然可以生存(。这使得Elasticsearch在移动其他碎片时保留一个碎片作为主要碎片。

如果您只有主服务器而没有副本，那么如果正在恢复或重新定位主服务器，则该碎片中的数据在集群中将不可见。一旦碎片再次处于活动状态，其中的文档就会变得可见。

如Chris Heald所述，当试图使用allocate_stale_primary恢复具有丢失主碎片的未分配碎片时，您可能会得到：

"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "No data for shard [0] of index [xyz] found on any node"
}

这意味着除非丢失的节点重新加入集群，否则数据将丢失。或者，可以使用allocate_empty_primary命令清空碎片。

curl -H 'Content-Type: application/json' 
-XPOST '127.0.0.1:9200/_cluster/reroute?pretty' -d '{
"commands" : [ {
"allocate_empty_primary" :
{
"index" : "datauwu", "shard" : 6,
"node" : "target-data-node-id",
"accept_data_loss" : true
}
}
]
}'

这会擦除数据，如果丢失的节点重新加入，则会覆盖碎片。

相关内容

最新更新

热门标签：