Katta索引部署问题



我看到在3节点Katta集群上部署失败。当我通过命令行触发katta部署时,它尝试部署索引3次,但失败了,并显示消息"未能在374635毫秒内部署索引'katta_index_1'"。

部署相关日志如下:

13/12/09 03:48:19 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=master.domain.com,node1.domain.com,node2.domain.com sessionTimeout=30000 watcher=org.I0Itec.zkclient.ZkClient@5c74c3aa
13/12/09 03:48:19 INFO zookeeper.ClientCnxn: Opening socket connection to server node1.domain.com/<node1-ip>:2181
13/12/09 03:48:19 INFO zookeeper.ClientCnxn: Socket connection established to node1.domain.com/<node1-ip>:2181, initiating session
13/12/09 03:48:19 INFO zookeeper.ClientCnxn: Session establishment complete on server node1.domain.com/<node1-ip>:2181, sessionid = 0x242d67304660002, negotiated timeout = 30000
13/12/09 03:48:19 INFO zkclient.ZkClient: zookeeper state changed (SyncConnected)
...................................13/12/09 03:49:30 INFO zookeeper.ClientCnxn: Client session timed out, have not heard from server in 20000ms for sessionid 0x242d67304660002, closing socket connection and attempting reconnect
13/12/09 03:49:30 INFO zkclient.ZkClient: zookeeper state changed (Disconnected)
13/12/09 03:49:31 INFO zookeeper.ClientCnxn: Opening socket connection to server node2.domain.com/<node2-ip>:2181
13/12/09 03:49:31 INFO zookeeper.ClientCnxn: Socket connection established to node2.domain.com/<node2-ip>:2181, initiating session
13/12/09 03:49:31 INFO zookeeper.ClientCnxn: Session establishment complete on server node2.domain.com/<node2-ip>:2181, sessionid = 0x242d67304660002, negotiated timeout = 30000
13/12/09 03:49:31 INFO zkclient.ZkClient: zookeeper state changed (SyncConnected)
13/12/09 03:49:31 WARN client.IndexDeployFuture: Reconnecting IndexDeployFuture
....................13/12/09 03:50:10 INFO zookeeper.ClientCnxn: Client session timed out, have not heard from server in 20000ms for sessionid 0x242d67304660002, closing socket connection and attempting reconnect
13/12/09 03:50:10 INFO zkclient.ZkClient: zookeeper state changed (Disconnected)
13/12/09 03:50:10 INFO zookeeper.ClientCnxn: Opening socket connection to server master.domain.com/<master-ip>:2181
13/12/09 03:50:10 INFO zookeeper.ClientCnxn: Socket connection established to master.domain.com/<master-ip>:2181, initiating session
13/12/09 03:50:10 INFO zookeeper.ClientCnxn: Session establishment complete on server master.domain.com/<master-ip>:2181, sessionid = 0x242d67304660002, negotiated timeout = 30000
13/12/09 03:50:10 INFO zkclient.ZkClient: zookeeper state changed (SyncConnected)
13/12/09 03:50:10 WARN client.IndexDeployFuture: Reconnecting IndexDeployFuture
13/12/09 03:50:30 INFO zookeeper.ClientCnxn: Client session timed out, have not heard from server in 20000ms for sessionid 0x242d67304660002, closing socket connection and attempting reconnect
13/12/09 03:50:30 INFO zkclient.ZkClient: zookeeper state changed (Disconnected)
13/12/09 03:50:31 INFO zookeeper.ClientCnxn: Opening socket connection to server node1.domain.com/<node1-ip>:2181
13/12/09 03:50:31 INFO zookeeper.ClientCnxn: Socket connection established to node1.domain.com/<node1-ip>:2181, initiating session
13/12/09 03:50:31 INFO zookeeper.ClientCnxn: Session establishment complete on server node1.domain.com/<node1-ip>:2181, sessionid = 0x242d67304660002, negotiated timeout = 30000
13/12/09 03:50:31 INFO zkclient.ZkClient: zookeeper state changed (SyncConnected)
13/12/09 03:50:51 INFO zookeeper.ClientCnxn: Client session timed out, have not heard from server in 20000ms for sessionid 0x242d67304660002, closing socket connection and attempting reconnect
13/12/09 03:50:51 INFO zkclient.ZkClient: zookeeper state changed (Disconnected)
13/12/09 03:50:51 INFO zookeeper.ClientCnxn: Opening socket connection to server node2.domain.com/<node2-ip>:2181
13/12/09 03:50:51 INFO zookeeper.ClientCnxn: Socket connection established to node2.domain.com/<node2-ip>:2181, initiating session
13/12/09 03:50:51 INFO zookeeper.ClientCnxn: Session establishment complete on server node2.domain.com/<node2-ip>:2181, sessionid = 0x242d67304660002, negotiated timeout = 30000
13/12/09 03:50:51 INFO zkclient.ZkClient: zookeeper state changed (SyncConnected)
.13/12/09 03:51:09 WARN client.IndexDeployFuture: Reconnecting IndexDeployFuture
13/12/09 03:51:09 WARN client.IndexDeployFuture: Reconnecting IndexDeployFuture
...........................................................................................................................................................................................................
failed to deploy index 'katta_index_1' in 374635 ms
13/12/09 03:54:34 WARN protocol.InteractionProtocol: following components still connected:[net.sf.katta.client.IndexDeployFuture@604e280c]
13/12/09 03:54:34 INFO zkclient.ZkEventThread: Terminate ZkClient event thread.
13/12/09 03:54:34 INFO zookeeper.ClientCnxn: EventThread shut down
13/12/09 03:54:34 INFO zookeeper.ZooKeeper: Session: 0x242d67304660002 closed

startMaster日志中没有太多与此问题相关的有用信息,如下所示:

13/12/09 03:51:13 INFO master.Master: master.domain.com_cb1ea8a3-6b5e-463e-8ef5-ae9a01e1fd61 became master with 0 waiting master operations
13/12/09 03:51:13 INFO master.Master: start managing nodes...
13/12/09 03:51:13 INFO master.Master: found following nodes connected: [master.domain.com:20000, node2.domain.com:20000, node1.domain.com:20000]
13/12/09 03:51:13 INFO master.OperatorThread: starting...
13/12/09 03:51:13 INFO master.OperatorThread: SAFE MODE: No nodes available or state unstable within the last 200000 ms.
13/12/09 03:52:03 INFO master.OperatorThread: SAFE MODE: No nodes available or state unstable within the last 200000 ms.
13/12/09 03:52:53 INFO master.OperatorThread: SAFE MODE: No nodes available or state unstable within the last 200000 ms.
13/12/09 03:53:43 INFO master.OperatorThread: SAFE MODE: No nodes available or state unstable within the last 200000 ms.
13/12/09 03:54:33 INFO master.OperatorThread: SAFE MODE: leaving safe mode with 3 connected nodes
13/12/09 03:54:33 INFO master.OperationRegistry: watch operation 'IndexDeployOperation:65d9e449:katta_index_1' for node operations [master.domain.com:20000-operation-0000000003, node1.domain.com:20000-operation-0000000003, node2.domain.com:20000-operation-0000000003]
13/12/09 03:54:33 INFO protocol.InteractionProtocol: unregistering component net.sf.katta.master.OperationWatchdog@441aa8dc: {net.sf.katta.master.Master@626f50a8=[AddRemoveListenerAdapter:/katta/nodes/live, AddRemoveListenerAdapter:/katta/nodes/live, AddRemoveListenerAdapter:/katta/nodes/live, AddRemoveListenerAdapter:/katta/nodes/live, AddRemoveListenerAdapter:/katta/nodes/live, AddRemoveListenerAdapter:/katta/nodes/live, AddRemoveListenerAdapter:/katta/nodes/live]}
13/12/09 03:54:33 INFO master.AbstractIndexOperation: deployment of index katta_index_1 complete
13/12/09 03:54:34 INFO master.OperationWatchdog: watch for IndexDeployOperation:65d9e449:katta_index_1 finished
13/12/09 03:54:34 INFO master.OperatorThread: executing operation 'CheckIndicesOperation:58ca40d2'
13/12/09 03:54:34 INFO master.OperatorThread: executing operation 'RemoveObsoleteShardsOperation:783614ab:master.domain.com:20000'
13/12/09 03:54:34 INFO master.OperatorThread: executing operation 'RemoveObsoleteShardsOperation:48274f3c:node2.domain.com:20000'
13/12/09 03:54:34 INFO master.OperatorThread: executing operation 'RemoveObsoleteShardsOperation:1550a7d:node1.domain.com:20000'
13/12/09 03:54:34 INFO master.OperatorThread: executing operation 'RemoveObsoleteShardsOperation:4419391c:master.domain.com:20000'
13/12/09 03:54:34 INFO master.OperatorThread: executing operation 'RemoveObsoleteShardsOperation:519a9fe8:master.domain.com:20000'
13/12/09 03:54:35 INFO master.OperatorThread: executing operation 'BalanceIndexOperation:9f4152a:katta_index_1'
13/12/09 03:54:35 INFO master.AbstractIndexOperation: balancing shards for index 'katta_index_1'
13/12/09 03:54:35 INFO master.OperationRegistry: watch operation 'BalanceIndexOperation:9f4152a:katta_index_1' for node operations [master.domain.com:20000-operation-0000000000, node1.domain.com:20000-operation-0000000000, node2.domain.com:20000-operation-0000000000]
13/12/09 03:54:50 INFO protocol.InteractionProtocol: unregistering component net.sf.katta.master.OperationWatchdog@22f37b06: {net.sf.katta.master.Master@626f50a8=[AddRemoveListenerAdapter:/katta/nodes/live, AddRemoveListenerAdapter:/katta/nodes/live, AddRemoveListenerAdapter:/katta/nodes/live, AddRemoveListenerAdapter:/katta/nodes/live, AddRemoveListenerAdapter:/katta/nodes/live, AddRemoveListenerAdapter:/katta/nodes/live, AddRemoveListenerAdapter:/katta/nodes/live]}
13/12/09 03:54:50 INFO master.AbstractIndexOperation: balancing of index katta_index_1 complete
13/12/09 03:54:50 INFO master.OperationWatchdog: watch for BalanceIndexOperation:9f4152a:katta_index_1 finished
13/12/09 03:55:24 INFO zookeeper.ClientCnxn: Client session timed out, have not heard from server in 6666ms for sessionid 0x342d68098830007, closing socket connection and attempting reconnect
13/12/09 03:56:13 INFO zkclient.ZkClient: zookeeper state changed (Disconnected)
13/12/09 03:56:13 INFO zookeeper.ClientCnxn: Opening socket connection to server node1.domain.com/<node1-ip>:2181
13/12/09 03:56:15 INFO protocol.InteractionProtocol: unregistering component net.sf.katta.master.OperationWatchdog@22f37b06: {net.sf.katta.master.Master@626f50a8=[AddRemoveListenerAdapter:/katta/nodes/live, AddRemoveListenerAdapter:/katta/nodes/live, AddRemoveListenerAdapter:/katta/nodes/live, AddRemoveListenerAdapter:/katta/nodes/live, AddRemoveListenerAdapter:/katta/nodes/live, AddRemoveListenerAdapter:/katta/nodes/live, AddRemoveListenerAdapter:/katta/nodes/live]}
13/12/09 03:56:15 INFO master.OperatorThread: operator thread stopped
13/12/09 03:56:15 INFO zookeeper.ClientCnxn: Socket connection established to node1.domain.com/<node1-ip>:2181, initiating session
13/12/09 03:56:15 INFO zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x342d68098830007 has expired, closing socket connection
13/12/09 03:56:15 INFO zkclient.ZkClient: zookeeper state changed (Expired)
13/12/09 03:56:15 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=master.domain.com,node1.domain.com,node2.domain.com sessionTimeout=10000 watcher=org.I0Itec.zkclient.ZkClient@3a860d49
13/12/09 03:56:15 INFO zookeeper.ClientCnxn: EventThread shut down
13/12/09 03:56:15 INFO zookeeper.ClientCnxn: Opening socket connection to server node1.domain.com/<node1-ip>:2181
13/12/09 03:56:15 INFO zookeeper.ClientCnxn: Socket connection established to node1.domain.com/<node1-ip>:2181, initiating session
13/12/09 03:56:15 INFO zookeeper.ClientCnxn: Session establishment complete on server node1.domain.com/<node1-ip>:2181, sessionid = 0x242d67304660006, negotiated timeout = 10000
13/12/09 03:56:15 INFO zkclient.ZkClient: zookeeper state changed (SyncConnected)
13/12/09 03:56:15 INFO protocol.InteractionProtocol: master.domain.com_cb1ea8a3-6b5e-463e-8ef5-ae9a01e1fd61 started as master
13/12/09 03:56:15 INFO protocol.InteractionProtocol: master 'master.domain.com_cb1ea8a3-6b5e-463e-8ef5-ae9a01e1fd61' published
13/12/09 03:56:15 INFO upgrade.UpgradeRegistry: version of distribution 0.7.dev
13/12/09 03:56:15 INFO upgrade.UpgradeRegistry: version of cluster 0.7.dev
13/12/09 03:56:15 INFO master.Master: master.domain.com_cb1ea8a3-6b5e-463e-8ef5-ae9a01e1fd61 became master with 0 waiting master operations
13/12/09 03:56:15 INFO master.Master: start managing nodes...
13/12/09 03:56:15 INFO master.Master: found following nodes connected: [master.domain.com:20000, node1.domain.com:20000]
13/12/09 03:56:15 INFO master.OperatorThread: starting...
13/12/09 03:56:15 INFO master.OperatorThread: SAFE MODE: No nodes available or state unstable within the last 200000 ms.
13/12/09 03:57:05 INFO master.OperatorThread: SAFE MODE: No nodes available or state unstable within the last 200000 ms.
13/12/09 03:57:55 INFO master.OperatorThread: SAFE MODE: No nodes available or state unstable within the last 200000 ms.
13/12/09 03:58:45 INFO master.OperatorThread: SAFE MODE: No nodes available or state unstable within the last 200000 ms.
13/12/09 03:59:35 INFO master.OperatorThread: SAFE MODE: No nodes available or state unstable within the last 200000 ms.
13/12/09 04:00:25 INFO master.OperatorThread: SAFE MODE: leaving safe mode with 3 connected nodes
13/12/09 04:00:25 INFO master.OperatorThread: executing operation 'CheckIndicesOperation:6b114461'
13/12/09 04:00:25 INFO master.OperatorThread: executing operation 'RemoveObsoleteShardsOperation:7a8b0be7:master.domain.com:20000'

我猜这是一些与超时有关的问题。有人能帮我解决这个问题吗?提前谢谢。

此问题特定于我们的Katta集群。我发现这些节点上的磁盘I/O速度非常低。Katta主机正在等待节点在"协商时间"内回复部署状态,但在等待时间后没有收到回复,Katta失败了。

由于磁盘写入速度较低,Katta节点正忙于响应Zookeeper集群的查询,从而导致主机重新启动/进入安全模式"安全模式:在过去200000毫秒内没有可用的节点或状态不稳定。"

我们通过用不同的装载替换磁盘卷来解决磁盘写入问题,Katta部署问题也得到了解决。

最新更新