我们在Python (Luigi)中有ETL作业。它们都连接到Hive Metastore获取分区信息。
代码:
from hive_metastore import ThriftHiveMetastore
client = ThriftHiveMetastore.Client(protocol)
partitions = client.get_partition_names('sales', 'salesdetail', -1)
-1是max_parts(返回的最大分区)
随机超时如下:
File "/opt/conda/envs/etl/lib/python2.7/site-packages/luigi/contrib/hive.py", line 210, in _existing_partitions
partition_strings = client.get_partition_names(database, table, -1)
File "/opt/conda/envs/etl/lib/python2.7/site-packages/hive_metastore/ThriftHiveMetastore.py", line 1703, in get_partition_names
return self.recv_get_partition_names()
File "/opt/conda/envs/etl/lib/python2.7/site-packages/hive_metastore/ThriftHiveMetastore.py", line 1716, in recv_get_partition_names
(fname, mtype, rseqid) = self._iprot.readMessageBegin()
File "/opt/conda/envs/etl/lib/python2.7/site-packages/thrift/protocol/TBinaryProtocol.py", line 126, in readMessageBegin
sz = self.readI32()
File "/opt/conda/envs/etl/lib/python2.7/site-packages/thrift/protocol/TBinaryProtocol.py", line 206, in readI32
buff = self.trans.readAll(4)
File "/opt/conda/envs/etl/lib/python2.7/site-packages/thrift/transport/TTransport.py", line 58, in readAll
chunk = self.read(sz - have)
File "/opt/conda/envs/etl/lib/python2.7/site-packages/thrift/transport/TTransport.py", line 159, in read
self.__rbuf = StringIO(self.__trans.read(max(sz, self.__rbuf_size)))
File "/opt/conda/envs/etl/lib/python2.7/site-packages/thrift/transport/TSocket.py", line 105, in read
buff = self.handle.recv(sz)
timeout: timed out
这个错误偶尔发生。
Hive Metastore超时15分钟。
当我单独运行get_partition_names时,它会在几秒钟内返回数据。
即使我设置了socket。超时至1或2秒,查询完成。
Hive metastore日志中没有socket close connection的记录cat /var/log/hive/..log.out
通常超时的表有大量的分区~10K+。但如前所述,它们只是随机暂停。当单独测试那部分代码时,它们会很快返回分区元数据。
知道为什么它会随机超时,或者如何在metastore日志中捕获这些超时错误,或者如何修复它们吗?
问题是LUIGI中的线程重叠
我们使用Singleton来实现穷人的连接池。但是Luigi的不同工作线程相互踩在一起,当一个线程的get_partition_names与另一个线程的冲突时,会导致奇怪的行为。
我们通过确保每个线程的连接对象在连接池中获得自己的"键"来修复这个问题(而不是所有线程共享进程id键)