如何在Python的Chuncks的HBase桌子上循环



我当前正在编写一个Python脚本,该脚本使用" HappyBase"将HBase表转换为CSV。我遇到的问题是,如果桌子太大,我在达到200万行以上后会出现以下错误:

Hbase_thrift.IOError: IOError(message='org.apache.hadoop.hbase.DoNotRetryIOException: hconnection-0x8dfa2f2 closedntat org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1182)ntat org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:305)ntat org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:156)ntat org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:60)ntat org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:212)ntat org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:314)ntat org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:432)ntat org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:358)ntat org.apache.hadoop.hbase.client.AbstractClientScanner.next(AbstractClientScanner.java:70)ntat org.apache.hadoop.hbase.thrift.ThriftServerRunner$HBaseHandler.scannerGetList(ThriftServerRunner.java:1423)ntat sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)ntat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)ntat java.lang.reflect.Method.invoke(Method.java:498)ntat org.apache.hadoop.hbase.thrift.HbaseHandlerMetricsProxy.invoke(HbaseHandlerMetricsProxy.java:67)ntat com.sun.proxy.$Proxy10.scannerGetList(Unknown Source)ntat org.apache.hadoop.hbase.thrift.generated.Hbase$Processor$scannerGetList.getResult(Hbase.java:4789)ntat org.apache.hadoop.hbase.thrift.generated.Hbase$Processor$scannerGetList.getResult(Hbase.java:4773)ntat org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)ntat org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)ntat org.apache.hadoop.hbase.thrift.TBoundedThreadPoolServer$ClientConnnection.run(TBoundedThreadPoolServer.java:289)ntat org.apache.hadoop.hbase.thrift.CallQueue$Call.run(CallQueue.java:64)ntat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)ntat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)ntat java.lang.Thread.run(Thread.java:748)n')

我虽然将for循环切成子路(即打开HBase连接 ->获取第一条100,000行的数据 ->关闭连接 ->重新打开 ->再次重新打开 ->获取下一100,000行 ->关闭它...等等),但我似乎无法弄清楚如何做。这是我的代码示例,读取所有行并崩溃:

import happybase
connection = happybase.Connection('localhost')
table = 'some_table'
table_object = connection.table(table)
for row in table_object.scan():
    print row

任何帮助都将不胜感激(即使您建议另一种解决方案:)

谢谢

实际上,我发现了这样做的方法,如下:

import happybase
connection = happybase.Connection('localhost')
table = 'some_table'
table_object = connection.table(table)
while True:
  try:
    for row in table_object.scan():
      print row
    break
  except Exception as e:
    if "org.apache.hadoop.hbase.DoNotRetryIOException" in e.message:
      connection.open()
    else:
      print e
      quit()

最新更新