如何处理无效的扫描InvalidChunk例外

我正在尝试扫描一些行'脏'的数据，但这取决于扫描，导致（序列化？） invalidchunk exceptions。代码如下：

from google.cloud import bigtable
from google.cloud import happybase
client = bigtable.Client(project=project_id, admin=True)
instance = client.instance(instance_id)
connection = happybase.Connection(instance=instance)
table = connection.table(table_name)
for key, row in table.scan(limit=5000):  #BOOM!
    pass

忽略一些列或将行限制为更少或指定起点和停止键，允许扫描成功。我无法检测到哪些值从堆栈Trace中出现问题 - 它在列中变化 - 扫描只是失败。这使清洁源数据的数据是有问题的。

当我利用Python调试器时，我会看到块（类型为 google.bigtable.v2.bigtable_pb2.cellchunk ）没有价值（它是无效的）P>

ipdb> pp chunk.value
b''
ipdb> chunk.value_size
0

我可以用rowkey的HBase壳来确认这一点（我从 self._row.row.ykey ）

因此，问题变成：如何可以使用未定义/空/null值的Boogtable扫描过滤列？

我从两个Google Cloud API中都有相同的问题，它们返回了内部流数据的GRPC块：

google.cloud。
google.cloud。

缩写的堆栈如下：

---------------------------------------------------------------------------
InvalidChunk                              Traceback (most recent call last)
<ipython-input-48-922c8127f43b> in <module>()
      1 row_gen = table.scan(limit=n) 
      2 rows = []
----> 3 for kvp in row_gen:
      4     pass
.../site-packages/google/cloud/happybase/table.py in scan(self, row_start, row_stop, row_prefix, columns, timestamp, include_timestamp, limit, **kwargs)
    391         while True:
    392             try:
--> 393                 partial_rows_data.consume_next()
    394                 for row_key in sorted(rows_dict):
    395                     curr_row_data = rows_dict.pop(row_key)
.../site-packages/google/cloud/bigtable/row_data.py in consume_next(self)
    273         for chunk in response.chunks:
    274 
--> 275             self._validate_chunk(chunk)
    276 
    277             if chunk.reset_row:
.../site-packages/google/cloud/bigtable/row_data.py in _validate_chunk(self, chunk)
    388             self._validate_chunk_new_row(chunk)
    389         if self.state == self.ROW_IN_PROGRESS:
--> 390             self._validate_chunk_row_in_progress(chunk)
    391         if self.state == self.CELL_IN_PROGRESS:
    392             self._validate_chunk_cell_in_progress(chunk)
.../site-packages/google/cloud/bigtable/row_data.py in _validate_chunk_row_in_progress(self, chunk)
    368         self._validate_chunk_status(chunk)
    369         if not chunk.HasField('commit_row') and not chunk.reset_row:
--> 370             _raise_if(not chunk.timestamp_micros or not chunk.value)
    371         _raise_if(chunk.row_key and
    372                   chunk.row_key != self._row.row_key)
.../site-packages/google/cloud/bigtable/row_data.py in _raise_if(predicate, *args)
    439     """Helper for validation methods."""
    440     if predicate:
--> 441         raise InvalidChunk(*args)
InvalidChunk:

您可以向我展示如何从Python扫描Bigtable，忽略/登录肮脏的行以提高无效的行？（尝试... 在发电机上工作，该生成器在Google Cloud API row_data partialrowsdata class中）

另外，您可以向我展示代码以缩小流式扫描吗？HappyBase batch_size ＆amp; scan_batching 似乎不支持。

这可能是由于此错误：https：//github.com/googleapis/google-cloud-python/issues/2980

该错误已修复，因此这不再是问题。

相关内容

最新更新

热门标签：