读取parquet文件并使用pyarrow转换为pandas



我想读取一个parquet文件并将其转换为pandas,以便能够可视化字段。我是新的拼花结构和得到一个错误时,转换为熊猫。我的代码如下:

import pyarrow as pa
import pyarrow.parquet as pq
parquet_file = pq.read_table('/biomedja01/disk1/software/GTEX/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_exon_reads.parquet')
parquet_file.to_pandas()

下面是一些文件元数据:

metadata = pq.read_metadata('/biomedja01/disk1/software/GTEX/GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9$print(metadata)
print(metadata.row_group(0))
print(metadata.row_group(0).column(0))

<pyarrow._parquet.FileMetaData object at 0x7f92fb146ef0>
created_by: parquet-cpp version 1.5.1-SNAPSHOT
num_columns: 17384
num_rows: 328671
num_row_groups: 1
format_version: 1.0
serialized_size: 4883225
<pyarrow._parquet.RowGroupMetaData object at 0x7f92fb100be0>
num_columns: 17384
num_rows: 328671
total_byte_size: 11453379595
<pyarrow._parquet.ColumnChunkMetaData object at 0x7f931abfa150>
file_offset: 600791
file_path:
physical_type: BYTE_ARRAY
num_values: 328671
path_in_schema: Description
is_stats_set: True
statistics:
<pyarrow._parquet.RowGroupStatistics object at 0x7f931abfad80>
has_min_max: True
min: b'5S_rRNA'
max: b'yR211F11.2'
null_count: 0
distinct_count: 0
num_values: 328671
physical_type: BYTE_ARRAY
compression: SNAPPY
encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE')
has_dictionary_page: True
dictionary_page_offset: 4
data_page_offset: 389078
total_compressed_size: 600787
total_uncompressed_size: 1028503

调用parquet_file.to_pandas()时得到的错误如下:

Traceback (most recent call last):
File "file.py", line 4, in <module>
parquet_file.to_pandas()
File "pyarrow/table.pxi", line 1410, in pyarrow.lib.Table.to_pandas
File "/home/lingxu/.conda/envs/GTEX/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 618, in table_to_blockmanager
columns = _reconstruct_columns_from_metadata(columns, column_indexes)
File "/home/lingxu/.conda/envs/GTEX/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 735, in _reconstruct_columns_from_metadata
return pd.MultiIndex(levels=new_levels, labels=labels, names=columns.names)
TypeError: __new__() got an unexpected keyword argument 'labels'

您似乎安装了不兼容的pandas版本。你可以尝试安装旧版本;看起来0.25.3应该可以工作。

最新更新