使用pyarrow将pandas数据框转换为ORC时出错



我正在尝试使用Pyarrow将Pandas DataFrame保存为.orc文件。软件包版本为:pandas==1.3.5pyarrow==6.0.1。我的python3版本是3.9.12

下面是代码片段:
import pandas as pd
import pyarrow as pa
import pyarrow.orc as orc
df = pd.read_orc('sample.orc')
table = pa.Table.from_pandas(df, preserve_index=False)
orc.write_table(table, 'sample_rewritten.orc')

我得到的错误是:ArrowNotImplementedError: Unknown or unsupported Arrow type: null

如何将Pandas DataFrame (csv)保存为python中的。orc文件?

write_table线路失败。这是整个堆栈跟踪:

ArrowNotImplementedError                  Traceback (most recent call last)
Input In [1], in <cell line: 7>()
5 df = pd.read_orc('hats_v2_sample.orc')
6 table = pa.Table.from_pandas(df, preserve_index=False)
----> 7 orc.write_table(table, 'sample_rewritten.orc')
File /opt/homebrew/lib/python3.9/site-packages/pyarrow/orc.py:176, in write_table(table, where)
174     table, where = where, table
175 writer = ORCWriter(where)
--> 176 writer.write(table)
177 writer.close()
File /opt/homebrew/lib/python3.9/site-packages/pyarrow/orc.py:146, in ORCWriter.write(self, table)
136 def write(self, table):
137     """
138     Write the table into an ORC file. The schema of the table must
139     be equal to the schema used when opening the ORC file.
(...)
144         The table to be written into the ORC file
145     """
--> 146     self.writer.write(table)
File /opt/homebrew/lib/python3.9/site-packages/pyarrow/_orc.pyx:159, in pyarrow._orc.ORCWriter.write()
File /opt/homebrew/lib/python3.9/site-packages/pyarrow/error.pxi:120, in pyarrow.lib.check_status()
ArrowNotImplementedError: Unknown or unsupported Arrow type: null

在导出数据帧中带有空值的数据时,会出现此问题。你可以用df。fillna(value = 0,inplace = True)然后导出数据帧到orc文件

请确定您是否能够等待,但是Pandas v1.5.0原生支持写入ORC文件。

DataFrame.to_orc()

https://pandas.pydata.org/pandas-docs/version/1.5/whatsnew/v1.5.0.html writing-to-orc-fileshttps://github.com/pandas-dev/pandas/pull/44554

最新更新