parquet文件上的pyarrow时间戳数据类型错误

当我使用pyarrow读取和计数pandas中的记录时，我有这个错误，我不希望pyarrow转换为时间戳[ns]，它可以保留在时间戳[us]中，是否有一个选项保持时间戳原样?我使用pyarrow 11.0,0和python 3.10。请通知

代码:

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.compute as pc
import pandas as pd
# Read the Parquet file into a PyArrow Table
table = pq.read_table('/Users/abc/Downloads/LOAD.parquet').to_pandas()
print(len(table))

误差

pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 101999952000000000

我不希望pyarrow转换为时间戳[ns]，它可以保持在时间戳[us]，是否有一个选项来保持时间戳不变?

目前，pandas只支持纳秒时间戳。

如果你坚持保持我们的精度，你有几个选择:

不使用pandas，坚持使用支持微秒的pyarrow:

table = pq.read_table("data.parquet")
len(table)

使用日期时间。Datetime而不是pd。数据帧中的时间戳(非常慢)

table = pq.read_table("data.parquet")
df = table.to_pandas(timestamp_as_object=True)

忽略超出范围的时间戳的精度损失

table = pq.read_table("data.parquet")
df = table.to_pandas(safe=False)

但是原来的时间戳是5202-04-02变成了1694-12-04

如果你觉得勇敢，可以使用pandas 2.0和pyarrow作为pandas的后端

pip install  pandas==2.0.0rc1

pd.read_parquet("data.parquet", dtype_backend="pyarrow")

使用pyarrow修复数据

5202-04-02肯定是个错别字。参见这个问题

相关内容

最新更新

热门标签：