使用 pyarrow 加载"pivoted"数据(或者，对于 pyarrow，"stack" 或 "melt"。表)

我在"枢转"；format：行和列是分类的，值是同构的数据类型。

将这样的文件加载到具有"0"的CCD_ 1中的最佳(内存和计算效率(方式是什么；未象牙化的"；架构？换句话说，给定一个具有n行和m列的CSV文件，如何获得具有n*m行和一列的pyarrow.Table？

就pandas而言，我想我想要相当于pandas.DataFrame.melt()或.stack()的pyarrow。

例如。。。

给定此CSV文件

item,A,B
item_0,0,0
item_1,370,1
item_2,43,0

我想要这个pyarrow.Table0

item    group  value
item_0        A      0
item_0        B      0
item_1        A    370
item_1        B      1
item_2        A     43
item_2        B      0

Pyarrow的计算能力有限，目前不支持melt。您可以在那里查看可用内容：https://arrow.apache.org/docs/python/api/compute.html#

一种选择是自己创建融化的桌子：

table = pyarrow.csv.read_csv("data.csv")
tables = []
for column_name in table.schema.names[1:]:
tables.append(pa.Table.from_arrays(
[
table[0],
pa.array([column_name]*table.num_rows, pa.string()),
table[column_name],
],
names=[
table.schema.names[0],
"key",
"value"
]

))

result = pa.concat_tables(tables)

另一种选择是使用类似熊猫的pola-rs，但使用箭头作为后端。与pyarrow不同，它有更多的计算函数，包括melt：https://pola-rs.github.io/polars/py-polars/html/reference/dataframe/api/polars.DataFrame.melt.html

相关内容

最新更新

热门标签：