使用pyarrow将JSON写入parquet文件



我正在运行以下代码

import pyarrow
import pyarrow.parquet as pq
import pandas as pd
import json
parquet_schema = schema = pyarrow.schema(
[('id', pyarrow.string()),
('firstname', pyarrow.string()),
('lastname', pyarrow.string())])

user_json = '{"id" : "id1", "firstname": "John", "lastname":"Doe"}'
writer = pq.ParquetWriter('user.parquet', schema=parquet_schema)
df = pd.DataFrame.from_dict(json.loads(user_json))
table = pyarrow.Table.from_pandas(df)
print(table.schema)
writer.write_table(table)
writer.close()

但我得到以下错误:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-14-a427a4cdd392> in <module>()
15 writer = pq.ParquetWriter('user.parquet', schema=parquet_schema)
16 
---> 17 df = pd.DataFrame.from_dict(json.loads(user_json))
18 table = pyarrow.Table.from_pandas(df)
19 print(table.schema)
4 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/internals/construction.py in extract_index(data)
385 
386         if not indexes and not raw_lengths:
--> 387             raise ValueError("If using all scalar values, you must pass an index")
388 
389         if have_series:
ValueError: If using all scalar values, you must pass an index

遵循文档和教程,但我遗漏了一些东西。

考虑到您正在尝试使用列数据,您使用的库将期望您将传递每个列的行

我猜你不会在现实生活中写一个单行的拼花文件,在这种情况下,你可以按列分组,这将适用于pandas和arrow。

你也可以完全避免使用熊猫,并通过pyarrow.Tablefrom_pydict方法

import pyarrow
import pyarrow.parquet as pq
users = {"id" : ["id1", "id2"], 
"firstname": ["John", "Jack"], 
"lastname": ["Doe", "Ryan"]}
table = pyarrow.Table.from_pydict(users)
print(table.schema)
with pq.ParquetWriter('user.parquet', schema=table.schema) as writer:
writer.write_table(table)

参见https://arrow.apache.org/cookbook/py/create.html#create-table-from-plain-types和https://arrow.apache.org/cookbook/py/io.html#write-a-parquet-file

您有三个选择:

  1. 停止使用标量值,并将dict的值(来自json字符串)作为列表。
import pyarrow
import pyarrow.parquet as pq
import pandas as pd
import json

user_json = '{"id" : "id1", "firstname": "John", "lastname":"Doe"}'
user_dict = json.loads(user_json)
# Make all values in the dict a list
for key, value in user_dict.items():
user_dict[key] = [value]
df = pd.DataFrame(user_dict)
df.to_parquet('myfile.parquet')

  1. 在加载标量值时简单地传递一个索引(例如2而不是[2])
import pyarrow
import pyarrow.parquet as pq
import pandas as pd
import json

user_json = '{"id" : "id1", "firstname": "John", "lastname":"Doe"}'
user_dict = json.loads(user_json)
# Pass an index instead
df = pd.DataFrame(user_dict, index=[0])
df.to_parquet('myfile.parquet')

  1. use ' Dataframe.from_records
import pyarrow
import pyarrow.parquet as pq
import pandas as pd
import json

user_json = '{"id" : "id1", "firstname": "John", "lastname":"Doe"}'
user_dict = json.loads(user_json)
# Simply use `DataFrame.from_records`
df = pd.DataFrame.from_records(user_dict)
df.to_parquet('myfile.parquet')

第三个是最简单的,但我可能会养成将标量值传递到DF的习惯,并使用第一个选项的解决方案。

从变量中的值构造pandas DataFrame给出了"ValueError:如果使用所有标量值,则必须传递一个索引">

最新更新