我在pandas中加载一个大的JSON行文件时遇到了问题,主要是因为我需要"压平";使用pd.read_json后得到的列之一例如,对于这个JSON行:
{"user_history": [{"event_info": 248595, "event_timestamp": "2019-10-01T12:46:03.145-0400", "event_type": "view"}, {"event_info": 248595, "event_timestamp": "2019-10-01T13:21:50.697-0400", "event_type": "view"}], "item_bought": 1909110}
我需要在熊猫中加载2行4列,如下所示:
+--------------+--------------------------------+--------------+---------------+
| "event_info" | "event_timestamp" | "event_type" | "item_bought" |
+--------------+--------------------------------+--------------+---------------+
| 248595 | "2019-10-01T12:46:03.145-0400" | "view" | 1909110 |
| 248595 | "2019-10-01T13:21:50.697-0400" | "view" | 1909110 |
+--------------+--------------------------------+--------------+---------------+
问题是,考虑到文件的大小(413000+行,超过1GB(,我设法做到这一点的方法对我来说都不够快
history_df = pd.read_json('data/train_dataset.jl', lines=True)
history_df['index1'] = history_df.index
normalized_history = pd.DataFrame()
for index, row in history_df.iterrows():
for dic in row['user_history']:
dic['index1'] = row['index1']
dic['item_bought'] = row['item_bought']
normalized_history = normalized_history.append(dic, ignore_index=True)
所以问题是,哪种方法最快实现这一点?有没有办法不迭代整个history_df数据帧?
提前感谢
也许你可以试试这个?:
import pandas as pd
import json
data = []
# assuming each line from data/train_dataset.jl
# is a json object like the one you posted above:
with open('data/train_dataset.jl') as f:
for line in f:
data.append(json.loads(line))
normalized_history = pd.json_normalize(data, 'user_history', 'item_bought')