在熊猫的数据帧中"flatten" JSON 的最有效方法是什么?



我在pandas中加载一个大的JSON行文件时遇到了问题,主要是因为我需要"压平";使用pd.read_json后得到的列之一例如,对于这个JSON行:

{"user_history": [{"event_info": 248595, "event_timestamp": "2019-10-01T12:46:03.145-0400", "event_type": "view"}, {"event_info": 248595, "event_timestamp": "2019-10-01T13:21:50.697-0400", "event_type": "view"}], "item_bought": 1909110}

我需要在熊猫中加载2行4列,如下所示:

+--------------+--------------------------------+--------------+---------------+
| "event_info" |       "event_timestamp"        | "event_type" | "item_bought" |
+--------------+--------------------------------+--------------+---------------+
|       248595 | "2019-10-01T12:46:03.145-0400" | "view"       |       1909110 |
|       248595 | "2019-10-01T13:21:50.697-0400" | "view"       |       1909110 |
+--------------+--------------------------------+--------------+---------------+

问题是,考虑到文件的大小(413000+行,超过1GB(,我设法做到这一点的方法对我来说都不够快

history_df = pd.read_json('data/train_dataset.jl', lines=True)
history_df['index1'] = history_df.index
normalized_history = pd.DataFrame()
for index, row in history_df.iterrows():
for dic in row['user_history']:
dic['index1'] = row['index1']
dic['item_bought'] = row['item_bought']
normalized_history = normalized_history.append(dic, ignore_index=True)

所以问题是,哪种方法最快实现这一点?有没有办法不迭代整个history_df数据帧?

提前感谢

也许你可以试试这个?:

import pandas as pd
import json
data = []
# assuming each line from data/train_dataset.jl 
# is a json object like the one you posted above:
with open('data/train_dataset.jl') as f:
for line in f:
data.append(json.loads(line))
normalized_history = pd.json_normalize(data, 'user_history', 'item_bought')

最新更新