连接嵌套的JSON数据帧和另一个数据帧



我正试图使用字段order_id将JSON生成的数据帧1与数据帧2连接起来,然后分配;"状态";从数据帧2到";"状态";数据帧1的。任何人都知道怎么做。非常感谢你的帮助。

数据帧1

[{
"client_id": 1,
"name": "Test01",
"olist": [{
"order_id": 10000,
"order_dt_tm": "2012-12-01",
"status": ""   <== use "status" from dataframe2 to populate this field
},
{
"order_id": 10000,
"order_dt_tm": "2012-12-01",
"status": ""
}
]    
},
{
"client_id": 2,
"name": "Test02",
"olist": [{
"order_id": 10002,
"order_dt_tm": "2012-12-01",
"status": ""
},
{
"order_id": 10003,
"order_dt_tm": "2012-12-01",
"status": ""
}
]    
}
]

数据帧2

order_id     status
10002        "Delivered"
10001        "Ordered"

这是作为json字符串的原始数据集:

d = """[{
"client_id": 1,
"name": "Test01",
"olist": [{
"order_id": 10000,
"order_dt_tm": "2012-12-01",
"status": ""
},
{
"order_id": 10000,
"order_dt_tm": "2012-12-01",
"status": ""
}
]    
},
{
"client_id": 2,
"name": "Test02",
"olist": [{
"order_id": 10002,
"order_dt_tm": "2012-12-01",
"status": ""
},
{
"order_id": 10003,
"order_dt_tm": "2012-12-01",
"status": ""
}
]    
}
]"""

首先,我会将其加载为json:

import json
data = json.loads(d)

然后,我会把它变成一个Pandas数据帧,注意我删除了状态字段,因为它将由连接步骤填充:

df1 = pd.json_normalize(data, 'olist')[['order_id', 'order_dt_tm']]

然后,从第二个数据帧示例中,我将使用合并函数进行左联接:

data = {'order_id':[10002, 10001],'status':['Delivered', 'Ordered']}
df2 = pd.DataFrame(data)
result = df1.merge(df2, on='order_id', how='left')

祝好运

更新

# JSON to Dataframe
df1 = pd.json_normalize(data)
# Sub JSON to dataframe
df1['sub_df'] = df1['olist'].apply(lambda x: pd.json_normalize(x).drop('status', axis=1))
# Build second dataframe
data2 = {'order_id':[10002, 10001],'status':['Delivered', 'Ordered']}
df2 = pd.DataFrame(data2)
# Populates status in sub dataframes
df1['sub_df'] = df1['sub_df'].apply(lambda x: x.merge(df2, on='order_id', how='left').fillna(''))
# Sub dataframes back to JSON
def back_to_json_str(df):
# turns a df back to string json
return str(df.to_json(orient="records", indent=4))

df1['olist'] = df1['sub_df'].apply(lambda x: back_to_json_str(x))
# Global DF back to JSON string
parsed = str(df1.drop('sub_df', axis=1).to_json(orient="records", indent=4))
parsed = parsed.replace(r'n', 'n')
parsed = parsed.replace(r'"', '"')
# Print result
print(parsed)

UPDATE 2这里有一种将索引列添加到数据帧的方法:

df1['index'] = [e for e in range(df1.shape[0])]

这是我将标题值从数据帧分配回JSON对象的代码。如果JSON对象中的数字记录为100000,则赋值操作需要一些时间。任何人都知道如何提高此代码的性能。非常感谢。

import json
import random
import pandas as pd
import pydash as _
data = [{"pid":1,"name":"Test1","title":""},{"pid":2,"name":"Test2","title":""}] # 5000 records
# dataframe1
df = pd.json_normalize(data)
# dataframe2
pid = [x for x in range(1, 5000)]
title_set = ["Boss", "CEO", "CFO", "PMO", "Team Lead"]
titles = [title_set[random.randrange(0, 5)] for x in range(1, 5000)]
df2 = pd.DataFrame({'pid': pid, 'title': titles})
#left join dataframe1 and dataframe2
df3 = df.merge(df2, on='pid', how='left')
#assign title values from dataframe back to the json object
for row in df3.iterrows():
idx = _.find_index(data, lambda x: x['pid'] == row[1]['pid'])
data[idx]['title'] = row[1]['title_y']
print(data)

最新更新