我一直在尝试规范化嵌套的json,但遇到的问题是如何处理NAN和空字段,因为我得到了一个=>KeyError:"address">当我尝试使用进行规范化时
pd.json_normalize(data ,record_path=["address"],record_prefix='add.')
我正在使用的数据
data = [{
"id": "1",
"name": "test",
"last": "last"
}, {
"id": "2",
"name": "test1",
"last": "last1",
"address": []
},
{
"id": "2",
"name": "test2",
"last": "last2",
"address": [{
"steet": "streed2",
"no": 2
}]
}]
输出我想要
id | name | last | add.street | add.no
1 | test | last | NAN | NAN
2 | test1 | last1 | |
3 | test2 | last2 | street2 | 2
您可以使用以下方法(我对您提供的json字符串进行了拼写检查(:
所以,你的数据是:
data = [{
"id": "1",
"name": "test",
"last": "last"
}, {
"id": "2",
"name": "test1",
"last": "last1",
"address": []
},
{
"id": "3",
"name": "test2",
"last": "last2",
"address": [{
"street": "street d2",
"no": 2
}]
}]
第一步是将其插入数据帧:
df = pd.DataFrame(data)
print(df)
返回:
id name last address
0 1 test last NaN
1 2 test1 last1 []
2 3 test2 last2 [{'steet': 'streed2', 'no': 2}]
正如你所指出的,nan
的存在是相当有问题的。因此,合理的做法是;走开";。第一分解地址列。这将公开json字符串,而不是json字符串列表:
df = df.explode('address')
print(df)
哪个给出:
id name last address
0 1 test last NaN
1 2 test1 last1 NaN
2 3 test2 last2 {'steet': 'streed2', 'no': 2}
然后可以通过用{}
替换它们来处理nan
,这允许您规范化df。
df.address = df.address.fillna({i: {} for i in df.index})
# use json_normalize
df = df.join(pd.json_normalize(df.address)).drop(columns=['address'])
print(df)
它给你:
id name last street no
0 1 test last NaN NaN
1 2 test1 last1 NaN NaN
2 3 test2 last2 street d2 2.0