pandas-json使用一些空的record_path进行规范化



我一直在尝试规范化嵌套的json,但遇到的问题是如何处理NAN和空字段,因为我得到了一个=>KeyError:"address">当我尝试使用进行规范化时

pd.json_normalize(data ,record_path=["address"],record_prefix='add.')

我正在使用的数据

data = [{
"id": "1",
"name": "test",
"last": "last"
}, {
"id": "2",
"name": "test1",
"last": "last1",
"address": []
},
{
"id": "2",
"name": "test2",
"last": "last2",
"address": [{
"steet": "streed2",
"no": 2
}]
}]

输出我想要

id | name  | last | add.street | add.no
1  | test  | last | NAN        | NAN
2  | test1 | last1 |           |  
3  | test2 | last2 | street2   | 2

您可以使用以下方法(我对您提供的json字符串进行了拼写检查(:

所以,你的数据是:

data = [{
"id": "1",
"name": "test",
"last": "last"
}, {
"id": "2",
"name": "test1",
"last": "last1",
"address": []
},
{
"id": "3",
"name": "test2",
"last": "last2",
"address": [{
"street": "street d2",
"no": 2
}]
}]

第一步是将其插入数据帧:

df = pd.DataFrame(data)
print(df)

返回:

id   name   last                          address
0  1   test   last                              NaN
1  2  test1  last1                               []
2  3  test2  last2  [{'steet': 'streed2', 'no': 2}]

正如你所指出的,nan的存在是相当有问题的。因此,合理的做法是;走开";。第一分解地址列。这将公开json字符串,而不是json字符串列表:

df = df.explode('address')
print(df)

哪个给出:

id   name   last                        address
0  1   test   last                            NaN
1  2  test1  last1                            NaN
2  3  test2  last2  {'steet': 'streed2', 'no': 2}

然后可以通过用{}替换它们来处理nan,这允许您规范化df。

df.address = df.address.fillna({i: {} for i in df.index})
# use json_normalize
df = df.join(pd.json_normalize(df.address)).drop(columns=['address'])
print(df)

它给你:

id   name   last     street   no
0  1   test   last        NaN  NaN
1  2  test1  last1        NaN  NaN
2  3  test2  last2  street d2  2.0

最新更新