使用pandas.json_normalize "unfold"字典列表的字典

我是Python的新手(以及一般的编码)，所以我会尽我所能解释我正在努力解决的挑战。

我正在处理一个从数据库导出为CSV的大型数据集。但是，这个CSV导出中有一列包含嵌套的字典列表(据我所知)。我在网上广泛地寻找解决方案，包括Stackoverflow，但还没有得到完整的解决方案。我想我在概念上理解我试图实现的目标，但不清楚要使用的最佳方法或数据准备过程。

下面是一个数据示例(缩减为我感兴趣的两列)：

{
"app_ID": {
"0": 1abe23574,
"1": 4gbn21096
},
"locations": {
"0": "[ {"loc_id" : "abc1",  "lat" : "12.3456",  "long" : "101.9876"  
},
{"loc_id" : "abc2",  "lat" : "45.7890",  "long" : "102.6543"} 
]",
"1": "[ ]",
]"
}
}

基本上，每个app_ID可以有多个位置绑定到一个ID，也可以是空的，如上所示。我尝试使用我在网上找到的一些指南，使用Panda的json_normalize()函数来"；展开"；或者将字典列表放入Panda数据帧中它们自己的行中。

我想最终得到这样的东西：

loc_id    lat      long       app_ID
abc1      12.3456  101.9876   1abe23574
abc1      45.7890  102.6543   1abe23574

等等。。。

我正在学习如何使用json_normalize的不同函数，比如"；record_ path"；以及"；meta"；，但还没能让它发挥作用。

我试着用将json文件加载到Jupyter笔记本中

with open('location_json.json', 'r') as f:
data = json.loads(f.read())
df = pd.json_normalize(data, record_path = ['locations'])

但它只创建了一个1行多列长的数据帧，我希望从最内部的字典中生成多行，这些行与app_ID和loc_ID字段相关联。

尝试解决方案：

我能够使用接近我想要的数据帧格式

with open('location_json.json', 'r') as f:
data = json.loads(f.read())
df = pd.json_normalize(data['locations']['0'])

但这需要对列表进行某种迭代才能创建数据帧，然后我将失去与app_ID字段的连接。(尽我所能理解json_normalize函数是如何工作的)。

我尝试使用json_normalize是否正确，还是应该重新开始并尝试不同的路线？如有任何建议或指导，我们将不胜感激。

我不能说建议您使用convtools库是件好事，因为您是初学者，因为这个库几乎就像是另一个Python而不是Python。它有助于动态定义数据转换(在后台生成Python代码)。

但无论如何，如果我正确理解输入数据，下面是代码：

import json
from convtools import conversion as c
data = {
"app_ID": {"0": "1abe23574", "1": "4gbn21096"},
"locations": {
"0": """[ {"loc_id" : "abc1",  "lat" : "12.3456",  "long" : "101.9876" },
{"loc_id" : "abc2",  "lat" : "45.7890",  "long" : "102.6543"} ]""",
"1": "[ ]",
},
}
# define it once and use multiple times
converter = (
c.join(
# converts "app_ID" data to iterable of dicts
(
c.item("app_ID")
.call_method("items")
.iter({"id": c.item(0), "app_id": c.item(1)})
),
# converts "locations" data to iterable of dicts,
# where each id like "0" is zipped to each location.
# the result is iterable of dicts like {"id": "0", "loc": {"loc_id": ... }}
(
c.item("locations")
.call_method("items")
.iter(
c.zip(id=c.repeat(c.item(0)), loc=c.item(1).pipe(json.loads))
)
.flatten()
),
# join on "id"
c.LEFT.item("id") == c.RIGHT.item("id"),
how="full",
)
# process results, where 0 index is LEFT item, 1 index is the RIGHT one
.iter(
{
"loc_id": c.item(1, "loc", "loc_id", default=None),
"lat": c.item(1, "loc", "lat", default=None),
"long": c.item(1, "loc", "long", default=None),
"app_id": c.item(0, "app_id"),
}
)
.as_type(list)
.gen_converter()
)
result = converter(data)
assert result == [
{'loc_id': 'abc1', 'lat': '12.3456', 'long': '101.9876', 'app_id': '1abe23574'},
{'loc_id': 'abc2', 'lat': '45.7890', 'long': '102.6543', 'app_id': '1abe23574'},
{'loc_id': None, 'lat': None, 'long': None, 'app_id': '4gbn21096'}
]

相关内容

最新更新

热门标签：