我使用Python 3.9。x -我有一个问题,我想以最优的方式合并一组字典。然而,这些不是简单的字典——有一个简单的数字ID;还有一个字典列表,叫做"代码"。-用于存放要合并的字典列表
原始数据的示例如下:
[
{
"id" : "1234",
"codes" : [
{
"provider" : "provider1",
"id" : "1234",
},
{
"provider" : "provider2",
"id" : "AA0001",
},
{
"provider" : "provider3",
"id" : "tt00001",
},
{
"provider" : "provider4",
"id" : "0000-0000-27E0-0000-9-0000-0000-A",
}
]
},
{
"id" : "12345",
"codes" : [
{
"provider" : "provider1",
"id" : "2345",
},
{
"provider" : "provider3",
"id" : "tt00001",
},
{
"provider" : "provider4",
"id" : "0000-0000-27E0-0000-9-0000-0000-A",
}
{
"provider" : "provider5",
"id" : "F0046872",
},
]
},
{
"id": "123456",
"codes": [
{
"id": "0000",
"provider": "provider6"
}
]
}
]
在上面的例子中,我们可以看到在"代码"中有两个常见的字典。两个对象列表(id: "1234")和"12345"):
{
"provider" : "provider3",
"id" : "tt00001",
},
{
"provider" : "provider4",
"id" : "0000-0000-27E0-0000-9-0000-0000-A",
}
只要有一个共同的提供者和id组合-我们应该触发合并。因此,id为1234和12345的对象应该合并。
但是,期望的输出将保留具有所有id的所有对象,即:
[
{
"id" : "1234",
"codes" : [
{
"provider" : "provider1",
"id" : "1234",
},
{
"provider" : "provider2",
"id" : "AA0001",
},
{
"provider" : "provider3",
"id" : "tt00001",
},
{
"provider" : "provider4",
"id" : "0000-0000-27E0-0000-9-0000-0000-A",
},
{
"provider" : "provider5",
"id" : "F0046872",
}
]
},
{
"id" : "12345",
"codes" : [
{
"provider" : "provider1",
"id" : "2345",
},
{
"provider" : "provider3",
"id" : "tt00001",
},
{
"provider" : "provider5",
"id" : "F0046872",
},
{
"provider" : "provider2",
"id" : "AA0001",
},
{
"provider" : "provider4",
"id" : "0000-0000-27E0-0000-9-0000-0000-A",
}
]
},
{
"id": "123456",
"codes": [
{
"id": "0000",
"provider": "provider6"
}
]
}
]
目前的一些代码:
- codes_file具有与上面 原始数据相同的数据
- raw_data被读入的Codes类只是一个python数据类
def test_codes_extractor_flattener(self, codes_extractor: CodesExtractor):
codes_file = os.path.join(
os.path.dirname(os.path.abspath(__file__)),
"data/codes_cache_flatten.txt",
)
codes_cache = []
with open(codes_file, "r") as in_file:
data = ast.literal_eval(in_file.read())
for raw_data in data:
for code in raw_data["codes"]:
for raw_data_ in data:
for code_ in raw_data_["codes"]:
if code_["provider"] == code["provider"] and
code_["id"] == code["id"]:
self.merge(raw_data, raw_data_)
continue
continue
codes_cache.append(Codes(**raw_data))
assert len(codes_cache) == 2
for codes in codes_cache:
assert len(codes.codes) == 6
def merge(self, code_1, code_2):
code_2_codes = code_2["codes"]
for code_ in code_1["codes"]:
if code_ not in code_2_codes:
code_2_codes.append(code_)
注意:
- 提供者字段的顺序无关紧要(如所需输出 中的第二个条目所示)
- "id"字段之外的"代码"。
我不确定是否要合并列表中的所有codes
,如果它们有共同的提供商或其中任何两个。这是一种合并它们的方法。如果是其中任意两个,您仍然可以选择使用itertools
进行迭代。我想想办法解决这种情况,但现在是我的午餐时间:D
from pprint import pprint
import typing as ty
class Data(ty.TypedDict):
id: str
codes: ty.List[ty.Dict[str, str]]
data: ty.List[Data] = [
... # your initial raw data
]
# merge all the codes
merged_codes = set()
id_codes: Data
for id_codes in data:
for code in id_codes["codes"]:
merged_codes.add(frozenset(code.items()))
len_all_codes = sum(len(id_codes["codes"]) for id_codes in data)
if len(merged_codes) < len_all_codes: # so some got merged
# update all codes to only have the merged providers
for id_codes in data:
id_codes["codes"] = [dict(obj) for obj in merged_codes]
pprint(data)
打印
$ python3 tmp.py
[{'codes': [{'id': '2345', 'provider': 'provider1'},
{'id': 'F0046872', 'provider': 'provider5'},
{'id': '1234', 'provider': 'provider1'},
{'id': 'AA0001', 'provider': 'provider2'},
{'id': 'tt00001', 'provider': 'provider3'},
{'id': '0000-0000-27E0-0000-9-0000-0000-A',
'provider': 'provider4'}],
'id': '1234'},
{'codes': [{'id': '2345', 'provider': 'provider1'},
{'id': 'F0046872', 'provider': 'provider5'},
{'id': '1234', 'provider': 'provider1'},
{'id': 'AA0001', 'provider': 'provider2'},
{'id': 'tt00001', 'provider': 'provider3'},
{'id': '0000-0000-27E0-0000-9-0000-0000-A',
'provider': 'provider4'}],
'id': '12345'}]