将一个〔{str:int},{str:int},..〕的dict列表转换为一个{str:int}的dict



给定这样的数据结构:

[{'a':1, 'b': 2}, {'c':3 }, {'a':4, 'c':9}, {'d':0}, {'d': 0, 'b':6}]

目标是解析数据以生成:

{'a': 2.5, 'b': 4, 'c': 6, 'd': 0}

通过做:

  • 累加每个唯一键的值
  • 平均每个键的值

有什么简单的方法可以实现上述所需的数据挖掘


我尝试了以下方法,它有效:

from collections import defaultdict
from statistics import mean

x = [{'a':1, 'b': 2}, {'c':3 }, {'a':4, 'c':9}, {'d':0}, {'d': 0, 'b':6}]
z = defaultdict(list)
for y in x:
for k, v in y.items():
z[k].append(v)
output = {k: mean(v) for k,v in z.items()}

但是有没有更简单的方法可以实现相同的数据解析也许是collections.Counter什么的?

如果你想要带计数器的东西,你可以分别计算键和值,然后构建这样的平均值:

original = [{'a':1, 'b': 2}, {'c':3 }, {'a':4, 'c':9}, {'d':0}, {'d': 0, 'b':6}]
sum_counter = dict(sum([Counter(x) for x in original], Counter()))
count_counter = dict(sum([Counter(x.keys()) for x in original], Counter()))
final = {k: sum_counter.get(k,0)/count_counter[k] for k in count_counter}
print(final)

输出:

{'a': 2.5, 'b': 4.0, 'c': 6.0, 'd': 0.0}

编辑:我有另一个想法,它可能是解决您问题的更简单的方法(事实证明它也快得多(。这个想法是查看字典列表并创建一个新字典,其中保存每个键的值和出现次数的总和。然后,我们可以简单地通过将键的两个值除以来计算每个键的平均值。

from collections import defaultdict
original = [{'a':1, 'b': 2}, {'c':3 }, {'a':4, 'c':9}, {'d':0}, {'d': 0, 'b':6}]
ddict = defaultdict(lambda: [0,0])
for dictionary in original:
for key in dictionary:
ddict[key][0] += dictionary[key]
ddict[key][1] += 1        

final = {k: ddict[k][0]/ddict[k][1] for k in ddict}
print(final)

输出仍然相同:

{'a': 2.5, 'b': 4.0, 'c': 6.0, 'd': 0.0}

一个选项(类似于@JANO的答案(是使用collections.Counter,一次获得值的总和,然后再次获得每个键的值的数量,以获得所有字典中的键列表(,使用dict理解来获得平均值:

from collections import Counter
from itertools import chain
sums = sum(map(Counter, lst), Counter())
counts = Counter(chain.from_iterable(map(dict.keys, lst)))
out = {k: sums[k] / v for k,v in counts.items()}

另一种选择是使用cytoolz.dicttoolz.merge_with创建一个列表字典,然后对其进行迭代以获得平均值:

from cytoolz.dicttoolz import merge_with
out = {k: sum(v)/len(v) for k,v in merge_with(list, *lst).items()}

输出:

{'a': 2.5, 'b': 4.0, 'c': 6.0, 'd': 0.0}

时间安排:

>>> lst = [{'a':1, 'b': 2}, {'c':3 }, {'a':4, 'c':9}, {'d':0}, {'d': 0, 'b':6}] * 100000
>>> %timeit counter_dc(lst)
3.32 s ± 90.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit defaultdict_dc(lst)
241 ms ± 18.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit dicttools_dc(lst)
66.9 ms ± 1.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

其中的功能是:

def counter_dc(lst):
sums = sum(map(Counter, lst), Counter())
counts = Counter(chain.from_iterable(map(dict.keys, lst)))
return {k: sums[k] / v for k,v in counts.items()}

def defaultdict_dc(lst):
out = defaultdict(list)
for d in lst:
for k,v in d.items():
out[k].append(v)
return {k: sum(v)/len(v) for k,v in out.items()}
def dicttools_dc(lst):
return {k: sum(v)/len(v) for k,v in merge_with(list, *lst).items()}

如果您对使用pandas持开放态度,那么只需:

lst = [{"a": 1, "b": 2}, {"c": 3}, {"a": 4, "c": 9}, {"d": 0}, {"d": 0, "b": 6}]
print(pd.DataFrame(lst).mean().to_dict())

打印:

{'a': 2.5, 'b': 4.0, 'c': 6.0, 'd': 0.0}

这对你来说是如何工作的?

def average_dicts(in_data):
out_dict = {}
for dic in in_data:
for key, val in dic.items():
if key in out_dict:
out_dict[key] = (out_dict[key] + val) / 2
else:
out_dict.update({key: val})

return out_dict
if __name__ == "__main__":

in_data = [{'a':1, 'b': 2}, {'c':3 }, {'a':4, 'c':9}, {'d':0}, {'d': 0, 'b':6}]
print(average_dicts(in_data))

相关内容

  • 没有找到相关文章

最新更新