给定这样的数据结构:
[{'a':1, 'b': 2}, {'c':3 }, {'a':4, 'c':9}, {'d':0}, {'d': 0, 'b':6}]
目标是解析数据以生成:
{'a': 2.5, 'b': 4, 'c': 6, 'd': 0}
通过做:
- 累加每个唯一键的值
- 平均每个键的值
有什么简单的方法可以实现上述所需的数据挖掘
我尝试了以下方法,它有效:
from collections import defaultdict
from statistics import mean
x = [{'a':1, 'b': 2}, {'c':3 }, {'a':4, 'c':9}, {'d':0}, {'d': 0, 'b':6}]
z = defaultdict(list)
for y in x:
for k, v in y.items():
z[k].append(v)
output = {k: mean(v) for k,v in z.items()}
但是有没有更简单的方法可以实现相同的数据解析也许是collections.Counter
什么的?
如果你想要带计数器的东西,你可以分别计算键和值,然后构建这样的平均值:
original = [{'a':1, 'b': 2}, {'c':3 }, {'a':4, 'c':9}, {'d':0}, {'d': 0, 'b':6}]
sum_counter = dict(sum([Counter(x) for x in original], Counter()))
count_counter = dict(sum([Counter(x.keys()) for x in original], Counter()))
final = {k: sum_counter.get(k,0)/count_counter[k] for k in count_counter}
print(final)
输出:
{'a': 2.5, 'b': 4.0, 'c': 6.0, 'd': 0.0}
编辑:我有另一个想法,它可能是解决您问题的更简单的方法(事实证明它也快得多(。这个想法是查看字典列表并创建一个新字典,其中保存每个键的值和出现次数的总和。然后,我们可以简单地通过将键的两个值除以来计算每个键的平均值。
from collections import defaultdict
original = [{'a':1, 'b': 2}, {'c':3 }, {'a':4, 'c':9}, {'d':0}, {'d': 0, 'b':6}]
ddict = defaultdict(lambda: [0,0])
for dictionary in original:
for key in dictionary:
ddict[key][0] += dictionary[key]
ddict[key][1] += 1
final = {k: ddict[k][0]/ddict[k][1] for k in ddict}
print(final)
输出仍然相同:
{'a': 2.5, 'b': 4.0, 'c': 6.0, 'd': 0.0}
一个选项(类似于@JANO的答案(是使用collections.Counter
,一次获得值的总和,然后再次获得每个键的值的数量,以获得所有字典中的键列表(,使用dict理解来获得平均值:
from collections import Counter
from itertools import chain
sums = sum(map(Counter, lst), Counter())
counts = Counter(chain.from_iterable(map(dict.keys, lst)))
out = {k: sums[k] / v for k,v in counts.items()}
另一种选择是使用cytoolz.dicttoolz.merge_with
创建一个列表字典,然后对其进行迭代以获得平均值:
from cytoolz.dicttoolz import merge_with
out = {k: sum(v)/len(v) for k,v in merge_with(list, *lst).items()}
输出:
{'a': 2.5, 'b': 4.0, 'c': 6.0, 'd': 0.0}
时间安排:
>>> lst = [{'a':1, 'b': 2}, {'c':3 }, {'a':4, 'c':9}, {'d':0}, {'d': 0, 'b':6}] * 100000
>>> %timeit counter_dc(lst)
3.32 s ± 90.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit defaultdict_dc(lst)
241 ms ± 18.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit dicttools_dc(lst)
66.9 ms ± 1.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
其中的功能是:
def counter_dc(lst):
sums = sum(map(Counter, lst), Counter())
counts = Counter(chain.from_iterable(map(dict.keys, lst)))
return {k: sums[k] / v for k,v in counts.items()}
def defaultdict_dc(lst):
out = defaultdict(list)
for d in lst:
for k,v in d.items():
out[k].append(v)
return {k: sum(v)/len(v) for k,v in out.items()}
def dicttools_dc(lst):
return {k: sum(v)/len(v) for k,v in merge_with(list, *lst).items()}
如果您对使用pandas
持开放态度,那么只需:
lst = [{"a": 1, "b": 2}, {"c": 3}, {"a": 4, "c": 9}, {"d": 0}, {"d": 0, "b": 6}]
print(pd.DataFrame(lst).mean().to_dict())
打印:
{'a': 2.5, 'b': 4.0, 'c': 6.0, 'd': 0.0}
这对你来说是如何工作的?
def average_dicts(in_data):
out_dict = {}
for dic in in_data:
for key, val in dic.items():
if key in out_dict:
out_dict[key] = (out_dict[key] + val) / 2
else:
out_dict.update({key: val})
return out_dict
if __name__ == "__main__":
in_data = [{'a':1, 'b': 2}, {'c':3 }, {'a':4, 'c':9}, {'d':0}, {'d': 0, 'b':6}]
print(average_dicts(in_data))