接受一个字典列表，并返回一个由多个键组的列表组成的字典

我正在编写一个具有2个参数的函数(data, keys)，将schools_list作为第一个参数(见下文)，并将元组groupby_keys作为第二个参数:

groupby_keys = ('region', 'state')
schools_list = [
{'region': 'northeast', 'state': 'MA', 'school': 'Brandeis', 'stu_pop': 5800},
{'region': 'south', 'state': 'GA', 'school': 'Gatech', 'stu_pop': 36489},
{'region': 'westcoast', 'state': 'CA', 'school': 'Stanford', 'stu_pop': 17249},
{'region': 'northeast', 'state': 'MA', 'school': 'Olin', 'stu_pop': 390},
{'region': 'south', 'state': 'TX', 'school': 'UT Austin', 'stu_pop': 51090},
{'region': 'northeast', 'state': 'CT', 'school': 'Yale', 'stu_pop': 13609},
{'region': 'northeast', 'state': 'CT', 'school': 'Trinity College', 'stu_pop': 2198},
{'region': 'westcoast', 'state': 'OR', 'school': 'Reed', 'stu_pop': 1470},
{'region': 'westcoast', 'state': 'CA', 'school': 'Harvey Mudd', 'stu_pop': 895},
{'region': 'westcoast', 'state': 'WA', 'school': 'UW', 'stu_pop': 47571},
{'region': 'south', 'state': 'TX', 'school': 'TCU', 'stu_pop': 11024},
{'region': 'northeast', 'state': 'MA', 'school': 'Tufts', 'stu_pop': 11878},
{'region': 'south', 'state': 'TX', 'school': 'SMU', 'stu_pop': 12373},
{'region': 'westcoast', 'state': 'OR', 'school': 'Lewis & Clark', 'stu_pop': 3390}
]

这个函数应该按(分组，而不使用numpy和pandas)通过第二个参数中元组指定的键在第一个参数中查找列表中的字典，并返回如下输出:

{
('northeast','MA'):
[{'region':'northeast', 'state':'MA', 'school':'Brandeis', 'stu_pop':5800},
{'region':'northeast', 'state':'MA', 'school':'Tufts', 'stu_pop':11878}],
('northeast','CT'): 
[{'region':'northeast', 'state':'CT', 'school':'Yale', 'stu_pop':13609}, 
{'region':'northeast', 'state':'CT', 'school':'Trinity College', 'stu_pop':2198}],
...
}

下面是我的代码:

def group_by_field(source, fields):
data = source 
value_sets = []
#create a dict with unique tuples-keys from the data
for datum in data:
temp = []
for field in fields:
if datum[field] not in temp:
temp.append(datum[field])
if temp not in value_sets:
value_sets.append(tuple(temp))   
groups = dict.fromkeys(value_sets, [])
#the check function check whethers a dict has values specified by the tuple 
def check(dic, fields, tup):
sum_check = len(fields)
for field, val in zip(fields,tup):
if dic[field] == val:
sum_check = sum_check - 1
if sum_check == 0:
return True
return False
#append the correct dict to the correct tuple-key
for value_set in value_sets:
for datum in data:
if check(datum, fields, value_set):
groups[value_set].append(datum)
data.remove(datum) #so that the 1st for-loop doesn't have to loop through this value again
return groups

问题是我的代码工作，如果列表没有很多元素，但当元素的数量是几千，它运行得非常慢。我该如何优化呢?

多谢!

您可以使用itertools.groupby():

from itertools import groupby
school_tuples = sorted((((d["region"], d["state"]), d) for d in schools_list), key=lambda x: x[0])
print({key: [d for k, d in group] for key, group in groupby(school_tuples, key=lambda x: x[0])})

这应该打印:

{('northeast', 'CT'): [{'region': 'northeast',
'school': 'Yale',
'state': 'CT',
'stu_pop': 13609},
{'region': 'northeast',
'school': 'Trinity College',
'state': 'CT',
'stu_pop': 2198}],
('northeast', 'MA'): [{'region': 'northeast',
'school': 'Brandeis',
'state': 'MA',
'stu_pop': 5800},
...

您可以使用itertools.groupby()对所有相似的元素进行分组。

itertools.groupby()每次的值都产生一个break或一个新组关键功能变化(这就是为什么通常需要有使用相同的键函数排序数据)

可迭代对象(schools_list)需要已经在groupby中使用的相同键函数上排序。

schools_list = sorted(schools_list, key=lambda x: tuple(x[key] for key in keys))

key指定了一个函数，用于从iterable中的每个元素中提取比较键。

from itertools import groupby
def your_funct(data, keys):
return {
key: list(group)
for key, group in groupby(data, key=lambda x: tuple(x[key] for key in keys))
}

返回的组本身就是一个迭代器，这就是我使用list(group)展开它的原因。

查看更多关于itertools.groupby(Python-docs)

相关内容

最新更新

热门标签：