groupby contents of list



我有以下数据框架:

import pandas as pd
d1 = {'id': ["car", "car", "bus", "plane", "plane"], 'value': [["a","b"], ["b","a"], ["a","b"], ["c","d"], ["d","c"]]}
df1 = pd.DataFrame(data=d1)
df1

id  value
0   car  [a, b]
1   car  [b, a]
2   bus  [a, b]
3   plane[c, d]
4   plane[d, c]

我想根据值列表的内容对id进行分组。元素的顺序应该无关紧要。之后,我想根据分组大小对它们进行排序,所以我得到这样的内容:

id  value
0   car [a, b]
1   car [b, a]
2   bus [a, b]
id      value
0   plane   [c, d]
1   plane   [d, c]

我尝试使用Counter()将列表转换为字典,然后获得组的大小。然而,我得到以下错误:

import collections
df1["temp"] = list(map(collections.Counter,  df1["value"]))
df1 = df1.groupby('temp').size().sort_values(ascending = True)

TypeError: unhashable type: 'Counter'

您可以对列表进行排序以忽略顺序。list类型是不可哈希的,将它们转换为tuple,然后可以groupby

for _, g in df1.groupby(df1['value'].map(lambda x: tuple(sorted(x)))) :
print(g)

输出:

id   value
0  car  [a, b]
1  car  [b, a]
2  bus  [a, b]
id   value
3  plane  [c, d]
4  plane  [d, c]

value列进行排序,将其转换为字符串,然后将其用作分组标准:

groups = df.assign(val_str=df['value'].apply(sorted).str.join(',')).groupby('val_str')
for _, g in groups:  # separate groups
g = g.drop('val_str', axis=1)
print(g)

id   value
0  car  [a, b]
1  car  [b, a]
2  bus  [a, b]
id   value
3  plane  [c, d]
4  plane  [d, c]