如果至少有一个元素是公共的,则分组



我有以下数据帧

import pandas as pd
d1 = {'id': ["car", "car", "bus", "plane", "plane", "plane"], 'value': [["ab","b"], ["b","ab"], ["ab","b"], ["cd","df"], ["d","cd"], ["df","df"]]}
df = pd.DataFrame(data=d1)
df
id      value
0   car     [ab, b]
1   car     [b, ab]
2   bus     [ab, b]
3   plane   [cd, df]
4   plane   [d, cd]
5   plane   [df, df]

我想分组我的id,如果他们从值列至少有一个元素是共同的。期望的输出如下所示:


id  value
0   car [ab, b]
1   car [b, ab]
2   bus [ab, b]
id     value
0   plane   [cd, df]
1   plane   [d, cd]
id     value
0   plane   [cd, df]
1   plane   [df, df]

我尝试使用groupby,但问题是一些id应该包含在多个数据帧中,如

plane   [cd, df]

可以使用set操作:

keep = (df.explode('value').reset_index().groupby('value')['index'].agg(frozenset)
.loc[lambda s: s.str.len()>1].unique()
)
for idx in keep:
print(df.loc[idx])

输出:

id    value
0  car  [ab, b]
1  car  [b, ab]
2  bus  [ab, b]
id     value
3  plane  [cd, df]
4  plane   [d, cd]
id     value
3  plane  [cd, df]
5  plane  [df, df]

工作原理

首先获取每个值的匹配索引

df.explode('value').reset_index().groupby('value')['index'].agg(frozenset)
value
ab    (0, 1, 2)
b     (0, 1, 2)
cd       (3, 4)
d           (4)
df       (3, 5)
Name: index, dtype: object

删除重复项,只保留多于1个成员的组:

keep = (df.explode('value').reset_index().groupby('value')['index'].agg(frozenset)
.loc[lambda s: s.str.len()>1].unique()
)
[frozenset({0, 1, 2}), frozenset({3, 4}), frozenset({3, 5})]

最后,遍历组。

可选语法(相同逻辑)

s = df['value'].explode()
keep = dict.fromkeys(frozenset(x) for x in s.index.groupby(s).values() if len(x)>1)
for idx in keep:
print(df.loc[idx])

相关内容

  • 没有找到相关文章

最新更新