我正在寻找一种更Python/高效(且简短(的方法来分解(=枚举唯一实例(聚合数据帧中的行集差异。下表说明了数据帧操作不应过于复杂的内容:
产品 | product_group | 位置|
---|---|---|
10 | 世界其他地区 | |
11 | 美国 | |
12 | 1 | CA |
13 | 2 | 世界其他地区 |
14 | 2 | JP |
15 | 2 | 美国 |
16 | 3 | FR |
17 | 3 | BE |
18 | 4 | 世界其他地区 |
19 | 4 | 美国|
20 | 4 | CA |
每个组具有set
具体集的解决方案,不筛选出RoW
行,使用join
获得差异,最后使用factorize
和frozenset
s:
list_io_countries = ['US', 'CA', 'JP', 'BE', 'FR']
s = set(list_io_countries)
df = df.groupby(df['product_group'])['location'].agg(set).reset_index(name='location_list')
df = (df[['RoW' in x for x in df['location_list']]]
.assign(rest_of_world_location_list = lambda x: x['location_list'].apply(lambda x: ','.join(s - x)),
rest_of_world_index = lambda x: pd.factorize(x['location_list'].apply(lambda x: frozenset(x - set(['RoW']))))[0] + 1,
location_list = lambda x: x['location_list'].agg(','.join)
)
.assign(rest_of_world_index = lambda x: 'RoW_' + x['rest_of_world_index'].astype(str)))
print (df)
product_group location_list rest_of_world_location_list rest_of_world_index
0 1 RoW,CA,US JP,BE,FR RoW_1
1 2 RoW,JP,US CA,BE,FR RoW_2
3 4 RoW,CA,US JP,BE,FR RoW_1
IIUC,您可以使用一个包含3个步骤的管道:
world = set(list_locations)
(df.groupby('product_group', as_index=False)
# aggregate locations as string and the rest of the from from a set difference
.agg(**{'location_list': ('location', ', '.join),
'rest_of_world_location_list': ('location', lambda l: ', '.join(sorted(world.difference(l))))
})
# filter the rows without RoW
.loc[lambda d: d['location_list'].str.contains('RoW')]
# add category
.assign(rest_of_world_index=lambda d: 'RoW_'+d['location_list'].astype('category').cat.codes.add(1).astype(str)
)
)
输出:
product_group location_list rest_of_world_location_list rest_of_world_index
0 1 RoW, US, CA BE, FR, JP RoW_2
1 2 RoW, JP, US BE, CA, FR RoW_1
3 4 RoW, US, CA BE, FR, JP RoW_2