我有一个像这样的数据帧id和以','分隔的字符串:
ID | Preferences | 1 | 香蕉、苹果 |
---|---|
1 | 香蕉、苹果、猕猴桃 |
鳄梨,苹果 | |
鳄梨,葡萄 | |
2 | 香蕉、苹果、猕猴桃 |
要获得首选项,首先将DataFrame拆分并展开为更长的Series。然后,对出现的次数进行计数和排序,另一个groupby
+agg
将允许您加入第一和第二首选项的关系。
此结果的索引将是原始DataFrame中唯一的'ID'
值,因此您可以将其他groupby
+agg
操作的结果与concat
结合使用
import pandas as pd
df = pd.DataFrame({'ID': [1,1,1,2,2],
'Preferences': ['banana, apple', 'banana, apple, kiwi', 'avocado, apple', 'avocado, grapes',
'banana, apple, kiwi']})
# Expand to long Series
s = df.set_index(['ID']).Preferences.str.split(', ', expand=True).stack()
# Within each ID, rank preferences based on # of occurrences
s = (s.groupby([s.index.get_level_values(0), s.rename('preference')]).size()
.groupby(level=0).rank(method='dense', ascending=False)
.map({1: 'first', 2: 'second'}).rename('order'))
res = s[s.isin(['first', 'second'])].reset_index().groupby(['ID', 'order']).agg(', '.join).unstack(-1)
# Collapse MultiIndex to get simple column labels
res.columns = [f'{y}_{x}' for x,y in res.columns]
print(res)
first_preference second_preference
ID
1 apple banana
2 apple, avocado, banana, grapes, kiwi NaN
# Expand to long Series
s = df.set_index(['ID']).Preferences.str.split(', ', expand=True).stack()
# Within each ID, rank preferences based on # of occurrences
s = (s.groupby([s.index.get_level_values(0), s.rename('preference')]).size()
.groupby(level=0).rank(method='dense', ascending=False)
.map({1: 'first', 2: 'second'}).rename('order'))
res = s[s.isin(['first', 'second'])].reset_index().groupby(['ID', 'order']).agg(', '.join).unstack(-1)
# Collapse MultiIndex to get simple column labels
res.columns = [f'{y}_{x}' for x,y in res.columns]
print(res)
first_preference second_preference
ID
1 apple banana
2 apple, avocado, banana, grapes, kiwi NaN