我有这种格式的原始数据(输入实际上是一个巨大的excel表(,由于一列中的唯一值,生成了多行,有点像下面的例子:
data = [['tom', 10, 'vanilla'], ['tom', 10, 'chocolate'], ['nick', 15, 'strawberry'], ['nick', 15, 'butter pecan'], ['nick', 15, 'mint chocolate chip'], ['juli', 14, 'strawberry'], ['juli', 14, 'chocolate'], ['juli', 14, 'vanilla']]
df = pd.DataFrame(data, columns = ['Name', 'Age', 'Favorite Ice Cream'])
但我需要用下面的格式从上面的原始数据创建一个数据帧,其中最后一列被排列成行,重复的行被消除:
data2 = [['tom', 10, 'vanilla', 'chocolate', '', ''], ['nick', 15, 'butter pecan', '', 'strawberry', 'mint chocolate chip'], ['juli', 14, 'vanilla', 'chocolate', 'strawberry', '']]
df2 = pd.DataFrame(data2, columns = ['Name', 'Age', 'Favorite Vanilla-based Ice Cream', 'Favorite Chocolate-based Ice Cream', 'Favorite Berry Ice Cream', 'Favorite Other Ice Cream'])
谢谢你的帮助!
使用pivot
和映射dict:
mapping = {'Vanilla': ['vanilla', 'butter pecan'],
'Chocolate': ['chocolate'],
'Berry': ['strawberry']}
reverse_mapping = {v: k for k, l in mapping.items() for v in l}
df['X-based'] = df['Favorite Ice Cream'].map(reverse_mapping).fillna('Other')
out = df.pivot(index=['Name', 'Age'],
columns='X-based',
values='Favorite Ice Cream')
.rename_axis(columns=None).reset_index().fillna('')
输出:
>>> out
Name Age Berry Chocolate Other Vanilla
0 juli 14 strawberry chocolate vanilla
1 nick 15 strawberry mint chocolate chip butter pecan
2 tom 10 chocolate vanilla