我想创建一个唯一的水果数据集。我不知道每个水果下面的所有类型(例如颜色,储存,价格)。对于每种类型,也可能有重复的行。是否有一种方法可以以完全一般化的方式检测所有可能的重复并捕获所有唯一的信息?
type val detail
0 fruit apple
1 colour green greenish
2 colour yellow
3 store walmart usa
4 price 10
5 NaN
6 fruit banana
7 colour yellow
8 fruit pear
9 fruit jackfruit
...
预期输出
fruit colour store price detail ...
0 apple [green, yellow ] [walmart] [10] [greenish, usa]
1 banana [yellow] NaN NaN
2 pear NaN NaN NaN
3 jackfruit NaN NaN NaN
我试过了。但这与预期产出相差甚远。它也不显示列名。
df.groupby("type")["val"].agg(size=len, set=lambda x: set(x))
0 fruit {"apple",...}
1 colour ...
首先创建fruit
列,如果类型为fruit
,则使用val
值,将不匹配的值替换为NaN
s并向前填充缺失值,然后使用DataFrame.pivot_table
使用自定义函数对没有NaN
s的唯一值进行旋转,然后平坦MultiIndex
:
m = df['type'].eq('fruit')
df['fruit'] = df['val'].where(m).ffill()
df1 = (df.pivot_table(index='fruit',columns='type',
aggfunc=lambda x: list(dict.fromkeys(x.dropna())))
.drop('fruit', axis=1, level=1))
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df1)
detail_colour detail_price detail_store val_colour val_price
fruit
apple [greenish] [] [usa] [green, yellow] [10]
banana [] NaN NaN [yellow] NaN
jackfruit NaN NaN NaN NaN NaN
pear NaN NaN NaN NaN NaN
val_store
fruit
apple [walmart]
banana NaN
jackfruit NaN
pear NaN