按组捕获所有唯一信息



我想创建一个唯一的水果数据集。我不知道每个水果下面的所有类型(例如颜色,储存,价格)。对于每种类型,也可能有重复的行。是否有一种方法可以以完全一般化的方式检测所有可能的重复并捕获所有唯一的信息?

type    val       detail
0 fruit    apple
1 colour   green     greenish
2 colour   yellow    
3 store    walmart    usa
4 price    10
5 NaN
6 fruit    banana
7 colour   yellow
8 fruit    pear
9 fruit    jackfruit
...

预期输出

fruit      colour            store    price       detail           ...
0  apple     [green, yellow ]  [walmart]  [10]      [greenish, usa] 
1  banana     [yellow]           NaN      NaN
2  pear        NaN               NaN      NaN    
3  jackfruit   NaN               NaN      NaN    

我试过了。但这与预期产出相差甚远。它也不显示列名。

df.groupby("type")["val"].agg(size=len, set=lambda x: set(x))
0 fruit   {"apple",...}
1 colour  ...

首先创建fruit列,如果类型为fruit,则使用val值,将不匹配的值替换为NaNs并向前填充缺失值,然后使用DataFrame.pivot_table使用自定义函数对没有NaNs的唯一值进行旋转,然后平坦MultiIndex:

m = df['type'].eq('fruit')
df['fruit'] = df['val'].where(m).ffill()
df1 = (df.pivot_table(index='fruit',columns='type', 
aggfunc=lambda x: list(dict.fromkeys(x.dropna())))
.drop('fruit', axis=1, level=1))
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df1)
detail_colour detail_price detail_store       val_colour val_price  
fruit                                                                          
apple        [greenish]           []        [usa]  [green, yellow]      [10]   
banana               []          NaN          NaN         [yellow]       NaN   
jackfruit           NaN          NaN          NaN              NaN       NaN   
pear                NaN          NaN          NaN              NaN       NaN   
val_store  
fruit                 
apple      [walmart]  
banana           NaN  
jackfruit        NaN  
pear             NaN  

最新更新