Pandas Groupby,在字符串变量上汇总并向上移动空单元格



我有一个带有带有重复标识符和许多空单元格的字符串变量的数据框架。

我想通过标识符进行分组,并在一行中具有所有值。如果一列有多个条目vor一个标识符,我需要一个带有后缀的新行,因此我稍后可以识别它。

这是我的数据

ID   name1  name2   name3   name4   name5   name6   name7   name8
Tom  sarah          mike                
Tom                                 john    john        
Gen  paul                           
Gen         sandra                      
Gen                 lara    lara    lara    lara        
Gen                                                 mike    mike
Lara bill   bill    bill                    
Lara                josh    josh            
Lara                kevin   kevin   kevin   mike    
Lara                                        devon   devon   devon

这是我远离这里使用的代码:

grouped = df1.groupby('ID')
df1 = grouped.aggregate(lambda x: tuple(x))

给我以下结果:

                      name1                    name2                  name3  
ID                                                                            
Gen   (paul, nan, nan, nan)  (nan, sandra, nan, nan)  (nan, nan, lara, nan)   
Lara  (bill, nan, nan, nan)    (bill, nan, nan, nan)  (bill, nan, nan, nan)   
Tom            (sarah, nan)               (nan, nan)             (nan, nan)   
                        name4                    name5  
ID                                                       
Gen     (nan, nan, lara, nan)    (nan, nan, lara, nan)   
Lara  (nan, josh, kevin, nan)  (nan, josh, kevin, nan)   
Tom               (mike, nan)              (nan, john)   
                       name6                    name7                   name8  
ID                                                                             
Gen    (nan, nan, lara, nan)    (nan, nan, nan, mike)   (nan, nan, nan, mike)  
Lara  (nan, nan, kevin, nan)  (nan, nan, mike, devon)  (nan, nan, nan, devon)  
Tom              (nan, john)               (nan, nan)              (nan, nan) 

但这是我真正想要的,我只是不知道该怎么做:

ID   name1  name2   name3   name3_suffixA   name3_suffixB   name4   name4_suffixA   name5   name6   name6_suffixA   name7   name8
Tom  sarah          mike                                                            john    john            
Gen  paul   sandra  lara                                    lara                    lara    lara                    mike    mike
Lara bill   bill    bill    josh            kevin           josh    kevin           kevin   mike    devon           devon   devon

如果在末端或之间出现附加条目,则后缀的Actuall名称也无关紧要,也无关紧要。

有一些相似的问题,我知道。但是我无法使用任何案例/解决方案,我真的很感谢您的帮助。

使用:

s = df.set_index('ID').stack().to_frame('c')
df = (s.set_index(s.groupby(level=[0,1]).cumcount().astype(str), append=True)['c']
       .unstack([1,2])
       .sort_index(level=0, axis=1))
df.columns = ['{}_{}'.format(i,j) if j != '0' else '{}'.format(i) for i, j in df.columns]
print (df)
      name1   name2 name3 name3_1 name3_2 name4 name4_1  name5 name6 name6_1  
ID                                                                             
Gen    paul  sandra  lara    None    None  lara    None   lara  lara    None   
Lara   bill    bill  bill    josh   kevin  josh   kevin  kevin  mike   devon   
Tom   sarah    None  mike    None    None  None    None   john  john    None   
      name7  name8  
ID                  
Gen    mike   mike  
Lara  devon  devon  
Tom    None   None  

说明

  1. set_indexstack -NaN s删除Series的CC_1
  2. cumcount添加新级别的第一和第二级
  3. 通过unstack重塑最后一个级别,并使用sort_index在列中对MultiIndex进行排序 - 仅第一级
  4. 用列表理解的 MultiIndex用排除 0的第一列的名称

详细信息

print (s)
                 c
ID                
Tom  name1   sarah
     name3    mike
     name5    john
     name6    john
Gen  name1    paul
     name2  sandra
     name3    lara
     name4    lara
     name5    lara
     name6    lara
     name7    mike
     name8    mike
Lara name1    bill
     name2    bill
     name3    bill
     name3    josh
     name4    josh
     name3   kevin
     name4   kevin
     name5   kevin
     name6    mike
     name6   devon
     name7   devon
     name8   devon

另一个解决方案:

df = df.groupby('ID').agg(lambda x: list(x.dropna()))
L = ([pd.DataFrame(df[x].values.tolist(),
                  index=df.index,
                  columns = ['{}_{}'.format(x,i) if i != 0 
                                                 else '{}'.format(x) 
                                                 for i in range(df[x].str.len().max())]) for x in df])
df = pd.concat(L, axis=1)
print (df)
      name1   name2 name3 name3_1 name3_2 name4 name4_1  name5 name6 name6_1  
ID                                                                             
Gen    paul  sandra  lara    None    None  lara    None   lara  lara    None   
Lara   bill    bill  bill    josh   kevin  josh   kevin  kevin  mike   devon   
Tom   sarah    None  mike    None    None  None    None   john  john    None   
      name7  name8  
ID                  
Gen    mike   mike  
Lara  devon  devon  
Tom    None   None  

说明

  1. dropna汇总每列,然后转换为list
  2. 在列表中,理解使用每列和concat一起创建DataFrame

它比我想的要复杂一些,我正在使用 replace stack unstack

s=df.replace('',np.nan).set_index('ID').stack().reset_index(level=1)
#gourpby here is to create the suffix , like you mention in the question  you will need then all distinct value  
s['suffix']=s['level_1']+'_'+s.groupby([s.index,s.level_1]).cumcount().astype(str)
s.set_index('suffix',append=True)[0].unstack()

suffix name1_0 name2_0 name3_0 name3_1 name3_2 name4_0 name4_1 name5_0  
ID                                                                       
Gen       paul  sandra    lara    None    None    lara    None    lara   
Lara      bill    bill    bill    josh   kevin    josh   kevin   kevin   
Tom      sarah    None    mike    None    None    None    None    john   
suffix name6_0 name6_1 name7_0 name8_0  
ID                                      
Gen       lara    None    mike    mike  
Lara      mike   devon   devon   devon  
Tom       john    None    None    None  

最新更新