我有一个带有带有重复标识符和许多空单元格的字符串变量的数据框架。
我想通过标识符进行分组,并在一行中具有所有值。如果一列有多个条目vor一个标识符,我需要一个带有后缀的新行,因此我稍后可以识别它。
这是我的数据
ID name1 name2 name3 name4 name5 name6 name7 name8
Tom sarah mike
Tom john john
Gen paul
Gen sandra
Gen lara lara lara lara
Gen mike mike
Lara bill bill bill
Lara josh josh
Lara kevin kevin kevin mike
Lara devon devon devon
这是我远离这里使用的代码:
grouped = df1.groupby('ID')
df1 = grouped.aggregate(lambda x: tuple(x))
给我以下结果:
name1 name2 name3
ID
Gen (paul, nan, nan, nan) (nan, sandra, nan, nan) (nan, nan, lara, nan)
Lara (bill, nan, nan, nan) (bill, nan, nan, nan) (bill, nan, nan, nan)
Tom (sarah, nan) (nan, nan) (nan, nan)
name4 name5
ID
Gen (nan, nan, lara, nan) (nan, nan, lara, nan)
Lara (nan, josh, kevin, nan) (nan, josh, kevin, nan)
Tom (mike, nan) (nan, john)
name6 name7 name8
ID
Gen (nan, nan, lara, nan) (nan, nan, nan, mike) (nan, nan, nan, mike)
Lara (nan, nan, kevin, nan) (nan, nan, mike, devon) (nan, nan, nan, devon)
Tom (nan, john) (nan, nan) (nan, nan)
但这是我真正想要的,我只是不知道该怎么做:
ID name1 name2 name3 name3_suffixA name3_suffixB name4 name4_suffixA name5 name6 name6_suffixA name7 name8
Tom sarah mike john john
Gen paul sandra lara lara lara lara mike mike
Lara bill bill bill josh kevin josh kevin kevin mike devon devon devon
如果在末端或之间出现附加条目,则后缀的Actuall名称也无关紧要,也无关紧要。
有一些相似的问题,我知道。但是我无法使用任何案例/解决方案,我真的很感谢您的帮助。
使用:
s = df.set_index('ID').stack().to_frame('c')
df = (s.set_index(s.groupby(level=[0,1]).cumcount().astype(str), append=True)['c']
.unstack([1,2])
.sort_index(level=0, axis=1))
df.columns = ['{}_{}'.format(i,j) if j != '0' else '{}'.format(i) for i, j in df.columns]
print (df)
name1 name2 name3 name3_1 name3_2 name4 name4_1 name5 name6 name6_1
ID
Gen paul sandra lara None None lara None lara lara None
Lara bill bill bill josh kevin josh kevin kevin mike devon
Tom sarah None mike None None None None john john None
name7 name8
ID
Gen mike mike
Lara devon devon
Tom None None
说明:
-
set_index
和stack
-NaN
s删除Series
的CC_1 - 由
cumcount
添加新级别的第一和第二级 - 通过
unstack
重塑最后一个级别,并使用sort_index
在列中对MultiIndex
进行排序 - 仅第一级 - 用列表理解的
MultiIndex
用排除0
的第一列的名称
详细信息:
print (s)
c
ID
Tom name1 sarah
name3 mike
name5 john
name6 john
Gen name1 paul
name2 sandra
name3 lara
name4 lara
name5 lara
name6 lara
name7 mike
name8 mike
Lara name1 bill
name2 bill
name3 bill
name3 josh
name4 josh
name3 kevin
name4 kevin
name5 kevin
name6 mike
name6 devon
name7 devon
name8 devon
另一个解决方案:
df = df.groupby('ID').agg(lambda x: list(x.dropna()))
L = ([pd.DataFrame(df[x].values.tolist(),
index=df.index,
columns = ['{}_{}'.format(x,i) if i != 0
else '{}'.format(x)
for i in range(df[x].str.len().max())]) for x in df])
df = pd.concat(L, axis=1)
print (df)
name1 name2 name3 name3_1 name3_2 name4 name4_1 name5 name6 name6_1
ID
Gen paul sandra lara None None lara None lara lara None
Lara bill bill bill josh kevin josh kevin kevin mike devon
Tom sarah None mike None None None None john john None
name7 name8
ID
Gen mike mike
Lara devon devon
Tom None None
说明:
- 用
dropna
汇总每列,然后转换为list
- 在列表中,理解使用每列和
concat
一起创建DataFrame
它比我想的要复杂一些,我正在使用 replace
stack
unstack
s=df.replace('',np.nan).set_index('ID').stack().reset_index(level=1)
#gourpby here is to create the suffix , like you mention in the question you will need then all distinct value
s['suffix']=s['level_1']+'_'+s.groupby([s.index,s.level_1]).cumcount().astype(str)
s.set_index('suffix',append=True)[0].unstack()
suffix name1_0 name2_0 name3_0 name3_1 name3_2 name4_0 name4_1 name5_0
ID
Gen paul sandra lara None None lara None lara
Lara bill bill bill josh kevin josh kevin kevin
Tom sarah None mike None None None None john
suffix name6_0 name6_1 name7_0 name8_0
ID
Gen lara None mike mike
Lara mike devon devon devon
Tom john None None None