没有循环的数据帧之间的比较

我有2个这样的数据框架:

Df1:

tbody> <<tr>

基因id	Go terms
ID1	GO1
ID1	GO2
ID2	GO1
ID2	GO3
ID3	GO1
ID4	GO1

可以使用groupby.agg将具有相同ID的行连接为字符串，使用split+explode扩展为多行。最后merge调整输出的两个部分:

out = (
df1.groupby('gene ids', as_index=False).agg(','.join)
.merge((df2.assign(**{'gene ids': lambda d: d['gene ids'].str.split(r',s*')}).explode('gene ids')
.groupby('gene ids', as_index=False).agg(', '.join)
), how='left')
)

输出:

gene ids Go terms  MP terms      MP names
0      ID1  GO1,GO2  MP1, MP2  Name1, Name2
1      ID2  GO1,GO3  MP1, MP3  Name1, Name3
2      ID3      GO1       MP2         Name2
3      ID4      GO1       MP1         Name1

如果你对"MP名称"不感兴趣列，切片在第二个groupby.agg:

out = (
df1.groupby('gene ids', as_index=False).agg(','.join)
.merge((df2.assign(**{'gene ids': lambda d: d['gene ids'].str.split(r',s*')}).explode('gene ids')
.groupby('gene ids', as_index=False)['MP terms'].agg(', '.join)
), how='left')
)

输出:

gene ids Go terms  MP terms
0      ID1  GO1,GO2  MP1, MP2
1      ID2  GO1,GO3  MP1, MP3
2      ID3      GO1       MP2
3      ID4      GO1       MP1

在df2中使用concat与GroupBy.agg聚合和join与DataFrame.explode通过,分割值:

df = pd.concat([df2.assign(**{'gene ids': df2['gene ids'].str.split(',s*')})
.explode('gene ids')
.groupby('gene ids')['MP terms'].agg(', '.join),
df1.groupby('gene ids')['Go terms'].agg(', '.join)], axis=1).reset_index()
print (df)
gene ids  MP terms  Go terms
0      ID1  MP1, MP2  GO1, GO2
1      ID2  MP1, MP3  GO1, GO3
2      ID3       MP2       GO1
3      ID4       MP1       GO1

如果需要按join聚合所有列，请使用:

df = pd.concat([df2.assign(**{'gene ids': df2['gene ids'].str.split(',s*')})
.explode('gene ids')
.groupby('gene ids').agg(', '.join),
df1.groupby('gene ids').agg(', '.join)], axis=1).reset_index()
print (df)
gene ids  MP terms      MP names  Go terms
0      ID1  MP1, MP2  Name1, Name2  GO1, GO2
1      ID2  MP1, MP3  Name1, Name3  GO1, GO3
2      ID3       MP2         Name2       GO1
3      ID4       MP1         Name1       GO1

相关内容

最新更新

热门标签：