如何基于共享列对多个数据帧进行子集划分



我想阅读三个组学数据集——甲基化、cnv和mRNA表达。

首先,我删除了重复的列,并分别对所有3个数据帧mrnamethcna按索引排序。然后将这些数据帧级联以形成一个数据帧CCD_ 4。

最终,我希望common数据帧只包含"0";Hugo_Symbol"列,它们在所有三个数据帧mrnamethcna中是相同的。

import re
dfs = [mrna, meth, cna]
common = pd.concat(dfs, join='inner')

这里,common数据帧具有48625 rows × 348 columns维度。

现在,我只想在mrnacnameth这三个原始数据帧的行值相同的情况下保留common中的行。

keep_col = common.drop_duplicates(keep="first")["Hugo_Symbol"]
common = common[common.set_index(["Hugo_Symbol"]).index.isin(keep_col)]

我希望新的子集common数据帧的行数等于keep_col的长度

len(keep_col)
48618

然而,子化的common数据帧具有48625 rows × 348 columns维度,这意味着没有删除/子化任何行。

示例数据帧:

mrna

TCGA-1TCGA-2>TCGA-3789678>td>456//tr>742
Hugo_Symbol
0ABC123456
1DEF187
2MNO147147
3VWX184195268

理想的输出究竟是什么样子的,这就是您想要做的吗?我觉得列和行之间有些混淆。。。

dfs = [mrna, meth]
def do_stuff(df):
df = pd.DataFrame(df).T
df.columns = df.iloc[0]
df = df[1:]
return df
mrna, meth = map(do_stuff, dfs)
print(pd.concat([mrna, meth], axis=1).dropna())

输出:

Hugo_Symbol        AACS   FSTL1   ELMO2 CREB3L1   RPS11   PNMA1    MMP2  
TCGA-02-0001-01 -0.5909 -0.9099 -2.3351  0.2216  0.6798   -2.48  0.7912   
TCGA-02-0003-01 -0.5154  0.0896   -1.17  0.1255  0.2374 -3.2629  1.2846   
TCGA-02-0007-01  0.6932 -5.6511 -2.8365  2.0026 -0.6326 -1.3741  -3.437   
Hugo_Symbol      SAMD4A SMARCD3   A4GNT     MEOX2     COX8C     IMPA2  
TCGA-02-0001-01 -1.4578 -3.8009  3.4868  0.041451  0.981055   0.01858   
TCGA-02-0003-01  -1.474 -2.9891 -0.1511   0.10389  0.989163  0.127843   
TCGA-02-0007-01  -1.047  -4.185  2.1816  0.025519  0.985666  0.018484   
Hugo_Symbol          TTC8   TMEM186    RETSAT     TULP1     TAF15   CCDC88B  
TCGA-02-0001-01  0.037853  0.858835  0.028956  0.895705  0.024163   0.64503   
TCGA-02-0003-01  0.061488  0.718749  0.037052  0.875983   0.02025  0.338662   
TCGA-02-0007-01  0.084045  0.862945  0.029969  0.930821   0.02238  0.319357   
Hugo_Symbol       EPB41L3  
TCGA-02-0001-01  0.025546  
TCGA-02-0003-01  0.879701  
TCGA-02-0007-01  0.019343  

给定整行必须相等,如我下面的第一个答案所示,返回:

import pandas as pd
mrna = {
'Hugo_Symbol': ['ABC','DEF','MNO','VWX'],
'TCGA-1': [123, 678, 742, 184,],
'TCGA-2': [456, 187, 147, 195,],
'TCGA-3': [789, 456, 147, 268,],
}
cna = {
'Hugo_Symbol': ['DEF','ABC','GHI','JKL',],
'TCGA-1': [456, 123, 694, 384,],
'TCGA-2': [123, 456, 284, 843,],
'TCGA-3': [321, 789, 270, 147,],
}

meth = {
'Hugo_Symbol': ['DEF','PQR','STU','ABC',],
'TCGA-1': [456, 285, 225, 123,],
'TCGA-2': [123, 226, 632, 456,],
'TCGA-3': [432, 732, 532, 789,],
}

dfs = [pd.DataFrame.from_dict(mrna), pd.DataFrame.from_dict(cna), pd.DataFrame.from_dict(meth)]
common = pd.DataFrame(set.intersection(*[set([tuple(l) for l in df.values.tolist()])
for df in dfs]), columns=dfs[0].columns)
print(f"common: n{common}")
"""
common:
Hugo_Symbol  TCGA-1  TCGA-2  TCGA-3
0         ABC     123     456     789
"""

假设只有"Hugo_Symbol"列是相关的,下面返回:进口熊猫作为pd

mrna = {
'Hugo_Symbol': ['ABC','DEF','MNO','VWX'],
'TCGA-1': [123, 678, 742, 184,],
'TCGA-2': [456, 187, 147, 195,],
'TCGA-3': [789, 456, 147, 268,],
}
cna = {
'Hugo_Symbol': ['DEF','ABC','GHI','JKL',],
'TCGA-1': [456, 123, 694, 384,],
'TCGA-2': [123, 456, 284, 843,],
'TCGA-3': [321, 789, 270, 147,],
}

meth = {
'Hugo_Symbol': ['DEF','PQR','STU','ABC',],
'TCGA-1': [456, 285, 225, 123,],
'TCGA-2': [123, 226, 632, 456,],
'TCGA-3': [432, 732, 532, 789,],
}

dfs = [pd.DataFrame.from_dict(mrna), pd.DataFrame.from_dict(cna), pd.DataFrame.from_dict(meth)]
hugoFilter = set.intersection(*[set(df.loc[:, 'Hugo_Symbol'].values.tolist()) for df in dfs])
common = pd.concat(dfs)
common = common[common.Hugo_Symbol.isin(hugoFilter)].drop_duplicates(keep="first", subset='Hugo_Symbol')
print(f"common: n{common}")
"""
common:
Hugo_Symbol  TCGA-1  TCGA-2  TCGA-3
0         ABC     123     456     789
1         DEF     678     187     456
NOTE: make sure the DEF from the corrrect DF is returned, values differ.
"""

最新更新