我想阅读三个组学数据集——甲基化、cnv和mRNA表达。
首先,我删除了重复的列,并分别对所有3个数据帧mrna
、meth
和cna
按索引排序。然后将这些数据帧级联以形成一个数据帧CCD_ 4。
最终,我希望common
数据帧只包含"0";Hugo_Symbol"列,它们在所有三个数据帧mrna
、meth
和cna
中是相同的。
import re
dfs = [mrna, meth, cna]
common = pd.concat(dfs, join='inner')
这里,common
数据帧具有48625 rows × 348 columns
维度。
现在,我只想在mrna
、cna
和meth
这三个原始数据帧的行值相同的情况下保留common
中的行。
keep_col = common.drop_duplicates(keep="first")["Hugo_Symbol"]
common = common[common.set_index(["Hugo_Symbol"]).index.isin(keep_col)]
我希望新的子集common
数据帧的行数等于keep_col
的长度
len(keep_col)
48618
然而,子化的common
数据帧具有48625 rows × 348 columns
维度,这意味着没有删除/子化任何行。
示例数据帧:
mrna
Hugo_Symbol | ||||
---|---|---|---|---|
0 | ABC | 123 | 456 | 789|
1 | DEF | 678187 | >td>456//tr>||
2 | MNO | 742147 | 147 | |
3 | VWX | 184 | 195 | 268 |
理想的输出究竟是什么样子的,这就是您想要做的吗?我觉得列和行之间有些混淆。。。
dfs = [mrna, meth]
def do_stuff(df):
df = pd.DataFrame(df).T
df.columns = df.iloc[0]
df = df[1:]
return df
mrna, meth = map(do_stuff, dfs)
print(pd.concat([mrna, meth], axis=1).dropna())
输出:
Hugo_Symbol AACS FSTL1 ELMO2 CREB3L1 RPS11 PNMA1 MMP2
TCGA-02-0001-01 -0.5909 -0.9099 -2.3351 0.2216 0.6798 -2.48 0.7912
TCGA-02-0003-01 -0.5154 0.0896 -1.17 0.1255 0.2374 -3.2629 1.2846
TCGA-02-0007-01 0.6932 -5.6511 -2.8365 2.0026 -0.6326 -1.3741 -3.437
Hugo_Symbol SAMD4A SMARCD3 A4GNT MEOX2 COX8C IMPA2
TCGA-02-0001-01 -1.4578 -3.8009 3.4868 0.041451 0.981055 0.01858
TCGA-02-0003-01 -1.474 -2.9891 -0.1511 0.10389 0.989163 0.127843
TCGA-02-0007-01 -1.047 -4.185 2.1816 0.025519 0.985666 0.018484
Hugo_Symbol TTC8 TMEM186 RETSAT TULP1 TAF15 CCDC88B
TCGA-02-0001-01 0.037853 0.858835 0.028956 0.895705 0.024163 0.64503
TCGA-02-0003-01 0.061488 0.718749 0.037052 0.875983 0.02025 0.338662
TCGA-02-0007-01 0.084045 0.862945 0.029969 0.930821 0.02238 0.319357
Hugo_Symbol EPB41L3
TCGA-02-0001-01 0.025546
TCGA-02-0003-01 0.879701
TCGA-02-0007-01 0.019343
给定整行必须相等,如我下面的第一个答案所示,返回:
import pandas as pd
mrna = {
'Hugo_Symbol': ['ABC','DEF','MNO','VWX'],
'TCGA-1': [123, 678, 742, 184,],
'TCGA-2': [456, 187, 147, 195,],
'TCGA-3': [789, 456, 147, 268,],
}
cna = {
'Hugo_Symbol': ['DEF','ABC','GHI','JKL',],
'TCGA-1': [456, 123, 694, 384,],
'TCGA-2': [123, 456, 284, 843,],
'TCGA-3': [321, 789, 270, 147,],
}
meth = {
'Hugo_Symbol': ['DEF','PQR','STU','ABC',],
'TCGA-1': [456, 285, 225, 123,],
'TCGA-2': [123, 226, 632, 456,],
'TCGA-3': [432, 732, 532, 789,],
}
dfs = [pd.DataFrame.from_dict(mrna), pd.DataFrame.from_dict(cna), pd.DataFrame.from_dict(meth)]
common = pd.DataFrame(set.intersection(*[set([tuple(l) for l in df.values.tolist()])
for df in dfs]), columns=dfs[0].columns)
print(f"common: n{common}")
"""
common:
Hugo_Symbol TCGA-1 TCGA-2 TCGA-3
0 ABC 123 456 789
"""
假设只有"Hugo_Symbol"列是相关的,下面返回:进口熊猫作为pd
mrna = {
'Hugo_Symbol': ['ABC','DEF','MNO','VWX'],
'TCGA-1': [123, 678, 742, 184,],
'TCGA-2': [456, 187, 147, 195,],
'TCGA-3': [789, 456, 147, 268,],
}
cna = {
'Hugo_Symbol': ['DEF','ABC','GHI','JKL',],
'TCGA-1': [456, 123, 694, 384,],
'TCGA-2': [123, 456, 284, 843,],
'TCGA-3': [321, 789, 270, 147,],
}
meth = {
'Hugo_Symbol': ['DEF','PQR','STU','ABC',],
'TCGA-1': [456, 285, 225, 123,],
'TCGA-2': [123, 226, 632, 456,],
'TCGA-3': [432, 732, 532, 789,],
}
dfs = [pd.DataFrame.from_dict(mrna), pd.DataFrame.from_dict(cna), pd.DataFrame.from_dict(meth)]
hugoFilter = set.intersection(*[set(df.loc[:, 'Hugo_Symbol'].values.tolist()) for df in dfs])
common = pd.concat(dfs)
common = common[common.Hugo_Symbol.isin(hugoFilter)].drop_duplicates(keep="first", subset='Hugo_Symbol')
print(f"common: n{common}")
"""
common:
Hugo_Symbol TCGA-1 TCGA-2 TCGA-3
0 ABC 123 456 789
1 DEF 678 187 456
NOTE: make sure the DEF from the corrrect DF is returned, values differ.
"""