合并具有相同值的两列的数据



我有一个CSV文件,其中列n1 (ensembl_gene_id)在df1和和列n1 (gene1)在df2有相同的值在几行。我想合并具有相同值的df1 (column1)和df2(column2)的数据。我尝试了下面给出的脚本(我意识到代码有太多错误):

import os
import pandas as pd
df=pd.DataFrame()
for item in os.listdir('./'):
if item.endswith('.csv'):
df1 = pd.read_csv('gene1-protein.csv')
df2 = pd.read_csv('gene1.csv')
df = pd.Dataframe.merge(df1, df2, on=['ensembl_gene_id', 'gene1']
print(df)

我的文件看起来像-

df1
ensembl_gene_id hgnc_symbol gene_biotype
ENSG00000000419 DPM1        protein_coding
ENSG00000000938 FGR         protein_coding
ENSG00000000938 FGR         protein_coding
ENSG00000001084 GCLC        protein_coding
df2
gene1
ENSG00000000419.11
ENSG00000000938.11
ENSG00000001084.9
ENSG00000001084.9

我如何合并我的数据在一个单独的csv文件为df1 (column1)和df2(column2)具有相同的值?

我期望的结果应该看起来像-

gene1              ensembl_gene_id  hgnc_symbol gene_biotype
ENSG00000000419.11 ENSG00000000419  DPM1        protein_coding
ENSG00000000938.11 ENSG00000000938  FGR         protein_coding
ENSG00000000938.11 ENSG00000000938  FGR         protein_coding
ENSG00000001084.9  ENSG00000001084  GCLC        protein_coding

好吧,如果你真的需要保存你的副本,你可以这样做。(假设df2中的点后面的值对于给定的ensembl_gene_id总是相同的)

# ----- just to reproduce the dataframes
import pandas as pd
df1_s = "ensembl_gene_id hgnc_symbol gene_biotypen
ENSG00000000419 DPM1        protein_codingn
ENSG00000000938 FGR         protein_codingn
ENSG00000000938 FGR         protein_codingn
ENSG00000001084 GCLC        protein_coding"
df2_s = "gene1n
ENSG00000000419.11n
ENSG00000000938.11n
ENSG00000001084.9n
ENSG00000001084.9"
df1 = pd.DataFrame([i.split() for i in df1_s.split("n")])
df1.columns = df1.iloc[0]
df1.drop(0, inplace=True)
df1.set_index("ensembl_gene_id", inplace=True)
df2 = pd.DataFrame([i.split() for i in df2_s.split("n")])
df2.columns = df2.iloc[0]
df2.drop(0, inplace=True)
# -------------------------------------------------

# make the index of df2 similar to df1
df2.set_index(df2.gene1.str.split(".").str[0], inplace=True)
# append the values 
# (you need to drop duplicates since "same-index-values" are propagated automatically)
df1["gene1"] = df2["gene1"].drop_duplicates()
print(df1)
>>>                 hgnc_symbol    gene_biotype               gene1
>>> ensembl_gene_id                                                
>>> ENSG00000000419        DPM1  protein_coding  ENSG00000000419.11
>>> ENSG00000000938         FGR  protein_coding  ENSG00000000938.11
>>> ENSG00000000938         FGR  protein_coding  ENSG00000000938.11
>>> ENSG00000001084        GCLC  protein_coding   ENSG00000001084.9

相关内容

  • 没有找到相关文章

最新更新