我如何使用python比较两个不同的csv文件?



我想让代码比较两个csv文件!

import pandas as pd
import numpy as np
df = pd.read_csv("E:Dupfile.csv")
df1 = pd.read_csv("E:file.csv")

df['Correct'] = None

def Result(x):
if ....:
return int(1)
else:
return int(0)


df.loc[:,"Correct"]=df.apply(Result,axis=1)

print(df["Correct"])

df.to_csv("E:file.csv")
print(df.head(20))

例如,file.csv格式如下:

round    date  first  second  third  fourth  fifth  sixth  
0     1  2021.04      1      14     15      24     40     41     
1     2  2021.04      2       9     10      16     35     37      
2     3  2021.04      4      15     24      35     36     40      
3     4  2021.03     10      11     20      21     25     41     
4     5  2021.03      4       9     23      26     29     33     
5     6  2021.03      1       9     26      28     30     41     

Dupfile.csv如下所示:

round    date  first  second  third  fourth  fifth  sixth  
0     1  2021.04      1      14     15      24     40     41  
0     1  2021.04      1       2      3       4      5      6    
1     2  2021.04      2       9     10      16     35     37   
1     2  2021.04      1       2      3       4      5      6      
2     3  2021.04      4      15     24      35     36     40    
2     3  2021.04      1       2      3       4      5      6     
3     4  2021.03     10      11     20      21     25     41  
3     4  2021.03      1       2      3       4      5      6     
4     5  2021.03      4       9     23      26     29     33  
4     5  2021.03      1       2      3       4      5      6   

它有一个相同的轮,但值不同。

检查文件的四舍五入值与Dupfile的四舍五入值,如果第一个到第六个值相等,使另一个"正确";在Dupfile中放入1。如果不正确,在"正确"后面加上0。列。

我试着比较两个不同的csv文件,但是,我不知道怎么做。有人能帮帮我吗?

我的期望答案:

round    date  first  second  third  fourth  fifth  sixth Correct
0     1  2021.04      1      14     15      24     40     41    1
0     1  2021.04      1       2      3       4      5      6    0
1     2  2021.04      2       9     10      16     35     37    1
1     2  2021.04      1       2      3       4      5      6    0  
2     3  2021.04      4      15     24      35     36     40    1
2     3  2021.04      1       2      3       4      5      6    0 
3     4  2021.03     10      11     20      21     25     41    1
3     4  2021.03      1       2      3       4      5      6    0 
4     5  2021.03      4       9     23      26     29     33    1
4     5  2021.03      1       2      3       4      5      6    0

如果您使用pandas模块,最好获得模块中提供的方法。我建议您尝试使用merge来比较2个不同的数据帧。我把你的代码重写如下:

import pandas as pd
df = pd.read_csv("E:Dupfile.csv")
df1 = pd.read_csv("E:file.csv")
df1['Correct'] = 1
df = df.merge(
df1,
how='left',
on=['round',
'date',
'first',
'second',
'third',
'fourth',
'fifth',
'sixth']).fillna(0)
print(df)
print(df['Correct'])
df.to_csv("E:file.csv")
print(df.head(20))

它是如何工作的?

merge方法尝试将dfdf1中的列与on数组中存在的相同名称进行匹配。当您为how参数选择left时,合并(df)左侧的任何值都不会被删除(左连接)。另一种方式是,我们在file.csv中创建的correct列追加到Dupfil.csv数据,并且将不匹配赋值为nan值。fillna(0)方法帮助我们将nan的值替换为0。

pandas.DataFrame。合并API引用

您可以使用df.merge对纯熊猫执行此操作。

查看示例:

import pandas as pd

# file.csv
file_df = pd.DataFrame(
columns=["round", "date", "first", "second", "third", "fourth", "fifth", "sixth"],
data=[
("1", "2021.04", "1", "14", "15", "24", "40", "41"),
("2", "2021.04", "2", "9", "10", "16", "35", "37"),
("3", "2021.04", "4", "15", "24", "35", "36", "40"),
("4", "2021.03", "10", "11", "20", "21", "25", "41"),
("5", "2021.03", "4", "9", "23", "26", "29", "33"),
("6", "2021.03", "1", "9", "26", "28", "30", "41"),
],
)
# adding control column (we already know that those are the right values)
file_df["correct"] = 1
# Dupfile.csv
dup_file_df = pd.DataFrame(
columns=["round", "date", "first", "second", "third", "fourth", "fifth", "sixth"],
data=[
("1", "2021.04", "1", "14", "15", "24", "40", "41"),
("1", "2021.04", "1", "2", "3", "4", "5", "6"),
("2", "2021.04", "2", "9", "10", "16", "35", "37"),
("2", "2021.04", "1", "2", "3", "4", "5", "6"),
("3", "2021.04", "4", "15", "24", "35", "36", "40"),
("3", "2021.04", "1", "2", "3", "4", "5", "6"),
("4", "2021.03", "10", "11", "20", "21", "25", "41"),
("4", "2021.03", "1", "2", "3", "4", "5", "6"),
("5", "2021.03", "4", "9", "23", "26", "29", "33"),
("5", "2021.03", "1", "2", "3", "4", "5", "6"),
],
)
# We extract the column names to use in the merging process
cols = [x for x in dup_file_df.columns]
# We merge the 2 dataframes.
# The data frames are to match on every column (round, date and first to sixth). 
# The "correct" column will be populated only if all the columns are matching
merged = dup_file_df.merge(file_df, how="outer", left_on=cols, right_on=cols)
# We put "0" where correct is None and cast to integer (it was float)
merged["correct"] = merged["correct"].fillna(0).astype(int)
# Done!
print(merged)

相关内容

  • 没有找到相关文章

最新更新