我想让代码比较两个csv文件!
import pandas as pd
import numpy as np
df = pd.read_csv("E:Dupfile.csv")
df1 = pd.read_csv("E:file.csv")
df['Correct'] = None
def Result(x):
if ....:
return int(1)
else:
return int(0)
df.loc[:,"Correct"]=df.apply(Result,axis=1)
print(df["Correct"])
df.to_csv("E:file.csv")
print(df.head(20))
例如,file.csv格式如下:
round date first second third fourth fifth sixth
0 1 2021.04 1 14 15 24 40 41
1 2 2021.04 2 9 10 16 35 37
2 3 2021.04 4 15 24 35 36 40
3 4 2021.03 10 11 20 21 25 41
4 5 2021.03 4 9 23 26 29 33
5 6 2021.03 1 9 26 28 30 41
Dupfile.csv如下所示:
round date first second third fourth fifth sixth
0 1 2021.04 1 14 15 24 40 41
0 1 2021.04 1 2 3 4 5 6
1 2 2021.04 2 9 10 16 35 37
1 2 2021.04 1 2 3 4 5 6
2 3 2021.04 4 15 24 35 36 40
2 3 2021.04 1 2 3 4 5 6
3 4 2021.03 10 11 20 21 25 41
3 4 2021.03 1 2 3 4 5 6
4 5 2021.03 4 9 23 26 29 33
4 5 2021.03 1 2 3 4 5 6
它有一个相同的轮,但值不同。
检查文件的四舍五入值与Dupfile的四舍五入值,如果第一个到第六个值相等,使另一个"正确";在Dupfile中放入1。如果不正确,在"正确"后面加上0。列。
我试着比较两个不同的csv文件,但是,我不知道怎么做。有人能帮帮我吗?
我的期望答案:
round date first second third fourth fifth sixth Correct
0 1 2021.04 1 14 15 24 40 41 1
0 1 2021.04 1 2 3 4 5 6 0
1 2 2021.04 2 9 10 16 35 37 1
1 2 2021.04 1 2 3 4 5 6 0
2 3 2021.04 4 15 24 35 36 40 1
2 3 2021.04 1 2 3 4 5 6 0
3 4 2021.03 10 11 20 21 25 41 1
3 4 2021.03 1 2 3 4 5 6 0
4 5 2021.03 4 9 23 26 29 33 1
4 5 2021.03 1 2 3 4 5 6 0
如果您使用pandas
模块,最好获得模块中提供的方法。我建议您尝试使用merge
来比较2个不同的数据帧。我把你的代码重写如下:
import pandas as pd
df = pd.read_csv("E:Dupfile.csv")
df1 = pd.read_csv("E:file.csv")
df1['Correct'] = 1
df = df.merge(
df1,
how='left',
on=['round',
'date',
'first',
'second',
'third',
'fourth',
'fifth',
'sixth']).fillna(0)
print(df)
print(df['Correct'])
df.to_csv("E:file.csv")
print(df.head(20))
它是如何工作的?
merge
方法尝试将df
和df1
中的列与on
数组中存在的相同名称进行匹配。当您为how
参数选择left
时,合并(df
)左侧的任何值都不会被删除(左连接)。另一种方式是,我们在file.csv
中创建的correct
列追加到Dupfil.csv
数据,并且将不匹配赋值为nan
值。fillna(0)
方法帮助我们将nan
的值替换为0。
pandas.DataFrame。合并API引用
您可以使用df.merge
对纯熊猫执行此操作。
查看示例:
import pandas as pd
# file.csv
file_df = pd.DataFrame(
columns=["round", "date", "first", "second", "third", "fourth", "fifth", "sixth"],
data=[
("1", "2021.04", "1", "14", "15", "24", "40", "41"),
("2", "2021.04", "2", "9", "10", "16", "35", "37"),
("3", "2021.04", "4", "15", "24", "35", "36", "40"),
("4", "2021.03", "10", "11", "20", "21", "25", "41"),
("5", "2021.03", "4", "9", "23", "26", "29", "33"),
("6", "2021.03", "1", "9", "26", "28", "30", "41"),
],
)
# adding control column (we already know that those are the right values)
file_df["correct"] = 1
# Dupfile.csv
dup_file_df = pd.DataFrame(
columns=["round", "date", "first", "second", "third", "fourth", "fifth", "sixth"],
data=[
("1", "2021.04", "1", "14", "15", "24", "40", "41"),
("1", "2021.04", "1", "2", "3", "4", "5", "6"),
("2", "2021.04", "2", "9", "10", "16", "35", "37"),
("2", "2021.04", "1", "2", "3", "4", "5", "6"),
("3", "2021.04", "4", "15", "24", "35", "36", "40"),
("3", "2021.04", "1", "2", "3", "4", "5", "6"),
("4", "2021.03", "10", "11", "20", "21", "25", "41"),
("4", "2021.03", "1", "2", "3", "4", "5", "6"),
("5", "2021.03", "4", "9", "23", "26", "29", "33"),
("5", "2021.03", "1", "2", "3", "4", "5", "6"),
],
)
# We extract the column names to use in the merging process
cols = [x for x in dup_file_df.columns]
# We merge the 2 dataframes.
# The data frames are to match on every column (round, date and first to sixth).
# The "correct" column will be populated only if all the columns are matching
merged = dup_file_df.merge(file_df, how="outer", left_on=cols, right_on=cols)
# We put "0" where correct is None and cast to integer (it was float)
merged["correct"] = merged["correct"].fillna(0).astype(int)
# Done!
print(merged)