不使用diff选项。
我有两个不同的CSV文件(我知道列标题的名称)-old.CSV和new.CSV。它们的行数和列数不同。CSV文件可以包括数字、字符串、字符和特殊字符。例如:
旧.csv
round date first second third fourth fifth sixth
1 2 2021.04 2 45e69 10 16 4565 37
2 3 2021.04 4 15 456as df924 35 4N320
4 5 2021.03 4 43!d9 23 26 29 33
新.csv
round date first second third fourth fifth sixth
0 1 2021.04 1 14 15 24 40 41
1 2 2021.04 2 45e69 10 16 4565 37
2 3 2021.04 4 15 456as df924 35 4N320
3 4 2021.03 10 11 20 21 24325 41
5 6 2021.03 4321 9 2#@6 28 34350 41
两个CSV的第1行和第2行相同。
现在,我想打印出new.csv与old.csv之间的差异。我只想打印出new.csv文件中的新行。像这样:
result.csv
round date first second third fourth fifth sixth
0 1 2021.04 1 14 15 24 40 41
3 4 2021.03 10 11 20 21 24325 41
5 6 2021.03 4321 9 2#@6 28 34350 41
数数行是为了更容易阅读。我希望标题列将保留在result.csv文件中。
我认为在这里可能有帮助的选项是使用关键字。例如,按列比较两个CSV文件";第一个";以及";第五个"-如果其中一列不相同,请将其打印到result.csv文件中。
您需要分别检查要比较的每一列。以下代码是如何做到这一点的一个选项。
import pandas as pd
def getLinesOnlyInA(dfa, dfb, result):
# iterate over all lines in file a
for index1, row1 in dfa.iterrows():
aLineIsEqual = False
# iterate over all lines in file b
for index2, row2 in dfb.iterrows():
thisLineIsDifferent = False
# for each column, check if they are different
for column in columns:
if row1[column] != row2[column]:
thisLineIsDifferent = True
# ionly continue when the fields are the same
break
if not thisLineIsDifferent:
aLineIsEqual = True
# only continue when no equal line was found
break
# when no equal line was found, add that line to the result
if not aLineIsEqual:
result.append(row1)
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv')
columns = ['round', 'first', 'fifth'] # columns to be compared
results = []
getLinesOnlyInA(df1, df2, results) # find all lines only existing in file 1
getLinesOnlyInA(df2, df1, results) # find all lines only existing in file 2
dfResult = pd.DataFrame(results) # cast all lines into a dataframe
print(dfResult.to_string())
dfResult.to_csv('result.csv', sep=',')
结果:
round date first second third fourth fifth sixth
2 5 2021.03 4 43!d9 23 26 29 33
0 1 2021.04 1 14 15 24 40 41
3 4 2021.03 10 11 20 21 24325 41
4 6 2021.03 4321 9 2#@6 28 34350 41
如果一个文件包含较少的列(例如file1
不包含date
列),这仍然有效,并导致:
round first second third fourth fifth sixth date
2 5 4 43!d9 23 26 29 33 NaN
0 1 1 14 15 24 40 41 2021.04
3 4 10 11 20 21 24325 41 2021.03
4 6 4321 9 2#@6 28 34350 41 2021.03
您可以找到一个有用的库:csv diff
您可以从命令行使用它:
csv-diff one.csv two.csv --key=id --json
或作为python导入
from csv_diff import load_csv, compare
diff = compare(
load_csv(open("one.csv"), key="id"),
load_csv(open("two.csv"), key="id")
)
在读取两个文件后使用集合而不是列表:
fileone = set(fileone)
filetwo = set(filetwo)
那么在结果中,您可以获得对称差-除了fileone和filetwo集合中的元素之外,这两个集合的所有元素。
result = fileone.symmetric_difference(filetwo)
注意:标题行不包括在结果中,因为它在两个集合中是相同的
更新:filetwo减去fileone:
result = filetwo.difference(fileone)