如何按关键字比较列和行数不同的两个不同CSV文件

不使用diff选项。

我有两个不同的CSV文件(我知道列标题的名称)-old.CSV和new.CSV。它们的行数和列数不同。CSV文件可以包括数字、字符串、字符和特殊字符。例如：

旧.csv

round    date  first  second  third  fourth  fifth  sixth
1     2  2021.04      2    45e69     10    16      4565   37
2     3  2021.04      4      15    456as  df924     35   4N320
4     5  2021.03      4    43!d9    23      26      29     33

新.csv

round    date  first  second  third  fourth  fifth  sixth
0     1  2021.04      1      14     15      24      40     41
1     2  2021.04      2    45e69    10      16     4565    37
2     3  2021.04      4      15    456as   df924    35    4N320
3     4  2021.03     10      11     20      21     24325   41
5     6  2021.03    4321       9   2#@6     28     34350   41

两个CSV的第1行和第2行相同。

现在，我想打印出new.csv与old.csv之间的差异。我只想打印出new.csv文件中的新行。像这样：

result.csv

round    date  first  second  third  fourth  fifth  sixth
0     1  2021.04      1     14     15      24     40     41
3     4  2021.03     10     11     20      21     24325  41
5     6  2021.03    4321    9     2#@6     28     34350  41

数数行是为了更容易阅读。我希望标题列将保留在result.csv文件中。

我认为在这里可能有帮助的选项是使用关键字。例如，按列比较两个CSV文件"；第一个"；以及"；第五个"-如果其中一列不相同，请将其打印到result.csv文件中。

您需要分别检查要比较的每一列。以下代码是如何做到这一点的一个选项。

import pandas as pd
def getLinesOnlyInA(dfa, dfb, result):
# iterate over all lines in file a
for index1, row1 in dfa.iterrows():
aLineIsEqual = False
# iterate over all lines in file b
for index2, row2 in dfb.iterrows():
thisLineIsDifferent = False
# for each column, check if they are different
for column in columns:
if row1[column] != row2[column]:
thisLineIsDifferent = True
# ionly continue when the fields are the same
break
if not thisLineIsDifferent:
aLineIsEqual = True
# only continue when no equal line was found
break
# when no equal line was found, add that line to the result
if not aLineIsEqual:
result.append(row1)


df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv')
columns = ['round', 'first', 'fifth']     # columns to be compared
results = []
getLinesOnlyInA(df1, df2, results)        # find all lines only existing in file 1
getLinesOnlyInA(df2, df1, results)        # find all lines only existing in file 2
dfResult = pd.DataFrame(results)          # cast all lines into a dataframe
print(dfResult.to_string())   
dfResult.to_csv('result.csv', sep=',')

结果：

round     date  first second third fourth  fifth sixth
2      5  2021.03      4  43!d9    23     26     29    33
0      1  2021.04      1     14    15     24     40    41
3      4  2021.03     10     11    20     21  24325    41
4      6  2021.03   4321      9  2#@6     28  34350    41

如果一个文件包含较少的列(例如file1不包含date列)，这仍然有效，并导致：

round  first second third fourth  fifth sixth     date
2      5      4  43!d9    23     26     29    33      NaN
0      1      1     14    15     24     40    41  2021.04
3      4     10     11    20     21  24325    41  2021.03
4      6   4321      9  2#@6     28  34350    41  2021.03

您可以找到一个有用的库：csv diff

您可以从命令行使用它：

csv-diff one.csv two.csv --key=id --json

或作为python导入

from csv_diff import load_csv, compare
diff = compare(
load_csv(open("one.csv"), key="id"),
load_csv(open("two.csv"), key="id")
)

在读取两个文件后使用集合而不是列表：

fileone = set(fileone)
filetwo = set(filetwo)

那么在结果中，您可以获得对称差-除了fileone和filetwo集合中的元素之外，这两个集合的所有元素。

result = fileone.symmetric_difference(filetwo)

注意：标题行不包括在结果中，因为它在两个集合中是相同的

更新：filetwo减去fileone:

result = filetwo.difference(fileone)

相关内容

最新更新

热门标签：