比较两个csv文件的多列,并在新csv文件中将输出保存为匹配/不匹配



假设我在file1.csv中有列作为

Customer id    Name 
Q1             Alen
W2             Ricky
E3             Katrina
R4             Anya
T5             Leonardo

和file2.csv中的列作为

Customer id    Name
Q1             Alen
W2             Harry
E3             Katrina
R4             Anya
T5             Leonard

正如您所看到的,对于客户id:W2,对应的名称不匹配。所以output.csv应该如下所示:

Customer id  Status
Q1           Matching
W2           Not matching
E3           Matching
R4           Matching
T5           Matching

如何使用python获得上述输出。

附言:比较多列的代码是什么,而不仅仅是列名?

我的代码

import csv
with open('file1.csv', 'rt', encoding='utf-8') as csvfile1:
csvfile1_indices = dict((r[1], i) for i, r in enumerate(csv.reader(csvfile1)))
with open('file2.csv', 'rt', encoding='utf-8') as csvfile2:
with open('output.csv', 'w') as results:    
reader = csv.reader(csvfile2)
writer = csv.writer(results)
writer.writerow(next(reader, []) + ['status'])
for row in reader:
index = csvfile1_indices.get(row[1])
if index is not None:
message = 'matching'
writer.writerow(row + [message])
else:
message = 'not matching'
writer.writerow(row + [message])
results.close()

这很好,但我可以用其他更简单的方式来获得相同的输出吗?为了比较多个列,我需要做哪些更改?

如果你不介意使用Pandas,你可以用5行代码来完成:

import pandas as pd 
# assuming id columns are identical and contain the same values
df1 = pd.read_csv('file1.csv', index_col='Customer_id')
df2 = pd.read_csv('file2.csv', index_col='Customer_id')
df3 = pd.DataFrame(columns=['status'], index=df1.index)
df3['status'] = (df1['Name'] == df2['Name']).replace([True, False], ['Matching', 'Not Matching'])
df3.to_csv('output.csv')

编辑:删除sep = 't'以使用默认逗号分隔符。

将两个csv文件读取到两个不同的字典中,并在其中任何一个字典上迭代,并在另一个字典中检查相同的键。如果您想订购,请使用OrderedDict

您可以在多个列上merge

f1
Customer_id      Name
0          Q1      Alen
1          W2     Ricky
2          E3   Katrina
3          R4      Anya
4          T5  Leonardo
f2
Customer_id      Name
0          Q1      Alen
1          W2     Harry
2          E3   Katrina
3          R4      Anya
4          T5  Leonardo
m = f1.merge(f2, on=['Customer_id', 'Name'], indicator='Status', how='outer')
Customer_id      Name      Status
0          Q1      Alen        both
1          W2     Ricky   left_only
2          E3   Katrina        both
3          R4      Anya        both
4          T5  Leonardo        both
5          W2     Harry  right_only
m['Status'] = m['Status'].map({'both': 'Matching', 
'left_only': 'Not matching', 
'right_only': 'Not matching'})
m.drop_duplicates(subset=['Customer_id', 'Status'])
m.drop(['Name'], axis=1)
Customer_id        Status
0          Q1      Matching
1          W2  Not matching
2          E3      Matching
3          R4      Matching
4          T5      Matching

最新更新