比较csv文件中的两列



two csv files,在first文件中,third columndata有一定数量的行,而在second file中,first column有相似的数据,也有一些不确定的量,这些以md5的形式呈现,例如:

file_1

column_1
废话废话 废话废话 废话废话 etc

我的答案将适用于文件中的所有记录。它将在文件1和文件2中的所有记录中找到匹配项。

  1. 反转列表reader1 = [i[::-1] for i in reader1]以进行排序
  2. 列出这两个reader = reader1 + reader2
  3. 制作一本字典,按数字查找所有匹配项
  4. 只是打印我们的搜索结果
import csv
interesting_cols = [0, 2, 3, 4, 5]
with open("file1.csv", 'r') as file1,
open("file2.csv", 'r') as file2:
reader1, reader2 = csv.reader(file1), csv.reader(file2)
reader1 = [i[::-1] for i in reader1]
reader2 = [i for i in reader2]
dictionary_of_records = {item[0]: [] for item in reader1}
for i, item in enumerate(reader2):
key = item[0]
if key in dictionary_of_records:
dictionary_of_records[key].append(i)
for key, value in dictionary_of_records.items():
if len(value) >= 1:
print(f"Match for {key}")
for index in value:
print(' '.join(reader2[index]))
else:
print(f"No match for {key}")
print("-----------------------------")

附言:我觉得这是硬编码的。你也可以观看pandas库或itertools,找到更有趣的方法。

如果允许,您可以使用Pandas来执行此操作。首先使用pip安装软件包:python -m pip install pandas

或conda:conda install pandas

然后阅读并与熊猫进行比较:

注意:这只适用于两个数据帧具有相同结构(如相同列(的情况。如果您的数据帧不同,并且您只对比较它们之间的一列或几列感兴趣,请参阅下面的内容。

import pandas as pd
interesting_cols = [0, 2, 3, 4, 5]    
file1 = pd.read_csv("/root/file1.csv")
file2 = pd.read_csv("/root/file2.csv")
comp = file1.compare(file2)
print(comp.to_markdown())

OR,如果您希望保留"with"语句,则应该创建一个类并定义__enter____exit__方法:

import pandas as pd
interesting_cols = [0, 2, 3, 4, 5]    
class DataCSV:
def __init__(self, file) -> None:
self.filename = file
def __enter__(self):
self.file = pd.read_csv(self.filename)
return self.file
def __exit__(self, exc_type, exc_value, traceback):
pass
with DataCSV("/root/file1.csv") as file1, DataCSV("/root/file2.csv") as file2:
comp = file1.compare(file2)
print(comp.to_markdown())

输出应该类似于:

>td style="text-align:left;">49269f413284abfa58f41687b6f631e0>td style="文本align:left>9e5b91c360d6be29d556db7e1241ce82
('column_1','self'(('column_1','other'(/tr>
0等等等等aa7744226c695c0b2e440419848cf700
1等等等等a0879ff97178e03eb18470277fbc7056
2诸如此类ad1172b28f277eab7ca91f96f13a242b诸如此类

这里说,从csv文件读取的每一行都以字符串列表的形式返回您可以从这些行中读取各个列。

例如:

使用两个简单的csv文件
addresses.csv

Doe,John,120 jefferson st.,Riverside, NJ, 08075
McGinnis,Jack,220 hobo Av.,Phila, PA,09119
Repici,"John ""Da Man""",120 Jefferson St.,Riverside, NJ,08075
Tyler,Stephen,"7452 Terrace ""At the Plaza"" road",SomeTown,SD, 91234

phones.sv

John,Doe,19122
Jack,McGinnis,20220
"John ""Da Man""",Repici,1202134
Stephen,Tyler,72384

>>> with open('addresses.csv') as file1, open('phones.csv') as file2:
...     r1, r2 = csv.reader(file1), csv.reader(file2)
...     for line1, line2 in zip(r1, r2):
...             if line1[1] == line2[0]:
...                     print('found a duplicate', line1[1])
...
found a duplicate John
found a duplicate Jack
found a duplicate John "Da Man"
found a duplicate Stephen

我们得到在指定列中具有相同值的行。在我们的案例中,这些是第一个csv文件的第二列和第二个csv档案的第一列。为了获得行号,您可以使用enumerate(zip()),就像您提供的示例代码一样。

您可以检查Python列表的理解,以了解示例中使用的语法。

您可以使用生成器对列表进行重新排序并快速检查。

import csv
def parse_csv(filename, header=False, delim=',', quotechar='"'):
with open(filename, 'r') as f:
csvfile = csv.reader(f, delimiter=delim, quotechar=quotechar)
if header:
csvfile.__next__()
for row in csvfile:
yield row
def diff(l1, l2, reorder=None):
if reorder:
for i,line in enumerate(l2):
l2[i] = [line[x] for x in reorder]
for i, line in enumerate(l1):
if line not in l2:
yield i,  line
filename1 = ''
filename2 = ''
reorder = [2,1,0]
missing = [(i, line) for i,line in diff(parse_csv(filename1, header=False), list(parse_csv(filename2, header=False)), reorder=reorder) or (None, None)]
print(missing)

最新更新