比较两个csv文件并输出第三个csv



我正在尝试读取两个不同长度的csv文件。我的第一个文件是一个参考文件,它包含时间戳数据,如下所示,一直到23:59:58

代表问题的参考文件的一些样本数据如下所示

2021-04-04 00:00:00 ,-1.0
2021-04-04 00:00:01 ,-1.0
2021-04-04 00:00:02 ,-1.0
2021-04-04 00:00:03 ,-1.0
2021-04-04 00:00:04 ,-1.0
2021-04-04 00:00:05 ,-1.0
2021-04-04 00:00:06 ,-1.0
2021-04-04 00:00:07 ,-1.0
2021-04-04 00:00:08 ,-1.0
2021-04-04 00:00:09 ,-1.0
2021-04-04 00:00:10 ,-1.0
2021-04-04 00:00:11 ,-1.0
2021-04-04 00:00:12 ,-1.0
2021-04-04 00:00:13 ,-1.0

我的第二个文件是可以在这里找到的原始文件

下面显示了原始文件的一些示例数据来表示问题

HEADER_TIME_STAMP, UNIT
2021-04-04 00:00:00.005 ,0.4
2021-04-04 00:00:01.005 ,0.3
2021-04-04 00:00:02.005 ,0.2
2021-04-04 00:00:03.005 ,0.3
2021-04-04 00:00:04.005 ,0.4
2021-04-04 00:00:05.005 ,0.5
2021-04-04 00:00:10.005 ,0.4
2021-04-04 00:00:11.005 ,0.2
2021-04-04 00:00:12.005 ,0.3
2021-04-04 00:00:13.005 ,0.1

与引用文件相比,它缺少时间戳。当原始文件缺少时间戳时,我需要将引用文件中的行添加到第三个csv文件中。如果时间戳没有丢失,则必须添加原始文件中的行。

应该只使用HH:MM:SS格式的时间戳来进行比较,以查看一行是否相同,忽略毫秒。

我已经尝试了以下代码,这些代码能够完成我希望它完成的大部分工作,但当从引用文件中添加行以使p代表缺失的行时,它无法停止对原始文件的迭代。因此导致程序也忽略了原始文件中存在的一些行。

# compare the difference between two given csv and produce a third csv that contains all possible time stamps
def csv_compare_new(c_o, c_r):
with open(c_o, "r") as original, open(c_r, "r") as reference:
original_reader = csv.reader(original, delimiter=',', quotechar='"')
reference_reader = csv.reader(reference, delimiter=',', quotechar='"')
with open('compare.csv', 'w') as out:
new_writer = csv.writer(out, delimiter=',', quotechar='"')
print(new_writer.dialect)
for line_or, line_ref in itertools.zip_longest(original_reader, reference_reader):
if line_or is None:
new_writer.writerow(line_ref)
else:
if line_ref[0][0:19] in line_or[0][0:19]:
new_writer.writerow(line_or)
else:
if line_ref[0][0:19] not in line_or[0][0:19]:
new_writer.writerow(line_ref)
out.close()

我希望有人能帮我解决这个bug。请注意,我希望它保持为python代码,最好是一个内存高效的解决方案。

样本数据的预期结果如下

2021-04-04 00:00:00.005 ,0.4
2021-04-04 00:00:01.005 ,0.3
2021-04-04 00:00:02.005 ,0.2
2021-04-04 00:00:03.005 ,0.3
2021-04-04 00:00:04.005 ,0.4
2021-04-04 00:00:05.005 ,0.5
2021-04-04 00:00:06 ,-1.0
2021-04-04 00:00:07 ,-1.0
2021-04-04 00:00:08 ,-1.0
2021-04-04 00:00:09 ,-1.0
2021-04-04 00:00:10.005 ,0.4
2021-04-04 00:00:11.005 ,0.2
2021-04-04 00:00:12.005 ,0.3
2021-04-04 00:00:13.005 ,0.1

但是,它并没有按预期添加原始文件中的行,而是添加参考文件中的线,如下所示。

2021-04-04 00:00:00.005 ,0.4
2021-04-04 00:00:01.005 ,0.3
2021-04-04 00:00:02.005,0.2
2021-04-04 00:00:03.005 ,0.3
2021-04-04 00:00:04.005 ,0.4
2021-04-04 00:00:05.005 ,0.5
2021-04-04 00:00:06 ,-1.0
2021-04-04 00:00:07 ,-1.0
2021-04-04 00:00:08 ,-1.0
2021-04-04 00:00:09 ,-1.0
2021-04-04 00:00:10 ,-1.0
2021-04-04 00:00:11 ,-1.0
2021-04-04 00:00:12 ,-1.0
2021-04-04 00:00:13 ,-1.0

您可以使用panda来实现此

import pandas as pd
df1 = pd.read_csv('reference.csv')
df2 = pd.read_csv('original.csv')

# Concatenate, making sure the df that needs to be updated is first in the list
df = pd.concat([df2,df1])
# This will drop the milliseconds
df['t'] = df['HEADER_TIME_STAMP'].astype('datetime64[s]')
# Droping duplicates on the new time column
# This will keep the first duplicated record and drop the second
df.drop_duplicates(subset='t').sort_values(by='HEADER_TIME_STAMP').drop(columns='t').to_csv('combined.csv', index=False)

最新更新