如何合并2个或更多的csv文件与时间重叠的数据?例如,
data1是
Time u v w
0.24001821 0 0.009301949 0
0.6400364 0 0.009311552 0
0.84005458 0 0.0093211568 0
0.94034343 0 0.0094739951 0
data2是
Time u v w
0.74041502 0 0.0095119512 0
0.84043291 0 0.0095214359 0
0.94045075 0 0.0095309047 0
1.2404686 0 0.0095403752 0
我要的是:
Time u v w
0.24001821 0 0.009301949 0
0.6400364 0 0.009311552 0
0.74041502 0 0.0095119512 0
0.84043291 0 0.0095214359 0
0.94045075 0 0.0095309047 0
1.2404686 0 0.0095403752 0
所以第一个csv文件的最后几行数据被删除,第二个csv文件被合并,这样时间序列就增加了。
怎么做呢?谢谢。
Python有一个很好的内置库函数heapq.merge()
来帮助完成这个任务。
假设您的数据是用空格分隔的,您可以这样使用:
from heapq import merge
import csv
filenames = ['data1.csv', 'data2.csv']
merge_list = []
for filename in filenames:
f_input = open(filename)
csv_input = csv.reader(f_input, delimiter=' ', skipinitialspace=True)
header = next(csv_input)
merge_list.append(csv_input)
with open('output.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output, delimiter=' ')
csv_output.writerow(header)
csv_output.writerows(merge(*merge_list, key=lambda x: float(x[0])))
这将产生如下格式的CSV输出:
Time u v w
0.24001821 0 0.009301949 0
0.6400364 0 0.009311552 0
0.74041502 0 0.0095119512 0
0.84005458 0 0.0093211568 0
0.84043291 0 0.0095214359 0
0.94034343 0 0.0094739951 0
0.94045075 0 0.0095309047 0
1.2404686 0 0.0095403752 0
这将适用于任何数量的输入CSV文件。
如果两个文件已经分别按时间排序。使用for循环就足够了:
# csv cell should be separated by comma, change if required
dilimeter = ','
# open files and read lines
f1 = open('data1.csv', 'r')
f1_lines = f1.readlines()
f1.close()
f2 = open('data2.csv', 'r')
f2_lines = f2.readlines()
f2.close()
# extract header
output_lines = [f1_lines[0]]
# start scanning frome line 2 of both files (line 1 is header)
f1_index = 1
f2_index = 1
while True:
# all data1 are processed, append remaining lines from data2
if f1_index >= len(f1_lines):
output_lines += f2_lines[f2_index:]
break
# all data2 are processed, append remaining lines from data1
if f2_index >= len(f2_lines):
output_lines += f1_lines[f1_index:]
break
f1_line_time = float(f1_lines[f1_index].split(dilimeter)[0]) # get the time cell of data1
f2_line_time = float(f2_lines[f2_index].split(dilimeter)[0]) # get the time cell of data2
if f1_line_time < f2_line_time:
output_lines.append(f1_lines[f1_index])
f1_index += 1
elif f1_lines == f2_line_time:
# if they are equal in time, pick one
output_lines.append(f1_lines[f1_index])
f1_index += 1
f2_index += 1
else:
output_lines.append(f2_lines[f2_index])
f2_index += 1
f_output = open('out.csv', 'w')
f_output.write(''.join(output_lines))
f_output.close()
另一个选项:
import csv
delimiter = " "
with open("data1.csv", "r") as fin1,
open("data2.csv", "r") as fin2,
open("data.csv", "w") as fout:
reader1 = csv.reader(fin1, delimiter=delimiter)
reader2 = csv.reader(fin2, delimiter=delimiter)
writer = csv.writer(fout, delimiter=delimiter)
next(reader2)
first_row = next(reader2)
start2 = float(first_row[0])
writer.writerow(next(reader1))
for row in reader1:
if start2 <= float(row[0]):
break
writer.writerow(row)
writer.writerow(first_row)
writer.writerows(reader2)
假设文件已经单独排序:
- 首先取
data2.csv
的第一行数据,并将其第一个条目转换为浮点数start2
。 - 考虑到这一点,将
data1.csv
中的所有行以少于start2
的时间写入新文件data.csv
,并在条件不再满足时跳出循环。 - 然后将
data2.csv
中已经提取的第一行数据写入data.csv
,然后将data2.csv
中剩余的数据写入data.csv
。
结果data1.csv
Time u v w
0.24001821 0 0.009301949 0
0.6400364 0 0.009311552 0
0.84005458 0 0.0093211568 0
0.94034343 0 0.0094739951 0
data2.csv
Time u v w
0.74041502 0 0.0095119512 0
0.84043291 0 0.0095214359 0
0.94045075 0 0.0095309047 0
1.2404686 0 0.0095403752 0
Time u v w
0.24001821 0 0.009301949 0
0.6400364 0 0.009311552 0
0.74041502 0 0.0095119512 0
0.84043291 0 0.0095214359 0
0.94045075 0 0.0095309047 0
1.2404686 0 0.0095403752 0
一个更通用的解决方案(多个文件)可能看起来像:
import csv
delimiter = " "
files = ["data1.csv", "data2.csv", "data3.csv"]
stops = []
for file in files[1:]:
with open(file, "r") as file:
reader = csv.reader(file, delimiter=delimiter)
header = next(reader)
stops.append(float(next(reader)[0]))
stops.append(float("inf"))
with open("data.csv", "w") as fout:
writer = csv.writer(fout, delimiter=delimiter)
writer.writerow(header)
for stop, file in zip(stops, files):
with open(file, "r") as fin:
next(fin)
reader = csv.reader(fin, delimiter=delimiter)
for row in reader:
if stop <= float(row[0]):
break
writer.writerow(row)
对于像
这样的重叠部分可以使用1. file: |------|
2. file: |--------|
3. file: |------|
但不
1. file: |--------|
2. file: |-------|
3. file: |--------------|