我有一个csv文件,看起来像这样:
index label observation groundTruth
0 1 10.00 0
1 3 5.50 0
2 1 18.90 1
---------------------------
3 1 12.00 1
4 3 23.68 0
5 1 21.45 0
6 3 6.57 1
7 1 10.00 1
这些数据表示时间序列观测,其中每个链的集合长度为5。因为不是所有的观察链默认都是5长,所以添加了一些填充来人为地增加长度,使用下面的代码来获得这个文件:
index label observation groundTruth
0 1 10.00 0
1 3 5.50 0
2 1 18.90 1
3 0 0 0
4 0 0 0
--------------------------
5 1 12.00 1
6 3 23.68 0
7 1 21.45 0
8 3 6.57 1
9 1 10.00 1
这是代码:
line = [0,0,0]
with open(input_file, 'r') as inp, open(output_file, 'a') as out:
writer = csv.writer(out)
reader = csv.reader(inp)
counter = 0
for row in reader:
counter += 1
if(row[0]=='s' and counter<6):
while(counter<6):
writer.writerow(line)
counter+=1
counter=0
else:
writer.writerow(row)
我的问题是,这个填充需要在每个序列的开始,而不是结束。
我需要的是像这样的文件:
index label observation groundTruth
0 0 0 0
1 0 0 0
2 1 10.00 0
3 3 5.50 0
4 1 18.90 1
--------------------------
5 1 12.00 1
6 3 23.68 0
7 1 21.45 0
8 3 6.57 1
9 1 10.00 1
我试着简单地反转输出csv文件,像这样:
with open('data/test.csv', 'r') as inp, open('data/test_reverse.csv', 'a') as out:
writer = csv.writer(out)
reader = csv.reader(inp)
for row in reversed(list(reader)):
writer.writerow(row)
,但这会反转整个时间序列,再次产生我不想要的不合理数据:
index label observation groundTruth
0 0 0 0
1 0 0 0
2 1 18.90 1
3 3 5.50 0
4 1 10.00 0
--------------------------
5 1 10.00 1
6 3 6.57 1
7 1 21.45 0
8 3 23.68 0
9 1 12.00 1
你知道怎么做吗?
注意:---
不是我的。csv的一部分,它只是帮助使问题更清楚。
注2:可以可靠地检测到填充行,因为label 0
不是数据中自然出现的。(如果这有助于解决问题)。
如果所有的观测值的长度为5,那么你可以使用下一个例子如何移动所有的行label="0"前:
import csv
from itertools import zip_longest
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
with open("data.csv", "r") as f_in, open("out.csv", "w") as f_out:
reader = csv.reader(f_in)
writer = csv.writer(f_out)
# write headers to output
writer.writerow(next(reader))
for rows in grouper(reader, 5):
# save index column
index_column, *_ = zip(*rows)
# move rows with label=="0" to front:
rows = sorted(rows, key=lambda k: k[1] != "0")
# correct index column
for i, r in zip(index_column, rows):
r[0] = i
# write to csv file
writer.writerows(rows)
写out.csv
:
index,label,observation,groundTruth
0,0,0,0
1,0,0,0
2,1,10.00,0
3,3,5.50,0
4,1,18.90,1
5,1,12.00,1
6,3,23.68,0
7,1,21.45,0
8,3,6.57,1
9,1,10.00,1
如何修改原始程序来正确编写填充?
(我使用Python 3.10)
import csv
from typing import Any
Rows = list[list[Any]]
def pad_rows(rows: Rows) -> Rows:
max_rows = 6
n_rows = len(rows)
if n_rows >= max_rows:
return rows
pad_n = max_rows - n_rows
pad = [[0, 0, 0]] * pad_n
return rows + pad
with (
open("input.csv", newline="") as f_in, # the csv module docs recommend newline=""
open("output.csv", "w", newline="") as f_out, # I changed "a" to "w" for my dev/testing
):
reader = csv.reader(f_in)
writer = csv.writer(f_out)
writer.writerow(next(reader)) # header
series: Rows = []
for row in reader:
if row[0] == "s" and series != []:
writer.writerows(pad_rows(series))
series = []
continue
series.append(row)
# Write final series if "s" (break) wasn't the last non-empty row
if series != []:
writer.writerows(pad_rows(series))
实际上,这产生了原始的,不需要的输出:
| label | observation | groundTruth |
|-------|-------------|-------------|
| 1 | 10.00 | 0 |
| 3 | 5.50 | 0 |
| 1 | 18.90 | 1 |
| 0 | 0 | 0 |
| 0 | 0 | 0 |
| 1 | 12.00 | 1 |
| 3 | 23.68 | 0 |
| 1 | 21.45 | 0 |
| 3 | 6.57 | 1 |
| 1 | 10.00 | 1 |
我相信你可以找到一行修改,使它工作的方式你想要的。(提示:它在pad_rows函数中)