按1列读取CSV组并应用sum,不使用pandas



正如我在标题中所写的那样,我想阅读CSV,在同一CSV上按列分组,应用sum,然后使用尽可能少的库(并避免pandas)将旧CSV替换为新值。我已经走了这么远:

index = {}
with open('event.csv') as f:
cr = reader(f)
for row in cr:
index.setdefault(row[0], []).append(int(row[1]))
f.close()
with open('event.csv', 'w', newline='n') as csv_file:
writer = writer(csv_file)
for key, value in index.items():
writer.writerow([key, value[0]])
csv_file.close()

但是这样我可以取平均值,而且我必须打开文件两次,这对我来说似乎不明智。下面是一个类似于event.csv的CSV:

work1,100
work2,200
work3,200
work1,50
work3,20

所需输出:

work1,150
work2,200
work3,220

你实际上非常接近。只需对重写文件时读取的值求和。注意,当在文件上使用with时,您不必显式地关闭它们,它会自动为您关闭它们。另外请注意,CSV文件应该使用newline=''打开-用于读写-按照文档。

import csv

index = {}
with open('event.csv', newline='') as csv_file:
cr = csv.reader(csv_file)
for row in cr:
index.setdefault(row[0], []).append(int(row[1]))
with open('event2.csv', 'w', newline='n') as csv_file:
writer = csv.writer(csv_file)
for key, values in index.items():
value = sum(values)
writer.writerow([key, value])
print('-fini-')
通过消除一些临时变量并使用生成器表达式,上面的代码可以写得更简洁一些:
import csv

index = {}
with open('event.csv', newline='') as csv_file:
for row in csv.reader(csv_file):
index.setdefault(row[0], []).append(int(row[1]))
with open('event2.csv', 'w', newline='n') as csv_file:
csv.writer(csv_file).writerows([key, sum(values)] for key, values in index.items())
print('-fini-')

另一个已经显示的简化解决方案,不需要额外的库:

import csv
index = {}
with open('event.csv', newline='') as f:
cr = csv.reader(f)
for item,value in cr:
index[item] = index.get(item, 0) + int(value)  # sum as you go
with open('event2.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(index.items())  # write all the items in one shot
print('-fini-')

使用一个额外的库-convtools,它提供了大量的功能,而不是每次都手工编写大量的代码。

from convtools import conversion as c
from convtools.contrib.tables import Table

rows = Table.from_csv("event.csv", header=False).into_iter_rows(list)
converter = (
c.group_by(c.item(0))
.aggregate(
(
c.item(0),
c.ReduceFuncs.Sum(c.item(1).as_type(int)),
)
)
.gen_converter()
)
processed_rows = converter(rows)
Table.from_rows(processed_rows, header=False).into_csv(
"event2.csv", include_header=False
)

这是另一种思考方式。

不是在读取时存储整型数组,然后"压缩";在写入过程中将它们转换为所需的值,在读取过程中预先显示您正在求和:

import csv
from collections import defaultdict
summed_work = defaultdict(int)
with open('event_input.csv', newline='') as f:
reader = csv.reader(f)
for row in reader:
work_id = row[0]
work_value = int(row[1])
summed_work[work_id] += work_value
with open('event_processed.csv', 'w', newline='') as f:
writer = csv.writer(f)
for work_id, summed_value in summed_work.items():
writer.writerow([work_id, summed_value])

这在功能上等同于你的目标和martineau但是,我认为,它能更快、更清楚地向你和你的读者展示你的意图。

它技术上使用了一个库,defaultdict,但这是一个标准的库,我不确定你对正在使用的库的数量有什么价值。

编辑

哦,我刚想起来有一个计数器也可以从集合中类。可能更清楚:

summed_work = Counter()

和其他的都是一样的