从一个文件读取和写入另一个文件时，计算资源的最有效使用是什么？逐行还是批量到列表？

我正在从文本文件中逐行读取字符串，然后将其写入csv文件。

我可以想出两种最好的方法来做到这一点(我欢迎其他想法或修改(：

读取，将单行处理成列表，然后直接写入该行

linelist = []
with open('dirty.txt', 'r') as dirty_text:
with open('clean.csv', 'w') as clean_csv:
cleancsv_writer = csv.writer(clean_csv, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for line in dirty_text:
#Parse fields into list, replacing the previous list item with a new string that is a comma-separated row.
#Write list item into clean.csv.

将行读取并处理到列表中(直到达到列表的大小限制(，然后将列表一大批写入csv。重复直到文件结束(但我在这个例子中省略了循环(

linelist = []
seekpos = 0
with open('dirty.txt', 'r') as dirty_text:
for line in dirty_text:
#Parse fields into list until the end of the file or the end of the list's memory space, such that each list item is a string that is a comma-separated row.
#update seek position to come back to after this batch, if looping through multiple batches
with open('clean.csv', 'a') as clean_csv:
cleancsv_writer = csv.writer(clean_csv, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
#write list into clean.csv, each list item becoming a comma-separated row.
#This would likely be a loop for bigger files, but for my project and for simplicity, it's not necessary.

哪一个过程是对资源最有效的利用？

在这种情况下，我假设在这个过程中没有人(人类或其他人(需要访问任何一个文件(尽管我很乐意听到关于这种情况下效率的讨论(。

我还假设列表比字典需要更少的资源。

内存使用是我最关心的问题。我的直觉是，第一个进程使用的内存最少，因为列表永远不会超过一个项目，所以它在任何给定时刻使用的最大内存都小于第二个进程，后者会使列表内存最大化。但是，我不确定Python中的动态内存分配是如何工作的，并且在第一个过程中同时打开了两个文件对象。

至于用电量和所需的总时间，我不确定哪个过程更有效。我的直觉是，对于多个批次，第二个选项会消耗更多的功率和时间，因为它会在每个批次打开和关闭文件。

至于代码的复杂性和长度，第一个选项似乎会变得更简单、更短。

其他考虑因素？

哪个过程最好？

有更好的方法吗？十种更好的方法？

提前感谢！

将所有数据读入内存是低效的，因为它使用的内存超过了需要的内存。

你可以用一些CPU换内存；将所有内容读入内存的程序将有一个非常简单的主循环；但主要的瓶颈将是I/O通道，所以它确实不会更快。无论代码的运行速度有多快，任何合理的实现都将花费大部分运行时间等待磁盘。

如果你有足够的内存，将整个文件读入内存会很好。一旦数据大于可用内存，性能就会急剧下降(即操作系统会开始将内存区域交换到磁盘，然后在再次需要时将其交换回；在最坏的情况下，这基本上会使系统陷入停顿，这种情况称为颠簸(。喜欢一次读写一行的主要原因是，即使扩展到更大的数据量，程序也不会降级。

I/O已被缓冲；只需编写看起来自然的内容，并让类似文件的对象和操作系统负责实际的磁盘读写操作。

with open('dirty.txt', 'r') as dirty_text:
with open('clean.csv', 'w') as clean_csv:
cleancsv_writer = csv.writer(clean_csv, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for line in dirty_text:
row = some_function(line)
cleancsv_writer.writerow(row)

如果清理一行的所有工作都被some_function抽象掉了，那么您甚至不需要for循环。

with open('dirty.txt', 'r') as dirty_text,
with open('clean.csv', 'w') as clean_csv:
cleancsv_writer = csv.writer(clean_csv, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
cleancsv_writer.writerows(some_function(line) for line in dirty_text))

相关内容

最新更新

热门标签：