在python中并行读取文件

我有一堆文件（大约100个），其中包含以下格式的数据：（人数）\t（平均年龄）

这些文件是从对某一人口群体进行的随机步行中生成的。每个文件有100000行，对应于1至100000人的平均年龄。每个文件对应于第三世界国家的不同地区。我们将把这些值与发达国家类似规模地区的平均年龄进行比较。

我想做的是

for each i (i ranges from 1 to 100,000):
  Read in the first 'i' values of average-age
  perform some statistics on these values

这意味着，对于每次i（其中i的范围从1到100000），读取平均年龄的前i=值，将其添加到列表中，并运行一些测试（如Kolmogorov-Smirnov或卡方）

为了并行打开所有这些文件，我认为最好的方法是创建一个文件对象的字典。但我一直在努力做上述操作。

我的方法是最好的方法吗（从复杂性角度来看）？

有更好的方法吗？

实际上，内存中可以容纳10000000行。

制作一个字典，其中键是number of people，值是average age的列表，其中列表中的每个元素都有一个不同的文件。因此，如果有100个文件，那么每个列表都将有100个元素。

这样，就不需要将文件对象存储在dict 中

希望这能帮助

为什么不采取一种简单的方法：

按顺序打开每个文件并读取其行以填充内存中的数据结构
对内存中的数据结构执行统计

这里有一个包含3个"文件"的自包含示例，每个文件包含3行。为了方便起见，它使用StringIO而不是实际的文件：

#!/usr/bin/env python
# coding: utf-8
from StringIO import StringIO
# for this example, each "file" has 3 lines instead of 100000
f1 = '1t10n2t11n3t12'
f2 = '1t13n2t14n3t15'
f3 = '1t16n2t17n3t18'
files = [f1, f2, f3]
# data is a list of dictionaries mapping population to average age
# i.e. data[0][10000] contains the average age in location 0 (files[0]) with
# population of 10000.
data = []
for i,filename in enumerate(files):
    f = StringIO(filename)
    # f = open(filename, 'r')
    data.append(dict())
    for line in f:
        population, average_age = (int(s) for s in line.split('t'))
        data[i][population] = average_age
print data
# gather custom statistics on the data
# i.e. here's how to calculate the average age across all locations where
# population is 2:
num_locations = len(data)
pop2_avg = sum((data[loc][2] for loc in xrange(num_locations)))/num_locations
print 'Average age with population 2 is', pop2_avg, 'years old'

输出为：

[{1: 10, 2: 11, 3: 12}, {1: 13, 2: 14, 3: 15}, {1: 16, 2: 17, 3: 18}]
Average age with population 2 is 14 years old

I。。。不知道我是否喜欢这种方法，但它可能对你有用。它有可能消耗大量内存，但可能会做你需要的事情。我假设你的数据文件是有编号的。如果不是这样的话，这可能需要适应。

# open the files.
handles = [open('file-%d.txt' % i) for i in range(1, 101)]
# loop for the number of lines.
for line in range(100000):
  lines = [fh.readline() for fh in handles]
  # Some sort of processing for the list of lines.

这可能接近你所需要的，但我不知道我是否喜欢它。如果你有任何文件的行数不相同，这可能会遇到麻烦。

相关内容

最新更新

热门标签：