Python处理2亿元素数据集的方式



我有一个1000个文件的目录。每个文件都有许多行,其中每一行都是一个从4到8字节不等的ngram。我试图解析所有文件,以获得不同的ngram作为头行,然后对于每个文件,我想写一行,该行具有该ngram序列在文件中出现的频率。

以下代码成功地收集了标头,但在尝试将标头写入csv文件时遇到内存错误。我在一个有30GB RAM的AmazonEC2实例上运行它。有人能为我不知道的优化提供建议吗?

#Note: A combination of a list and a set is used to maintain order of metadata
#but still get performance since non-meta headers do not need to maintain order
header_list = []
header_set = set()
header_list.extend(META_LIST)
for ngram_dir in NGRAM_DIRS:
  ngram_files = os.listdir(ngram_dir)
  for ngram_file in ngram_files:      
      with open(ngram_dir+ngram_file, 'r') as file:
        for line in file:
          if not '.' in line and line.rstrip('n') not in IGNORE_LIST:
            header_set.add(line.rstrip('n'))
header_list.extend(header_set)#MEMORY ERROR OCCURRED HERE
outfile = open(MODEL_DIR+MODEL_FILE_NAME, 'w')
csvwriter = csv.writer(outfile)
csvwriter.writerow(header_list)
#Convert ngram representations to vector model of frequencies
for ngram_dir in NGRAM_DIRS:
  ngram_files = os.listdir(ngram_dir)
  for ngram_file in ngram_files:      
      with open(ngram_dir+ngram_file, 'r') as file:
        write_list = []
        linecount = 0
        header_dict = collections.OrderedDict.fromkeys(header_set, 0)
        while linecount < META_FIELDS: #META_FIELDS = 3
          line = file.readline()
          write_list.append(line.rstrip('n'))
          linecount += 1 
        file_counter = collections.Counter(line.rstrip('n') for line in file)
        header_dict.update(file_counter)
        for value in header_dict.itervalues():
          write_list.append(value)
        csvwriter.writerow(write_list)
outfile.close() 

然后不要扩展该列表。请使用来自itertools的chain来链接列表和集合。

取而代之的是:

header_list.extend(header_set)#MEMORY ERROR OCCURRED HERE

这样做(假设csvwriter.writerow接受任何迭代器):

headers = itertools.chain(header_list, header_set)
...
csvwriter.writerow(headers)

这至少可以避免你目前看到的内存问题。

最新更新