使用 python 解析大型 (9GB) 文件

我有一个大文本文件，我需要使用 python 解析成一个管道分隔的文本文件。该文件看起来像这样（基本上）：

product/productId: D7SDF9S9 
review/userId: asdf9uas0d8u9f 
review/score: 5.0 
review/some text here
product/productId: D39F99 
review/userId: fasd9fasd9f9f 
review/score: 4.1 
review/some text here

每条记录由两个换行符分隔 /n .我在下面写了一个解析器。

with open ("largefile.txt", "r") as myfile:
    fullstr = myfile.read()
allsplits = re.split("nn",fullstr)
articles = []
for i,s in enumerate(allsplits[0:]):
        splits = re.split("n.*?: ",s)
        productId = splits[0]
        userId = splits[1]
        profileName = splits[2]
        helpfulness = splits[3]
        rating = splits[4]
        time = splits[5]
        summary = splits[6]
        text = splits[7]
fw = open(outnamename,'w')
fw.write(productId+"|"+userID+"|"+profileName+"|"+helpfulness+"|"+rating+"|"+time+"|"+summary+"|"+text+"n")
return

问题是我正在读取的文件太大，以至于我在完成之前耗尽了内存。
我怀疑它在allsplits = re.split("nn",fullstr)线上嘀咕。
有人可以让我知道一种一次只读取一条记录的方法，解析它，将其写入文件，然后移动到下一个记录？

不要一次性将整个文件读入内存;通过使用这些换行符生成记录。使用 csv 模块写入数据，以便于写出管道分隔的记录。

以下代码一次读取输入文件行，并在您操作过程中为每个记录写出 CSV 行。它永远不会在内存中保存超过一行，外加正在构造的一条记录。

import csv
import re
fields = ('productId', 'userId', 'profileName', 'helpfulness', 'rating', 'time', 'summary', 'text')
with open("largefile.txt", "r") as myfile, open(outnamename,'w', newline='') as fw:
    writer = csv.DictWriter(fw, fields, delimiter='|')
    record = {}
    for line in myfile:
        if not line.strip() and record:
            # empty line is the end of a record
            writer.writerow(record)
            record = {}
            continue
        field, value = line.split(': ', 1)
        record[field.partition('/')[-1].strip()] = value.strip()
    if record:
        # handle last record
        writer.writerow(record)

这段代码确实假定文件在形式category/key、product/productId、review/userId等冒号之前包含文本。斜杠后面的部分用于 CSV 列;顶部的fields列表反映了这些键。

或者，您可以删除该fields列表并改用csv.writer，改为在列表中收集记录值：

import csv
import re
with open("largefile.txt", "r") as myfile, open(outnamename,'wb') as fw:
    writer = csv.writer(fw, delimiter='|')
    record = []
    for line in myfile:
        if not line.strip() and record:
            # empty line is the end of a record
            writer.writerow(record)
            record = []
            continue
        field, value = line.split(': ', 1)
        record.append(value.strip())
    if record:
        # handle last record
        writer.writerow(record)

此版本要求记录字段全部存在，并以固定顺序写入文件。

不要一次将整个文件读入内存，而是逐行迭代，还要使用 Python 的 csv 模块来解析记录：

import csv
with open('hugeinputfile.txt', 'rb') as infile, open('outputfile.txt', 'wb') as outfile:
    writer = csv.writer(outfile, delimiter='|')
    for record in csv.reader(infile, delimiter='n', lineterminator='nn'):
        values = [item.split(':')[-1].strip() for item in record[:-1]] + [record[-1]]
        writer.writerow(values)

这里有几点需要注意：

使用with打开文件。为什么？因为使用 with 可确保文件close() d，即使异常中断脚本也是如此。

因此：

with open('myfile.txt') as f:
    do_stuff_to_file(f)

相当于：

f = open('myfile.txt')
try:
    do_stuff_to_file(f)
finally:
    f.close()

待续...（我没时间自动取款机了）

使用 "readline（）" 逐个读取记录的字段。或者你可以使用 read（n）来读取"n"个字节。

相关内容

最新更新

热门标签：