从大文件 python 快速构建数百万个项目集

我正在尝试从一个巨大的文件中构建几组整数对。典型文件中的每个集合都包含大约几百万行，用于解析和构建一个集合。我创建了以下代码，但仅由 36 万行组成的一组就需要 36 小时>！！

输入文件(像这样几百万行(：以

*|NET 2 0.000295965PF ...//不需要的部分 R2_42 2：1 2：2 3.43756e-05 $a=2.909040 $lvl=99 $llx=15.449 $lly=9.679 $urx=17.309 $ury=11.243 R2_43 2：2 2：3 0.805627 $l=0.180 $w=1.564 $lvl=71 $llx=16.199 $lly=9.679 $urx=16.379 $ury=11.243 $dir=0 R2_44 2：2 2：4 4.16241 $l=0.930 $w=1.564 $lvl=71 $llx=16.379 $lly=9.679 $urx=17.309 $ury=11.243 $dir=0 R2_45 2：3 2：5 0.568889 $a=0.360000 $lvl=96 $llx=15.899 $lly=10.185 $urx=16.499 $ury=10.785 R2_46 2：3 2：6 3.35678 $l=0.750 $w=1.564 $lvl=71 $llx=15.449 $lly=9.679 $urx=16.199 $ury=11.243 $dir=0 R2_47 2：5 2：7 0.0381267 $l=0.301 $w=0.600 $lvl=8 $llx=16.199 $lly=10.200 $urx=16.500 $ury=10.800 $dir=0 R2_48 2：5 2：8 0.0378733 $l=0.299 $w=0.600 $lvl=8 $llx=15.900 $lly=10.200 $urx=16.199 $ury=10.800 $dir=0 *|净输出 0.000895965PF ...等

最后，我需要从上面构建一组整数对，其中整数是从文件的第 2 列和第 3 列组成的列表的索引。 [(2：1,2：2(， (2：2,2：3(， (2：2,2：4(， (2：3,2：5(， (2：3,2：6(， (2：5,2：7(， (2：5,2：8(] 变为 [(0,1(，(1,2(，(1,3(，(2,4(，(2,5(，(4,6(，(4,7(]

我编码了这个：

if __name__ == '__main__':
with open('myspf') as infile, open('tmp','w') as outfile:
copy = False
allspf = []
for line in infile:
if line.startswith("*|NET 2"):
copy = True
elif line.strip() == "":
copy = False
elif copy:
#capture col2 and col3
if line.startswith("R"):
allspf.extend(re.findall(r'^R.*?s(.*?)s(.*?)s', line))
final = f6(list(itertools.chain(*allspf))) //to get unique list 
#build the finalpairs again by index: I've found this was the bottleneck
for x in allspf:
left,right = x
outfile.write("({},{}),".format(final.index(left),final.index(right)))
pair = []
f = open('tmp')
pair = list(ast.literal_eval(f.read()))
f.close()
fopen = open('hopespringseternal.txt','w')
fopen.write((json.dumps(construct_trees_by_TingYu(pair), indent=1)))
fopen.close()
def f6(seq):
# Not order preserving    
myset = set(seq)
return list(myset)

瓶颈在于"all spf 中的 x "循环，在我给它设置了数百万个项目后，过程本身construct_trees_by_TingYu也耗尽了内存。这家伙的程序需要一次完成整套：http://xahlee.info/python/python_construct_tree_from_edge.html

最终输出是从父级到子级的树：

{ "3": {  "1": { "0": {}  } }, "5": {  "2": { "1": {  "0": {} }  } }, "6": {  "4": { "2": {  "1": { "0": {}  } }  } }, "7": {  "4": { "2": {  "1": { "0": {}  } }  } } }

构建集合始终是 O(n(。您需要遍历整个列表才能将每个项目添加到您的集合中。

但是，看起来您甚至没有使用上面代码摘录中的 set 操作。

如果你的内存不足，你可能想迭代这个巨大的集合，而不是等待整个集合被创建，然后将其传递给construct_trees_by_TingYu(顺便说一下，我不知道这是什么(。此外，您可以创建一个生成器来生成集合中的每个项目，这将减少内存占用。我不知道"construct_trees_by_TingYu"是否会处理传递给它的生成器。

相关内容

最新更新

热门标签：