BeautifulSoup在目录中打开多个文件时出错

上下文：每周，我都会收到一份html文件形式的实验室结果列表。每周大约有3000个结果，每组结果都有两到四个与之相关的表格。对于每个结果/试验，我只关心存储在其中一个表中的一些标准信息。该表可以唯一标识，因为第一个单元格、第一列的文本总是"实验室结果"。

问题：当我一次处理每个文件时，以下代码非常有效。也就是说，我没有在目录上执行for循环，而是将get_data=open()指向一个特定的文件。然而，我想获取过去几年的数据，而不是单独处理每个文件。因此，我使用glob模块和for循环来循环浏览目录中的所有文件。我遇到的问题是，当我到达目录中的第三个文件时，我会得到一个MemoryError。

问题：是否有方法清除/重置每个文件之间的内存？这样，我就可以循环浏览目录中的所有文件，而不是单独粘贴每个文件名。正如您在下面的代码中看到的，我尝试用del清除变量，但没有成功。

谢谢。

from bs4 import BeautifulSoup
import glob
import gc
for FileName in glob.glob("\Research Results\*"):
get_data = open(FileName,'r').read()
soup = BeautifulSoup(get_data)
VerifyTable = "Clinical Results"
tables = soup.findAll('table')
for table in tables:
First_Row_First_Column = table.findAll('tr')[0].findAll('td')[0].text
if VerifyTable == First_Row_First_Column.strip():
v1 = table.findAll('tr')[1].findAll('td')[0].text
v2 = table.findAll('tr')[1].findAll('td')[1].text
complete_row = v1.strip() + ";" + v2.strip()
print (complete_row)
with open("Results_File.txt","a") as out_file:
out_string = ""
out_string += complete_row
out_string += "n"
out_file.write(out_string)
out_file.close()
del get_data
del soup
del tables
gc.collect()
print ("done")

我是一个非常初级的程序员，我也面临着同样的问题。我做了三件事似乎解决了这个问题：

还在迭代开始时调用垃圾回收('gc.collect()')
在迭代中转换解析，因此所有全局变量都将成为局部变量，并在函数结束时被删除
使用soupe.desport()

我认为第二次更改可能解决了这个问题，但我没有时间检查它，也不想更改工作代码。

对于这个代码，解决方案是这样的：

from bs4 import BeautifulSoup
import glob
import gc
def parser(file):
gc.collect()
get_data = open(file,'r').read()
soup = BeautifulSoup(get_data)
get_data.close()
VerifyTable = "Clinical Results"
tables = soup.findAll('table')
for table in tables:
First_Row_First_Column = table.findAll('tr')[0].findAll('td')[0].text
if VerifyTable == First_Row_First_Column.strip():
v1 = table.findAll('tr')[1].findAll('td')[0].text
v2 = table.findAll('tr')[1].findAll('td')[1].text
complete_row = v1.strip() + ";" + v2.strip()
print (complete_row)
with open("Results_File.txt","a") as out_file:
out_string = ""
out_string += complete_row
out_string += "n"
out_file.write(out_string)
out_file.close()
soup.decompose()
gc.collect()
return None

for filename in glob.glob("\Research Results\*"):
parser(filename)
print ("done")

相关内容

最新更新

热门标签：