在 EC2 上使用请求和美丽汤时出现内存错误



我正在使用请求和BeautifulSoup来解析维基数据以构建Person对象。我能够成功地做到这一点,但是在迭代地执行此操作时,我在创建 ~3,000 个 Person 对象后遇到了下面的 MemoryError。

MemoryError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "TreeBuilder.py", line 11, in <module>
ancestor = Person(next['id'])
File "/home/ec2-user/Person.py", line 14, in __init__
html = soup (data , 'lxml')
File "/usr/local/lib/python3.7/site-packages/bs4/__init__.py", line 325, in __init__
self._feed()
File "/usr/local/lib/python3.7/site-packages/bs4/__init__.py", line 399, in _feed
self.builder.feed(self.markup)
File "/usr/local/lib/python3.7/site-packages/bs4/builder/_lxml.py", line 324, in feed
self.parser.feed(markup)
File "src/lxml/parser.pxi", line 1242, in lxml.etree._FeedParser.feed
File "src/lxml/parser.pxi", line 1285, in lxml.etree._FeedParser.feed
File "src/lxml/parser.pxi", line 855, in lxml.etree._BaseParser._getPushParserContext
File "src/lxml/parser.pxi", line 871, in lxml.etree._BaseParser._createContext
File "src/lxml/parser.pxi", line 528, in lxml.etree._ParserContext.__cinit__
SystemError: <class 'lxml.etree._ErrorLog'> returned a result with an error set

我试图使用以下方法捕获不起作用的异常;

try:
data = requests.get (url).text
html = soup(data, 'lxml')
except MemoryError:
return None

在 Pycharm 中运行程序时,此错误不会发生在我的本地计算机上,只会在我的 AWS EC2 服务器上发生。

更新

请参阅下面的代码。我在每 100 次迭代后添加gc.collect(),这似乎没有帮助。

Person.py

import requests
from bs4 import BeautifulSoup as soup
class Person:
def __init__(self, id):
url = 'https://www.wikidata.org/wiki/' + id
data = requests.get (url).text
html = soup (data , 'lxml')
### id ###
self.id = id
### Name ###
if html.find ("span" , {"class": "wikibase-title-label"}) != None:
self.name = html.find ("span" , {"class": "wikibase-title-label"}).string
else:
self.name = ""
### Born ###
self.birth = ""
birth = html.find ("div" , {"id": "P569"})
if birth != None:
birth = birth.findAll ("div" , {"class": "wikibase-snakview-variation-valuesnak"})
if len(birth) > 0:
self.birth = birth[0].string
### Death ###
self.death = ""
death = html.find ("div" , {"id": "P570"})
if death != None:
death = death.findAll ("div" , {"class": "wikibase-snakview-variation-valuesnak"})
if len(death) > 0:
self.death = death[0].string
#### Sex ####
sex = html.find ("div" , {"id": "P21"})
if sex != None:
for item in sex.strings:
if item == 'male' or item == 'female':
self.sex = item
### Mother ###
self.mother = ""
mother = html.find ("div" , {"id": "P25"})
if mother != None:
mother = mother.findAll ("div" , {"class": "wikibase-snakview-variation-valuesnak"})
if len(mother) > 0:
self.mother = {"name": mother[0].string , "id": mother[0].find ('a')['title']}
### Father ###
self.father = ""
father = html.find ("div" , {"id": "P22"})
if father != None:
father = father.findAll ("div" , {"class": "wikibase-snakview-variation-valuesnak"})
if len(father) > 0:
self.father = {"name": father[0].string , "id": father[0].find ('a')['title']}
### Children ###
self.children = []
x = html.find("div" , {"id": "P40"})
if x != None:
x = x.findAll("div" , {"class": "wikibase-statementview"})
for i in x:
a = i.find ('a')
if a != None and a['title'][0] == 'Q':
self.children.append ({'name': a.string , 'id': a['title']})
def __str__(self):
return self.name + "ntBirth: " + self.birth + "ntDeath: " + self.death + "nntMother: " + 
self.mother['name'] + "ntFather: " + self.father['name'] + "nntNumber of Children: " + 
str(len(self.children))

TreeBuilder.py

from Person import Person
import gc, sys
file = open('ancestors.txt', 'w+')
ancestors = [{'name':'Charlemange', 'id':'Q3044'}]
all = [ancestors[0]['id']]
i = 1
while ancestors != []:
next = ancestors.pop(0)
ancestor = Person(next['id'])
for child in ancestor.children:
if child['id'] not in all:
all.append(child['id'])
ancestors.append(child)
if ancestor.mother != "" and ancestor.mother['id'] not in all:
all.append(ancestor.mother['id'])
ancestors.append(ancestor.mother)
if ancestor.father != "" and ancestor.father['id'] not in all:
all.append(ancestor.father['id'])
ancestors.append(ancestor.father)

file.write(ancestor.id + "*" + ancestor.name + "*" + "https://www.wikidata.org/wiki/" + ancestor.id + "*" + str(ancestor.birth) + "*" + str(ancestor.death) + "n")
if i % 100 == 0:
print (ancestor.name + " (" + ancestor.id + ")" + " - " + str(len(all)) + " - " + str (len(ancestors)) + " - " + str (sys.getsizeof(all)))
gc.collect()
i += 1
file.close()
print("nDone!")

解决方案,无论多么烦人,都是在str()调用中包装对 BS 对象的所有引用。虽然您在执行birth[0].string时似乎只存储对裸字符串的引用,但事实证明,当您在循环中执行此操作时,对 BS 对象的引用会继续在内存中构建。试一试,我刚刚用您的代码尝试过,内存占用量仍然很低。

只需确保捕获所有引用,例如:

self.children.append({'name': str(a.string), 'id': str(a['title'])})

编辑:请参阅此答案,以了解为什么BS最终保留这些参考的可能解释。

最新更新