如何解决运行时错误:无法使用Selenium启动新线程进行网页抓取?



我已经构建了一个脚本,从许多网站收集产品及其详细信息(~120)。它做了我想实现的事情,但过了一段时间(主要是70页左右),它给了我一个"MemoryError"和一个"RuntimeError:无法启动新线程"。我尝试过寻找解决方案,比如:.clear()我的列表,或者尝试使用sys.getsizeof()来发现内存泄漏,但还没有成功。你知道可能是什么问题吗?

详细错误消息:

Traceback (most recent call last):
File "C:EGYÉBPYTHONPyCharmhelperspydevpydevd.py", line 1741, in <module>
main()
File "C:EGYÉBPYTHONPyCharmhelperspydevpydevd.py", line 1735, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "C:EGYÉBPYTHONPyCharmhelperspydevpydevd.py", line 1135, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "C:EGYÉBPYTHONPyCharmhelperspydev_pydev_imps_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"n", file, 'exec'), glob, loc)
File "C:/EGYÉB/PYTHON/Projects/WebScraping/Selenium_scraping.py", line 63, in <module>
soup1 = BeautifulSoup(driver.page_source, 'html.parser')
File "C:EGYÉBPYTHONProjectsvenvlibsite-packagesseleniumwebdriverremotewebdriver.py", line 679, in page_source
return self.execute(Command.GET_PAGE_SOURCE)['value']
File "C:EGYÉBPYTHONProjectsvenvlibsite-packagesseleniumwebdriverremotewebdriver.py", line 319, in execute
response = self.command_executor.execute(driver_command, params)
File "C:EGYÉBPYTHONProjectsvenvlibsite-packagesseleniumwebdriverremoteremote_connection.py", line 374, in execute
return self._request(command_info[0], url, body=data)
File "C:EGYÉBPYTHONProjectsvenvlibsite-packagesseleniumwebdriverremoteremote_connection.py", line 423, in _request
data = utils.load_json(data.strip())
File "C:EGYÉBPYTHONProjectsvenvlibsite-packagesseleniumwebdriverremoteutils.py", line 37, in load_json
return json.loads(s)
File "C:EGYÉBPYTHONPython Corelibjson__init__.py", line 348, in loads
return _default_decoder.decode(s)
File "C:EGYÉBPYTHONPython Corelibjsondecoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:EGYÉBPYTHONPython Corelibjsondecoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
MemoryError
Traceback (most recent call last):
File "C:EGYÉBPYTHONPyCharmhelperspydev_pydevd_bundlepydevd_comm.py", line 1505, in do_it
t.start()
File "C:EGYÉBPYTHONPython Corelibthreading.py", line 847, in start
_start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread

代码:

from selenium import webdriver
from bs4 import BeautifulSoup
from itertools import count
import pandas as pd
import os
import csv
import time
import re
os.chdir('C:...')
price = []
prod_name = []
href_link = []
specs = []
item_specs1 = []
item_specs2 = []
url1 = 'https://login.aliexpress.com/'
driver = webdriver.Chrome()
driver.implicitly_wait(30)
driver.get(url1)
time.sleep(3)
driver.switch_to.frame('alibaba-login-box')
driver.find_element_by_id('fm-login-id').send_keys('..........')
driver.find_element_by_id('fm-login-password').send_keys('.........')
driver.find_element_by_id('fm-login-submit').click()
time.sleep(3)
driver.switch_to.default_content()
df = pd.read_csv('........csv', header=0)
for index, row in df.iterrows():
page_nr = 1
url = 'https://www.aliexpress.com/store/{}'.format(row['Link']) + '/search/{}'.format(page_nr) + '.html'
driver.get(url)
time.sleep(2)
for page_number in count(start=1):
time.sleep(5)
soup = BeautifulSoup(driver.page_source, 'html.parser')
for div_b in soup.find_all('div', {'class': 'cost'}):
price.append(div_b.text + 'Ł')
for pr_name in soup.find_all('div', {'class': 'detail'}):
for pr_h in pr_name.find_all('h3'):
for pr_title in pr_h.find_all('a'):
prod_name_t = (pr_title.get('title').strip())
prod_name_l = (pr_title.get('href').strip())
href_link.append(prod_name_l + 'Ł')
prod_name.append(prod_name_t + 'Ł')
links = [link.get_attribute('href') for link in driver.find_elements_by_xpath("//div[@id='node-gallery']/div[5]/div/div/ul/li/div[2]/h3/a")]
for link in links:
driver.get(link)
time.sleep(2)
soup1 = BeautifulSoup(driver.page_source, 'html.parser')
for item1 in soup1.find_all('span', {'class': 'propery-title'}):
item_specs1.append(item1.text)
for item2 in soup1.find_all('span', {'class': 'propery-des'}):
item_specs2.append(item2.text + 'Ł')
item_specs = list(zip(item_specs1, item_specs2)))
item_specs_join = ''.join(str(item_specs))
item_specs_replace = [re.sub('[^a-zA-Z0-9 n.:Ł]', '', item_specs_join)]
specs.append(item_specs_replace)
item_specs1.clear()
item_specs2.clear()
soup1.clear()
driver.back()
links.clear()
if len(prod_name) > 500:
data_csv = list(zip(prod_name, price, href_link, specs))
with open('........csv'), 'a', newline='') as f:
writer = csv.writer(f)
for row0 in data_csv:
writer.writerow(row0)
f.close()
price.clear()
prod_name.clear()
href_link.clear()
specs.clear()
data_csv.clear()
try:
if soup.find_all('span', {'class': 'ui-pagination-next ui-pagination-disabled'}):
print("Last page reached!")
break
else:
driver.find_element_by_class_name('ui-pagination-next').click()
time.sleep(1)
except Exception:
break
driver.quit()
data_csv = list(zip(prod_name, price, href_link, specs))
print(len(data_csv))
with open('.......csv', 'a', newline='') as f:
writer = csv.writer(f)
for row1 in data_csv:
writer.writerow(row1)
f.close() 

此错误消息。。。

RuntimeError: can't start new thread

意味着系统"无法启动新线程",因为您的python进程中已经有太多线程在运行,并且由于资源限制,创建新线程的请求被拒绝。

您的主要问题源于以下行:

item_specs_join = ''.join(str(item_specs))

根据您的环境,您需要查看程序正在创建的线程数与系统能够创建的最大线程数。可能您的程序启动的线程比您的系统可以处理的线程多。一个进程可以活动的线程数是有限制的。

另一个因素可能是,程序启动线程的速度比线程运行到完成的速度快。如果需要启动多个线程,则需要以更可控的方式启动,可以使用线程池。

考虑到线程是异步运行的,重新设计程序流将是一种更好的方法。也许使用线程池来获取资源,同时为每个请求启动一个线程。

您可以找到关于错误的详细讨论:无法启动新线程

在这里,你还会发现关于"有什么方法可以杀死线程吗?"的详细讨论?

相关内容

最新更新