我一直在使用刮擦和硒从英国政府网站下载文件(政府提供下载(。 当我上次在四月底这样做时,这一切都很好。
该网页有一个XML文件列表,我使用Scrapy来获取URL,然后使用Selenium打开它们,使用Scrapy来获取内容。
然而,最近它变得非常缓慢。 Chrome驱动程序打开的Chrome页面.exe启动良好,文件打开并逐个显示XML内容,我可以抓取XML数据,但是在访问几个文件后,它变得非常慢。Chrome 页面需要很长时间才能打开文件并且经常超时?
鉴于最近Chrome中的更新,我已经将Chrimedriver更新为83。
有人有什么想法吗?
我的代码如下:
import scrapy
from urllib.parse import urljoin
from foodstandardsagency.items import FoodstandardsagencyItem
from selenium import webdriver
from scrapy.http import TextResponse
class foodstandardsagencySpider(scrapy.Spider):
name = "foodstandardsagency"
allowed_domains = ["ratings.food.gov.uk"]
start_urls = ["http://ratings.food.gov.uk/open-data/en-GB"]
def parse(self, response):
for href in response.xpath('//tr/td/div/a[text()[contains(.,"English")]]/@href'):
url = urljoin('http://ratings.food.gov.uk/',href.extract())
print(url)
yield scrapy.Request(url, callback=self.parse_dir_contents)
def __init__(self):
try:
self.driver = webdriver.Chrome("C:/Users/andrew/Downloads/chromedriver_win32/chromedriver.exe")
except:
self.driver = webdriver.Chrome("C:/Users/andre/Downloads/chromedriver_win32/chromedriver.exe")
def parse_dir_contents(self, response):
self.driver.get(response.url)
response1 = TextResponse(url=response.url, body=self.driver.page_source, encoding='utf-8')
for sel in response1.xpath('//*[@id="folder2"]/div[@class="opened"]/div[@class="folder"]/div[@class="opened"]'):
businessname = sel.xpath('.//span[text()[contains(.,"<BusinessName")]]/../span[2]/text()').extract()
postcode = sel.xpath('.//span[text()[contains(.,"<PostCode")]]/../span[2]/text()').extract()
businesstype = sel.xpath('.//span[text()[contains(.,"<BusinessType") and not(contains(., "<BusinessTypeID"))]]/../span[2]/text()').extract()
businesstypeID = sel.xpath('.//span[text()[contains(.,"<BusinessTypeID")]]/../span[2]/text()').extract()
ratingvalue = sel.xpath('.//span[text()[contains(.,"<RatingValue")]]/../span[2]/text()').extract()
ratingdate = sel.xpath('.//span[text()[contains(.,"<RatingDate")]]/../span[2]/text()').extract()
item = FoodstandardsagencyItem()
item['businessname'] = businessname
item['postcode'] = postcode
item['businesstype'] = businesstype
item['businesstypeID'] = businesstypeID
item['ratingvalue'] = ratingvalue
item['ratingdate'] = ratingdate
yield item
所以,最后我决定在每个 xml 文件打开后添加代码来关闭并重新启动 chromedriver,并在设置中将并发打开设置为 1(从默认的 16(。
我添加的代码是:
self.driver.close()
try:
self.driver = webdriver.Chrome("C:/Users/andrew/Downloads/chromedriver_win32/chromedriver.exe")
except:
self.driver = webdriver.Chrome("C:/Users/andre/Downloads/chromedriver_win32/chromedriver.exe")
在设置中添加的代码是:
CONCURRENT_REQUESTS = 1
对于内存使用来说不是很好,但这是我能想到的最好的让它工作。
因此,完整代码是:
import scrapy
from urllib.parse import urljoin
from foodstandardsagency.items import FoodstandardsagencyItem
from selenium import webdriver
from scrapy.http import TextResponse
class foodstandardsagencySpider(scrapy.Spider):
name = "foodstandardsagency"
allowed_domains = ["ratings.food.gov.uk"]
start_urls = ["http://ratings.food.gov.uk/open-data/en-GB"]
def parse(self, response):
for href in response.xpath('//tr/td/div/a[text()[contains(.,"English")]]/@href'):
url = urljoin('http://ratings.food.gov.uk/',href.extract())
print(url)
yield scrapy.Request(url, callback=self.parse_dir_contents)
def __init__(self):
try:
self.driver = webdriver.Chrome("C:/Users/andrew/Downloads/chromedriver_win32/chromedriver.exe")
except:
self.driver = webdriver.Chrome("C:/Users/andre/Downloads/chromedriver_win32/chromedriver.exe")
def parse_dir_contents(self, response):
self.driver.get(response.url)
response1 = TextResponse(url=response.url, body=self.driver.page_source, encoding='utf-8')
for sel in response1.xpath('//*[@id="folder2"]/div[@class="opened"]/div[@class="folder"]/div[@class="opened"]'):
businessname = sel.xpath('.//span[text()[contains(.,"<BusinessName")]]/../span[2]/text()').extract()
postcode = sel.xpath('.//span[text()[contains(.,"<PostCode")]]/../span[2]/text()').extract()
businesstype = sel.xpath('.//span[text()[contains(.,"<BusinessType") and not(contains(., "<BusinessTypeID"))]]/../span[2]/text()').extract()
businesstypeID = sel.xpath('.//span[text()[contains(.,"<BusinessTypeID")]]/../span[2]/text()').extract()
ratingvalue = sel.xpath('.//span[text()[contains(.,"<RatingValue")]]/../span[2]/text()').extract()
ratingdate = sel.xpath('.//span[text()[contains(.,"<RatingDate")]]/../span[2]/text()').extract()
item = FoodstandardsagencyItem()
item['businessname'] = businessname
item['postcode'] = postcode
item['businesstype'] = businesstype
item['businesstypeID'] = businesstypeID
item['ratingvalue'] = ratingvalue
item['ratingdate'] = ratingdate
yield item
self.driver.close()
try:
self.driver = webdriver.Chrome("C:/Users/andrew/Downloads/chromedriver_win32/chromedriver.exe")
except:
self.driver = webdriver.Chrome("C:/Users/andre/Downloads/chromedriver_win32/chromedriver.exe")