Python Scrapy-有可能在Scrapy中处理那个长链接吗?如果是的话,为什么会抛出错误



我得到了这个脚本-在选中我想搜索的框后,我使用chrome中的网络开发工具提取了链接,这就是结果。然后我把这个链接作为我的start_urls,只想从结果中得到一页作为测试,但我在终端中得到了错误:

import scrapy
from ..items import PontsItems

class Names(scrapy.Spider):
name = 'ponts'
start_urls = [
'https://www.ponts.org/fr/annuaire/recherche?result=1&annuaire_mode=standard&annuaire_as_no=&keyword=&PersonneNom=&PersonnePrenom=&DiplomePromo%5B%5D=2023&DiplomePromo%5B%5D=2022&DiplomePromo%5B%5D=2021&DiplomePromo%5B%5D=2020&DiplomePromo%5B%5D=2019&DiplomePromo%5B%5D=2018&DiplomePromo%5B%5D=2017&DiplomePromo%5B%5D=2016&DiplomePromo%5B%5D=2015&DiplomePromo%5B%5D=2014&DiplomePromo%5B%5D=2013&DiplomePromo%5B%5D=2012&DiplomePromo%5B%5D=2011&DiplomePromo%5B%5D=2010', ]
def parse(self, response):
items = PontsItems()
for item in response.xpath('//div[@class="single_desc"]'):
items['name'] = item.xpath('/div[@class="single_libel"]/a/text()').get()
items['description'] = item.xpath('/div[@class="single_details]/div/text()').get()
yield items

错误为:

[scrapy.core.scraper] ERROR: Error downloading <GET https://www.ponts.org/fr/annuaire/recherche?result=1&annuaire_mode=standard&annuaire_as_no=&keyword=&Per
sonneNom=&PersonnePrenom=&DiplomePromo%5B%5D=2023&DiplomePromo%5B%5D=2022&DiplomePromo%5B%5D=2021&DiplomePromo%5B%5D=2020&DiplomePromo%5B%5D=2019&DiplomePromo%5B%5D=2018&Diplom
ePromo%5B%5D=2017&DiplomePromo%5B%5D=2016&DiplomePromo%5B%5D=2015&DiplomePromo%5B%5D=2014&DiplomePromo%5B%5D=2013&DiplomePromo%5B%5D=2012&DiplomePromo%5B%5D=2011&DiplomePromo%5
B%5D=2010>

紧随其后的是:

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "c:usersadampycharmprojectsscrapy_thingsvenvlibsite-packagestwistedinternetdefer.py", line 1418, in _inlineCallbacks
result = g.send(result)
File "c:usersadampycharmprojectsscrapy_thingsvenvlibsite-packagesscrapycoredownloadermiddleware.py", line 54, in process_response
response = yield deferred_from_coro(method(request=request, response=response, spider=spider))
File "c:usersadampycharmprojectsscrapy_thingsvenvlibsite-packagesscrapy_proxy_poolmiddlewares.py", line 287, in process_response
ban = is_ban(request, response)
File "c:usersadampycharmprojectsscrapy_thingsvenvlibsite-packagesscrapy_proxy_poolpolicy.py", line 15, in response_is_ban
if self.BANNED_PATTERN.search(response.text):
File "c:usersadampycharmprojectsscrapy_thingsvenvlibsite-packagesscrapyhttpresponse__init__.py", line 108, in text
raise AttributeError("Response content isn't text")
AttributeError: Response content isn't text
2020-09-15 08:12:59 [scrapy.core.engine] INFO: Closing spider (finished)
2020-09-15 08:12:59 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 656,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 15802,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 1.599773,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 9, 15, 6, 12, 59, 743969),
'log_count/ERROR': 1,
'log_count/INFO': 10,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 9, 15, 6, 12, 58, 144196)}
2020-09-15 08:12:59 [scrapy.core.engine] INFO: Spider closed (finished)

我很乐意澄清:这是因为Scrapy无法处理直接链接,还是网站的构建方式让我不得不使用其他框架?如果是这样的话,具体刮伤的解决方案是什么?

请确保您提供了准确的代码和日志。无法在我的测试中重现错误,响应内容毫无例外地被正确提取。

顺便说一句,报价在你的代码中没有结束。

item.xpath('/div[@class="single_details]/div/text()').get()

CCD_ 1在CCD_ 2中丢失。


更新:用xpath提取所需内容的代码。

import scrapy
from lxml.html import fromstring
from ..items import PontsItems

class Names(scrapy.Spider):
name = 'ponts'
start_urls = [
'https://www.ponts.org/fr/annuaire/recherche?result=1&annuaire_mode=standard&annuaire_as_no=&keyword=&PersonneNom=&PersonnePrenom=&DiplomePromo%5B%5D=2023&DiplomePromo%5B%5D=2022&DiplomePromo%5B%5D=2021&DiplomePromo%5B%5D=2020&DiplomePromo%5B%5D=2019&DiplomePromo%5B%5D=2018&DiplomePromo%5B%5D=2017&DiplomePromo%5B%5D=2016&DiplomePromo%5B%5D=2015&DiplomePromo%5B%5D=2014&DiplomePromo%5B%5D=2013&DiplomePromo%5B%5D=2012&DiplomePromo%5B%5D=2011&DiplomePromo%5B%5D=2010', ]
def parse(self, response):
items = PontsItems()
for item in response.xpath('//div[@class="single_desc"]'):
name = item.xpath('./div[@class="single_libel"]/a/text()').get().strip()
description = item.xpath('./div[@class="single_details"]').get()
description = fromstring(description).text_content().strip()
items['name'] = name
items['description'] = description
yield items

相关内容

最新更新