Scrapy - ValueError: Missing scheme in request url: h



在编写命令"scrapy crawl weather_spider2 -o output.json"时出现错误[scrapy.core。engine] ERROR:获取启动请求时出错

,然后ValueError: Missing scheme in request url: h我读了一些文章在stakoverflow,并试图修复,但它没有帮助

我代码:

import scrapy
import re
from weather_parent.weather_spider.items import WeatherItem
class WeatherSpiderSpider(scrapy.Spider):
name = "weather_spider2"
allowed_domains = 'https://weather.com'
start_urls = ['https://weather.com/en-MT/weather/today/l/bf01d09009561812f3f95abece23d16e123d8c08fd0b8ec7ffc9215c0154913c']

def parse_url(self, response):
city = response.xpath('//h1[contains(@class,"location")].text()').get()
temp = response.xpath('//span[@data-testid="TemperatureValue"]/text()').get()
air_quality = response.xpath('//span[@data-testid="AirQualityCategory"]/text()').get()
cond = response.xpath('//div[@data-testid="wxPhrase"]/text()').get()
item = WeatherItem()
item["city"] = city
item["temp"] = temp
item["air_quality"] = air_quality
item["cond"] = cond
yield item

误差

[]
2021-10-25 22:07:39 [scrapy.core.engine] INFO: Spider opened
2021-10-25 22:07:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-10-25 22:07:39 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-10-25 22:07:39 **[scrapy.core.engine] ERROR: Error while obtaining start requests**
Traceback (most recent call last):
File "D:ca nhanAnacondalibsite-packagesscrapycoreengine.py", line 129, in _next_request
request = next(slot.start_requests)
File "D:ca nhanAnacondaweather_parentweather_spiderspiderscrawl_weather.py", line 12, in start_requests
yield scrapy.Request(url = url, callback= self.parse_url)
File "D:ca nhanAnacondalibsite-packagesscrapyhttprequest__init__.py", line 25, in __init__
self._set_url(url)
File "D:ca nhanAnacondalibsite-packagesscrapyhttprequest__init__.py", line 73, in _set_url
raise ValueError(f'Missing scheme in request url: {self._url}')
**ValueError: Missing scheme in request url: h**
2021-10-25 22:07:39 [scrapy.core.engine] INFO: Closing spider (finished)
2021-10-25 22:07:39 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 0.015959,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 10, 25, 15, 7, 39, 874679),
'log_count/ERROR': 1,
'log_count/INFO': 10,
'start_time': datetime.datetime(2021, 10, 25, 15, 7, 39, 858720)}
request = next(slot.start_requests)
File "D:ca nhanAnacondaweather_parentweather_spiderspiderscrawl_weather.py", line 12, in start_requests
yield scrapy.Request(url = url, callback= self.parse_url)
File "D:ca nhanAnacondalibsite-packagesscrapyhttprequest__init__.py", line 25, in __init__
self._set_url(url)
File "D:ca nhanAnacondalibsite-packagesscrapyhttprequest__init__.py", line 73, in _set_url
raise ValueError(f'Missing scheme in request url: {self._url}')
ValueError: Missing scheme in request url: h
2021-10-25 22:07:39 [scrapy.core.engine] INFO: Closing spider (finished)
2021-10-25 22:07:39 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 0.015959,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 10, 25, 15, 7, 39, 874679),
'log_count/ERROR': 1,
'log_count/INFO': 10,
'start_time': datetime.datetime(2021, 10, 25, 15, 7, 39, 858720)}

重命名parse_url函数来解析这是Scrapy用来处理下载响应的默认回调,当它们的请求没有指定回调时。城市xpath是错误的,只使用/text()。在allowed_domains中,假设您的目标url是https://www.example.com/1.html,然后将'example.com'添加到列表中。它将是list。没有别的了,一切都好。

from scrapy.crawler  import CrawlerProcess
import scrapy
import re
# from weather_parent.weather_spider.items import WeatherItem
class WeatherSpiderSpider(scrapy.Spider):
name = "weather_spider2"
allowed_domains = 'https://weather.com'
start_urls = ['https://weather.com/en-MT/weather/today/l/bf01d09009561812f3f95abece23d16e123d8c08fd0b8ec7ffc9215c0154913c']

def parse(self, response):
city = response.xpath('//h1[contains(@class,"location")]/text()').get()
temp = response.xpath('//span[@data-testid="TemperatureValue"]/text()').get()
air_quality = response.xpath('//span[@data-testid="AirQualityCategory"]/text()').get()
cond = response.xpath('//div[@data-testid="wxPhrase"]/text()').get()
item = {}
item["city"] = city
item["temp"] = temp
item["air_quality"] = air_quality
item["cond"] = cond
yield item
# 
process = CrawlerProcess()
process.crawl(WeatherSpiderSpider)
process.start()

输出
{'city': 'Chennai, Tamil Nadu, India Weather', 'temp': '29�', 'air_quality': 'Unhealthy for Sensitive Groups', 'cond': 'Partly Cloudy'}