斗志旺盛.请求未通过



爬网过程似乎忽略和/或不执行行yield scrapy.Request(property_file, callback=self.parse_property)。第一次打架。def start_requests中的请求通过并正确执行,但def parse_navpage中没有一个请求,如图所示。

import scrapy
class SmartproxySpider(scrapy.Spider):
name = "scrape_zoopla"
allowed_domains = ['zoopla.co.uk']
def start_requests(self):
# Read source from file
navpage_file = f"file:///C:/Users/user/PycharmProjects/ScrapeZoopla/ScrapeZoopla/ScrapeZoopla/spiders/html_source/navpage/NavPage_1.html"
yield scrapy.Request(navpage_file, callback=self.parse_navpage)
def parse_navpage(self, response):
listings = response.xpath("//div[starts-with(@data-testid, 'search-result_listing_')]")
for listing in listings:
listing_url = listing.xpath(
"//a[@data-testid='listing-details-link']/@href").getall()  # List of property urls
break
print(listing_url) #Works
property_file = f"file:///C:/Users/user/PycharmProjects/ScrapeZoopla/ScrapeZoopla/ScrapeZoopla/spiders/html_source/properties/Property_1.html"
print("BEFORE YIELD")
yield scrapy.Request(property_file, callback=self.parse_property) #Not going through
print("AFTER YIELD")
def parse_property(self, response):
print("PARSE PROPERTY")
print(response.url)
print("PARSE PROPERTY AFTER URL")

在命令中运行scrapy crawl scrape_zoopla返回:

2022-09-10 20:38:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET file:///C:/Users/user/PycharmProjects/ScrapeZoopla/ScrapeZoopla/ScrapeZoopla/spiders/html_source/navpage/NavPage_1.html> (referer: None)
BEFORE YIELD
AFTER YIELD
2022-09-10 20:38:24 [scrapy.core.engine] INFO: Closing spider (finished)

两人都斗志旺盛。请求请求本地文件,但只有第一个有效。文件存在并且正确地显示页面,并且在其中一个文件不存在的情况下;没有这样的文件或目录";并且可能被中断。在这里,爬虫似乎只是直接通过了请求,甚至没有通过,并且没有返回任何错误。这里的错误是什么?

这完全是一个未知数,但您可以尝试从start_requests方法发送这两个请求。老实说,我不明白为什么这会奏效,但这可能值得一试。

import scrapy
class SmartproxySpider(scrapy.Spider):
name = "scraoe_zoopla"
allowed_domains = ['zoopla.co.uk']
def start_requests(self):
# Read source from file
navpage_file = f"file:///C:/Users/user/PycharmProjects/ScrapeZoopla/ScrapeZoopla/ScrapeZoopla/spiders/html_source/navpage/NavPage_1.html"
property_file = f"file:///C:/Users/user/PycharmProjects/ScrapeZoopla/ScrapeZoopla/ScrapeZoopla/spiders/html_source/properties/Property_1.html"
yield scrapy.Request(navpage_file, callback=self.parse_navpage)
yield scrapy.Request(property_file, callback=self.parse_property)
def parse_navpage(self, response):
listings = response.xpath("//div[starts-with(@data-testid, 'search-result_listing_')]")
for listing in listings:
listing_url = listing.xpath(
"//a[@data-testid='listing-details-link']/@href").getall()  # List of property urls
break
print(listing_url) #Works

def parse_property(self, response):
print("PARSE PROPERTY")
print(response.url)
print("PARSE PROPERTY AFTER URL")

更新

我才明白为什么会发生这种事。这是因为您已经设置了allowed_domains属性,但您正在进行的请求是在本地文件系统上进行的,而本地文件系统自然不会与允许的域匹配。

Scrapy假设从start_requests发送的所有初始URL都是允许的,因此不对这些URL进行任何验证,但所有后续解析方法都会根据allowed_domains属性进行检查。

只需从spider类的顶部删除该行,您的原始结构就会正常工作。

最新更新