爬网过程似乎忽略和/或不执行行yield scrapy.Request(property_file, callback=self.parse_property)
。第一次打架。def start_requests
中的请求通过并正确执行,但def parse_navpage
中没有一个请求,如图所示。
import scrapy
class SmartproxySpider(scrapy.Spider):
name = "scrape_zoopla"
allowed_domains = ['zoopla.co.uk']
def start_requests(self):
# Read source from file
navpage_file = f"file:///C:/Users/user/PycharmProjects/ScrapeZoopla/ScrapeZoopla/ScrapeZoopla/spiders/html_source/navpage/NavPage_1.html"
yield scrapy.Request(navpage_file, callback=self.parse_navpage)
def parse_navpage(self, response):
listings = response.xpath("//div[starts-with(@data-testid, 'search-result_listing_')]")
for listing in listings:
listing_url = listing.xpath(
"//a[@data-testid='listing-details-link']/@href").getall() # List of property urls
break
print(listing_url) #Works
property_file = f"file:///C:/Users/user/PycharmProjects/ScrapeZoopla/ScrapeZoopla/ScrapeZoopla/spiders/html_source/properties/Property_1.html"
print("BEFORE YIELD")
yield scrapy.Request(property_file, callback=self.parse_property) #Not going through
print("AFTER YIELD")
def parse_property(self, response):
print("PARSE PROPERTY")
print(response.url)
print("PARSE PROPERTY AFTER URL")
在命令中运行scrapy crawl scrape_zoopla
返回:
2022-09-10 20:38:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET file:///C:/Users/user/PycharmProjects/ScrapeZoopla/ScrapeZoopla/ScrapeZoopla/spiders/html_source/navpage/NavPage_1.html> (referer: None)
BEFORE YIELD
AFTER YIELD
2022-09-10 20:38:24 [scrapy.core.engine] INFO: Closing spider (finished)
两人都斗志旺盛。请求请求本地文件,但只有第一个有效。文件存在并且正确地显示页面,并且在其中一个文件不存在的情况下;没有这样的文件或目录";并且可能被中断。在这里,爬虫似乎只是直接通过了请求,甚至没有通过,并且没有返回任何错误。这里的错误是什么?
这完全是一个未知数,但您可以尝试从start_requests
方法发送这两个请求。老实说,我不明白为什么这会奏效,但这可能值得一试。
import scrapy
class SmartproxySpider(scrapy.Spider):
name = "scraoe_zoopla"
allowed_domains = ['zoopla.co.uk']
def start_requests(self):
# Read source from file
navpage_file = f"file:///C:/Users/user/PycharmProjects/ScrapeZoopla/ScrapeZoopla/ScrapeZoopla/spiders/html_source/navpage/NavPage_1.html"
property_file = f"file:///C:/Users/user/PycharmProjects/ScrapeZoopla/ScrapeZoopla/ScrapeZoopla/spiders/html_source/properties/Property_1.html"
yield scrapy.Request(navpage_file, callback=self.parse_navpage)
yield scrapy.Request(property_file, callback=self.parse_property)
def parse_navpage(self, response):
listings = response.xpath("//div[starts-with(@data-testid, 'search-result_listing_')]")
for listing in listings:
listing_url = listing.xpath(
"//a[@data-testid='listing-details-link']/@href").getall() # List of property urls
break
print(listing_url) #Works
def parse_property(self, response):
print("PARSE PROPERTY")
print(response.url)
print("PARSE PROPERTY AFTER URL")
更新
我才明白为什么会发生这种事。这是因为您已经设置了allowed_domains
属性,但您正在进行的请求是在本地文件系统上进行的,而本地文件系统自然不会与允许的域匹配。
Scrapy假设从start_requests
发送的所有初始URL都是允许的,因此不对这些URL进行任何验证,但所有后续解析方法都会根据allowed_domains
属性进行检查。
只需从spider类的顶部删除该行,您的原始结构就会正常工作。