为什么在 Scrapy 中调用收益后不立即执行回调?



我正在构建一个web scraper来抓取远程作业。蜘蛛的行为方式我不理解,如果有人能解释原因,我将不胜感激。

这是蜘蛛的代码:

import scrapy
import time
class JobsSpider(scrapy.Spider):
    name = "jobs"
    start_urls = [
        "https://stackoverflow.com/jobs/remote-developer-jobs"
    ]
    already_visited_links = []
    def parse(self, response):
        jobs = response.xpath("//div[contains(@class, 'job')]")
        links_to_next_pages = response.xpath("//a[contains(@class, 's-pagination--item')]").css("a::attr(href)").getall()
        # visit each job page (as I do in the browser) and scrape the relevant information (Job title etc.)
        for job in jobs:
            job_id = int(job.xpath('@data-jobid').extract_first()) # there will always be one element
            # now visit the link with the job_id and get the info
            job_link_to_visit = "https://stackoverflow.com/jobs?id=" + str(job_id)
            request = scrapy.Request(job_link_to_visit,
                             callback=self.parse_job)
            yield request
        # sleep for 10 seconds before requesting the next page
        print("Sleeping for 10 seconds...")
        time.sleep(10)
        # go to the next job listings page (if you haven't already been there)
        # not sure if this solution is the best since it has a loop which has a recursion in it
        for link_to_next_page in links_to_next_pages:
            if link_to_next_page not in self.already_visited_links:
                self.already_visited_links.append(link_to_next_page)
                yield response.follow(link_to_next_page, callback=self.parse)
        print("End of parse method")
    def parse_job(self, response):
        print(response.body)
        print("Sleeping for 10 seconds...")
        time.sleep(10)
        pass

以下是输出(相关部分(:

Sleeping for 10 seconds...
End of parse method
2021-04-29 20:49:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=525754> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:49:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=525748> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:49:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=497114> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:49:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=523136> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:49:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=525730> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
In parse_job
2021-04-29 20:50:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs/remote-developer-jobs?so_source=JobSearch&so_medium=Internal> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:50:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=523319> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:50:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=522480> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:50:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=511761> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:50:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=522483> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:50:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=249610> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
2021-04-29 20:50:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://stackoverflow.com/jobs?id=522481> (referer: https://stackoverflow.com/jobs/remote-developer-jobs)
In parse_job
In parse_job
In parse_job
In parse_job
...

我不明白为什么parse方法在调用parse_job方法之前就被完全执行了根据我的理解,一旦我从jobsyieldjob,就应该调用parse_job方法。spider应该浏览作业列表的每个页面,并访问该作业列表页面上每个单独作业的详细信息。然而,我在上一句中给出的描述与输出不匹配。我也不明白为什么每次调用parse_job方法之间都有多个GET请求。

有人能解释一下这里发生了什么吗

Scrapy是事件驱动的。首先,请求由Scheduler排队。排队的请求被传递给Downloader。当响应被下载并准备好时,回调函数被调用,然后,响应将作为第一个参数传递给回调函数。

您正在使用time.sleep()阻止回调。在所提供的日志中,在第一次回调调用后,该过程在parsed_job()中被阻止了10秒,但与此同时,Downloader正在工作,并为回调功能准备响应,这在第一次parse_job()调用后的连续DEBUG: Crawled (200)日志中是显而易见的。所以,当回调被阻止时,Downloader完成了它的工作,响应被排队以提供给回调函数。正如在日志的最后一部分中显而易见的那样,将响应传递给回调函数是瓶颈式的。

如果您想在请求之间设置延迟,最好使用DOWNLOAD_DELAY设置,而不是time.sleep()

查看此以了解有关Scrapy架构的更多详细信息。

最新更新