Scrapy Spider在爬任何东西之前都会停下来



所以我有一个django项目和一个views.py,如果满足特定条件,我想从中调用Scrapy spider。爬网程序似乎被调用得很好,但终止得太快,以至于解析函数没有被调用(至少这是我的假设(,如下所示:

2020-11-16 18:51:25 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'products',
'NEWSPIDER_MODULE': 'crawler.spiders',
'SPIDER_MODULES': ['crawler.spiders.my_spider'],
'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, '
'like Gecko) Chrome/80.0.3987.149 Safari/537.36'}
2020-11-16 18:51:25 [scrapy.extensions.telnet] INFO: Telnet Password: ******
2020-11-16 18:51:25 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
['https://www.tesco.com/groceries/en-GB/products/307358055']
2020-11-16 18:51:26 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-11-16 18:51:26 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
[16/Nov/2020 18:51:26] "POST /productsinfo HTTP/1.1" 200 2

views.py

def get_info():
url = data[product]["url"]
setup()
runner(url)
products = []
serializer = ProductSerializer(products, many=True)
return(Response(serializer.data))
@wait_for(timeout=10.0)
def runner(url):
crawler_settings = Settings()
configure_logging()
crawler_settings.setmodule(my_settings)
runner = CrawlerRunner(settings=crawler_settings)
d = runner.crawl(MySpider, url=url)

my_spider.py

import scrapy
from scrapy.loader import ItemLoader
from itemloaders.processors import TakeFirst
from crawler.items import ScraperItem

class MySpider(scrapy.Spider):
name = "myspider"
def __init__(self, *args, **kwargs):
link = kwargs.get('url')
self.start_urls = [link]
super().__init__(**kwargs)
def start_requests(self):
yield scrapy.Request(url=self.start_urls[0], callback=self.parse)
def parse(self, response):
do stuff

有人能告诉我为什么会发生这种情况,以及我如何解决它吗?

我不确定为什么会这样,但我记得遇到过类似的问题。请您将__init__start_requests方法更改为以下方法,并告诉我结果:

def __init__(self, *args, **kwargs):
self.link = kwargs.get('url')
super().__init__(**kwargs)
def start_requests(self):
yield scrapy.Request(url=self.link, callback=self.parse)

最新更新