深度为 2 的抓取生成请求



所以我在从起始parse方法向下 2 个请求的函数记录时遇到问题。这是代码:

from datetime import datetime
import scrapy
import requests
import re
import os

class ScrapyTest(scrapy.Spider):
"""
Generic crawler
"""
name = "test"
start_urls = [
'http://www.reddit.com',
]
def __init__(self, *args, **kwargs):
super(ScrapyTest, self).__init__(*args, **kwargs)
def parse(self, response):
"""
Entry point for the crawler
"""
self.logger.debug('starting off in the parse function')
yield scrapy.Request(self.start_urls[0], callback=self.parse_hw_post)
def parse_hw_images(self, image_links):
self.logger.debug("inside parse_hw_images about to scrapy request parse_hw_image")
yield scrapy.Request(self.start_urls[0], callback=self.parse_hw_image)
def parse_hw_image(self, response):
self.logger.debug('inside ________internal________ parse hw image')
yield 'test string to yield in to'
def parse_hw_post(self, response):
# Save the images to a tmp directory for now
self.logger.debug('in parse_hw_post')
self.parse_hw_images('whatever')

现在显示的唯一日志记录是Starting off in the parse function,然后是inside parse_hw_images about to scrapy request parse_hw_image

预期行为为:

  1. 解析

  2. parse_hw_post

  3. parse_hw_images

  4. parse_hw_image

谁能看到我正在做的事情有什么问题?

yield scrapy.Request(self.start_urls[0], callback=self.parse)意味着您使用相同的URL调用相同的parse方法,因此Scrapy将其过滤为重复的URL。

设置DUPEFILTER_DEBUG=True以查看重复的网址。

def parse_hw_post(self, response):
# Save the images to a tmp directory for now
self.logger.debug('in parse_hw_post')
for req in self.parse_hw_images('whatever'):
yield req

最新更新