Linkextractor in Scrapy,分页和2深度链接



我试图了解链接提取器在Scrapy中是如何工作的。我正在努力完成什么:

  • 在起始页上遵循分页

  • 搜索 URL 并扫描模式中的所有链接

  • 在找到的链接页面中,点击该页面上与模式匹配的另一个链接并废弃该页面

我的代码:

class ToScrapeMyspider(CrawlSpider):
    name            = "myspider"
    allowed_domains = ["myspider.com"]
    start_urls      = ["www.myspider.com/category.php?k=766"]
    rules = (
        Rule(LinkExtractor(restrict_xpaths='//link[@rel="next"]/a'), follow=True),
        Rule(LinkExtractor(allow=r"/product.php?p=d+$"), callback='parse_spider')
)
    def parse_spider(self, response):
        Request(allow=r"/product.php?e=d+$",callback=self.parse_spider2)
    def parse_spider2(self, response):
        #EXTRACT AND PARSE DATA HERE ETC (IS WORKING)

我的分页链接如下所示:

<link rel="next" href="https://myspider.com/category.php?k=766&amp;amp;s=100" >

首先,我收到来自restrict_xpaths

的错误
'str' object has no attribute 'iter'

但我想我已经把事情搞砸了

终于工作了:

rules = (
          Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@rel="next"]',)), follow=True),
          Rule(LinkExtractor(allow=('product.php', )), callback='parse_sider'),
)

BASE_URL = 'https://myspider.com/'
def parse_spy(self, response):
    links = response.xpath('//li[@id="id"]/a/@href').extract()
    for link in links:
        absolute_url = self.BASE_URL + link
        yield scrapy.Request(absolute_url, callback=self.parse_spider2) 

最新更新