如何使抓取框架保持跟踪链接?

我正在尝试制作一个爬虫，从SCP wiki中获取信息，并遵循下一个SCP的链接，并继续这样做。

使用我当前的代码，从第一个跟随的链接中提取数据后，爬虫停止跟随下一个链接。

import scrapy
class QuotesSpider(scrapy.Spider):
name = "scp"
start_urls = [
'https://scp-wiki.wikidot.com/scp-002',
]
def parse(self, response):
for scp in response.xpath('//*[@id="main-content"]'):
yield {
'title': scp.xpath('//*[@id="page-content"]/p[1]').get(),
'tags': scp.xpath('//*[@id="main-content"]/div[4]').get(),
'class': scp.xpath('//*[@id="page-content"]/p[2]').get(),
'scp': scp.xpath('//*[@id="page-content"]/p[3]').get(),
'desc': scp.xpath('//*[@id="page-content"]/p[6]').get(),
}
next_page = response.xpath('//*[@id="page-content"]/div[3]/div/p/a[2]/@href').get()
next_page = 'https://scp-wiki.wikidot.com'+next_page
print(next_page)
next_page = response.urljoin(next_page)
print(next_page)
yield response.follow(next_page, callback=self.parse)

当我运行这个爬虫时，我得到以下错误:

next_page = 'https://scp-wiki.wikidot.com'+next_page
TypeError: can only concatenate str (not "NoneType") to str

正如错误明确指出的那样，它不能将" nontype "str。

这意味着next_page变量没有从上一行response.xpath().get()函数中提到的xpath中获取任何值。

没有匹配的xpath，所以get()返回None。

你可以查看Scrapy的文档

相关内容

最新更新

热门标签：