我试图了解链接提取器在Scrapy中是如何工作的。我正在努力完成什么:
-
在起始页上遵循分页
-
搜索 URL 并扫描模式中的所有链接
-
在找到的链接页面中,点击该页面上与模式匹配的另一个链接并废弃该页面
我的代码:
class ToScrapeMyspider(CrawlSpider):
name = "myspider"
allowed_domains = ["myspider.com"]
start_urls = ["www.myspider.com/category.php?k=766"]
rules = (
Rule(LinkExtractor(restrict_xpaths='//link[@rel="next"]/a'), follow=True),
Rule(LinkExtractor(allow=r"/product.php?p=d+$"), callback='parse_spider')
)
def parse_spider(self, response):
Request(allow=r"/product.php?e=d+$",callback=self.parse_spider2)
def parse_spider2(self, response):
#EXTRACT AND PARSE DATA HERE ETC (IS WORKING)
我的分页链接如下所示:
<link rel="next" href="https://myspider.com/category.php?k=766&amp;s=100" >
首先,我收到来自restrict_xpaths
的错误'str' object has no attribute 'iter'
但我想我已经把事情搞砸了
终于工作了:
rules = (
Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@rel="next"]',)), follow=True),
Rule(LinkExtractor(allow=('product.php', )), callback='parse_sider'),
)
BASE_URL = 'https://myspider.com/'
def parse_spy(self, response):
links = response.xpath('//li[@id="id"]/a/@href').extract()
for link in links:
absolute_url = self.BASE_URL + link
yield scrapy.Request(absolute_url, callback=self.parse_spider2)