使用起始 url 在分页链接中抓取抓取不起作用



我试图废弃一个有分页链接的网站,所以我做了这个

import scrapy
class SummymartSpider(scrapy.Spider):
name = 'dummymart'
allowed_domains = ['www.dummrmart.com/product']
start_urls = ['https://www.dummymart.net/product/auto-parts--118?page%s'% page for page in range(1,20)]

它奏效了!! 使用单个 url 它可以工作,但是当我尝试这样做时:

import scrapy
class DummymartSpider(scrapy.Spider):
name = 'dummymart'
allowed_domains = ['www.dummymart.com/product']
start_urls = ['https://www.dummymart.net/product/auto-parts--118?page%s',
'https://www.dummymart.net/product/accessories-tools--112?id=1316264860?page%s'% page for page in range(1,20)]

它不起作用,我如何实现相同的逻辑但对于多个 URL?谢谢

一种方法是使用start_requests()scrapy.Spider方法,而不是使用start_urls属性。您可以在此处查看更多信息

import scrapy
class DummymartSpider(scrapy.Spider):
name = 'dummymart'
allowed_domains = ['dummymart.com']
def start_requests(self):
for page in range(1,20):
yield scrapy.Request(
url='https://www.dummymart.net/product/auto-parts--118?page%s' % page,
callback=self.parse,
)
yield scrapy.Request(
url='https://www.dummymart.net/product/accessories-tools--112?id=1316264860?page%s' % page,
callback=self.parse,
)

如果你想继续使用start_urls属性,你可以尝试这样的东西(我还没有测试过(:

start_urls = ['https://www.dummymart.net/product/auto-parts--118?page%s' % page for page in range(1,20)] + ['https://www.dummymart.net/product/accessories-tools--112?id=1316264860?page%s'% page for page in range(1,20)]

另请注意,在allowed_domains属性中,您只需要指定域。看这里。