我试图废弃一个有分页链接的网站,所以我做了这个
import scrapy
class SummymartSpider(scrapy.Spider):
name = 'dummymart'
allowed_domains = ['www.dummrmart.com/product']
start_urls = ['https://www.dummymart.net/product/auto-parts--118?page%s'% page for page in range(1,20)]
它奏效了!! 使用单个 url 它可以工作,但是当我尝试这样做时:
import scrapy
class DummymartSpider(scrapy.Spider):
name = 'dummymart'
allowed_domains = ['www.dummymart.com/product']
start_urls = ['https://www.dummymart.net/product/auto-parts--118?page%s',
'https://www.dummymart.net/product/accessories-tools--112?id=1316264860?page%s'% page for page in range(1,20)]
它不起作用,我如何实现相同的逻辑但对于多个 URL?谢谢
一种方法是使用start_requests()
的scrapy.Spider
方法,而不是使用start_urls
属性。您可以在此处查看更多信息
import scrapy
class DummymartSpider(scrapy.Spider):
name = 'dummymart'
allowed_domains = ['dummymart.com']
def start_requests(self):
for page in range(1,20):
yield scrapy.Request(
url='https://www.dummymart.net/product/auto-parts--118?page%s' % page,
callback=self.parse,
)
yield scrapy.Request(
url='https://www.dummymart.net/product/accessories-tools--112?id=1316264860?page%s' % page,
callback=self.parse,
)
如果你想继续使用start_urls
属性,你可以尝试这样的东西(我还没有测试过(:
start_urls = ['https://www.dummymart.net/product/auto-parts--118?page%s' % page for page in range(1,20)] + ['https://www.dummymart.net/product/accessories-tools--112?id=1316264860?page%s'% page for page in range(1,20)]
另请注意,在allowed_domains
属性中,您只需要指定域。看这里。