是否可以运行管道并在同一时间抓取多个URL



我的蜘蛛看起来像这个

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.http import Request
from ProjectName.items import ProjectName
class SpidernameSpider(CrawlSpider):
    name = 'spidername'
    allowed_domains = ['webaddress']
    start_urls = ['webaddress/query1']
    rules = (
            Rule(LinkExtractor(restrict_css='horizontal css')),
            Rule(LinkExtractor(restrict_css='vertical css'),
                     callback='parse_item')
            )
    def parse_item(self, response):
        item = ProjectName()
        1_css = 'css1::text'
        item['1'] = response.css(1_css).extract()
        item = ProjectName()
        2_css = 'css2::text'
        item['2'] = response.css(2_css).extract()
        return item

我的管道是这样的:

from scrapy.exceptions import DropItem
class RemoveIncompletePipeline(object):
    def reminc_item(self, item, spider):
        if item['1']:
            return item
        else:
            raise DropItem("Missing content in %s" % item)

一切都很好,当字段1的值丢失时,就会从输出中取出相应的项。

但是,当我更改start_urls时,为了完成多个查询的工作,比如:

f = open("queries.txt")
start_urls = [url.strip() for url in f.readlines()]
f.close()

或者像这样:

start_urls = [i.strip() for i in open('queries.txt').readlines()]

然后输出包含字段1缺少值的项。

怎么回事?我该怎么避免呢?

对于记录,queries.txt看起来像:

网址/query1
网址/query2

根据文档,您应该覆盖start_requests方法。

此方法必须返回具有要爬网的第一个请求的迭代为了这只蜘蛛。

这是Scrapy在打开蜘蛛时调用的方法在未指定特定URL时进行抓取。如果特定URL指定后,将使用make_requests_from_url()创建请求。这个方法也只能从Scrapy中调用一次,所以它将其作为生成器来实现是安全的。

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.http import Request
from ProjectName.items import ProjectName
class SpidernameSpider(CrawlSpider):
    name = 'spidername'
    allowed_domains = ['webaddress']
    start_urls = ['webaddress/query1']
    rules = (
            Rule(LinkExtractor(restrict_css='horizontal css')),
            Rule(LinkExtractor(restrict_css='vertical css'),
                     callback='parse_item')
            )
    def start_requests(self):
        return [scrapy.Request(i.strip(), callback=self.parse_item) for i in open('queries.txt').readlines()]
    def parse_item(self, response):
        item = ProjectName()
        1_css = 'css1::text'
        item['1'] = response.css(1_css).extract()
        item = ProjectName()
        2_css = 'css2::text'
        item['2'] = response.css(2_css).extract()
        return item

UPD:只需将此代码放入您的蜘蛛类

def start_requests(self):
    return [scrapy.Request(i.strip(), callback=self.parse_item) for i in open('queries.txt').readlines()]

UPD:您的parse_item方法中有一个错误的逻辑。你需要修复它。

def parse_item(self, response):
    for job in response.css('div.card-top')
        item = ProjectName()
        # just quick example.
        item['city'] = job.xpath('string(//span[@class="serp-location"])').extract()[0].replace(' ', '').replace('n', '')
        # TODO: you should fill other item fields
        # ...
        yeild item

最新更新