我的蜘蛛看起来像这个
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.http import Request
from ProjectName.items import ProjectName
class SpidernameSpider(CrawlSpider):
name = 'spidername'
allowed_domains = ['webaddress']
start_urls = ['webaddress/query1']
rules = (
Rule(LinkExtractor(restrict_css='horizontal css')),
Rule(LinkExtractor(restrict_css='vertical css'),
callback='parse_item')
)
def parse_item(self, response):
item = ProjectName()
1_css = 'css1::text'
item['1'] = response.css(1_css).extract()
item = ProjectName()
2_css = 'css2::text'
item['2'] = response.css(2_css).extract()
return item
我的管道是这样的:
from scrapy.exceptions import DropItem
class RemoveIncompletePipeline(object):
def reminc_item(self, item, spider):
if item['1']:
return item
else:
raise DropItem("Missing content in %s" % item)
一切都很好,当字段1的值丢失时,就会从输出中取出相应的项。
但是,当我更改start_urls
时,为了完成多个查询的工作,比如:
f = open("queries.txt")
start_urls = [url.strip() for url in f.readlines()]
f.close()
或者像这样:
start_urls = [i.strip() for i in open('queries.txt').readlines()]
然后输出包含字段1缺少值的项。
怎么回事?我该怎么避免呢?
对于记录,queries.txt
看起来像:
网址/query1
网址/query2
根据文档,您应该覆盖start_requests
方法。
此方法必须返回具有要爬网的第一个请求的迭代为了这只蜘蛛。
这是Scrapy在打开蜘蛛时调用的方法在未指定特定URL时进行抓取。如果特定URL指定后,将使用make_requests_from_url()创建请求。这个方法也只能从Scrapy中调用一次,所以它将其作为生成器来实现是安全的。
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.http import Request
from ProjectName.items import ProjectName
class SpidernameSpider(CrawlSpider):
name = 'spidername'
allowed_domains = ['webaddress']
start_urls = ['webaddress/query1']
rules = (
Rule(LinkExtractor(restrict_css='horizontal css')),
Rule(LinkExtractor(restrict_css='vertical css'),
callback='parse_item')
)
def start_requests(self):
return [scrapy.Request(i.strip(), callback=self.parse_item) for i in open('queries.txt').readlines()]
def parse_item(self, response):
item = ProjectName()
1_css = 'css1::text'
item['1'] = response.css(1_css).extract()
item = ProjectName()
2_css = 'css2::text'
item['2'] = response.css(2_css).extract()
return item
UPD:只需将此代码放入您的蜘蛛类
def start_requests(self):
return [scrapy.Request(i.strip(), callback=self.parse_item) for i in open('queries.txt').readlines()]
UPD:您的parse_item
方法中有一个错误的逻辑。你需要修复它。
def parse_item(self, response):
for job in response.css('div.card-top')
item = ProjectName()
# just quick example.
item['city'] = job.xpath('string(//span[@class="serp-location"])').extract()[0].replace(' ', '').replace('n', '')
# TODO: you should fill other item fields
# ...
yeild item