抓取跟踪链接并提取数据



基本上,我想递归地进入每个链接并提取数据。我遇到的问题是"finalTag"是一个字符串列表,其中包含我想进入的每个链接的 URL。但是,如果我像这样插入刮擦请求request = scrapy.Request(finalTag, callback=self.parse2),它说它不是一个字符串。我尝试就地做str(finalTag(,但也没有用。

所以这是我到目前为止的代码:

import scrapy
class RecursionSpider(scrapy.Spider):
name = 'recursion'
start_urls = ['https://www.jobbank.gc.ca/jobsearch/?fper=L&fper=P&fter=S&page=2&sort=M&fprov=ON#article-32316546']
def parse(self, response):
tag = response.xpath('//a/@href').extract()
# Extracting all the href tags to the new links
tag = [str for str in tag if '/jobsearch/jobposting' in str]
finalTag = ['https://www.jobbank.gc.ca' + tag for tag in tag]
request = scrapy.Request(finalTag, callback=self.parse2)
yield request
def parse2(self, response):
# Extracting the content using css selectors
vacancy = response.xpath('//span/text()').extract()
status = response.css('span.attribute-value::text').extract()
duration = response.css('span.attribute-value::text').extract()
jobID = response.css('span::text').extract()
vacancy = [str for str in vacancy if "Vacanc" in str]
vacancy.remove('Vacancies')
del status[1]
del duration[0]
duration = map(lambda s: s.strip(), duration)
jobID = [str for str in jobID if "146" in str]
for item in zip(vacancy, status, duration, jobID):
# Create a dictionary to store the scraped info
scraped_info = {
'vacancy' : item[0],
'status' : item[1],
'duration' : item[2],
'job id' : item[3]
}
# Yield or give the scraped info to scrapy
yield scraped_info```

如果是列表,请遍历该列表的每个成员,并为每个成员调用scrapy.Request()

for url in finalTag:
request = scrapy.Request(url, callback=self.parse2)
yield request

最新更新