试图错开两只蜘蛛:
spider1
抓取并构建url列表到。csv
spider2
从。csv中抓取特定数据
我一直得到这个错误:with open('urls.csv') as file: FileNotFoundError: [Errno 2] No such file or directory: 'urls.csv'
看起来spider1
不能首先触发,和/或python正在检查文件urls.csv
,因为代码的顺序,并且因为文件不存在而出错。
这是一块交错爬行-这是我从gitHub抓取的东西,但链接似乎不再是向上的。我试过把它放在不同的位置,甚至复制或分开。
def crawl():
yield runner.crawl(spider1)
yield runner.crawl(spider2)
reactor.stop()
crawl()
reactor.run()
我喜欢有urls.csv
来解决url,但最好将url存储在一个列表中作为一个变量[虽然我还没有弄清楚语法能够做到这一点]
下面是我使用的完整代码。我们将非常感谢任何投入。谢谢你!
import scrapy
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
configure_logging()
settings = get_project_settings()
runner = CrawlerRunner(settings)
class spider1(scrapy.Spider):
name = 'spider1'
start_urls = [
'https://tsd-careers.hii.com/en-US/search?keywords=alion&location='
]
custom_settings = {'FEEDS': {r'urls.csv': {'format': 'csv', 'item_export_kwargs': {'include_headers_line': False,}, 'overwrite': True,}}}
def parse(self, response):
for job in response.xpath('//@href').getall():
yield {'url': response.urljoin(job),}
next_page = response.xpath('//a[@class="next-page-caret"]/@href').get()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
class spider2(scrapy.Spider):
name = 'spider2'
with open('urls.csv') as file:
start_urls = [line.strip() for line in file]
custom_settings = {'FEEDS': {r'data_tsdfront.xml': {'format': 'xml', 'overwrite': True}}}
def parse(self, response):
reqid = response.xpath('//li[6]/div/div[@class="secondary-text-color"]/text()').getall()
yield {
'reqid': reqid,
}
@defer.inlineCallbacks
def crawl():
yield runner.crawl(spider1)
yield runner.crawl(spider2)
reactor.stop()
crawl()
reactor.run()
我已经明白使用变量需要大量的重构。
在做了更多的研究和实验后,我做了一些修改,这可能是丑陋的,但我有一切工作如预期。随着我的知识和经验的增长,我可以改进和调整。
import scrapy
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import pandas as pd
configure_logging()
settings = get_project_settings()
runner = CrawlerRunner(settings)
class spider1(scrapy.Spider):
name = 'spider1'
start_urls = [
'https://tsd-careers.hii.com/en-US/search?keywords=alion&location='
]
custom_settings = {'FEEDS': {r'urls.csv': {'format': 'csv', 'overwrite': True,}}}
def parse(self, response):
for job in response.xpath('//@href').getall():
yield {'url': response.urljoin(job),}
next_page = response.xpath('//a[@class="next-page-caret"]/@href').get()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
def read_csv():
df = pd.read_csv('urls.csv')
return df['url'].values.tolist()
class spider2(scrapy.Spider):
name = 'spider2'
def start_requests(self):
for url in read_csv():
yield scrapy.Request(url)
custom_settings = {'FEEDS': {r'data_tsdfront.xml': {'format': 'xml', 'overwrite': True}}}
def parse(self, response):
data = response.css('*').getall()
yield {
'data': data,
}
@defer.inlineCallbacks
def crawl():
yield runner.crawl(spider1)
yield runner.crawl(spider2)
reactor.stop()
crawl()
reactor.run()