我使用的是debian Bullseye(11.2(我想保存到(.csv(文件。我该怎么做?
from scrapy.spiders import CSVFeedSpider
class CsSpiderSpider(CSVFeedSpider):
name = 'cs_spider'
allowed_domains = ['ocw.mit.edu/courses/electrical-engineering-and-computer-science/']
start_urls = ['http://ocw.mit.edu/courses/electrical-engineering-and-computer-science//feed.csv']
# headers = ['id', 'name', 'description', 'image_link']
# delimiter = 't'
# Do any adaptations you need here
#def adapt_response(self, response):
# return response
def parse_row(self, response, row):
i = {}
#i['url'] = row['url']
#i['name'] = row['name']
#i['description'] = row['description']
return i
下面是一个使用scratchy的FEEDS
导出的示例。
import scrapy
from scrapy.crawler import CrawlerProcess
class CsspiderSpider(scrapy.Spider):
name = 'cs_spider'
start_urls = ['http://ocw.mit.edu/courses/electrical-engineering-and-computer-science']
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url=url, callback = self.parse_row
)
def parse_row(self, response):
yield {
'test':response.text
}
process = CrawlerProcess(
settings = {
'FEEDS':{
'data.csv':{
'format':'csv'
}
}
}
)
process.crawl(CsspiderSpider)
process.start()
将文件的输出保存为.csv
格式。此外,要指定要导出的列及其顺序,请使用FEED_EXPORT_FIELDS
。你可以在文档中阅读更多关于这方面的信息
在命令行中,您可以运行:
scrapy crawl cs_spider -o output.csv
但是,在命令行中运行上面的代码时,请确保注释掉process
及以下的所有代码。