废料存储功能



我有一个小项目,我正在努力进行,虽然我的工作很吃力,但我有点被存储选项难住了。

所以我安装了ubuntu 20 headless,最新的python安装得很好,一切都运行得很好。

我的脚本是这样的:

import scrapy

class QuotesSpider(scrapy.Spider):
name = "github"
def start_requests(self):
urls = [
'https://osint.digitalside.it/Threat-Intel/lists/latesturls.txt',
'https://raw.githubusercontent.com/davidonzo/Threat-Intel/master/lists/latesthashes.txt',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[-2]
filename = f'github-{page}.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log(f'Saved file {filename}')

我执行的脚本是这样的:

sudo scrapy crawl github -o results.json

并得到这个结果:

barsa@ubuntu20~/scrape/scrape/spiders$ sudo scrapy crawl github -o results.json
2020-10-14 09:36:44 [scrapy.utils.log] INFO: Scrapy 2.4.0 started (bot: scrape)
2020-10-14 09:36:44 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 18.9.0, Python 3.8.5 (default, Jul 28 2020, 12:59:40) - [GCC 9.3.0], pyOpenSSL 19.0.0 (OpenSSL 1.1.1f  31 Mar 2020), cryptography 2.8, Platform Linux-5.4.0-48-generic-x86_64-with-glibc2.29
2020-10-14 09:36:44 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-10-14 09:36:44 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrape',
'NEWSPIDER_MODULE': 'scrape.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['scrape.spiders']}
2020-10-14 09:36:44 [scrapy.extensions.telnet] INFO: Telnet Password: xxxxx
2020-10-14 09:36:44 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2020-10-14 09:36:44 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-10-14 09:36:44 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-10-14 09:36:44 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-10-14 09:36:44 [scrapy.core.engine] INFO: Spider opened
2020-10-14 09:36:44 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-10-14 09:36:44 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-10-14 09:36:44 [scrapy.core.engine] DEBUG: Crawled (400) <GET https://raw.githubusercontent.com/robots.txt> (referer: None)
2020-10-14 09:36:44 [protego] DEBUG: Rule at line 1 without any user agent to enforce it on.
2020-10-14 09:36:44 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://osint.digitalside.it/robots.txt> (referer: None)
2020-10-14 09:36:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://raw.githubusercontent.com/davidonzo/Threat-Intel/master/lists/latesthashes.txt> (referer: None)
2020-10-14 09:36:44 [github] DEBUG: Saved file github-lists.html
2020-10-14 09:36:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://osint.digitalside.it/Threat-Intel/lists/latesturls.txt> (referer: None)
2020-10-14 09:36:45 [github] DEBUG: Saved file github-lists.html
2020-10-14 09:36:45 [scrapy.core.engine] INFO: Closing spider (finished)
2020-10-14 09:36:45 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 995,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 4,
'downloader/response_bytes': 1444016,
'downloader/response_count': 4,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/400': 1,
'downloader/response_status_count/404': 1,
'elapsed_time_seconds': 0.727767,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 10, 14, 9, 36, 45, 27397),
'log_count/DEBUG': 7,
'log_count/INFO': 10,
'memusage/max': 52645888,
'memusage/startup': 52645888,
'response_received_count': 4,
'robotstxt/request_count': 2,
'robotstxt/response_count': 2,
'robotstxt/response_status_count/400': 1,
'robotstxt/response_status_count/404': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2020, 10, 14, 9, 36, 44, 299630)}
2020-10-14 09:36:45 [scrapy.core.engine] INFO: Spider closed (finished)

现在我检查json文件,它是空的,但github-lists.html包含两个列表,它们之间没有分隔符,所以它看起来像一个很大的长列表。

我不明白的是我如何才能做到以下其中一项:

  1. 将列表拆分为各自独立的文件(github-list1.html和github-list 2.html(
  2. 在github-list.html中添加一个分隔符,这样我就可以运行一些逻辑,将其提取到两个单独的CSV文件中

我在这个垃圾网站上找不到任何展示文件存储如何工作的例子

filename = f'github-{page}.html'
with open(filename, 'wb') as f:
f.write(response.body)
self.log(f'Saved file {filename}')

解决这个问题的最佳方法是什么?因为在我看来,上面的这个函数似乎只处理一个文件实例。。。所以我想也许我需要使用pipelines函数?

非常感谢

scrapy crawl github -o results.json

参数-o告诉scrky使用FEED_EXPORT(docs(,但是您的spider从不向引擎生成任何项,因此不会导出任何内容,这就是为什么您的json是空的。

为了让您看到它的工作,您可以在parse方法的底部添加以下行,再次执行spider(使用-o results.json(,您将在json中看到url。

def parse(self, response):
...
yield {'url': response.url} # Add this

def start_requests(self):
urls = [
'https://osint.digitalside.it/Threat-Intel/lists/latesturls.txt',
'https://raw.githubusercontent.com/davidonzo/Threat-Intel/master/lists/latesthashes.txt',
]
...
def parse(self, response):
page = response.url.split("/")[-2]
filename = f'github-{page}.html'
with open(filename, 'wb') as f:
f.write(response.body)

在这里,您的代码将响应url拆分为一个列表;倒数第一";元素来命名文件。如果您检查,对于这两个URL,元素将是";列表";(巧合(因此,两次调用parse方法时,它都将引用相同的文件github-lists.html(其中列表来自page变量(。

您可以在这里使用任何您想要命名文件的逻辑。


我建议你继续阅读Scrapy教程,你会更好地理解如何利用框架来提取和存储数据。

特别是这三个部分:

https://docs.scrapy.org/en/latest/intro/tutorial.html#extracting-数据

https://docs.scrapy.org/en/latest/intro/tutorial.html#extracting-我们的蜘蛛中的数据

https://docs.scrapy.org/en/latest/intro/tutorial.html#storing-刮取的数据

最新更新