不确定如何排除Scrapy端子输出



提前致谢

我正在尝试使用scrapy,这对我来说有点新。我构建了(我认为是)一个简单的蜘蛛,它做以下事情:

class SuperSpider(CrawlSpider):
name = 'KYM_entries'
start_urls = ['https://knowyourmeme.com/memes/all/page/1']

def parse(self, response):
for entry in response.xpath('/html/body/div[3]/div/div[3]/section'):
yield {
# The link to a meme entry page on Know Your Meme
'entry_link': entry.xpath('./td[2]/a/@href').get()
}

然后在终端窗口中运行以下命令:

$ scrapy  crawl KYM_entries -O practice.csv
/usr/lib/python3/dist-packages/pkg_resources/__init__.py:116: PkgResourcesDeprecationWarning: 0.1.43ubuntu1 is an invalid version and will not be supported in a future release
warnings.warn(
/usr/lib/python3/dist-packages/pkg_resources/__init__.py:116: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release
warnings.warn(
2022-12-26 20:08:04 [scrapy.utils.log] INFO: Scrapy 2.7.1 started (bot: KYM_spider)
2022-12-26 20:08:04 [scrapy.utils.log] INFO: Versions: lxml 4.8.0.0, libxml2 2.9.13, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 22.1.0, Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0], pyOpenSSL 21.0.0 (OpenSSL 3.0.2 15 Mar 2022), cryptography 3.4.8, Platform Linux-5.15.0-56-generic-x86_64-with-glibc2.35
2022-12-26 20:08:04 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'KYM_spider',
'NEWSPIDER_MODULE': 'KYM_spider.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['KYM_spider.spiders']}
2022-12-26 20:08:04 [py.warnings] WARNING: /usr/local/lib/python3.10/dist-packages/scrapy/utils/request.py:231: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.
It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.
See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
return cls(crawler)
2022-12-26 20:08:04 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-12-26 20:08:04 [scrapy.extensions.telnet] INFO: Telnet Password: 97ac3d17f1e4cea1
2022-12-26 20:08:04 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2022-12-26 20:08:04 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-12-26 20:08:04 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-12-26 20:08:04 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-12-26 20:08:04 [scrapy.core.engine] INFO: Spider opened
2022-12-26 20:08:04 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-12-26 20:08:04 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-12-26 20:08:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://knowyourmeme.com/robots.txt> (referer: None)
2022-12-26 20:08:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://knowyourmeme.com/memes/all/page/1> (referer: None)
2022-12-26 20:08:05 [scrapy.core.engine] INFO: Closing spider (finished)
2022-12-26 20:08:05 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 466,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 11690,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'elapsed_time_seconds': 0.953839,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 12, 27, 1, 8, 5, 833510),
'httpcompression/response_bytes': 45804,
'httpcompression/response_count': 2,
'log_count/DEBUG': 3,
'log_count/INFO': 10,
'log_count/WARNING': 1,
'memusage/max': 65228800,
'memusage/startup': 65228800,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2022, 12, 27, 1, 8, 4, 879671)}
2022-12-26 20:08:05 [scrapy.core.engine] INFO: Spider closed (finished)

返回一个空的CSV,我想这意味着xpath有问题,或者与Know Your Meme的连接有问题。但是,除了200代码说它正在连接到网站之外,我不确定如何排除这里发生的事情。

所以我有几个问题,一个更直接与我的问题有关,另一个对这个输出更感兴趣:

  1. 是否有一种方法可以查看我的脚本在此特定情况下无法检索xpath中的指定数据?
  2. 是否有一个简单的指南或参考如何读取scrapy输出?

我看过你的代码。选择器/XPath有几个问题。我已经更新了CSS选择器并删除了XPATH。meme url是相对url,所以我添加了urljoin方法使这些url成为绝对url。我已经添加了start_request方法,因为我的scrapy版本是2.6.0。如果您使用的是较低版本的scrapy(1.6.0),您可以删除此方法。

class SuperSpider(CrawlSpider):
name = 'KYM_entries'
start_urls = ['https://knowyourmeme.com/memes/all/page/1']
def start_requests(self):
yield Request(self.start_urls[0], callback=self.parse)

def parse(self, response):
for entry in response.css('.entry-grid-body .photo'):
yield {
# The link to a meme entry page on Know Your Meme
'entry_link': response.urljoin(entry.css('::attr(href)').get())
}

代码现在运行良好。输出如下:

2022-12-27 13:14:52 [scrapy.core.engine] INFO: Spider opened
2022-12-27 13:14:52 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-12-27 13:14:52 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-12-27 13:14:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://knowyourmeme.com/memes/all/page/1> (referer: None)
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/mayinquangcao'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/this-is-x-bitch-we-clown-in-this-muthafucka-betta-take-yo-sensitive-ass-back-to-y'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/subcultures/choo-choo-charles'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/subcultures/bug-fables-the-everlasting-sapling'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/onii-holding-a-picture'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/vintage-recipe-videos'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/ytpmv-elf'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/i-just-hit-a-dog-going-70mph-on-my-truck'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/women-dodging-accountability'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/grinchs-ultimatum'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/where-is-idos-black-and-white'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/basilisk-time'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/subcultures/rankinbass-productions'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/subcultures/error143'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/whatsapp-university'}
2022-12-27 13:14:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://knowyourmeme.com/memes/all/page/1>
{'entry_link': 'https://knowyourmeme.com/memes/messi-autism-speculation-messi-is-autistic'}

最新更新