Scrapy 在外壳中工作,但爬行 0 页



我使用scrapy来解析以下站点:http://www.banki.ru/services/responses/.当我通过 shell 逐步解析时,一切正常,即,这一行有效:

response.xpath("//script[contains(., 'banksData')]/text()").re(r'"name":"(.*?)","code"')

但是当我开始爬行时,我得到以下日志。

2017-06-16 20:59:27 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: banksru)
2017-06-16 20:59:27 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'banksru', 'FEED_FORMAT': 'json', 'NEWSPIDER_MODULE': 'banksru.spiders', 'SPIDER_MODULES': ['banksru.spiders'], 'FEED_URI': 'banki.json'}
2017-06-16 20:59:27 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.feedexport.FeedExporter']
2017-06-16 20:59:28 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-06-16 20:59:28 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-06-16 20:59:28 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-06-16 20:59:28 [scrapy.core.engine] INFO: Spider opened
2017-06-16 20:59:28 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-06-16 20:59:28 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-06-16 20:59:28 [scrapy.core.engine] DEBUG: Crawled (429) <GET http://www.banki.ru/services/responses/> (referer: None)
2017-06-16 20:59:28 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <429 http://www.banki.ru/services/responses/>: HTTP status code is not handled or not allowed
2017-06-16 20:59:28 [scrapy.core.engine] INFO: Closing spider (finished)
2017-06-16 20:59:28 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 229,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 119,
'downloader/response_count': 1,
'downloader/response_status_count/429': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 6, 16, 17, 59, 28, 827696),
'httperror/response_ignored_count': 1,
'httperror/response_ignored_status_count/429': 1,
'log_count/DEBUG': 2,
'log_count/INFO': 8,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 6, 16, 17, 59, 28, 573054)}
2017-06-16 20:59:28 [scrapy.core.engine] INFO: Spider closed (finished)

我知道该网站存在机器人阻止和用户代理问题,所以我更改了 settings.py 我的项目的抓取设置

# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'banksru'
SPIDER_MODULES = ['banksru.spiders']
NEWSPIDER_MODULE = 'banksru.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'www.example.com'
# Obey robots.txt rulesROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'banksru.middlewares.BanksruSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'banksru.middlewares.MyCustomDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'banksru.pipelines.BanksruPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

我尝试实现的代码很简单:

import scrapy
class BankRating(scrapy.Spider):
name = "banki"
start_urls = [
"http://www.banki.ru/services/responses/",
]

def parse(self, response):
name = response.xpath("//script[contains(., 'banksData')]/text()").re(r'"name":"(.*?)","code"')
rating = response.xpath("//script[contains(., 'ratingData')]/text()").re(r'"rating":(.*?),"responseCount"')
avg_grade = response.xpath("//script[contains(., 'ratingData')]/text()").re(r'"middleGrade":(.*?),"middleRating"')
checked_responses = response.xpath(
"//script[contains(., 'ratingData')]/text()").re(r'"checkedResponseCount":(.*?),"checkedResponseCountForYear"')
num_responses = response.xpath("//script[contains(., 'ratingData')]/text()").re(r'"responseCount":(.*?),"responseCountForYear"')
solved_problems = response.xpath(
"//script[contains(., 'ratingData')]/text()").re(r'"solvedResponseCount":(.*?),"withAgentAnswer"')
bank_answers = response.xpath("//script[contains(., 'ratingData')]/text()").re(r'"withAgentAnswer":(.*?),"middleGrade"')
yield name, rating, avg_grade, checked_responses,  num_responses, solved_problems, bank_answers

我的机器有win8.1,并且为python 3.5安装了scrapy。提前感谢您的任何帮助

Scrapy 作为一个机器人在服务器上非常耗费资源,因为它非常并且进行异步调用,因此需要遵循一些明确的准则。这些是为了使爬行以更可容忍和友好的方式工作,并且不会对网络造成任何损害。 这些在博客中得到了很好的强调 如何礼貌地抓取网络 与 Scrapy 由 Valdir Stumm Jr.

  • 网站所有者使用robots.txt文件文件向网络机器人提供有关其网站的指令;这称为机器人排除协议。此文件通常位于网站的根目录下,您的爬虫应遵循由此定义的规则

  • 网站可以处理的请求数量差异很大。自动限制根据当前 Web 服务器负载自动调整请求之间的延迟。它首先计算一个请求的延迟。然后,它将调整同一域的请求之间的延迟,以使不超过AUTOTHROTTLE_TARGET_CONCURRENCY个请求同时处于活动状态。

在 settings.py 启用这些应该允许刮擦在网站上爬行。感谢@Ding指出"HTTP 代码 429:这意味着在给定时间段内发送太多请求">

当请求出现问题时,网站会尝试保护自己响应不同的状态。

这种特殊情况很常见,但很简单。您可以使用常见的USER_AGENT绕过它:

settings.py

USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:50.0) Gecko/20100101 Firefox/50.0'

因为scrapy默认使用以下内容:

"Scrapy/1.3.0 (+http://scrapy.org)"

最新更新