我使用Scrapy从我们这里抓取内容,现在,我尝试与Splash集成以运行Javascript页面。问题是当我启动爬虫时,大约前 20 个请求返回空内容,而所有其他请求返回 504 状态代码。为什么会这样?
这是日志文件:
2018-06-20 10:43:14 [scrapy.core.scraper] WARNING: Dropped:
Not valid item dropped!
{'name': None, 'store': 'Centauro', 'tkbRatio': None, 'description': None, 'salesPrice': None, 'installmentsPrice': None, 'disponibility': True, 'image': None, 'category': None, 'timeStamp': '2018-06-20 13:43:14.875348', 'modifiedTime': None, 'url': 'https://www.centauro.com.br/camisa-compressao-adams-termica-ml-821229.html', 'rating': 0, 'numberOfReviews': 0}
2018-06-20 10:43:14 [centauro] WARNING: Not valid item dropped! https://www.centauro.com.br/camisa-do-brasil-i-2018-nike-masculina-918516.html
2018-06-20 10:43:14 [scrapy.core.scraper] WARNING: Dropped:
Not valid item dropped!
{'name': None, 'store': 'Centauro', 'tkbRatio': None, 'description': None, 'salesPrice': None, 'installmentsPrice': None, 'disponibility': True, 'image': None, 'category': None, 'timeStamp': '2018-06-20 13:43:14.940786', 'modifiedTime': None, 'url': 'https://www.centauro.com.br/camisa-do-brasil-i-2018-nike-masculina-918516.html', 'rating': 0, 'numberOfReviews': 0}
2018-06-20 10:43:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.centauro.com.br/tenis-adidas-duramo-7-lite-masculino-918742.html via http://0.0.0.0:8050/render.html> (referer: None)
2018-06-20 10:43:15 [centauro] WARNING: Not valid item dropped! https://www.centauro.com.br/tenis-adidas-duramo-7-lite-masculino-918742.html
2018-06-20 10:43:15 [scrapy.core.scraper] WARNING: Dropped:
Not valid item dropped!
{'name': None, 'store': 'Centauro', 'tkbRatio': None, 'description': None, 'salesPrice': None, 'installmentsPrice': None, 'disponibility': True, 'image': None, 'category': None, 'timeStamp': '2018-06-20 13:43:15.298537', 'modifiedTime': None, 'url': 'https://www.centauro.com.br/tenis-adidas-duramo-7-lite-masculino-918742.html', 'rating': 0, 'numberOfReviews': 0}
2018-06-20 10:43:22 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.centauro.com.br/tenis-oxer-netuno-masculino-913399.html via http://0.0.0.0:8050/render.html> (failed 1 times): 504 Gateway Time-out
2018-06-20 10:43:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.centauro.com.br/calca-termica-kappa-belquior-masculina-910118.html via http://0.0.0.0:8050/render.html> (failed 1 times): 504 Gateway Time-out
2018-06-20 10:43:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.centauro.com.br/jaqueta-oxer-water-repelent-feminina-858050.html via http://0.0.0.0:8050/render.html> (failed 1 times): 504 Gateway Time-out
2018-06-20 10:43:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.centauro.com.br/camiseta-do-brasil-2018-crest-nike-masculina-918483.html via http://0.0.0.0:8050/render.html> (failed 1 times): 504 Gateway Time-out
2018-06-20 10:43:27 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://www.centauro.com.br/tenis-fila-infinity-m00kil-mktp.html via http://0.0.0.0:8050/render.html> (failed 1 times): 504 Gateway Time-out
这是我的蜘蛛开始报废的主要方法:
def start_requests(self):
mode = self.settings.get('MODE')
urls = util.get_urls_db(self.custom_settings['URLS_COLLECTION_NAME'])
urls = list(urls)
if mode == 'all':
for url in urls:
yield SplashRequest(url['url'], self.parse_item,
args={
# optional; parameters passed to Splash HTTP API
'timeout': 10,
# 'url' is prefilled from request url
# 'http_method' is set to 'POST' for POST requests
# 'body' is set to request body for POST requests
}
)
这是我的settings.py
:
SPLASH_URL = 'http://0.0.0.0:8050'
DOWNLOADER_MIDDLEWARES = {
# 'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
# 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
# 'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
# 'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
# 'updater.middlewares.SeleniumMiddleware': 700,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
尝试使用<real_ip>:8050
而不是0.0.0.0:8050
在yield中添加此特定代码 SplashRequest 解决了我的问题args={"timeout": 3000}
喜欢这个:
yield SplashRequest(url, args={"timeout": 3000}((