旋转代理(STORM,SMART)没有在每个零散的请求中提供唯一的IP



如何确保在每个棘手的请求中都获得新的ip?我尝试了风暴代理和智能代理,但它为会话提供的ip是相同的。

然而,每次运行时ip都是新的。但对于单个会话,ip是相同的。

我下面的代码:

import json
import uuid
import scrapy
from scrapy.crawler import CrawlerProcess
class IpTest(scrapy.Spider):
name = 'IP_test'
previous_ip = ''
count = 1
ip_url = 'https://ifconfig.me/all.json'
def start_requests(self,):
yield scrapy.Request(
self.ip_url,
dont_filter=True,
meta={
'cookiejar': uuid.uuid4().hex,
'proxy': MY_ROTATING_PROXY # either stormproxy or smartproxy
}
)
def parse(self, response):
ip_address = json.loads(response.text)['ip_addr']
self.logger.info(f"IP: {ip_address}")
if self.count < 10:
self.count += 1
yield from self.start_requests()

settings = {
'DOWNLOAD_DELAY': 1,
'CONCURRENT_REQUESTS': 1,
}
process = CrawlerProcess(settings)
process.crawl(IpTest)
process.start()

输出日志:

2020-12-27 21:15:52 [scrapy.core.engine] INFO: Spider opened
2020-12-27 21:15:52 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-12-27 21:15:52 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-12-27 21:15:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: None)
2020-12-27 21:15:55 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:15:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:15:56 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:15:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:15:57 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:15:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:15:59 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:16:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:16:00 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:16:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:16:01 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:16:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:16:03 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:16:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:16:04 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:16:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:16:06 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:16:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:16:07 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:16:07 [scrapy.core.engine] INFO: Closing spider (finished)

我在这里做错了什么?我甚至尝试禁用cookie(COOKIES_ENABLED = False(,从request.meta中删除cookiejar。但运气不好。

这很难,但我找到了答案。对于Storm,您需要带有"Connection":"close"的传递标头。在这种情况下,您将为每个请求获得新的代理。例如:
HEADERS = {'Connection': 'close'}
yield Request(url=url, callback=self.parse, body=body, headers=HEADERS)

在这种情况下,Storm将关闭连接,并根据请求为您提供新的IP

最新更新