使用代理进行抓取

我写了一个Scrapy中间件，必须通过Scrapy.request(url(.对每个请求使用代理

我的自定义中间件：

类MyCustomProxyMiddleware(对象(：

def __init__(self, settings):
self.chosen_proxy = settings.get('ROTATOR_PROXY', None)
@classmethod
def from_crawler(cls, crawler):
return cls(crawler.settings)
def process_request(self, request, spider):
if self.chosen_proxy is not None:
request.meta["proxy"] = self.chosen_proxy
log.debug('Using proxy <%s>' % self.chosen_proxy)

在我的设置中.py

ROTATOR_PROXY='http://ip:port'#这是我的旋转网关代理

我的蜘蛛：

def start_requests(self):
urls = []    # thousand URLs 
for url in urls:
# Don't redirect URL and scrape data
if checkers.is_url(url):
yield scrapy.Request(url)

然而，我检查了rotor代理网关的统计数据，我看到一些首先使用代理的scratchy.Request(url(，但许多scratchy.Request(url([/strong>不使用我的rotor代理网关。我需要所有请求都必须使用我的旋转器网关。

我猜不出问题，请让我知道我的问题，并在可能的情况下提出我的错误。

提前感谢

有各种方法可以将代理与scratch爬网程序一起使用。第一种方法是使用传统方式，运行命令"pip-install scrapy-rotating proxys"并遵循官方文档。开发人员使用最多的第二种方法是集成具有预构建函数的API来处理代理，同时支持多种语言(包括python(，以便这些API自动处理代理轮换并提供完全匿名性。除此之外，您还可以尝试以下代码来使用带有scratch的代理。但在编写代码之前，请注意，在第一个源代码中，我们通过请求参数设置代理，在第二个源代码中将创建自定义代理中间件。

方法1:

def start_requests(self):
for url in self.start_urls:
return Request(url=url, callback=self.parse,
headers={"User-Agent": "scrape web"},
meta={"proxy": "http:/154.112.82.262:8

方法2:

from w3lib.http import basic_auth_header
class CustomProxyMiddleware(object):
def process_request(self, request, spider):
request.meta[“proxy”] = "http://192.168.1.1:8050"
request.headers[“Proxy-Authorization”] = 
basic_auth_header(“<proxy_user>”, “<proxy_pass>”)
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.CustomProxyMiddleware': 350,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 400,
}

注意：通过将其作为请求参数传递来设置代理，第二种方法是创建一个自定义代理中间件。

相关内容

最新更新

热门标签：