Python的请求触发了Cloudflare的安全性,而urllib则不会。



我正在为一家餐厅网站开发一个自动网页抓取器,但我遇到了一个问题。上述网站使用Cloudflare的反机器人安全,我想绕过它,而不是Under Attack模式,而是一个只有在检测到非美国IP或机器人时才会触发的captcha测试。我正试图绕过它,因为当我清除cookie、禁用javascript或使用美国代理时,Cloudflare不会触发安全。

知道这一点后,我尝试使用python的请求库:

import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0'}
response = requests.get("https://grimaldis.myguestaccount.com/guest/accountlogin", headers=headers).text
print(response)

但这最终会触发Cloudflare,无论我使用什么代理。

但是使用具有相同标头的urlib.request时:

import urllib.request
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0'}
request = urllib.request.Request("https://grimaldis.myguestaccount.com/guest/accountlogin", headers=headers)
r = urllib.request.urlopen(request).read()
print(r.decode('utf-8'))

当使用相同的美国IP运行时,这一次它不会触发Cloudflare的安全性,即使它使用与请求库相同的头和IP。

因此,我试图弄清楚在请求库中触发Cloudflare的究竟是什么,而不是在urllib库中。

虽然典型的答案是";只需使用urllib然后";,我想弄清楚请求到底有什么不同,以及我如何修复它,首先了解请求是如何工作的,Cloudflare是如何检测机器人的,但也可以将我能找到的任何修复应用于其他httplib(尤其是异步的)

编辑N°2:迄今为止的进展:

多亏了@TuanGeek,我们现在可以使用请求绕过Cloudflare块,只要我们直接连接到主机IP而不是域名(出于某种原因,带有请求的DNS重定向会触发Cloudflare,但urllib不会):

import requests
from collections import OrderedDict
import socket
# grab the address using socket.getaddrinfo
answers = socket.getaddrinfo('grimaldis.myguestaccount.com', 443)
(family, type, proto, canonname, (address, port)) = answers[0]
headers = OrderedDict({
'Host': "grimaldis.myguestaccount.com",
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
})
s = requests.Session()
s.headers = headers
response = s.get(f"https://{address}/guest/accountlogin", verify=False).text

注意:尝试通过HTTP(而不是验证变量设置为False的HTTPS)访问将触发Cloudflare的块

现在这很好,但不幸的是,我的最终目标,即与httplib HTTPX异步工作,仍然没有实现,因为使用以下代码,Cloudflare块仍然被触发,即使我们直接通过主机IP连接,具有适当的头,并且验证设置为False:

import trio
import httpx
import socket
from collections import OrderedDict
answers = socket.getaddrinfo('grimaldis.myguestaccount.com', 443)
(family, type, proto, canonname, (address, port)) = answers[0]
headers = OrderedDict({
'Host': "grimaldis.myguestaccount.com",
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0',
})
async def asks_worker():
async with httpx.AsyncClient(headers=headers, verify=False) as s:
r = await s.get(f'https://{address}/guest/accountlogin')
print(r.text)
async def run_task():
async with trio.open_nursery() as nursery:
nursery.start_soon(asks_worker)
trio.run(run_task)

编辑N°1:有关更多详细信息,以下是来自urllib的原始HTTP请求和请求

请求:

send: b'GET /guest/nologin/account-balance HTTP/1.1rnAccept-Encoding: identityrnHost: grimaldis.myguestaccount.comrnUser-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0rnConnection: closernrn'
reply: 'HTTP/1.1 403 Forbiddenrn'
header: Date: Thu, 02 Jul 2020 20:20:06 GMT
header: Content-Type: text/html; charset=UTF-8
header: Transfer-Encoding: chunked
header: Connection: close
header: CF-Chl-Bypass: 1
header: Set-Cookie: __cfduid=df8902e0b19c21b364f3bf33e0b1ce1981593721256; expires=Sat, 01-Aug-20 20:20:06 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure
header: Cache-Control: private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0
header: Expires: Thu, 01 Jan 1970 00:00:01 GMT
header: X-Frame-Options: SAMEORIGIN
header: cf-request-id: 03b2c8d09300000ca181928200000001
header: Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
header: Set-Cookie: __cfduid=df8962e1b27c25b364f3bf66e8b1ce1981593723206; expires=Sat, 01-Aug-20 20:20:06 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure
header: Vary: Accept-Encoding
header: Server: cloudflare
header: CF-RAY: 5acb25c75c981ca1-EWR

URLLIB:

send: b'GET /guest/nologin/account-balance HTTP/1.1rnAccept-Encoding: identityrnHost: grimaldis.myguestaccount.comrnUser-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0rnConnection: closernrn'
reply: 'HTTP/1.1 200 OKrn'
header: Date: Thu, 02 Jul 2020 20:20:01 GMT
header: Content-Type: text/html;charset=utf-8
header: Transfer-Encoding: chunked
header: Connection: close
header: Set-Cookie: __cfduid=db9de9687b6c22e6c12b33250a0ded3251292457801; expires=Sat, 01-Aug-20 20:20:01 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure
header: Expires: Thu, 2 Jul 2020 20:20:01 GMT
header: Cache-Control: no-cache, private, no-store
header: X-Powered-By: Undertow/1
header: Pragma: no-cache
header: X-Frame-Options: SAMEORIGIN
header: Content-Security-Policy: script-src 'self' 'unsafe-inline' 'unsafe-eval' https://www.google-analytics.com https://www.google-analytics.com/analytics.js https://use.typekit.net connect.facebook.net/ https://googleads.g.doubleclick.net/ app.pendo.io cdn.pendo.io pendo-static-6351154740266000.storage.googleapis.com pendo-io-static.storage.googleapis.com https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/ https://www.google.com/recaptcha/api.js apis.google.com https://www.googletagmanager.com api.instagram.com https://app-rsrc.getbee.io/plugin/BeePlugin.js https://loader.getbee.io api.instagram.com https://bat.bing.com/bat.js https://www.googleadservices.com/pagead/conversion.js https://connect.facebook.net/en_US/fbevents.js  https://connect.facebook.net/ https://fonts.googleapis.com/ https://ssl.gstatic.com/ https://tagmanager.google.com/;style-src 'unsafe-inline' *;img-src * data:;connect-src 'self' app.pendo.io api.feedback.us.pendo.io; frame-ancestors 'self' app.pendo.io pxsweb.com *.pxsweb.com;frame-src 'self' *.myguestaccount.com https://app.getbee.io/ *;
header: X-Lift-Version: Unknown Lift Version
header: CF-Cache-Status: DYNAMIC
header: cf-request-id: 01b2c5b1fa00002654a25485710000001
header: Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
header: Set-Cookie: __cfduid=db9de811004e591f9a12b66980a5dde331592650101; expires=Sat, 01-Aug-20 20:20:01 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure
header: Set-Cookie: __cfduid=db9de811004e591f9a12b66980a5dde331592650101; expires=Sat, 01-Aug-20 20:20:01 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure
header: Set-Cookie: __cfduid=db9de811004e591f9a12b66980a5dde331592650101; expires=Sat, 01-Aug-20 20:20:01 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure
header: Server: cloudflare
header: CF-RAY: 5acb58a62c5b5144-EWR

这真的激起了我的兴趣。我能够使用的requests解决方案。

解决方案

最后缩小问题范围。当您使用请求时,它使用urllib3连接池。常规urlib3连接和连接池之间似乎存在一些不一致。一个可行的解决方案:

import requests
from collections import OrderedDict
from requests import Session
import socket
# grab the address using socket.getaddrinfo
answers = socket.getaddrinfo('grimaldis.myguestaccount.com', 443)
(family, type, proto, canonname, (address, port)) = answers[0]
s = Session()
headers = OrderedDict({
'Accept-Encoding': 'gzip, deflate, br',
'Host': "grimaldis.myguestaccount.com",
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0'
})
s.headers = headers
response = s.get(f"https://{address}/guest/accountlogin", headers=headers, verify=False).text
print(response)

技术背景

因此,我通过Burp Suite运行了这两种方法来比较请求。以下是请求的原始转储

使用请求

GET /guest/accountlogin HTTP/1.1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0
Accept-Encoding: gzip, deflate
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Connection: close
Host: grimaldis.myguestaccount.com
Accept-Language: en-GB,en;q=0.5
Upgrade-Insecure-Requests: 1
dnt: 1

使用urlib

GET /guest/accountlogin HTTP/1.1
Host: grimaldis.myguestaccount.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Language: en-GB,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: close
Upgrade-Insecure-Requests: 1
Dnt: 1

不同之处在于标头的顺序dnt大小写的差异实际上并不是问题所在。

因此,我能够用以下原始请求成功地提出请求:

GET /guest/accountlogin HTTP/1.1
Host: grimaldis.myguestaccount.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0

因此,Host报头已经在User-Agent之上发送。所以,如果你想继续使用请求。请考虑使用OrderedDict来确保标头的顺序。

经过一些调试,感谢@TuanGeek的回答,我们发现请求库的问题似乎来自于处理cloudflare时请求的DNS问题,解决这个问题的简单方法是直接连接到主机IP,例如:

import requests
from collections import OrderedDict
from requests import Session
import socket
# grab the address using socket.getaddrinfo
answers = socket.getaddrinfo('grimaldis.myguestaccount.com', 443)
(family, type, proto, canonname, (address, port)) = answers[0]
s = Session()
headers = OrderedDict({
'Accept-Encoding': 'gzip, deflate, br',
'Host': "grimaldis.myguestaccount.com",
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0'
})
s.headers = headers
response = s.get(f"https://{address}/guest/accountlogin", headers=headers, verify=False).text
print(response)

现在,这个修复程序在使用httplib HTTPX时不起作用,但是我发现了问题的根源。

这个问题来自h11库(HTTPX用于处理HTTP/1.1请求),虽然urllib会自动修复头的字母大小写,但h11采用了不同的方法,降低了每个头的大小写。虽然理论上这不应该引起任何问题,因为服务器应该以不区分大小写的方式处理标头(在很多情况下确实如此),但现实是HTTP很难™️而像Cloudflare这样的服务不尊重RFC2616,需要正确地大写报头。

关于资本化的讨论已经在h11:进行了一段时间

https://github.com/python-hyper/h11/issues/31

并且具有";最近";HTTPX的回购也开始出现:

https://github.com/encode/httpx/issues/538

https://github.com/encode/httpx/issues/728

现在,对于Cloudflare和HTTPX之间的问题,令人不满意的答案是,在h11方面完成一些事情之前(或者直到Cloudflare奇迹般地开始尊重RFC2616),HTTPX和Cloudflare如何处理标题大写问题不会有太大改变。

要么使用不同的HTTPLIB,如aiohttp或requests futures,尝试自己用h11分叉和修补头部大写,要么等待并希望h11团队正确处理该问题。

最新更新