这是我正在做的一个网页抓取项目。
我需要发送这个v2验证码的响应,但它没有带来我需要的数据'
headers = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36'
}
url = 'https://www2.detran.rn.gov.br/externo/consultarveiculo.asp'
session = requests.session()
fazer_get = session.get(url, headers=headers)
cookie = fazer_get.cookies
html = fazer_get.text
try:
rgxCaptchaKey = re.search(r'<divs*class="g-recaptcha"s*data-s*sitekey="([^"]*?)"></div>', html, re.IGNORECASE)
captchaKey = rgxCaptchaKey.group(1)
except:
print('erro')
resposta_captcha = captcha(captchaKey, url, KEY)
placa = 'pcj90'
renavam = '57940'
payload = {
'oculto:' 'AvancarC'
'placa': placa,
'renavam': renavam,
'g-recaptcha-response': resposta_captcha['code'],
'btnConsultaPlaca': ''
}
fazerPost = session.post(
url, payload,
headers=headers,
cookies=cookie)
我尝试在有效载荷中发送验证码响应,但我无法到达我想要的页面
如果你试图抓取的网站是受reCaptcha保护的,你最好的选择是使用一种隐形的方法来抓取。这意味着要么Selenium(至少有selenium-stealth
),要么第三方web scraper,如WebScrapingAPI,我是一名工程师。
使用第三方服务的好处是,它通常附带解决reCaptcha, IP旋转系统和其他各种功能,以防止机器人检测,所以你可以专注于构建处理抓取数据,而不是构建刮板。
为了更好地了解这两个选项,这里有两个可以比较的实现示例:
1。Python With Stealthy Selenium
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium_stealth import stealth
from bs4 import BeautifulSoup
URL = 'https://www2.detran.rn.gov.br/externo/consultarveiculo.asp'
options = Options()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options)
stealth(driver,
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True)
driver.get(URL)
html = driver.page_source
driver.quit()
你还应该考虑集成验证码求解器(如2captcha)与他的代码。
2。Python With WebScrapingAPI
import requests
URL = 'https://www2.detran.rn.gov.br/externo/consultarveiculo.asp'
API_KEY = '<YOUR_API_KEY>'
SCRAPER_URL = 'https://api.webscrapingapi.com/v1'
params = {
"api_key":API_KEY,
"url": URL,
"render_js":"1",
"js_instructions":'''
[{
"action":"value",
"selector":"input#placa",
"timeout": 5000,
"value":"<YOUR_EMAIL_OR_USERNAME>"
},
{
"action":"value",
"selector":"input#renavam",
"timeout": 5000,
"value":"<YOUR_PASSWORD>"
},
{
"action":"submit",
"selector":"button#btnConsultaPlaca",
"timeout": 5000
}]
'''
}
res = requests.get(SCRAPER_URL, params=params)
print(res.text)