我正在尝试抓取此页面 - https://www.g2.com/products/dropbox/reviews 但是一旦请求到来,我就会被检测到,有没有办法解决这个问题?
在此之前尝试使用请求,但也被检测到。 *我不能在这个项目中使用Scrapy。 而且我无法在网上找到有关如何解决它的适当信息......
也许我需要添加自定义标题?
现在代码的输出是(告诉您检测到的页面标题(:
Pardon Our Interruption
法典:
from selenium import webdriver
import selenium as se
def fetch(URL):
options = se.webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--disable-infobars')
options.add_argument('--disable-extensions')
options.add_argument('--profile-directory=Default')
options.add_argument('--incognito')
options.add_argument('--disable-plugins-discovery')
options.add_argument('--start-maximized')
driver = webdriver.Chrome('chromedriver',chrome_options=options)
driver.get(URL)
print(driver.title)
fetch('https://www.g2.com/products/dropbox/reviews')
编辑:能够四处走动,获得单页,但在第二次运行时,被检测到。 法典:
def fetch(URL):
firefox_profile = webdriver.FirefoxProfile()
firefox_profile.set_preference("browser.privatebrowsing.autostart", True)
browser = webdriver.Firefox(executable_path='geckodriver.exe', firefox_profile=firefox_profile)
browser.get(URL)
print(browser.title)
fetch('https://www.g2.com/products/dropbox/reviews')
我采用了您的代码,进行了一些调整并使用ChromeDriver/Chrome组合执行了脚本,并遇到了类似的问题,即标题为">请原谅我们的中断"的页面,如下所示:
-
代码块:
from selenium import webdriver options = webdriver.ChromeOptions() options.add_argument('window-size=1200x600') options.add_argument('--headless') options.add_experimental_option("excludeSwitches", ["enable-automation"]) options.add_experimental_option('useAutomationExtension', False) driver = webdriver.Chrome(options=options, executable_path=r'C:UtilityBrowserDriverschromedriver.exe') driver.get("https://www.g2.com/products/dropbox/reviews") print(driver.page_source) driver.quit()
-
控制台输出:
<html lang="zxx"><head> <title>Pardon Our Interruption</title> <link rel="stylesheet" type="text/css" href="//cdn.distilnetworks.com/css/distil.css" media="all"> <meta http-equiv="content-type" content="text/html; charset=UTF-8"> <meta name="viewport" content="width=1000"> <meta name="robots" content="noindex, nofollow"> <meta http-equiv="cache-control" content="max-age=0"> <meta http-equiv="cache-control" content="no-cache"> <meta http-equiv="expires" content="0"> <meta http-equiv="expires" content="Tue, 01 Jan 1980 1:00:00 GMT"> <meta http-equiv="pragma" content="no-cache"> <script type="text/javascript" async="" src="https://www.gstatic.com/recaptcha/releases/PRkVene3wKrZUWATSylf69ja/recaptcha__en.js"></script><script> function showBlockPage() { document.getElementsByClassName("container")[0].style.display = ""; } setTimeout(showBlockPage, 10000); </script> <script type="text/javascript" src="/g2-meta-data" async="" defer=""></script> <script>if (window.sessionStorage) { sessionStorage.setItem('distil_referrer', document.referrer); }</script> <script src="https://www.google.com/recaptcha/api.js" async="" defer=""></script> <script> function solvedCaptcha(payload) { const timeoutMs = 10000; protectionSubmitCaptcha("recaptcha", payload, timeoutMs).then(function() { window.location.reload(true); }); } </script> </head> <body class="block-page"> <div class="container" style=""> <script>document.getElementsByClassName("container")[0].style.display = "none";</script> <noscript>This page requires JavaScript!</noscript> <div class="row"> <div class="sidebar col-lg-4 col-sm-5"> <img src="//cdn.distilnetworks.com/images/anomaly-detected.png" alt="0"> </div> <div class="content col-lg-8 col-sm-7"> <h1>Pardon Our Interruption...</h1> <p> As you were browsing something about your browser made us think you were a bot. There are a few reasons this might happen: </p> <ul> <li>You're a power user moving through this website with super-human speed.</li> <li>You've disabled JavaScript and/or cookies in your web browser.</li> <li>A third-party browser plugin, such as Ghostery or NoScript, is preventing JavaScript from running. Additional information is available in this <a title="Third party browser plugins that block javascript" href="http://ds.tl/help-third-party-plugins" target="_blank">support article</a>.</li> </ul> <script>showBlockPage()</script> <p>After completing the CAPTCHA below, you will immediately regain access to the site again.</p> <div class="g-recaptcha" data-sitekey="6LcfNLkUAAAAALPSa4GI_zHIPcYVGlxNOdvMsUsh" data-callback="solvedCaptcha"><div style="width: 304px; height: 78px;"><div><iframe src="https://www.google.com/recaptcha/api2/anchor?ar=1&k=6LcfNLkUAAAAALPSa4GI_zHIPcYVGlxNOdvMsUsh&co=aHR0cHM6Ly93d3cuZzIuY29tOjQ0Mw..&hl=en&v=PRkVene3wKrZUWATSylf69ja&size=normal&cb=m8amuk5fpfe" width="304" height="78" role="presentation" name="a-x8exk2gk39a9" frameborder="0" scrolling="no" sandbox="allow-forms allow-popups allow-same-origin allow-scripts allow-top-navigation allow-modals allow-popups-to-escape-sandbox"></iframe></div><textarea id="g-recaptcha-response" name="g-recaptcha-response" class="g-recaptcha-response" style="width: 250px; height: 40px; border: 1px solid rgb(193, 193, 193); margin: 10px 25px; padding: 0px; resize: none; display: none;"></textarea></div></div> </div> </div> </div> <div id="d__fFH" style="position: absolute !important; top: -5000px !important; left: -5000px !important;"><object id="d_dlg" classid="clsid:3050f819-98b5-11cf-bb82-00aa00bdce0b" width="0px" height="0px"></object><span id="d__fF" style="font-family: ZWAdobeF, serif !important; font-size: 72px !important; visibility: hidden;">mmmmmmmmlli</span></div><div style="background-color: rgb(255, 255, 255); border: 1px solid rgb(204, 204, 204); box-shadow: rgba(0, 0, 0, 0.2) 2px 2px 3px; position: absolute; transition: visibility 0s linear 0.3s, opacity 0.3s linear 0s; opacity: 0; visibility: hidden; z-index: 2000000000; left: 0px; top: -10000px;"><div style="width: 100%; height: 100%; position: fixed; top: 0px; left: 0px; z-index: 2000000000; background-color: rgb(255, 255, 255); opacity: 0.05;"></div><div class="g-recaptcha-bubble-arrow" style="border: 11px solid transparent; width: 0px; height: 0px; position: absolute; pointer-events: none; margin-top: -11px; z-index: 2000000000;"></div><div class="g-recaptcha-bubble-arrow" style="border: 10px solid transparent; width: 0px; height: 0px; position: absolute; pointer-events: none; margin-top: -10px; z-index: 2000000000;"></div><div style="z-index: 2000000000; position: relative;"><iframe title="recaptcha challenge" src="https://www.google.com/recaptcha/api2/bframe?hl=en&v=PRkVene3wKrZUWATSylf69ja&k=6LcfNLkUAAAAALPSa4GI_zHIPcYVGlxNOdvMsUsh&cb=yl5twmy9lj55" name="c-x8exk2gk39a9" frameborder="0" scrolling="no" sandbox="allow-forms allow-popups allow-same-origin allow-scripts allow-top-navigation allow-modals allow-popups-to-escape-sandbox" style="width: 100%; height: 100%;"></iframe></div></div></body></html>
分析
在检查页面时,您会发现<body>
标签包含:
<script>window.distilReferrerValue = function() {
var value;
try {
if (window.sessionStorage) {
value = sessionStorage.getItem('distil_referrer');
sessionStorage.removeItem('distil_referrer');
}
} catch(e) {}
window.distilReferrerValue = function() {
return value;
};
return value;
};</script>
这清楚地表明,该网站 https://www.g2.com/products/dropbox/reviews 受到机器人管理服务提供商Distil Networks的保护,并且ChromeDriver的导航被检测到并随后被阻止。
蒸馏
根据文章,确实有一些关于 Distil.it...:
Distil 通过观察网站行为并识别抓取工具特有的模式来保护网站免受自动内容抓取机器人的侵害。当 Distil 在一个站点上识别出恶意机器人时,它会创建一个列入黑名单的行为配置文件,该配置文件将部署到其所有客户。类似于机器人防火墙,Distil 检测模式并做出反应。
进一步
"One pattern with **Selenium** was automating the theft of Web content"
,Distil首席执行官Rami Essaid上周在接受采访时表示。"Even though they can create new bots, we figured out a way to identify Selenium the a tool they're using, so we're blocking Selenium no matter how many times they iterate on that bot. We're doing that now with Python and a lot of different technologies. Once we see a pattern emerge from one type of bot, then we work to reverse engineer the technology they use and identify it as malicious".
参考
您可以在Chrome浏览器中找到通过ChromeDriver发起的相关讨论被检测到