Selenium HtmlUnitDriver Web Scrape从EC2服务器获取Captcha页面

我写了一个简单的web scraper来scrapeexpedia.com。使用Java Selenium HtmlUnitDriver，如果我在本地运行它，我能够成功地从网站上刮取数据。

然而，当我将其部署到EC2服务器上时，它总是向我返回expedia检测到它是机器人的页面，因此，它显示这个captcha来证明有人正在访问它。

我想这可能与ec2服务器的ip地址不知怎么被expedia.com列入黑名单有关？

我试过抓取不同的网站，它们不在乎/不做人体测试。

知道怎么绕过这个吗？

我尝试过但仍然检测到机器人的东西：

将用户代理更改为我在本地浏览器上使用的代理
设置代理

更新：实际上，设置代理服务器会给我一个不同的错误：

当前URL为https://www.expedia.com/things-to-do/search?location=Paris&pageNumber=1

htmlString:

<!--?xml version="1.0" encoding="ISO-8859-1"?-->
<html>
<head> 
<title>
500 Internal Server Error
</title> 
</head> 
<body> 
<h1> Internal Server Error </h1> 
<p> The server encountered an internal error or misconfiguration and was unable to complete your request. </p> 
<p> Please contact the server administrator at [no address given] to inform them of the time this error occurred, and the actions you performed just before this error. </p> 
<p> More information about this error may be available in the server error log. </p> 
<hr> 
<address> Apache/2.4.18 (Ubuntu) Server at www.expedia.com Port 443 </address>   
</body>
</html>

您是否涵盖了以下主题：

-您使用的是哪种代理？请确保您使用的代理与您在人工导航中使用的代理相同，更多详细信息请参阅此链接。

-您是否在导航中插入等待？如果页面加载后，您尝试单击或导航，则这不是模拟常规导航。更多详细信息。

-您使用的是哪种驱动程序，chromedriver有一个技巧，可以将内部变量"cdc_"重命名为其他名称，如"aaa_"，如果服务器中有javascript代码试图检测此变量(cdc_(，它将失败。更多详细信息。

-如果你真的不需要被服务器检测到，还有更多的事情需要研究：

-Is there a honeypot in place?
-Are your IP (EC2 IP) already blocked? You could redirect using a VPN tunnel.

有趣的文章：

https://www.kdnuggets.com/2018/02/web-scraping-tutorial-python.html

https://antoinevastel.com/bot%20detection/2017/08/05/detect-chrome-headles.html

https://intoli.com/blog/making-chrome-headless-undetectable/

相关内容

最新更新

热门标签：