如何避免在Puppeteer和Phantomjs上被检测为机器人?



Puppeteer和PhantomJS是相似的。我遇到的问题发生在两者身上,并且代码也很相似。

我想从网站捕获一些信息,该网站需要身份验证才能查看这些信息。我什至无法访问主页,因为它被检测为"可疑活动",例如 SS:https://i.stack.imgur.com/Atovn.png

我发现当我使用名为 Cookie 的标头在 Postman 上进行测试时,问题不会发生,并且在浏览器上捕获了它的 cookie 值,但这个 cookie 会在一段时间后过期。所以我猜Puppeteer/PhantomJS都没有捕获cookie,因为该网站拒绝了无头浏览器的访问。

我能做些什么来绕过这个?

// Simple Javascript example
var page = require('webpage').create();
var url = 'https://www.expertflyer.com';
page.open(url, function (status) {
    if( status === "success") {
        page.render("home.png");
        phantom.exit();
    }
});

如果将来有人需要同样的问题。 使用木偶师额外

我已经在服务器上测试了代码。在第二次运行时,有谷歌验证码。您可以自己解决并重新启动机器人或使用验证码解决服务。

我确实运行了代码 10 多次,没有 ip 禁令。我在继续运行时没有再次获得验证码。

但是你可以再次获得验证码!

//sudo npm install puppeteer puppeteer-extra puppeteer-extra-plugin-stealth puppeteer-extra-plugin-adblocker readline
var headless_mode = process.argv[2]
const readline = require('readline');
const puppeteer = require('puppeteer-extra')
const StealthPlugin = require('puppeteer-extra-plugin-stealth')
puppeteer.use(StealthPlugin())
const AdblockerPlugin = require('puppeteer-extra-plugin-adblocker')
puppeteer.use(AdblockerPlugin({ blockTrackers: true }))

async function run () {
  const browser = await puppeteer.launch({
    headless:(headless_mode !== 'true')? false : true,
    ignoreHTTPSErrors: true,
    slowMo: 0,
    args: ['--window-size=1400,900',
    '--remote-debugging-port=9222',
    "--remote-debugging-address=0.0.0.0", // You know what your doing?
    '--disable-gpu', "--disable-features=IsolateOrigins,site-per-process", '--blink-settings=imagesEnabled=true'
    ]})
  const page = await browser.newPage();

  console.log(`Testing expertflyer.com`)
  //await page.goto('https://www.expertflyer.com')
  await goto_Page('https://www.expertflyer.com')
  await waitForNetworkIdle(page, 3000, 0)
  //await page.waitFor(7000)
  await checking_error(do_2nd_part)


  async function do_2nd_part(){
    try{await page.click('#yui-gen2 > a')}catch{}
    await page.waitFor(5000)
    var seat = '#headerTitleContainer > h1'
    try{console.log(await page.$eval(seat, e => e.innerText))}catch{}
    await page.screenshot({ path: 'expertflyer1.png'})
    await checking_error(do_3nd_part)
  }
  async function do_3nd_part(){
    try{await page.click('#yui-gen1 > a')}catch{}
    await page.waitFor(5000)
    var pro = '#headerTitleContainer > h1'
    try{console.log(await page.$eval(pro, e => e.innerText))}catch{}
    await page.screenshot({ path: 'expertflyer2.png'})
    console.log(`All done, check the screenshots?`)
  }

  async function checking_error(callback){
    try{
      try{var error_found = await page.evaluate(() => document.querySelectorAll('a[class="text yuimenubaritemlabel"]').length)}catch(error){console.log(`catch error ${error}`)}
      if (error_found === 0) {
        console.log(`Error found`)
        var captcha_msg = "Due to suspicious activity from your computer, we have blocked your access to ExpertFlyer. After completing the CAPTCHA below, you will immediately regain access unless further suspicious behavior is detected."
        var ip_blocked = "Due to recent suspicious activity from your computer, we have blocked your access to ExpertFlyer. If you feel this block is in error, please contact us using the form below."
        try{var error_msg = await page.$eval('h2', e => e.innerText)}catch{}
        try{var error_msg_details = await page.$eval('body > p:nth-child(2)', e => e.innerText)}catch{}
        if (error_msg_details == captcha_msg) {
          console.log(`Google Captcha found, You have to solve the captch here manually or some automation recaptcha service`)
          await verify_User_answer()
          await callback()
        } else if (error_msg_details == ip_blocked) {
          console.log(`The current ip address is blocked. The only way is change the ip address.`)
        } else {
          console.log(`Waiting for error page load... Waiting for 10 sec before rechecking...`)
          await page.waitFor(10000)
          await checking_error()
        }
      } else {
        console.log(`Page loaded successfully! You can do things here.`)
        await callback()
      }
    }catch{}
  }
  async function goto_Page(page_URL){
    try{
      await page.goto(page_URL, { waitUntil: 'networkidle2', timeout: 30000 });
    } catch {
      console.log(`Error in loading page, re-trying...`)
      await goto_Page(page_URL)
    }
  }
  async function verify_User_answer(call_back){
      user_Answer = await readLine();
      if (user_Answer == 'yes') {
        console.log(`user_Answer is ${user_Answer}, Processing...`)
        // Not working what i want. Will fix later
        // Have to restart the bot after solving
        await call_back()
      } else {
        console.log(`answer not match. try again...`)
        var user_Answer = await readLine();
        console.log(`user_Answer is ${user_Answer}`)
        await verify_User_answer(call_back)
      }
    }
    async function readLine() {
      const rl = readline.createInterface({
        input: process.stdin,
        output: process.stdout
      });
      return new Promise(resolve => {
        rl.question('Solve the captcha and type yes to continue: ', (answer) => {
          rl.close();
          resolve(answer)
        });
      })
    }
  async function waitForNetworkIdle(page, timeout, maxInflightRequests = 0) {
  console.log('waitForNetworkIdle called')
  page.on('request', onRequestStarted);
  page.on('requestfinished', onRequestFinished);
  page.on('requestfailed', onRequestFinished);
  let inflight = 0;
  let fulfill;
  let promise = new Promise(x => fulfill = x);
  let timeoutId = setTimeout(onTimeoutDone, timeout);
  return promise;
  function onTimeoutDone() {
    page.removeListener('request', onRequestStarted);
    page.removeListener('requestfinished', onRequestFinished);
    page.removeListener('requestfailed', onRequestFinished);
    fulfill();
  }
  function onRequestStarted() {
    ++inflight;
    if (inflight > maxInflightRequests)
      clearTimeout(timeoutId);
  }
  function onRequestFinished() {
    if (inflight === 0)
      return;
    --inflight;
    if (inflight === maxInflightRequests)
      timeoutId = setTimeout(onTimeoutDone, timeout);
  }
}

  await browser.close()
}
run();

请注意"解决验证码并键入 yes 以继续:"方法无法按预期工作,需要一些修复。

编辑:10分钟后重新运行机器人再次获得验证码。解决了验证码chrome://inspect/#devices重新启动了机器人,一切恢复正常。没有知识产权禁令。

一般可以提供帮助的事情:

  • 标头应类似于常见浏览器,包括:
    • 用户代理 :使用最近的一个(见 https://developers.whatismybrowser.com/useragents/explore/(,或者更好的是,如果你发出多个请求,使用一个随机的最近的(见 https://github.com/skratchdot/random-useragent(
    • 接受语言:类似于">en,en-US;q=0,5" (适应您的语言(
    • 接受:一个标准的是">text/html,application/xhtml+xml,application/xml;q=0.9,/;Q=0.8">
  • 如果您发出多个请求,请在它们之间设置随机超时
  • 如果打开在页面中找到的链接,请相应地设置 Referer 标头
  • 应启用图像
  • 应该启用 Javascript
    • 检查是否在客户端 javascript 页面上下文中设置了"navigator.plugins"和">navigator.language">
  • 使用代理

如果你从网站的角度来看,你确实在做可疑的工作。因此,每当您想绕过这样的事情时,请务必考虑他们是怎么想的。

正确设置饼干

Puppeteer和PhantomJS等将使用真正的浏览器,并且那里使用的cookie比通过邮递员等使用时更好。您只需要正确使用cookie即可。

您可以使用page.setCookie(...cookies)来设置 Cookie。Cookie 是序列化的,所以如果 cookie 是一个对象数组,你可以简单地这样做,

const cookies = [{name: 'test', value: 'foo'}, {name: 'test2', value: 'foo'}]; // just as example, use real cookies here;
await page.setCookie(...cookies);

尝试调整行为

关闭无头模式并查看网站的行为。

await puppeteer.launch({headless: false})

尝试代理

一些网站根据 IP 地址进行监控,如果多次点击来自同一 IP,它们会阻止请求。在这种情况下,最好使用轮换代理。

您尝试访问的网站使用Distil Networks来防止网页抓取。

过去,人们通过替换Chromium call_function.js(在Puppeteer中使用(中的$cdc_变量来绕过Distil Networks取得了成功。

例如:

function getPageCache(opt_doc, opt_w3c) {
  var doc = opt_doc || document;
  var w3c = opt_w3c || false;
  // var key = '$cdc_asdjflasutopfhvcZLmcfl_';    <-- This is the line that is changed.
  var key = '$something_different_';
  if (w3c) {
    if (!(key in doc))
      doc[key] = new CacheWithUUID();
    return doc[key];
  } else {
    if (!(key in doc))
      doc[key] = new Cache();
    return doc[key];
  }
}

注意:根据此评论,如果您在进行此更改之前已被列入黑名单,您将面临另一组挑战,因此您必须"实施虚假画布指纹识别、禁用 flash、更改 IP 并更改请求标头顺序(交换语言和接受标头("。

相关内容

  • 没有找到相关文章

最新更新