Puppeteer在服务器云上运行的特定网站上超时



我制作了一个node.js web scraper代码,它在我的计算机上运行良好,然而,当我部署到运行Debian的Google Cloud VM实例时,它会返回特定网站的超时错误。我为木偶师尝试过很多不同的设置,但似乎都不起作用。我相信当我从谷歌云服务器上运行时,我试图抓取的网站会阻止我的代码,但当我从电脑上运行时不会。刮削部分在我的电脑上运行良好。Puppeteer找到HTML标签并检索信息。

const puppeteer = require('puppeteer');
const GoogleSpreadsheet = require('google-spreadsheet');
const { promisify } = require('util');
const credentials = require('./credentials.json');
async function main(){
const scrapCopasa = await scrapCopasaFunction();
console.log('Done!')
}

async function scrapCopasaFunction() {
const browser = await puppeteer.launch({
args: ['--no-sandbox'], 
});
const page = await browser.newPage();
//await page.setDefaultNavigationTimeout(0);
//await page.setViewport({width: 1366, height: 768});
await page.setUserAgent('Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36');
await page.goto('http://www.copasa.com.br/wps/portal/internet/abastecimento-de-agua/nivel-dos-reservatorios');
//await new Promise(resolve => setTimeout(resolve, 5000));

let isUsernameNotFound = await page.evaluate(() => {
if(document.getElementsByClassName('h2')[0]) {
if(document.getElementsByTagName('h2')[0].textContent == "Sorry, this page isn't available.") {
return true;
}
}
});
if(isUsernameNotFound) {
console.log('Account not exists!');        
await browser.close();
return;
}

let reservoirLevelsCopasa = await page.evaluate(() => {
const tds = Array.from(document.querySelectorAll('table tr td'))
return tds.map(td => td.innerText)        
});

const riomanso = reservoirLevelsCopasa[13].replace(",",".").substring(0,5);
const serraazul = reservoirLevelsCopasa[17].replace(",",".").substring(0,5);
const vargemdasflores = reservoirLevelsCopasa[21].replace(",",".").substring(0,5);
await browser.close();
return[riomanso, serraazul, vargemdasflores];
}

main();

我得到的错误如下:

(node:6425) UnhandledPromiseRejectionWarning: TimeoutError: Navigation Timeout Exceeded: 30000ms exceeded
at /home/xxx/reservoirs/node_modules/puppeteer/lib/LifecycleWatcher.js:142:21
at async FrameManager.navigateFrame (/home/xxx/reservoirs/node_modules/puppeteer/lib/FrameManager.js:94:17)
at async Frame.goto (/home/xxx/reservoirs/node_modules/puppeteer/lib/FrameManager.js:406:12)
at async Page.goto (/home/xxx/reservoirs/node_modules/puppeteer/lib/Page.js:674:12)
at async scrapCopasaFunction (/home/xxx/reservoirs/reservatorios.js:129:5)
at async main (/home/xxx/reservoirs/reservatorios.js:9:25)
-- ASYNC --
at Frame.<anonymous> (/home/xxx/reservoirs/node_modules/puppeteer/lib/helper.js:111:15)
at Page.goto (/home/xxx/reservoirs/node_modules/puppeteer/lib/Page.js:674:49)
at Page.<anonymous> (/home/xxx/reservoirs/node_modules/puppeteer/lib/helper.js:112:23)
at scrapCopasaFunction (/home/xxx/reservoirs/reservatorios.js:129:16)
at processTicksAndRejections (internal/process/task_queues.js:93:5)
at async main (/home/xxx/reservoirs/reservatorios.js:9:25)
(Use `node --trace-warnings ...` to show where the warning was created)
(node:6425) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async f
unction without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled
promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode)
. (rejection id: 1)
(node:6425) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not ha
ndled will terminate the Node.js process with a non-zero exit code.

云功能对木偶师来说有点慢。有一个GitHub问题#3120。关于这一点。如果有可能的话,你可以为这个函数分配更多的CPU/ram。你为chrome提供的CPU和RAM越多,它就会越快

您可以向goto添加超时,这是以毫秒为单位的最大导航时间,默认为30秒,传递0可禁用超时。

await page.goto('http://www.copasa.com.br', { timeout: 60000 });

您还可以使用setDefaultTimeout和setDefaultNavigationTimeout设置导航超时,后者的优先级高于setDefaultTimeout。

page.setDefaultNavigationTimeout(60000)

您正在提取的数据已经在HTML中,因此您可以通过HTTP请求获取HTML,并在Node.js脚本中提取数据,而不是在浏览器中。这将更快,所需资源更少。如果需要进行身份验证,可以发送POST请求,并在下面的GET请求中重用cookie。这个答案中的例子。

完整示例

const cheerio = require('cheerio')
const got = require('got')
const URL = 'http://www.copasa.com.br/wps/portal/internet/abastecimento-de-agua/nivel-dos-reservatorios'
function reportAndExit (error) {
console.error(error)
process.exit(1)
}
async function main () {
const headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'
}
const response = await got(URL, headers)
const $ = cheerio.load(response.body)
const reservoirLevelsCopasa = $('#conteudo-principal table:first-of-type tr:nth-of-type(n+3) td:nth-child(4)').map((i, el) => parseFloat($(el).text().replace(',', '.'))).get()
console.log(reservoirLevelsCopasa)
return reservoirLevelsCopasa
}
main().catch(reportAndExit)

输出

[ 83.4, 88.8, 85.9 ]

最新更新