使用puppeteer将网络碎片数据上传到node.js中的firebase云存储

我正在尝试浏览一个新闻网站，打开文章的每个链接并获取数据。我可以与木偶师进行网络抓取，但无法将其上传到fire base云存储。我如何每隔一个小时左右做一次？我在asynchrones函数中进行了webcraw，然后在cloud函数中调用了它：我用木偶师从新闻编辑室网站上抓取文章的链接，然后用这些链接从文章中获取更多信息。我最初把所有东西都放在一个异步函数中，但云函数抛出了一个错误，即循环中不应该有任何等待。

更新：

我在firebase函数中植入了上面的代码，但仍然没有收到等待循环错误。

这里有一些错误，但您正在努力实现这一点。主要的问题是，在try {} catch {}块中不能有await。异步JavaScript有一种不同的处理错误的方法。请参阅：带有async/await的try/catch块。

在您的情况下，在一个异步函数中编写所有内容是完全可以的。以下是我的做法：

async function scrapeIfc() {
const completeData = [];
const url = 'https://www.ifc.org/wps/wcm/connect/news_ext_content/ifc_external_corporate_site/news+and+events/pressroom/press+releases';
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
await page.setDefaultNavigationTimeout(0);
const links = await page.evaluate(() =>
Array.from(document.querySelectorAll('h3 > a')).map(anchor => anchor.href)
);
for (const link of links) {
const newPage = await browser.newPage();
await newPage.goto(link);
const data = await newPage.evaluate(() => {
const titleElement = document.querySelector('td[class="PressTitle"] > h3');
const contactElement = document.querySelector('center > table > tbody > tr:nth-child(1) > td');
const txtElement = document.querySelector('center > table > tbody > tr:nth-child(2) > td');
return {
source: 'ITC',
title: titleElement ? titleElement.innerText : undefined,
contact: contactElement ? contactElement.innerText : undefined,
txt: txtElement ? txtElement.innerText : undefined,
}
})
completeData.push(data);
newPage.close();
}
await browser.close();
return completeData;
}

还有几件事你应该注意：

您有一堆未使用的导入title、link、resolve和reject作为脚本的头，它们可能是由代码编辑器自动添加的。去掉它们，因为它们可能会覆盖真实的变量
我把你的document.querySelector改得更具体，因为我无法从ITC网站上选择实际元素。你可能需要修改它们
对于本地开发，我使用谷歌的函数框架，它帮助我在部署之前在本地运行和测试函数。如果你在本地机器上有错误，那么在部署到谷歌云时就会出现错误
(观点(如果你不需要Firebase，我会用Google Cloud Functions、Cloud Scheduler和Cloud Firestore来运行它。对我来说，这是定期刮取网页的工作流程
(观点(Puppeteer可能过于热衷于抓取一个简单的静态网站，因为它是在无头浏览器中运行的。像Cheerio这样的东西要轻得多，速度也快得多

希望我能帮忙。如果您遇到其他问题，请告诉我们。欢迎来到Stack Overflow社区！

相关内容

最新更新

热门标签：