无法使用木偶器重复使用新填充的链接



我用node.js编写了一个脚本,并结合puppeteerlinks解析为网页上所有帖子的标题,并使用这些links导航到其内页以抓取标题。

我本可以从它的登录页面中抓取标题,但我的目的是使用这些新填充的链接进行导航并从目标页面解析标题。当我执行脚本时,它会抓取第一个标题,然后抛出错误。我怎样才能按照我尝试应用的逻辑使其成功。

链接到网站

链接到其中一个此类目标页面

这是我到目前为止的脚本:

const puppeteer = require("puppeteer");
(async function main() {
const browser = await puppeteer.launch({headless:false});
const page = await browser.newPage();
await page.goto("https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&pageSize=50");
page.waitForSelector(".summary");
const sections = await page.$$(".summary");
for (const section of sections) {
const itemName = await section.$eval(".question-hyperlink", item => item.href);
(async function main() {
await page.goto(itemName);
page.waitForSelector(".summary");
const titles = await page.$$("#question-header");
for (const title of titles) {
const itmName = await title.$eval("#question-header .question-hyperlink", itm => itm.innerText);
console.log(itmName);
}
})();
}
browser.close();
})();

我可以在控制台中看到的内容:

(node:1992) UnhandledPromiseRejectionWarning: Error: Execution context was destroyed, most likely because of a navigation.
at rewriteError (c:UsersWCSnode_modulespuppeteerlibExecutionContext.js:144:15)
at <anonymous>
at process._tickCallback (internal/process/next_tick.js:189:7)
(node:1992) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 1)
(node:1992) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
How to search content related to keyword in an website?
(node:1992) UnhandledPromiseRejectionWarning: TimeoutError: waiting for selector ".summary" failed: timeout 30000ms exceeded
at new WaitTask (c:UsersWCSnode_modulespuppeteerlibFrameManager.js:862:28)
at Frame._waitForSelectorOrXPath (c:UsersWCSnode_modulespuppeteerlibFrameManager.js:753:12)
at Frame.waitForSelector (c:UsersWCSnode_modulespuppeteerlibFrameManager.js:711:17)
at Page.waitForSelector (c:UsersWCSnode_modulespuppeteerlibPage.js:1043:29)
at main (c:UsersWCSscrape.js:15:18)
at <anonymous>
at process._tickCallback (internal/process/next_tick.js:189:7)
(node:1992) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). (rejection id: 2)

您可以看到我在错误中得到了结果。

有两种方法可以解决您的问题:

首先:创建一个要遍历的 URL 数组,然后重用page来访问它们。

const puppeteer = require("puppeteer");
(async function main() {
const browser = await puppeteer.launch({headless:false});
const page = await browser.newPage();
await page.goto("https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&pageSize=50", {waitUntil: 'networkidle2'});
await page.waitForSelector(".summary");
const urls = await page.$$eval(".question-hyperlink", items => items.map(item => item.href));
console.log(urls);
for (let url of urls) 
{
await page.goto(url);
await page.waitForSelector("#question-header");
let title = await page.$eval("#question-header a", item => item.textContent);
console.log(title);
}
await browser.close();
})();

第二:正如Romain建议的那样,创建另一个页面并使用它来迭代页面。

这是实现方法 2 的脚本副本,还更正了其他几个问题(缺少await运算符,问题页面上的选择器不正确(

const puppeteer = require("puppeteer");
(async function main() {
const browser = await puppeteer.launch({headless:false});
const page = await browser.newPage();
const newPage = await browser.newPage();
await page.goto("https://stackoverflow.com/questions/tagged/web-scraping?sort=newest&pageSize=50", {waitUntil: 'networkidle2'});
await page.waitForSelector(".summary");
const sections = await page.$$(".summary");
for (const section of sections) {
let itemURL = await section.$eval(".question-hyperlink", item => item.href);
await newPage.goto(itemURL);
await newPage.waitForSelector("#question-header"); // <-- was ".summary"
let titles = await newPage.$$("#question-header");
for (let title of titles) {
let itmName = await title.$eval("#question-header .question-hyperlink", itm => itm.innerText);
console.log(itmName);
}
}
await browser.close();
})();

我没有重播场景,但您的两个错误来自:

  • page.waitForSelector(".summary");面前的两只await
  • 您将 for 循环中的page.goto()从上下文中导航,然后尝试评估section对象上不再是 DOM 一部分的内容。

要解决第一个问题,只需将两个缺失await添加。

要解决第二个问题,请打开一个包含let newPage = await browser.newPage()newPage.goto('whereveryouwanttogo.com')的新页面。这样,您就不会破坏原始page,并且仍然可以做section的事情。

最新更新