从nodeJ上的字体标记中提取文本

我使用Cheerio从不同网页的html代码中提取信息。然而，有一个网站，我想提取的文本包含在脚本标签中；因此CCD_ 2方法无法访问该段代码。

因此，为了寻找解决方案，我在网上发现了使用puppetier运行该脚本的可能性，puppetieer是一个处理chrome实例的API节点。使用它，即使不是最好的方式，因为我几天前发现了它，最终我获得了我需要的html代码。不幸的是，我无法提取我需要的信息。这是我想从中提取数据的html代码：

<h2 class="property-price">
<a href="blablabla">
<strong>
<font style="vertical-align: inherit;">
<font style="vertical-align: inherit;">Text that I wanna extract</font>
</font>
<small></small>
</strong>
</a>  
</h2>

这是我用来提取文本数据但没有成功的代码：

var cheerio = require("cheerio");
const puppeteer = require('puppeteer');
var $;
const POST_LINK_SELECTOR = 'div.property-title';
(async() => {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto('myUrl',{
timeout: 0
});
$=cheerio.load(renderedContent);
console.log($('h2.property-price').find('font').children().text());
await browser.close();
})();

我确信这不是获得我需要的数据文本的最佳方式，所以如果你有一些建议，我会很乐意接受。此外，我想知道是否可以直接使用木偶师API提取我需要的东西，或者我是否需要使用Cheerio(就像我在我的案例中所做的那样，无论如何都不起作用(。谢谢

您可以在page.evaluate方法的帮助下通过木偶师找到所需的数据：

(async() => {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto('myUrl',{waitUntil: "networkidle0"});
const text = await page.evaluate(() => document.querySelector("h2.property-price a").textContent.trim() )
console.log(text);
await browser.close();
})();

如果你想继续使用Cheerio的类似jQuery的语法，也可以这样做，只需将jQuery添加到页面中(如果网站现在不使用它(

await page.goto(...);
await page.addScriptTag({url: 'https://code.jquery.com/jquery-3.2.1.min.js'});

相关内容

最新更新

热门标签：