Cheerio如何从选择中删除DOM元素

我正在尝试编写一个bot来转换一堆HTML页面以降低标记，以便将它们导入Jekyll文档。为此，我使用puppeteer获取HTML文档，并使用cheerio对其进行操作。

源HTML是相当复杂的，并污染了谷歌广告标签，外部脚本等。我需要做的是获取预定义选择器的HTML内容，然后从中删除与预定义选择器集匹配的元素，以便获得仅包含文本的纯HTML并将其转换为markdown。

假设源html是这样的:

<html>
<head />
<body>
<article class="post">
<h1>Title</h1>
<p>First paragraph.</p>
<script>That for some reason has been put here</script>
<p>Second paragraph.</p>
<ins>Google ADS</ins>
<p>Third paragraph.</p>
<div class="related">A block full of HTML and text</div>
<p>Forth paragraph.</p>
</article>
</body>
</html>

我想实现的是像

<h1>Title</h1>
<p>First paragraph.</p>
<p>Second paragraph.</p>
<p>Third paragraph.</p>
<p>Forth paragraph.</p>

我定义了一个选择器数组，我想从源对象中删除它:

stripFromText: ['.social-share', 'script', '.adv-in', '.postinfo', '.postauthor', '.widget', '.related', 'img', 'p:empty', 'div:empty', 'section:empty', 'ins'],

并编写了如下函数:

const getHTMLContent = async ($, selector) => {
let value;
try {
let content = await $(selector);
for (const s of SELECTORS.stripFromText) {
// 1
content = await content.remove(s);
// 2
// await content.remove(s);
// 3
// content = await content.find(s).remove();
// 4
// await content.find(s).remove();
// 5
// const matches = await content.find(s);
// for (m of matches) {
//  await m.remove();
// }
};
value = content.html();
} catch(e) {
console.log(`- [!] Unable to get ${selector}`);
}
console.log(value);
return value;
};

$是包含const $ = await cheerio.load(html);
selector是容器的圆顶选择器(在上面的例子中，它将是.post)

我无法做到的是使用cheerio来remove()对象。我尝试了所有我在代码中留下注释的5个版本，但没有成功。Cheerio的文档到目前为止没有帮助，我只是找到了这个链接，但提出的解决方案不适合我。

我想知道是否有更有经验的人可以为我指出正确的方向，或者解释我在这里错过了什么。

您可以使用remove:

删除元素

$('script,ins,div').remove()

我在我的代码中发现了一个经典的newby错误，我在.remove()调用之前缺少了一个await。

工作函数现在看起来像这样，并且工作:

const getHTMLContent = async ($, selector) => {
let value;
try {
let content = await $(selector);
for (const s of SELECTORS.stripFromText) {
console.log(`--- Stripping ${s}`);
await content.find(s).remove();
};
value = await content.html();
} catch(e) {
console.log(`- [!] Unable to get ${selector}`);
}
return value;
};

相关内容

最新更新

热门标签：