NodeJS:从文本中删除html，但保留图像的alt

我有HTML文本作为字符串，表情符号作为图像标签，如下所示：

const htmlText = '<p>test emoji <span title="Smile" class="animated-emoticon-20-smile"><img title="Smile" alt="  "></span> and more <span title="Emo" class="animated-emoticon-20-emo"><img title="Emo" alt="  "></span><span title="Star eyes" class="animated-emoticon-20-stareyes"><img title="Star eyes" alt="  "></span></p>'

如何将其转换为没有 HTML 标签但带有表情符号的源文本？

我试试这个：

htmlText.replace(/<[^>]+>/g, '')

=>'test emoji and more '

但是我还希望显示表情符号，就像在图像的alt中一样，如下所示：

test emoji and more

也许正则表达式必须不同。

您可以使用

const htmlText = '<p>test emoji <span title="Smile" class="animated-emoticon-20-smile"><img title="Smile" alt="  "></span> and more <span title="Emo" class="animated-emoticon-20-emo"><img title="Emo" alt="  "></span><span title="Star eyes" class="animated-emoticon-20-stareyes"><img title="Star eyes" alt="  "></span></p>'
console.log(htmlText.replace(/<(?:[^<>]*?salt="([^"]*)")?[^<>]*>/g, '$1') )

详情：

<-<炭
(?:- 可选非捕获组的开始：
- [^<>]*?- 除<以外的零个或多个字符，>少至可能
- s- 空格
- alt="- 固定字符串
- ([^"]*)- 组 1 ($1)：除"以外的任何零个或多个字符
- "-"炭
)- 组结束，重复 1 或 0 次
[^<>]*- 除<和>以外的零个或多个字符，并尽可能多地
>->字符。

请参阅正则表达式演示。

也许是类似的？

htmlText.replace(/(?:<|(?<=p{EPres}))[^>p{EPres}]+(?:>|(?=p{EPres}))/gu, "")

p{EPres}是二进制 unicode 属性 Emoji_Presentation 的转义序列(对于表情符号也是如此)。正则表达式必须具有u标志才能被识别。

基本上，您匹配除>和表情符号之外的所有字符，从第一个<或第一个表情符号之后的字符开始，一直到第一个>或第一个表情符号之前的字符。

相关内容

最新更新

热门标签：