如何将 HTML 注释替换为自定义<comment>元素

我正在使用Python中的BeautifulSoup将许多HTML文件大规模转换为XML。

一个示例HTML文件如下所示：

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<!-- this is an HTML comment -->
<!-- this is another HTML comment -->
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        ...
        <!-- here is a comment inside the head tag -->
    </head>
    <body>
        ...
        <!-- Comment inside body tag -->
        <!-- Another comment inside body tag -->
        <!-- There could be many comments in each file and scattered, not just 1 in the head and three in the body. This is just a sample. -->
    </body>
</html>
<!-- This comment is the last line of the file -->

我想好了如何找到doctype并用标记<doctype>...</doctype>替换它，但注释让我很沮丧。我想用<comment>...</comment>替换HTML注释。在这个示例HTML中，我能够替换前两个HTML注释，但html标记内的任何内容和结束HTML标记后的最后一个注释我都不能。

这是我的代码：

file = open ("sample.html", "r")
soup = BeautifulSoup(file, "xml")
for child in soup.children:
    # This takes care of the first two HTML comments
    if isinstance(child, bs4.Comment):
        child.replace_with("<comment>" + child.strip() + "</comment>")
    # This should find all nested HTML comments and replace.
    # It looks like it works but the changes are not finalized
    if isinstance(child, bs4.Tag):
        re.sub("(<!--)|(&lt;!--)", "<comment>", child.text, flags=re.MULTILINE)
        re.sub("(-->)|(--&gr;)", "</comment>", child.text, flags=re.MULTILINE)
# The HTML comments should have been replaced but nothing changed.
print (soup.prettify(formatter=None))

这是我第一次使用BeautifulSoup。如何使用BeautifulSoup查找所有HTML注释并将其替换为<comment>标记？

我可以通过pickle将其转换为字节流，对其进行序列化，应用regex，然后将其取消关联为BeautifulSoup对象吗？这会奏效吗？还是只会引发更多问题？

我尝试在子标记对象上使用pickle，但TypeError: __new__() missing 1 required positional argument: 'name'反序列化失败。

然后，我尝试通过child.text只酸洗标记的文本，但由于AttributeError: can't set attribute，反序列化失败。基本上，child.text是只读的，这解释了regex不起作用的原因。所以，我不知道如何修改文本。

您有几个问题：

您不能修改child.text。它是一个只读属性，只在后台调用get_text()，其结果是一个与文档无关的全新字符串。

re.sub()不会修改任何内容。您的线路

re.sub("(<!--)|(&lt;!--)", "<comment>", child.text, flags=re.MULTILINE)

必须是

child.text = re.sub("(<!--)|(&lt;!--)", "<comment>", child.text, flags=re.MULTILINE)

但由于第1点的原因，这无论如何都不起作用。

试图通过用正则表达式替换文档中的文本块来修改文档是使用BeautifulSoup的错误方法。相反，您需要找到节点并用其他节点替换它们。

这里有一个有效的解决方案：

import bs4
with open("example.html") as f:
    soup = bs4.BeautifulSoup(f)
for comment in soup.find_all(text=lambda e: isinstance(e, bs4.Comment)):
    tag = bs4.Tag(name="comment")
    tag.string = comment.strip()
    comment.replace_with(tag)

这段代码首先迭代调用find_all()的结果，利用我们可以将函数作为text参数传递的事实。在BeautifulSoup中，Comment是NavigableString的一个子类，所以我们把它当作字符串来搜索，而lambda ...只是的缩写

def is_comment(e):
    return isinstance(e, bs4.Comment)
soup.find_all(text=is_comment)

然后，我们用适当的名称创建一个新的Tag，将其内容设置为原始注释的剥离内容，并用我们刚刚创建的标签替换注释。

结果如下：

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<comment>this is an HTML comment</comment>
<comment>this is another HTML comment</comment>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
        ...
        <comment>here is a comment inside the head tag</comment>
</head>
<body>
        ...
        <comment>Comment inside body tag</comment>
<comment>Another comment inside body tag</comment>
<comment>There could be many comments in each file and scattered, not just 1 in the head and three in the body. This is just a sample.</comment>
</body>
</html>
<comment>This comment is the last line of the file</comment>

相关内容

最新更新

热门标签：