我正在编写一些HTML预处理脚本,这些脚本正在从web爬网程序中清除/标记HTML,以便在接下来的语义/链接分析步骤中使用。我已经从HTML中过滤掉了不需要的标记,并将其简化为只包含可见文本和<div>
/<a>
元素。
我现在正试图编写一个"collapseDOM()"函数来遍历DOM树并执行以下操作:
(1) 在没有任何可见文本的情况下销毁叶节点
(2) 如果(a)直接不包含可见文本,并且(b)只有一个<div>
子,则折叠任何<div>
,用其子替换
例如,如果我有以下HTML作为输入:
<html>
<body>
<div>
<div>
<a href="www.foo.com">not collapsed into empty parent: only divs</a>
</div>
</div>
<div>
<div>
<div>
inner div not collapsed because this contains text
<div>some more text ...</div>
but the outer nested divs do get collapsed
</div>
</div>
</div>
<div>
<div>This won't be collapsed into parent because </div>
<div>there are two children ...</div>
</div>
</body>
它应该被转换成这个"崩溃"的版本:
<html>
<body>
<div>
<a href="www.foo.com">not collapsed into empty parent: only divs</a>
</div>
<div>
inner div not collapsed because this contains text
<div>some more text ...</div>
but the outer nested divs do get collapsed
</div>
<div>
<div>This won't be collapsed into parent because </div>
<div>there are two children ...</div>
</div>
</body>
我一直想不出该怎么做。我尝试使用BeautifulSoup的unwrap()
和decompose()
方法编写递归树遍历函数,但这在迭代DOM时修改了DOM,我不知道如何使其工作。。。
有没有一种简单的方法可以做我想做的事?我对BeautifulSoup或lxml中的解决方案持开放态度。谢谢
您可以从这个开始并根据自己的需求进行调整:
def stripTagWithNoText(soup):
def remove(node):
for index, item in enumerate(node.contents):
if not isinstance(item, NavigableString):
currentNodes = [text for text in item.contents if not isinstance(text, NavigableString) or (isinstance(text, NavigableString) and len(re.sub('[s+]', '', text)) > 0)]
parentNodes = [text for text in item.parent.contents if not isinstance(text, NavigableString) or (isinstance(text, NavigableString) and len(re.sub('[s+]', '', text)) > 0)]
if len(currentNodes) == 1 and item.name == item.parent.name:
if len(parentNodes) > 1:
continue
if item.name == currentNodes[0].name and len(currentNodes) == 1:
item.replaceWithChildren()
node.unwrap()
for tag in soup.find_all():
remove(tag)
print(soup)
soup = BeautifulSoup(data, "lxml")
stripTagWithNoText(soup)
<html> <body> <div> <a href="www.foo.com">not collapsed into empty parent: only divs</a> </div> <div> inner div not collapsed because this contains text <div>some more text ...</div> but the outer nested divs do get collapsed </div> <div> <div>This won't be collapsed into parent because </div> <div>there are two children ...</div> </div> </body> </html>