我有一个已经用lxml的Cleaner清理过的字符串,所以所有链接现在都在Content的形式中。现在我想去掉所有没有href属性的链接,例如
<a rel="nofollow">Link to be removed</a>
应该成为
Link to be removed
相同<a>Other link to be removed</a>
状况成为:
Other link to be removed
就是所有缺少href属性的链接。它不一定是regex,但由于lxml返回一个干净的标记结构,所以应该是可能的。我所需要的,是一个源字符串剥离这些非功能的a标记。
您可以使用BeautifulSoup
,这将使查找没有href
的<a>
标记更容易:
>>> from bs4 import BeautifulSoup as BS
>>> html = """
... <a rel="nofollow">Link to be removed</a>
... <a href="alink">This should not be included</a>
... <a>Other link to be removed</a>
... """
>>> soup = BS(html)
>>> for i in soup.find_all('a', href=False):
... i.replace_with(i.text)
...
>>> print soup
<html><body>Link to be removed
<a href="alink">This should not be included</a>
Other link to be removed</body></html>
使用drop_tag
方法
import lxml.html
root = lxml.html.fromstring('<div>Test <a rel="nofollow">Link to be <b>removed</b></a>. <a href="#">link</a>')
for a in root.xpath('a[not(@href)]'):
a.drop_tag()
assert lxml.html.tostring(root) == '<div>Test Link to be <b>removed</b>. <a href="#">link</a></div>'
http://lxml.de/lxmlhtml.html .drop_tag ():删除标签,但保留其子元素和文本。