从CSV文件中移除特定的Span标签



我试图从csv文件中删除特定的span标签,但我的代码正在删除所有这些标签。我只需要指出一些要去掉的比如'<span style="font-family: verdana,geneva; font-size: 10pt;">'。但有些有'<b>''<p>'<STRONG>,其中包含我需要保留的<STRONG>name<STRONG>等文本。我想删除前面提到的字体族和字体大小。如何用python实现这一点?

import re
CLEANR = re.compile('<.*?>')

def cleanhtml(raw_html):
cleantext = re.sub(CLEANR, '', raw_html)
return cleantext

a_file = open("file.csv", 'r')
lines = a_file.readlines()
a_file.close()
newfile = open("file2.csv", 'w')
for line in lines:
line = cleanhtml(line)
newfile.write(line)
newfile.close()

如果您的输入始终是HTML字符串,那么您可以使用BeautifulSoup

下面是一个例子:

from bs4 import BeautifulSoup
doc = '''<span style="font-family: verdana,geneva; font-size: 10pt;"><b>xyz</b></span>'''
soup = BeautifulSoup(doc, "html.parser")
for tag in soup.recursiveChildGenerator():
try:
result = dict(filter(lambda elem: 'font-family' not in elem[1] and 'font-size' not in elem[1], tag.attrs.items()))
tag.attrs = result
except AttributeError:
pass
print(soup)

输出:

<span><b>xyz</b></span>

你可以在代码中使用它,比如

from bs4 import BeautifulSoup
def cleanhtml(raw_html):
soup = BeautifulSoup(raw_html, "html.parser")
for tag in soup.recursiveChildGenerator():
try:
result = dict(filter(lambda elem: 'font-family' not in elem[1] and 'font-size' not in elem[1], tag.attrs.items()))
tag.attrs = result
except AttributeError:
pass
return str(soup) #return as HTML string

最新更新