如何在 Python 中将 html 'abbr' 标签文本转换为括号中的文本?



我需要将外部源生成的数百个html句子转换为可读文本,并且我对abbr标签的转换有疑问。下面是一个示例:

from bs4 import BeautifulSoup
text = "<abbr title="World Health Organization" style="color:blue">WHO</abbr> is a specialized agency of the <abbr title="United Nations" style="color:#CCCC00">UN</abbr>."
print (BeautifulSoup(text).get_text())

此代码返回"世卫组织是联合国的专门机构"。但是,我想要的是"WHO(世界卫生组织(是联合国(联合国(的专门机构"。有没有办法做到这一点?也许是另一个模块而不是BeautifulSoup?

您可以遍历soup.contents中的元素:

from bs4 import BeautifulSoup as soup
text = "<abbr title="World Health Organization" style="color:blue">WHO</abbr> is a specialized agency of the <abbr title="United Nations" style="color:#CCCC00">UN</abbr>."
d = ''.join(str(i) if i.name is None else f'{i.text} ({i["title"]})' for i in soup(text, 'html.parser').contents)

输出:

'WHO (World Health Organization) is a specialized agency of the UN (United Nations).'

可能是算法史上最糟糕的算法之一:

import re
from bs4 import BeautifulSoup
text = "<abbr title="World Health Organization" style="color:blue">WHO</abbr> is a specialized agency of the <abbr title="United Nations" style="color:#CCCC00">UN</abbr>."
soup = BeautifulSoup(text, 'html.parser')
inside_abbrs = soup.find_all('abbr')
string_out = ''
for i in inside_abbrs: 
s = BeautifulSoup(str(i), 'html.parser')
t = s.find('abbr').attrs['title']
split_soup = re.findall(r"[w]+|[.,!?;]", soup.text)
bind_caps = ''.join(re.findall(r'[A-Z]', t))
for word in split_soup:
if word == bind_caps:
string_out += word + " (" + t + ") " 
break
else:
string_out += word + " "
string_out = string_out.strip()
string_out += '.'
print(string_out)

输出

WHO (World Health Organization) WHO is a specialized agency of the UN (United Nations).

最新更新