如何将html中的多行段落合并为一行



我有一个html文件,其中包含pdf文件的标题和段落。但在这个文件中,段落的每一行都被视为另一个段落,这就是为什么它给出了许多

标记行,所以不可能创建多行的单个段落。有人能给我一个解决这个问题的方法吗?。

这是我得到的方式:

["<p>Forti provides access to a diverse array of Forti solutions through a single sign-on ",
"<p>including Forti Cloud, Forti Cloud, Forti, Forti, Forti and other Forti ",
"<p>cloud-based management and services. Forti accounts are free which require a license for ",
"<p>each solution. "]

按照我的意愿:

['Forti provides access to a diverse array of Forti solutions through a single sign-on including Forti Cloud, FortiWeb Cloud, Forti, Forti, Forti and other Forti cloud-based management and services. Forti accounts are free which require a license for each solution. ']

我已经做到了:

paragraphs_1 = []
local_path = "file.json"
data = json.loads(open(local_path).read())
for x in data:
soup = BeautifulSoup(x, 'html.parser') 
for paragraphs in soup.find_all("p"): 
paragraphs_1.append(paragraphs.get_text())

您可以使用replace函数来消除所有类似的p

yourtext.replace("<p>", "") 

试试这个代码:

new_list = []
for text in my_list_of_text:
# first remove <p>
new_list.append(text.replace('<p>', ''))
# next step create a long text using list comprehension
listToStr = ' '.join([str(elem) for elem in new_list]) 
# remove possible double space
final_text= listToStr.replace('  ', ' ')   

例如,使用simplenlg还有更复杂的方式。但是对于您的问题,这个代码应该足够了。

以下函数有助于清除raw_html标签

import re
def cleanhtml(raw_html):
cleanr = re.compile('<.*?>')
cleantext = re.sub(cleanr, '', raw_html)
return cleantext

若您想在列表中组合多个元素并将其作为单个段落返回,您可以尝试.join()

paragraph = cleanhtml(str(''.join(para)))

输出:

'Forti provides access to a diverse array of Forti solutions through a single sign-on including Forti Cloud, Forti Cloud, Forti, Forti, Forti and other Forti cloud-based management and services. Forti accounts are free which require a license for each solution. '

将其作为列表返回

paragraph = [cleanhtml(str(''.join(para)))]

输出

['Forti provides access to a diverse array of Forti solutions through a single sign-on including Forti Cloud, Forti Cloud, Forti, Forti, Forti and other Forti cloud-based management and services. Forti accounts are free which require a license for each solution. ']

最新更新