如何使用python的正则表达式从文本文件中删除多个标签

新手来了!我正在使用Python 3.8.3，并试图从附加的文本文件listfile.txt中删除标签

我想提取3个列表-标题，出版日期和文章的主要文本，并删除标签。在下面的代码中，我已经能够从标题和出版日期中删除标签。然而，我不能正确地从主要文本中删除所有标签。在该文件中，主要文本以标记<div class="story-element story-element-text">开始，并在下一个<h1类标记之前结束。>

在提取这部分文本的任何帮助将是非常感激的!!这篇文章是用非英文字体写的，但是所有的html标签都是英文的。

#opening text file which contains newspaper article information scraped off website using beautifulsoup
with open('listfile.txt', 'r', encoding='utf8') as my_file:
text = my_file.read()
print(text)  
#removing tags and generating list of newspaper article titles    
titles = re.findall('<h1.*?>(.*?)</h1>', text)
print(titles) 
#removing tags and generating list of newspaper article publication dates 
dates = re.findall('<div class="storyPageMetaData-m__publish-time__19bdV"><span>(.*?)</span>', text)
print(dates)
#removing tags and generating list containing main text of articles. This is where the code is incorrect
bodytext= re.findall('<div class="story-element story-element-text">(.*?)</div>', text)
print(bodytext)

我觉得你用错了工具，我建议您使用bs4;我保证你会喜欢的😊。

from bs4 import BeautifulSoup
raw_html = "YOUR RAW HTML"
soup = BeautifulSoup(raw_html, "html.parser")
titles = [h1_tag.text for h1_tag in soup.select('h1')]
dates = [span_tag.text for span_tag in soup.select('div.storyPageMetaData-m__publish-time__19bdV > span')]
bodytext = [div_tag.text for div_tag in soup.select('div.story-element.story-element-text')]

享受🤗

我不熟悉如何在python中设置正则表达式，但这在JavaScript中工作

如果您仍然希望使用RegEx，请使用它来捕获文本文件中的h1标记。<h1(.*?)</h1>

相关内容

最新更新

热门标签：