新手来了!我正在使用Python 3.8.3,并试图从附加的文本文件listfile.txt中删除标签
我想提取3个列表-标题,出版日期和文章的主要文本,并删除标签。在下面的代码中,我已经能够从标题和出版日期中删除标签。然而,我不能正确地从主要文本中删除所有标签。在该文件中,主要文本以标记<div class="story-element story-element-text">
开始,并在下一个<h1类标记之前结束。>
在提取这部分文本的任何帮助将是非常感激的!!这篇文章是用非英文字体写的,但是所有的html标签都是英文的。
#opening text file which contains newspaper article information scraped off website using beautifulsoup
with open('listfile.txt', 'r', encoding='utf8') as my_file:
text = my_file.read()
print(text)
#removing tags and generating list of newspaper article titles
titles = re.findall('<h1.*?>(.*?)</h1>', text)
print(titles)
#removing tags and generating list of newspaper article publication dates
dates = re.findall('<div class="storyPageMetaData-m__publish-time__19bdV"><span>(.*?)</span>', text)
print(dates)
#removing tags and generating list containing main text of articles. This is where the code is incorrect
bodytext= re.findall('<div class="story-element story-element-text">(.*?)</div>', text)
print(bodytext)
我觉得你用错了工具,我建议您使用bs4;我保证你会喜欢的😊。
from bs4 import BeautifulSoup
raw_html = "YOUR RAW HTML"
soup = BeautifulSoup(raw_html, "html.parser")
titles = [h1_tag.text for h1_tag in soup.select('h1')]
dates = [span_tag.text for span_tag in soup.select('div.storyPageMetaData-m__publish-time__19bdV > span')]
bodytext = [div_tag.text for div_tag in soup.select('div.story-element.story-element-text')]
享受🤗
我不熟悉如何在python中设置正则表达式,但这在JavaScript中工作
如果您仍然希望使用RegEx,请使用它来捕获文本文件中的h1标记。<h1(.*?)</h1>
"