我需要从html文件中获取文章/新闻,我找到的最佳解决方案是在python中使用newspaper3k。我得到了一个空白的结果,我尝试了很多解决方案,但我有点被困在这里。
from newspaper import Article
with open("index.html", 'r', encoding='utf-8') as f:
article = Article('', language='en')
article.download(input_html=f.read())
article.parse()
print(article.title)
结果:''
它应该是从html文件内的文章标记中打印文本。
您的代码看起来不错。
我想问题出在你身上。index.html
中有什么?你能给我提供这个文件或从中提取的URL吗?
BTW这是使用newspaper3k
读取离线内容的代码示例。此示例来自我关于使用newspaper3k
的概述文档
from newspaper import Config
from newspaper import Article
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10
base_url = 'https://www.cnn.com/2020/10/12/health/johnson-coronavirus-vaccine-pause-bn/index.html'
article = Article(base_url, config=config)
article.download()
article.parse()
with open('cnn.html', 'w') as fileout:
fileout.write(article.html)
# Read the HTML file created above
with open("cnn.html", 'r') as f:
# note the empty URL string
article = Article('', language='en')
article.download(input_html=f.read())
article.parse()
print(article.title)
Johnson & Johnson pauses Covid-19 vaccine trial after 'unexplained illness'
article_meta_data = article.meta_data
article_published_date = {value for (key, value) in article_meta_data.items() if key == 'pubdate'}
print(article_published_date)
{'2020-10-13T01:31:25Z'}
article_author = {value for (key, value) in article_meta_data.items() if key == 'author'}
print(article_author)
{'Maggie Fox, CNN'}
article_summary = {value for (key, value) in article_meta_data.items() if key == 'description'}
print(article_summary)
{'Johnson&Johnson said its Janssen arm had paused its coronavirus vaccine trial after an "unexplained illness" in one
of the volunteers testing its experimental Covid-19 shot.'}
article_keywords = {value for (key, value) in article_meta_data.items() if key == 'keywords'}
print(article_keywords)
{"health, Johnson & Johnson pauses Covid-19 vaccine trial after 'unexplained illness' - CNN"}