Python:看到新闻纸3k提供的文章的时间戳了吗

当我进行时

import newspaper
cnn_paper = newspaper.build(news_source_url, memoize_articles=False)
for article in cnn_paper.articles:
print(article.url)
exit()

我得到了可以使用newspaper3k包从news_source_url(例如'http://cnn.com'(下载的文章的URL列表。有什么方法可以得到各种文章的时间戳吗？

对于CNN来说，许多文章的URL中似乎都编码了日期，但我想获得任何新闻来源的文章时间戳。如果可能的话，我想知道日期和时间。

您可以使用Newspaper获取文章的发布日期，代码如下。我重新格式化了日期输出，因为它们有00:00:00时间戳。

import newspaper
from datetime import datetime
cnn_paper = newspaper.build('http://cnn.com', memoize_articles=False)
for item in cnn_paper.articles:
article = newspaper.Article(item.url)
article.download()
article.parse()
if article.url and article.publish_date is not None:
print(article.url)
publish_date = datetime.strptime(str(article.publish_date), '%Y-%m-%d %H:%M:%S').strftime('%Y-%m-%d')
print(publish_date)

如果你需要文章的确切发布日期和时间戳，那么你需要从文章的URL中获得这些时间戳。在查看Newspaper的代码后，我发现了一个元标签提取器。

import newspaper
cnn_paper = newspaper.build('http://cnn.com', memoize_articles=False)
for item in cnn_paper.articles:
article = newspaper.Article(item.url)
article.download()
article.parse()
if article.url and article.publish_date is not None:
article_meta_data = article.meta_data
article_published_date = sorted({value for (key, value) in article_meta_data.items() if key == 'pubdate'})
if article_published_date:
print(article_published_date)
else:
print('no published date provided')

相关内容

最新更新

热门标签：