Python图书馆的报纸没有返回出版日期

我正在使用报纸python库从新故事中提取一些数据。问题是我没有得到一些网址的数据。这些URL工作正常。他们全部返回200。我这样做是为了一个非常大的数据集，但这是日期提取不起作用的URL之一。该代码适用于某些链接，而不适用于其他链接(来自同一域(，所以我知道问题不是因为我的IP被阻止了太多请求。我只在一个URL上尝试了一下，得到了相同的结果(没有数据(。

import os
import sys
from newspaper import Article   
def split(link):
try:
story = Article(link)
story.download()
story.parse()
date_time = str(story.publish_date)
split_date = date_time.split()  
date = split_date[0]
if date != "None":
print(date)
except:
print("This URL did not return a published date. Try a different URL.")
print(link)
if __name__ == "__main__":
link = "https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one"
split(link)

我得到这个输出：

此URL未返回发布日期。请尝试其他URL。https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one

请检查链接，我检查了链接，现在不可用。如果链接不可用，则代码将不起作用。

尝试在代码中添加一些错误处理，以捕获返回404的URL，例如：https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one

from newspaper import Config
from newspaper import Article
from newspaper.article import ArticleException
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10
base_url = 'https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one'
try:
article = Article(base_url, config=config)
article.download()
article.parse()
except ArticleException as error:
print(error)

输出：

Article `download()` failed with 404 Client Error: Not Found for url: https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one on URL https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one

Newspaper3k有多种方法可以从文章中提取发布日期。看看我写的关于如何使用Newspaper3k的文档。

下面是一个有效URLhttps://www.aljazeera.com/program/featured-documentaries/2022/3/31/lords-of-water的示例，它从页面的meta tags中提取数据元素。

from newspaper import Config
from newspaper import Article
from newspaper.article import ArticleException
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10
base_url = 'https://www.aljazeera.com/program/featured-documentaries/2022/3/31/lords-of-water'
try:
article = Article(base_url, config=config)
article.download()
article.parse()
article_meta_data = article.meta_data
article_title = [value for (key, value) in article_meta_data.items() if key == 'pageTitle']
print(article_title)
article_published_date = str([value for (key, value) in article_meta_data.items() if key == 'publishedDate'])
print(article_published_date)
article_description = [value for (key, value) in article_meta_data.items() if key == 'description']
print(article_description)
except ArticleException as error:
print(error)

输出

['Lords of Water']
['2022-03-31T06:08:59']
['Is water the new oil? We expose the financialisation of water.']

相关内容

最新更新

热门标签：