我正在使用报纸python库从新故事中提取一些数据。问题是我没有得到一些网址的数据。这些URL工作正常。他们全部返回200。我这样做是为了一个非常大的数据集,但这是日期提取不起作用的URL之一。该代码适用于某些链接,而不适用于其他链接(来自同一域(,所以我知道问题不是因为我的IP被阻止了太多请求。我只在一个URL上尝试了一下,得到了相同的结果(没有数据(。
import os
import sys
from newspaper import Article
def split(link):
try:
story = Article(link)
story.download()
story.parse()
date_time = str(story.publish_date)
split_date = date_time.split()
date = split_date[0]
if date != "None":
print(date)
except:
print("This URL did not return a published date. Try a different URL.")
print(link)
if __name__ == "__main__":
link = "https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one"
split(link)
我得到这个输出:
此URL未返回发布日期。请尝试其他URL。https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one
请检查链接,我检查了链接,现在不可用。如果链接不可用,则代码将不起作用。
尝试在代码中添加一些错误处理,以捕获返回404的URL,例如:https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one
from newspaper import Config
from newspaper import Article
from newspaper.article import ArticleException
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10
base_url = 'https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one'
try:
article = Article(base_url, config=config)
article.download()
article.parse()
except ArticleException as error:
print(error)
输出:
Article `download()` failed with 404 Client Error: Not Found for url: https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one on URL https://www.aljazeera.com/program/featured-documentaries/2020/12/29/lords-of-water-episode-one
Newspaper3k
有多种方法可以从文章中提取发布日期。看看我写的关于如何使用Newspaper3k
的文档。
下面是一个有效URLhttps://www.aljazeera.com/program/featured-documentaries/2022/3/31/lords-of-water
的示例,它从页面的meta tags
中提取数据元素。
from newspaper import Config
from newspaper import Article
from newspaper.article import ArticleException
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10
base_url = 'https://www.aljazeera.com/program/featured-documentaries/2022/3/31/lords-of-water'
try:
article = Article(base_url, config=config)
article.download()
article.parse()
article_meta_data = article.meta_data
article_title = [value for (key, value) in article_meta_data.items() if key == 'pageTitle']
print(article_title)
article_published_date = str([value for (key, value) in article_meta_data.items() if key == 'publishedDate'])
print(article_published_date)
article_description = [value for (key, value) in article_meta_data.items() if key == 'description']
print(article_description)
except ArticleException as error:
print(error)
输出
['Lords of Water']
['2022-03-31T06:08:59']
['Is water the new oil? We expose the financialisation of water.']