用于抓取文章的报纸api



我使用了python中的newspaper3kapi来抓取文章。我无法抓取《印度时报》的文章,从回应中获得发布日期无效。其他文章都是正确的文章。

article = Article(url)
article.download()
article.parse()
result=vars(article)
print(result['publish_date']) 

当前版本的Newspaper无法从《印度时报》HTML代码中提取"发布日期",因为该日期在脚本标记中。您可以使用请求BeautifulSoup提取此日期。后者嵌入报纸中。我还注意到关键字在元标签中,所以Newspaper无法提取这些关键字。我添加了一些代码来提取关键词。希望下面的代码能帮助您查询《印度时报》上的文章。如果你有任何问题,请告诉我。

import requests
import re as regex
from newspaper import Article
from newspaper.utils import BeautifulSoup
base_url = 'https://timesofindia.indiatimes.com/business/india-business/govt-working-to-reduce-e-vehicle-tax-niti-aayog-ceo/articleshow/78210495.cms'
raw_html = requests.get(base_url)
soup = BeautifulSoup(raw_html.text, 'html.parser')
# parse date published
data = soup.findAll('script')[1]
find_date = regex.search(r'datePublished.{3}d{4}-d{2}-d{2}', data.string)
date_published = find_date.group().split('"')[2]
# parse other elements using Newspaper
article = Article('')
article.download(raw_html.content)
article.parse()
article_tags = article.tags
article_content = article.text
article_title = article.title
# parse keywords
article_meta_data = article.meta_data
article_keywords = sorted({value for (key, value) in article_meta_data.items() if key == 'keywords'})

相关内容

  • 没有找到相关文章

最新更新