使用 Python 中的 NewsPaper 库将多个新闻文章来源抓取到一个列表中?



亲爱的Stackoverflow社区!

这是关于我之前在这里发布的问题的后续问题。

我想将新闻论文库的新闻报纸 URL 从多个来源提取到一个列表中。这对于一个源很有效,但是一旦我添加第二个源链接,它就会只提取第二个源的URL。

import feedparser as fp
import newspaper
from newspaper import Article
website = {"cnn": {"link": "edition.cnn.com", "rss": "rss.cnn.com/rss/cnn_topstories.rss"}, "cnbc":{"link": "cnbc.com", "rss": "cnbc.com/id/10000664/device/rss/rss.html"}} A
for source, value in website.items():
if 'rss' in value:
d = fp.parse(value['rss']) 
#if there is an RSS value for a company, it will be extracted into d
article_list = []
for entry in d.entries:
if hasattr(entry, 'published'):
article = {}
article['link'] = entry.link
article_list.append(article['link'])
print(article['link'])

输出如下,仅附加来自第二个源的链接:

['https://www.cnbc.com/2019/10/23/why-china-isnt-cutting-lending-rates-like-the-rest-of-the-world.html', 'https://www.cnbc.com/2019/10/22/stocks-making-the-biggest-moves-after-hours-snap-texas-instruments-chipotle-and-more.html' , ...]

我希望将来自两个来源的所有 URL 提取到列表中。 有谁知道这个问题的解决方案? 提前非常感谢!!

article_list

在您的第一个for循环中被覆盖。 每次循环访问新源时,article_list都会设置为新的空列表,从而有效地丢失来自先前源的所有信息。 这就是为什么最后你只有一个来源的信息,最后一个来源

您应该在开始时初始化article_list而不是覆盖它。

import feedparser as fp
import newspaper
from newspaper import Article
website = {"cnn": {"link": "edition.cnn.com", "rss": "rss.cnn.com/rss/cnn_topstories.rss"}, "cnbc":{"link": "cnbc.com", "rss": "cnbc.com/id/10000664/device/rss/rss.html"}} A
article_list = [] # INIT ONCE
for source, value in website.items():
if 'rss' in value:
d = fp.parse(value['rss']) 
#if there is an RSS value for a company, it will be extracted into d
# article_list = [] THIS IS WHERE IT WAS BEING OVERWRITTEN
for entry in d.entries:
if hasattr(entry, 'published'):
article = {}
article['link'] = entry.link
article_list.append(article['link'])
print(article['link'])

相关内容

  • 没有找到相关文章

最新更新