报纸python缓存问题，每次调用相同的输出

我使用此模块：https://github.com/codelucas/newspaper 从 https://news.bitcoin.com/下载比特币文章。但是当我尝试从下一页"https://news.bitcoin.com/page/2/page"获取下一篇文章时，我得到了相同的输出。对于任何其他页面都相同。

我尝试过不同的网站和不同的起始页。我使用的第一个链接中的文章显示在所有其他链接上。

import newspaper
url = 'https://news.bitcoin.com/page/2'
btc_articles = newspaper.build(url, memoize_articles = False)
for article in btc_articles.articles:
    print(article.url)

报纸图书馆试图抓取整个网站，而不仅仅是您输入的链接。这意味着您不必遍历所有页面来获取文章。但是，正如您可能已经注意到的那样，库无论如何都找不到所有文章。

这样做的原因似乎是它没有将所有页面标识为类别（并且找不到提要），请参见下文（无论页面如何，输出都相同）：

import newspaper
url = 'https://news.bitcoin.com/'
btc_paper = newspaper.build(url, memoize_articles = False)
print('Categories:', [category.url for category in btc_paper.categories])
print('Feeds:', [feed.url for feed in btc_paper.feeds])

输出：

Categories: ['https://news.bitcoin.com/page/2', 'https://news.bitcoin.com']
Feeds: []

这似乎是代码中的一个错误（或比特币部分的错误网站设计，具体取决于您如何看待它），正如您在故障报告 https://github.com/codelucas/newspaper/issues/670 中指出的那样。

相关内容

最新更新

热门标签：