Newspaper3k，用户代理和报废

我正在制作由作者、发表日期和新闻文章正文组成的文本文件。我有这样做的代码，但我需要Newspaper3k首先从这些文章中识别相关信息。由于用户代理规范以前一直是个问题，所以我也指定了用户代理。这是我的代码，所以你可以跟随。这是Python的version 3.9.0。

import time, os, random, nltk, newspaper 
from newspaper import Article, Config
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124  Safari/537.36'
config = Config()
config.browser_user_agent = user_agent
url = 'https://www.eluniversal.com.mx/estados/matan-3-policias-durante-ataque-en-nochistlan-zacatecas'
article = Article(url, config=config)
article.download()
#article.html #
article.parse()
article.nlp()
article.authors
article.publish_date
article.text

为了更好地理解为什么这个案例特别是令人费解，请用这个链接代替我上面提供的链接，然后重新运行代码。有了这个链接，代码现在可以正确运行，返回作者、日期和文本。有了上面代码中的链接，它就没有了。我在这里俯瞰什么？

显然，Newspaper要求我们指定我们感兴趣的语言。由于一些奇怪的原因，这里的代码仍然没有提取作者，但这对我来说已经足够了。如果其他人能从中受益的话，这是代码。


#
# Imports our modules
#
import time, os, random, nltk, newspaper
from newspaper import Article
from googletrans import Translator
translator = Translator()
# The link we're interested in
url = 'https://www.eluniversal.com.mx/estados/matan-3-policias-durante-ataque-en-nochistlan-zacatecas'

#
# Extracts the meta-data
#
article = Article(url, language='es')
article.download()
article.parse()
article.nlp()
#
# Makes these into strings so they'll get into the list
#
authors = str(article.authors)
date = str(article.publish_date)
maintext = translator.translate(article.summary).text

# Makes the list we'll append
elements = [authors+ "n", date+ "n", maintext+ "n", url]
for x in elements:
print(x)

相关内容

最新更新

热门标签：