我制作了一个代码:
from newspaper import Article
url = 'http://www.infomoney.com.br/mercados/acoes-e-indices/noticia/7345670/dow-jones-tem-nova-derrocada-puxa-ibovespa-para-segunda-semana'
a = Article(url, language='pt')
a.download()
a.parse()
print(a.text)
但是我需要带有HTML标签的文本,例如,我需要文本中的IMG标签。
一年前提出了这个问题,但有人可能会通过Google找到这个问题。
您可以在文章文本中获得图像和其他HTML,并带有" A.Article_html"。
from newspaper import Article
a = Article('https://www.nytimes.com/2019/04/25/us/politics/joe-biden-anita-hill.html',
keep_article_html=True,
language='en')
a.download()
a.parse()
print(a.html) # This article's unchanged and raw HTML
print(a.article_html) # The HTML of this article's main node
记住参数" keep_article_html = true"
您可以通过html
成员获得HTML。
from newspaper import Article
url = 'http://www.infomoney.com.br/mercados/acoes-e-indices/noticia/7345670/dow-jones-tem-nova-derrocada-puxa-ibovespa-para-segunda-semana'
a = Article(url, language='pt')
a.download()
a.parse()
print(a.text)
html = a.html
print(html)