我想从报纸库(newspaper3k(的网站上抓取一篇文章。然而,它找不到文章的published_date,即网站源文本中的div.source-date,也找不到作者(或者更确切地说是源(,即网站的源文本中为div.delfi-source-name。我如何才能找到日期和作者/来源?
网站/URL示例:https://www.delfi.lt/en/politics/foreign-ministry-tsikhanouskayas-consultation-needed-for-treating-belarusians-in-lithuania.d?id=91531501
我的代码:
import newspaper
from newspaper import Article
from newspaper import Source
import pandas as pd
article = Article("url")
article.download()
article.parse()
article.nlp()
df = pd.DataFrame([{'Title':article.title, 'Author':article.authors, 'Text':article.text,
'published_date':article.publish_date, 'Source':article.source_url}])
df.to_excel('Delfi-1.xlsx')
有什么建议吗?
源中的日期元素位于两个位置。您看到的Wednesday, October 19, 2022
位于div
标记中,newspaper3k
在不使用BeautifulSoup
的情况下无法解析该标记。
第二个日期隐藏在元标签中,newspaper3k
可以通过一些额外的代码来解析。
from newspaper import Config
from newspaper import Article
from newspaper.article import ArticleException
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10
base_url = 'https://www.delfi.lt/en/politics/foreign-ministry-tsikhanouskayas-consultation-needed-for-treating-belarusians-in-lithuania.d?id=91531501'
try:
article = Article(base_url, config=config)
article.download()
article.parse()
article_meta_data = article.meta_data
article_title = [value['title'] for (key, value) in article_meta_data.items() if key == 'og']
print(article_title)
article_published_date = [value['recs']['publishtime'] for key, value in article_meta_data.items()
if key == 'cXenseParse']
print(article_published_date)
article_description = [value['description'] for (key, value) in article_meta_data.items() if key == 'og']
print(article_description)
except ArticleException as error:
print(error)
输出
["Foreign Ministry: Tsikhanouskaya's consultation needed for treating Belarusians in Lithuania"]
['2022-10-19T11:38:07+0300']
["As Belorus, a Belarus-owned sanatorium in Lithuania's southern resort of Druskininkai, complaints over the fact that Lithuania fails to issue visas to Belarusian citizens, forcing the sanatorium to fire a quarter of its staff, Lithuania's Foreign Ministry suggests coordinating the list of arrivals with Belarusian opposition leaders Sviatlana Tsikhanouskaya's office in Vilnius."]
p.S.Newspaper3k有多种方法可以从文章中提取发布日期。看看我写的关于如何使用Newspaper3k的文档。