我是网络抓取的新手,下面的代码产生了一个空字符向量,不知道如何解决:
google_url <- "https://news.google.com/topstories?hl=en-GB&gl=GB&ceid=GB:en"
google <- read_html(google_url)
articles <- google %>% html_nodes('.VDXfz') %>% html_text()
articles
下面将从当前加载的页面中获取所有标题。如果需要滚动并进一步提取数据,则需要RSelenium
。
library(rvest)
url = 'https://news.google.com/topstories?hl=en-GB&gl=GB&ceid=GB:en'
url %>% read_html() %>% html_nodes('.lBwEZb') %>%
html_nodes('.DY5T1d') %>%
html_text()
[1] "Liz Truss to hold Brexit talks with EU over NI protocol"
[2] "Lord Frost: I didn't support PM's coercive Covid plan"
[3] "David Frost: I never disagreed with Boris Johnson over Brexit policy – only coercive Covid rules"
[4] "Look at the lauding of David Frost and see a government deranged by the poison of Brexit"
[5] "What happened to the amiable, hard-working David Frost I once knew?"
[6] "COVID-19: Omicron now dominant variant in US after making up 73% of new cases, says CDC"