r语言 - 按 id 使用 rvest 抓取数据表，找不到表 - r - Scraping dataTable with rvest by id, doesn't find table 小贝子编程网

我正试图从这里的数据表中抓取数据，通过xpath的id:调用

library(rvest)
library(dplyr)
url <- "https://www.topuniversities.com/university-rankings/world-university-rankings/2018"  
h <- url %>% read_html() 
h %>% html_nodes(xpath = "//*[@id='qs-rankings-indicators']") %>% html_table()

最后一个命令给了我这个错误：

Error in matrix(NA_character_, nrow = n, ncol = maxp) : 
invalid 'ncol' value (too large or NA)
In addition: Warning messages:
1: In max(p) : no non-missing arguments to max; returning -Inf
2: In matrix(NA_character_, nrow = n, ncol = maxp) :
NAs introduced by coercion to integer range

我在这里错过了什么？

该表由javascript呈现。也许只是直接从源代码中获取JSON数据。试试这个

tstamp <- function() as.character(trunc(as.numeric(Sys.time()) * 1e3))
url <- "https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051.txt"
res <- 
jsonlite::fromJSON(paste0(url, "?_=", tstamp()))$data[, c(
"rank_display", "score", "title", "country", "region"
)]

输出

> head(res)
rank_display score                                        title        country        region
1            1   100  Massachusetts Institute of Technology (MIT)  United States North America
2            2  98.7                          Stanford University  United States North America
3            3  98.4                           Harvard University  United States North America
4            4  97.7 California Institute of Technology (Caltech)  United States North America
5            5  95.6                      University of Cambridge United Kingdom        Europe
6            6  95.3                         University of Oxford United Kingdom        Europe

您实际上已经使用了

library(rvest)
library(dplyr)
url <- "https://www.topuniversities.com/university-rankings/world-university-rankings/2018"  
h <- url %>% read_html() 
h %>% 
html_nodes(xpath = "//*[@id='qs-rankings-indicators']")
{xml_nodeset (1)}
[1] <table id="qs-rankings-indicators" class="order-column" cellspacing="0" width="100%"></table>

即没有最后的%>% html_table()

表中没有数据的原因是，它是在初始HTML页面加载之后用javascript加载的。

要获得包含javascript加载内容的表，您需要使用一个可以运行网站javascript的抓取工具(我推荐RSelenium(

r语言 - 按 id 使用 rvest 抓取数据表，找不到表

相关内容

最新更新

热门标签：