我正试图从这里的数据表中抓取数据,通过xpath的id
:调用
library(rvest)
library(dplyr)
url <- "https://www.topuniversities.com/university-rankings/world-university-rankings/2018"
h <- url %>% read_html()
h %>% html_nodes(xpath = "//*[@id='qs-rankings-indicators']") %>% html_table()
最后一个命令给了我这个错误:
Error in matrix(NA_character_, nrow = n, ncol = maxp) :
invalid 'ncol' value (too large or NA)
In addition: Warning messages:
1: In max(p) : no non-missing arguments to max; returning -Inf
2: In matrix(NA_character_, nrow = n, ncol = maxp) :
NAs introduced by coercion to integer range
我在这里错过了什么?
该表由javascript呈现。也许只是直接从源代码中获取JSON数据。试试这个
tstamp <- function() as.character(trunc(as.numeric(Sys.time()) * 1e3))
url <- "https://www.topuniversities.com/sites/default/files/qs-rankings-data/357051.txt"
res <-
jsonlite::fromJSON(paste0(url, "?_=", tstamp()))$data[, c(
"rank_display", "score", "title", "country", "region"
)]
输出
> head(res)
rank_display score title country region
1 1 100 Massachusetts Institute of Technology (MIT) United States North America
2 2 98.7 Stanford University United States North America
3 3 98.4 Harvard University United States North America
4 4 97.7 California Institute of Technology (Caltech) United States North America
5 5 95.6 University of Cambridge United Kingdom Europe
6 6 95.3 University of Oxford United Kingdom Europe
您实际上已经使用了
library(rvest)
library(dplyr)
url <- "https://www.topuniversities.com/university-rankings/world-university-rankings/2018"
h <- url %>% read_html()
h %>%
html_nodes(xpath = "//*[@id='qs-rankings-indicators']")
{xml_nodeset (1)}
[1] <table id="qs-rankings-indicators" class="order-column" cellspacing="0" width="100%"></table>
即没有最后的%>% html_table()
表中没有数据的原因是,它是在初始HTML页面加载之后用javascript加载的。
要获得包含javascript加载内容的表,您需要使用一个可以运行网站javascript的抓取工具(我推荐RSelenium(