r-使用Rvest从滚动表中提取数据



我希望从位于https://thearcfooty.com/2017/01/28/a-complete-history-of-the-afl/

我面临的挑战是,它是一个滚动表(文本显示在表的底部,其中包含31228条记录:

Showing 1 to 10 of 31,228 entries

我是Rvest的新手,在谷歌Chrome中检查了表格后尝试了以下操作:

library(rvest)
url <- "https://thearcfooty.com/2017/01/28/a-complete-history-of-the-afl/"
Table  <- url %>%
read_html() %>%
html_nodes(xpath= '//*[@id="table_1"]') %>%
html_table()
TableNew <- Table[[1]]
TableNew 

但它只是一直挂着。理想情况下,我希望返回一个数据帧,其中包含所有行和列中的所有记录。

我的猜测是html_table中的一些代码有点慢,这就是它无休止运行的原因。实际上,您可以读取所有文本并转换为数据帧形状。我还没有检查结果是否正确。但根据我观察的几个例子,它应该是好的。

library(rvest)
#> Loading required package: xml2
library(data.table)
url <- "https://thearcfooty.com/2017/01/28/a-complete-history-of-the-afl/"
page <- read_html(url)
tb_str <- page %>% 
html_nodes(css = 'tr') %>% 
html_text()
dt <- data.table(raw=tb_str)
headers <- strsplit(tb_str[1],split = "\W+")[[1]]
dt[,(headers):=tstrsplit(raw,split="n +")]
dt[,raw:=NULL]
str(dt[!is.na(season)])
#> Classes 'data.table' and 'data.frame':   31228 obs. of  14 variables:
#>  $ date             : chr  "08/05/1897" "08/05/1897" "08/05/1897" "08/05/1897" ...
#>  $ season           : chr  "1897" "1897" "1897" "1897" ...
#>  $ round            : chr  "1" "1" "1" "1" ...
#>  $ home_away        : chr  "A" "A" "A" "A" ...
#>  $ team             : chr  "CA" "SK" "ME" "ES" ...
#>  $ opponent         : chr  "FI" "CW" "SY" "GE" ...
#>  $ margin_pred      : chr  "0.00" "0.00" "0.00" "-2.99" ...
#>  $ margin_actual    : chr  "-33.00" "-25.00" "17.00" "23.00" ...
#>  $ win_prob         : chr  "0.50" "0.50" "0.50" "0.47" ...
#>  $ result           : chr  "0.18" "0.24" "0.69" "0.74" ...
#>  $ team_elo_pre     : chr  "1500" "1500" "1500" "1500" ...
#>  $ opponent_elo_pre : chr  "1500" "1500" "1500" "1500" ...
#>  $ team_elo_post    : chr  "1473" "1478" "1515" "1522" ...
#>  $ opponent_elo_post: chr  "1526" "1521" "1484" "1477" ...
#>  - attr(*, ".internal.selfref")=<externalptr>

由reprex包于2020-07-27创建(v0.3.0(

最新更新