我希望从位于https://thearcfooty.com/2017/01/28/a-complete-history-of-the-afl/
我面临的挑战是,它是一个滚动表(文本显示在表的底部,其中包含31228条记录:
Showing 1 to 10 of 31,228 entries
我是Rvest的新手,在谷歌Chrome中检查了表格后尝试了以下操作:
library(rvest)
url <- "https://thearcfooty.com/2017/01/28/a-complete-history-of-the-afl/"
Table <- url %>%
read_html() %>%
html_nodes(xpath= '//*[@id="table_1"]') %>%
html_table()
TableNew <- Table[[1]]
TableNew
但它只是一直挂着。理想情况下,我希望返回一个数据帧,其中包含所有行和列中的所有记录。
我的猜测是html_table
中的一些代码有点慢,这就是它无休止运行的原因。实际上,您可以读取所有文本并转换为数据帧形状。我还没有检查结果是否正确。但根据我观察的几个例子,它应该是好的。
library(rvest)
#> Loading required package: xml2
library(data.table)
url <- "https://thearcfooty.com/2017/01/28/a-complete-history-of-the-afl/"
page <- read_html(url)
tb_str <- page %>%
html_nodes(css = 'tr') %>%
html_text()
dt <- data.table(raw=tb_str)
headers <- strsplit(tb_str[1],split = "\W+")[[1]]
dt[,(headers):=tstrsplit(raw,split="n +")]
dt[,raw:=NULL]
str(dt[!is.na(season)])
#> Classes 'data.table' and 'data.frame': 31228 obs. of 14 variables:
#> $ date : chr "08/05/1897" "08/05/1897" "08/05/1897" "08/05/1897" ...
#> $ season : chr "1897" "1897" "1897" "1897" ...
#> $ round : chr "1" "1" "1" "1" ...
#> $ home_away : chr "A" "A" "A" "A" ...
#> $ team : chr "CA" "SK" "ME" "ES" ...
#> $ opponent : chr "FI" "CW" "SY" "GE" ...
#> $ margin_pred : chr "0.00" "0.00" "0.00" "-2.99" ...
#> $ margin_actual : chr "-33.00" "-25.00" "17.00" "23.00" ...
#> $ win_prob : chr "0.50" "0.50" "0.50" "0.47" ...
#> $ result : chr "0.18" "0.24" "0.69" "0.74" ...
#> $ team_elo_pre : chr "1500" "1500" "1500" "1500" ...
#> $ opponent_elo_pre : chr "1500" "1500" "1500" "1500" ...
#> $ team_elo_post : chr "1473" "1478" "1515" "1522" ...
#> $ opponent_elo_post: chr "1526" "1521" "1484" "1477" ...
#> - attr(*, ".internal.selfref")=<externalptr>
由reprex包于2020-07-27创建(v0.3.0(