r语言 - 大学篮球统计数据的网络抓取表 - r - Web scraping tables on college basketball stats 小贝子编程网

我是网络抓取的新手，在一个测试项目中，我试图为这个特定的团队抓取以下网站上的每个数据表。应该有15个表，但当我运行我的代码，它似乎只拉前6的15。我要怎么把剩下的桌子拿过来?

代码如下:

library(tidyverse)
library(rvest)
library(stringr)
library(lubridate)
library(magrittr)
iowa_stats<- read_html("https://www.sports-reference.com/cbb/schools/iowa/2021.html")
iowa_stats %>% html_table()

编辑:所以我决定更深入地研究这个问题，看看我是否能得到更多的见解。所以我决定从第一个表开始，当你调用html_table命令时，它不会出现，这是'总计'表。我做了以下操作，沿着html的路径一直到表格，看看我是否能找出问题所在。为此，我使用了以下代码:

iowa_stats %>% html_nodes("body") %>% html_nodes("div#wrap") %>% html_nodes("div#all_totals.table_wrapper")

这是我在得到错误之前所能得到的。在下一步中，应该有以下内容:is_setup中存储了表，但是如果我将它添加到上面的代码中，它就不存在了。当我输入以下内容时，它也不存在。

iowa_stats %>% html_nodes("body") %>% html_nodes("div#wrap") %>% html_nodes("div#all_totals.table_wrapper") %>% html_nodes("div")

是否有人谁是更好的html/css有任何想法为什么这种情况下?

看起来这个网页正在存储一些表作为注释。为了解决这个问题，读取并保存网页。删除注释标签，然后正常处理。

library(rvest)
library(dplyr)
iowa_stats<- read_html("https://www.sports-reference.com/cbb/schools/iowa/2021.html")
#Only save and work with the body
body<-html_node(iowa_stats,"body")
write_xml(body, "temp.xml")
#Find and remove comments
lines<-readLines("temp.xml")
lines<-lines[-grep("<!--", lines)]
lines<-lines[-grep("-->", lines)]
writeLines(lines, "temp2.xml")
#Read the file back in and process normally
body<-read_html("temp2.xml")
html_nodes(body, "table") %>% html_table()

r语言 - 大学篮球统计数据的网络抓取表

相关内容

最新更新

热门标签：