我正试图使用库(Polite(从网站上抓取出色的数据,但我收到了"ind_html[[1]]中出错:下标越界"。我在做什么:
library(tidyverse)
library(lubridate)
library(janitor)
library(rvest)
library(httr)
library(polite)
url <- "https://cew.georgetown.edu/cew-reports/roi2022/"
url_bow <- polite::bow(url)
url_bow
ind_html <-
polite::scrape(url_bow) %>%
rvest::html_nodes("table_div") %>%
rvest::html_table(fill = TRUE)
ind_tab <-
ind_html[[1]] %>%
make_clean_names()
ROI_TABLE <- ind_tab %>%
bind_rows() %>%
as_tibble()
我认为这个错误与ind_html[[1]]
有关,但我不知道如何修复。谢谢你的帮助!
如果您试图刮取下表,我们可以进行
df = read_csv('https://cewgeorgetown.github.io/collegeROI-2022/ROIforWeb0222.csv')
# A tibble: 4,419 x 45
Institution State Level `Predominant degr~ Control `10-year NPV ra~ `10-year NPV` `15-year NPV ra~ `15-year NPV` `20-year NPV ra~ `20-year NPV`
<chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Alaska Career Col~ AK 2-year Certificate Private f~ 2318 135000 2707 261000 2856 375000
2 Alaska Pacific Un~ AK 4-year Bachelor's Private n~ 3537 87000 2433 274000 1760 443000
3 Alaska Vocational~ AK Less tha~ Certificate Public 63 316000 240 458000 476 587000
4 University of Ala~ AK 4-year Bachelor's Public 2590 124000 1547 312000 1232 484000