使用以下脚本一段时间后,它突然停止工作。我构建了一个简单的函数,该功能在网页中找到了一个基于其XPATH的表。
library(rvest)
url <- c('http://finanzalocale.interno.it/apps/floc.php/certificati/index/codice_ente/1010020010/cod/4/anno/1999/md/0/cod_modello/CCOU/tipo_modello/U/cod_quadro/08'
find_table <- function(x){read_html(x) %>%
html_nodes(xpath = '//*[@id="center"]/table[2]') %>%
html_table() %>%
as.data.frame()}
table <- find_table(url)
我还尝试在read_html
之前使用httr::GET
,传递以下参数:
query = list(r_date = "2017-12-22")
,但没有任何改变。有什么想法吗?
好吧,该代码不起作用,因为您错过了url <-
行中的)
。
我们将添加httr
:
library(httr)
library(rvest)
url
是基本功能的名称。将基本功能名称作为变量可能会使代码中的问题难以调试。除非您编写完美的代码,否则不要以这种方式使用名称。
URL <- c('http://finanzalocale.interno.it/apps/floc.php/certificati/index/codice_ente/1010020010/cod/4/anno/1999/md/0/cod_modello/CCOU/tipo_modello/U/cod_quadro/08')
我不知道您是否知道有关Web刮擦的"规则",但是如果您要对此网站进行重复请求,则应使用"爬网延迟"。他们的机器人没有一套。我指出的是您可能会受到限制。
find_table <- function(x, crawl_delay=5) {
Sys.sleep(crawl_delay) # you can put this in a loop vs here if you aren't often doing repeat gets
# switch to httr::GET so you can get web server interaction info.
# since you're scraping, it's expected that you use a custom user agent
# that also supplies contact info.
res <- GET(x, user_agent("My scraper"))
# check to see if there's a non HTTP 200 response which there may be
# if you're getting rate-limited
stop_for_status(res)
# now, try to do the parsing. It looks like you're trying to target a
# single table, so i switched it from `html_nodes()` to `html_node()` since
# the latter returns a `list` and the pipe will error out if there's more
# than on list element.
content(res, "parsed") %>%
html_node(xpath = '//*[@id="center"]/table[2]') %>%
html_table() %>%
as.data.frame()
}
table
也是基本函数名称(请参见上文)
result <- find_table(URL)
对我来说很好:
str(result)
## 'data.frame': 11 obs. of 5 variables:
## $ ENTI EROGATORI : chr "Cassa DD.PP." "Istituti di previdenza amministrati dal Tesoro" "Istituto per il credito sportivo" "Aziende di credito" ...
## $ : logi NA NA NA NA NA NA ...
## $ ACCENSIONE ACCERTAMENTI : chr "4.638.500,83" "0,00" "0,00" "953.898,47" ...
## $ ACCENSIONE RISCOSSIONI C|COMP. + RESIDUI: chr "2.177.330,12" "0,00" "129.114,22" "848.935,84" ...
## $ RIMBORSO IMPEGNI : chr "438.696,57" "975,07" "45.584,55" "182.897,01" ...