r- rvest意外停止工作 - 刮擦表

  • 本文关键字:停止工作 rvest 意外 r rvest
  • 更新时间 :
  • 英文 :


使用以下脚本一段时间后,它突然停止工作。我构建了一个简单的函数,该功能在网页中找到了一个基于其XPATH的表。

library(rvest)
url <- c('http://finanzalocale.interno.it/apps/floc.php/certificati/index/codice_ente/1010020010/cod/4/anno/1999/md/0/cod_modello/CCOU/tipo_modello/U/cod_quadro/08'
find_table <- function(x){read_html(x) %>%
                          html_nodes(xpath = '//*[@id="center"]/table[2]') %>%
                          html_table() %>%
                          as.data.frame()}
table <- find_table(url)

我还尝试在read_html之前使用httr::GET,传递以下参数:

query = list(r_date = "2017-12-22")

,但没有任何改变。有什么想法吗?

好吧,该代码不起作用,因为您错过了url <-行中的)

我们将添加httr

library(httr)
library(rvest)

url是基本功能的名称。将基本功能名称作为变量可能会使代码中的问题难以调试。除非您编写完美的代码,否则不要以这种方式使用名称。

URL <- c('http://finanzalocale.interno.it/apps/floc.php/certificati/index/codice_ente/1010020010/cod/4/anno/1999/md/0/cod_modello/CCOU/tipo_modello/U/cod_quadro/08')

我不知道您是否知道有关Web刮擦的"规则",但是如果您要对此网站进行重复请求,则应使用"爬网延迟"。他们的机器人没有一套。我指出的是您可能会受到限制。

find_table <- function(x, crawl_delay=5) { 
  Sys.sleep(crawl_delay) # you can put this in a loop vs here if you aren't often doing repeat gets
  # switch to httr::GET so you can get web server interaction info.
  # since you're scraping, it's expected that you use a custom user agent
  # that also supplies contact info.
  res <- GET(x, user_agent("My scraper"))
  # check to see if there's a non HTTP 200 response which there may be
  # if you're getting rate-limited
  stop_for_status(res) 
  # now, try to do the parsing. It looks like you're trying to target a
  # single table, so i switched it from `html_nodes()` to `html_node()` since
  # the latter returns a `list` and the pipe will error out if there's more
  # than on list element.
  content(res, "parsed") %>% 
    html_node(xpath = '//*[@id="center"]/table[2]') %>%
    html_table() %>%
    as.data.frame()
}

table也是基本函数名称(请参见上文)

result <- find_table(URL)

对我来说很好:

str(result)
## 'data.frame':  11 obs. of  5 variables:
##  $ ENTI EROGATORI                          : chr  "Cassa DD.PP." "Istituti di previdenza amministrati dal Tesoro" "Istituto per il credito sportivo" "Aziende di credito" ...
##  $                                         : logi  NA NA NA NA NA NA ...
##  $ ACCENSIONE ACCERTAMENTI                 : chr  "4.638.500,83" "0,00" "0,00" "953.898,47" ...
##  $ ACCENSIONE RISCOSSIONI C|COMP. + RESIDUI: chr  "2.177.330,12" "0,00" "129.114,22" "848.935,84" ...
##  $ RIMBORSO IMPEGNI                        : chr  "438.696,57" "975,07" "45.584,55" "182.897,01" ...

最新更新