R Web 抓取 Excel 电子表格 URL 以使用 openxlsx 读取

>我需要将Excel文件的某些部分读入R。我有一些现有的代码，但权威机构更改了源代码。以前，有一个指向文档的直接 URL，现在只能通过网站登录页面访问指向文档的链接。

有人可以告诉我可以使用哪种套餐来实现这一点吗？指向 Excel 文件的链接是：http://www.snamretegas.it/it/business-servizi/dati-operativi-business/8_dati_operativi_bilanciamento_sistema/我正在看文件："Dati operativi relativi al bilanciamento del sistema post Del. 312/2016/R/gas - Database 2018">

我

添加了前面的代码来说明我做了什么。如您所见，我只需要阅读.xlsx第一步。

提前非常感谢！

  library(ggplot2)
  library(lubridate)
  library(openxlsx)
  library(reshape2)
  library(dplyr)
  Bilres <- read.xlsx(xlsxFile = "http://www.snamretegas.it/repository/file/Info-storiche-qta-gas-trasportato/dati_operativi/2017/DatiOperativi_2017-IT.xlsx",sheet = "Storico_G", startRow = 1, colNames = TRUE)

  # Selecting Column R from Storico_G and stored in variable Bilres_df
  Bilres_df <- data.frame(Bilres$pubblicazione, Bilres$BILANCIAMENTO.RESIDUALE )
  # Conerting pubblicazione in date format and time
  Bilres_df$pubblicazione <- ymd_h(Bilres_df$Bilres.pubblicazione)
  Bilreslast=tail(Bilres_df,1)
  Bilreslast=data.frame(Bilreslast)
  Bilreslast$Bilres.BILANCIAMENTO.RESIDUALE <- as.numeric(as.character((Bilreslast$Bilres.BILANCIAMENTO.RESIDUALE)))

如果从网页复制 URL，则可以先使用 download.files() 下载为二进制文件并使用read.xlsx()读取数据。根据网页上内容更改的频率，最好只复制 URL，而不是从页面解析它。

oldFile <- "http://www.snamretegas.it/repository/file/Info-storiche-qta-gas-trasportato/dati_operativi/2017/DatiOperativi_2017-IT.xlsx"
newFile <- "http://www.snamretegas.it/repository/file/it/business-servizi/dati-operativi-business/dati_operativi_bilanciamento_sistema/2017/DatiOperativi_2017-IT.xlsx"
if(!file.exists("./data/downloadedXlsx.xlsx")){
     download.file(newFile,"./data/downloadedXlsx.xlsx",
                   method="curl", #use "curl" for OS X / Linux, "wininet" for Windows
                   mode="wb") # "wb" means "write binary"
} else message("file already loaded locally, using disk version")
library(openxlsx)
Bilres <- read.xlsx(xlsxFile = "./data/downloadedXlsx.xlsx",
                sheet = "Storico_G", startRow = 1, colNames = TRUE)
head(Bilres[,1:3])

。和输出：

> head(Bilres[,1:3])
  pubblicazione aggiornato.il IMMESSO
1 2017_01_01_06      42736.24 1915484
2 2017_01_01_07      42736.28 1915484
3 2017_01_01_08      42736.33 1866326
4 2017_01_01_09      42736.36 1866326
5 2017_01_01_10      42736.41 1866326
6 2017_01_01_11      42736.46 1866326
>

更新：添加了逻辑以避免在下载文件后下载文件。

您可以

通过以下方式找到.xlsx链接：

library(rvest)
library(magrittr)
pg <- read_html("http://www.snamretegas.it/it/business-servizi/dati-operativi-business/8_dati_operativi_bilanciamento_sistema/")
# get all the Excel (xlsx) links on that page:
html_nodes(pg, xpath=".//a[contains(@href, '.xlsx')]") %>% 
  html_attr("href") %>% 
  sprintf("http://www.snamretegas.it%s", .) -> excel_links
head(excel_links)
## [1] "http://www.snamretegas.it/repository/file/it/business-servizi/dati-operativi-business/dati_operativi_bilanciamento_sistema/2017/DatiOperativi_2017-IT.xlsx"
## [2] "http://www.snamretegas.it/repository/file/it/business-servizi/dati-operativi-business/dati_operativi_bilanciamento_sistema/2018/DatiOperativi_2018-IT.xlsx"

并且，将您想要的内容传递给您的 Excel 阅读函数：

openxlsx::read.xlsx(excel_links[1], sheet = "Storico_G", startRow = 1, colNames = TRUE)
## data frame output here that I'm not going to show

但！！

这是一种非常自私和不友善的方法，因为每次您想阅读 Excel 文件时，您都会访问该站点以获取它，从而浪费了他们的 CPU 和带宽以及您的带宽。

您应该使用 Len 描述的download.file()技术来缓存本地副本，并且仅在必要时重新下载。

这应该会让你朝着正确的方向前进。

library(data.table)
mydat <- fread('http://www.snamretegas.it/repository/file/Info-storiche-qta-gas-trasportato/dati_operativi/2017/DatiOperativi_2017-IT.xlsx')
head(mydat)

相关内容

最新更新

热门标签：