将包含日期的文件名转换为 QQ-YYYY 格式,并作为 R 中 N 个 excel 文件的列插入



我有许多xls文件,我已经从网上下载了,我希望将这些文件名转换为QQ-YYYY并将这些字符串另存为新列。此日期字段当前不在原始 xls 文件中。请注意,我感兴趣的一些文件有时会有完整的月份名称,而其他文件将是缩写的,即十月和十月

> pricingFiles
[1] "Apr 2018 ASP Pricing File 031318.xls"           "Apr 2019 ASP Pricing File 032219.xls"          
[3] "Jan 18 ASP Pricing File updated 030218.xls"     "Jan 2019 ASP Pricing File - updated 052919.xls"
[5] "Jul 2018 ASP Pricing File updated 052919.xls"   "Jul 2019 ASP Pricing File 091119.xls"          
[7] "Oct 18 ASP Pricing File updated 052919.xls"     "Oct 2019 ASP Pricing File 092519.xls" 

正如我们在上面看到的,我希望在 ASP Pric 之前转换日期字符串片段......到 QQ-YYYY,然后将其保存到数据帧中的新列,即

2018 年 4 月 ASP 定价文件 031318.xls ->Q2-2018,然后在读取 R 时将其作为日期列插入 DF

。请注意,每个 xls 文件位于不同的行号上,我定义了一个我计划在 skip 参数中使用的函数

skip=header_begins[i]-1

我的代码如下所示:

startDate <- 2018
endDate <- as.numeric(format(Sys.Date(), "%Y"))
# FUNCTION DEFINITION TO FETCH 2018 TO 2019 CROSSWALK AND ASP PRICING ZIP FILES
fileFinder <- function(yearsVector)
{
for (i in yearsVector)
{
message(paste0("Getting data for ", i))
tryCatch({
webpages <- read_html(paste0("https://www.cms.gov/Medicare/Medicare-Fee-for-Service-Part-B-Drugs/McrPartBDrugAvgSalesPrice/", 
i, "ASPFiles.html"))
r <- webpages %>% html_nodes("a") %>% html_attr("href") %>% grep(c("\Crosswalk.zip$|\ASP-Pricing-File.zip$"), 
., value = TRUE) %>% gsub("apps\/ama\/license.asp\?file=\/", 
"", .) %>% paste0("https://www.cms.gov", .)
return(r)
}, error = function(e)
{
# TODO
return(NA)
})
}
}
# CREATE LIST TO STORE URLs
urls <- list()
# PASS DATE RANGE TO THE CMS ZIP FINDER FUNCTION
urls <- lapply(startDate:endDate, fileFinder)
# CONVERT LIST OF URLs INTO VECTOR
urls <- unlist(urls, recursive = F)
# CREATE FUNCTION TO DOWNLOAD FILES FROM CMS.GOV
download.cms <- function(urls, refetch = TRUE, path = ".")
{
dest <- file.path(path, basename(urls))
if (refetch || !file.exists(dest)) 
download.file(urls, dest)
dest
}
# DOWNLOAD THE CROSSWALK ZIP FILES
sapply(urls, download.cms)
# UNZIP ALL FILES
for (i in dir(path = ".", pattern = "*.zip$")) unzip(i)
# CREATE A LIST OF ALL EXCEL FILES IN THE DIRECTORY
listExcelFiles <- list.files(".", pattern = c("\.xls$|\.csv$|\.xlsx$"))
# RETAIN EXCEL FILES THAT ARE CROSSWALK FILES
crosswalkFiles <- listExcelFiles[grepl("ASP NDC-HCPCS", listExcelFiles) & !grepl("508 version", 
listExcelFiles)]
# RETAIN EXCEL FILES THAT ARE PRICING FILES
pricingFiles <- listExcelFiles[grepl("ASP Pricing File", listExcelFiles) & !grepl("508 version", 
listExcelFiles)]
# CREATE A LIST OF FILES TO DELETE 
# TBC
# SINCE EACH FILE BEGINS ON DIFFERENT ROWS, DEFINE FUNCTION TO DETECT HEADER LINE
detect_header_line <- function(file_names, column_name)
{
header_begins <- NULL
for (i in 1:length(file_names))
{
lines_read <- readLines(file_names[i], warn = FALSE)
header_begins[i] <- grep(column_name, lines_read)
}
}
# FIND THE FIRST ROW FOR EACH FILE
header_begins <- detect_header_line(myExcel, "Short Description")

我们可以获取子字符串并使用as.yearqtr

library(zoo)
as.yearqtr(sub("^(\w+\s+\d+).*", '\1', pricingFiles), "%b %Y")

最新更新