下载文件列表并为每个文件应用独特的清理功能,然后绑定到单个数据帧



我正在从30多个描述陆军,海军,海军陆战队和空军支出的PDF中提取信息。

每个服务的 pdf 格式都不同,因此我编写了四个单独的清理函数来提取我需要的数据。(但是,pdf 有时会因年份而异。所以有一天我可能需要为特定的年份编写特定的清洁函数。

我应该使用什么技术来下载、应用相关的清理功能并重新绑定许多文件?

从概念上讲,我的想法是在每一行中插入相关功能,并以某种方式使用 purrr 下载,应用关联的功能,然后bind_row?

我以前从未见过这样做,但相信这一定是一种普遍的做法。示例/参考/教程非常受欢迎!

#### Data (Example)#####
df <- expand.grid(
service = c("Army", "Navy", "Marines", "Air.Force"),
year = c(2010:2019)
) %>% tbl_df() %>% 
mutate(my.hyperlink = str_c("http://", "_", service, "_", year, ".html"),
my.cleaning.function = str_c(service, "cleaner",sep = "_" ))

# A tibble: 40 x 4
service    year my.hyperlink                my.cleaning.function
<fct>     <int> <chr>                       <chr>               
1 Army       2010 http://_Army_2010.html      Army_cleaner        
2 Navy       2010 http://_Navy_2010.html      Navy_cleaner        
3 Marines    2010 http://_Marines_2010.html   Marines_cleaner     
4 Air.Force  2010 http://_Air.Force_2010.html Air.Force_cleaner   
5 Army       2011 http://_Army_2011.html      Army_cleaner        
6 Navy       2011 http://_Navy_2011.html      Navy_cleaner        
7 Marines    2011 http://_Marines_2011.html   Marines_cleaner     
8 Air.Force  2011 http://_Air.Force_2011.html Air.Force_cleaner   
9 Army       2012 http://_Army_2012.html      Army_cleaner        
10 Navy       2012 http://_Navy_2012.html      Navy_cleaner        
# ... with 30 more rows

下面是一个快速示例,说明如何做到这一点。如果示例不够清楚,请告诉我。

library(tidyverse, quietly = TRUE)
df <- expand.grid(
service = c("Army", "Navy", "Marines", "Air.Force"),
year = c(2010:2019)
) %>% 
tbl_df() %>% 
mutate(my.hyperlink = str_c("http://", "_", service, "_", year, ".html"),
my.cleaning.function = str_c(service, "cleaner",sep = "_" ))
# define two example functions
Army_cleaner <- function(txt) {
tibble(
my_text = str_to_lower(txt),
my_num  = runif(4)
)
}
Navy_cleaner <- function(txt) {
tibble(
my_text = str_to_upper(txt),
my_num  = runif(4)
)
}

# fiter the data.frame only for the functions that we have defined
# then run the example
df %>% 
filter(my.cleaning.function %in% c("Army_cleaner", "Navy_cleaner")) %>% 
mutate(my_data = map2(my.hyperlink, my.cleaning.function, ~ {
FUN <- get(.y)
FUN(.x)
})) %>% 
unnest()
#> # A tibble: 80 x 6
#>    service  year my.hyperlink      my.cleaning.funct… my_text       my_num
#>    <fct>   <int> <chr>             <chr>              <chr>          <dbl>
#>  1 Army     2010 http://_Army_201… Army_cleaner       http://_army… 0.478 
#>  2 Army     2010 http://_Army_201… Army_cleaner       http://_army… 0.386 
#>  3 Army     2010 http://_Army_201… Army_cleaner       http://_army… 0.225 
#>  4 Army     2010 http://_Army_201… Army_cleaner       http://_army… 0.421 
#>  5 Navy     2010 http://_Navy_201… Navy_cleaner       HTTP://_NAVY… 0.450 
#>  6 Navy     2010 http://_Navy_201… Navy_cleaner       HTTP://_NAVY… 0.515 
#>  7 Navy     2010 http://_Navy_201… Navy_cleaner       HTTP://_NAVY… 0.429 
#>  8 Navy     2010 http://_Navy_201… Navy_cleaner       HTTP://_NAVY… 0.0371
#>  9 Army     2011 http://_Army_201… Army_cleaner       http://_army… 0.433 
#> 10 Army     2011 http://_Army_201… Army_cleaner       http://_army… 0.354 
#> # ... with 70 more rows

最新更新