我正在从30多个描述陆军,海军,海军陆战队和空军支出的PDF中提取信息。
每个服务的 pdf 格式都不同,因此我编写了四个单独的清理函数来提取我需要的数据。(但是,pdf 有时会因年份而异。所以有一天我可能需要为特定的年份编写特定的清洁函数。
我应该使用什么技术来下载、应用相关的清理功能并重新绑定许多文件?
从概念上讲,我的想法是在每一行中插入相关功能,并以某种方式使用 purrr 下载,应用关联的功能,然后bind_row?
我以前从未见过这样做,但相信这一定是一种普遍的做法。示例/参考/教程非常受欢迎!
#### Data (Example)#####
df <- expand.grid(
service = c("Army", "Navy", "Marines", "Air.Force"),
year = c(2010:2019)
) %>% tbl_df() %>%
mutate(my.hyperlink = str_c("http://", "_", service, "_", year, ".html"),
my.cleaning.function = str_c(service, "cleaner",sep = "_" ))
# A tibble: 40 x 4
service year my.hyperlink my.cleaning.function
<fct> <int> <chr> <chr>
1 Army 2010 http://_Army_2010.html Army_cleaner
2 Navy 2010 http://_Navy_2010.html Navy_cleaner
3 Marines 2010 http://_Marines_2010.html Marines_cleaner
4 Air.Force 2010 http://_Air.Force_2010.html Air.Force_cleaner
5 Army 2011 http://_Army_2011.html Army_cleaner
6 Navy 2011 http://_Navy_2011.html Navy_cleaner
7 Marines 2011 http://_Marines_2011.html Marines_cleaner
8 Air.Force 2011 http://_Air.Force_2011.html Air.Force_cleaner
9 Army 2012 http://_Army_2012.html Army_cleaner
10 Navy 2012 http://_Navy_2012.html Navy_cleaner
# ... with 30 more rows
下面是一个快速示例,说明如何做到这一点。如果示例不够清楚,请告诉我。
library(tidyverse, quietly = TRUE)
df <- expand.grid(
service = c("Army", "Navy", "Marines", "Air.Force"),
year = c(2010:2019)
) %>%
tbl_df() %>%
mutate(my.hyperlink = str_c("http://", "_", service, "_", year, ".html"),
my.cleaning.function = str_c(service, "cleaner",sep = "_" ))
# define two example functions
Army_cleaner <- function(txt) {
tibble(
my_text = str_to_lower(txt),
my_num = runif(4)
)
}
Navy_cleaner <- function(txt) {
tibble(
my_text = str_to_upper(txt),
my_num = runif(4)
)
}
# fiter the data.frame only for the functions that we have defined
# then run the example
df %>%
filter(my.cleaning.function %in% c("Army_cleaner", "Navy_cleaner")) %>%
mutate(my_data = map2(my.hyperlink, my.cleaning.function, ~ {
FUN <- get(.y)
FUN(.x)
})) %>%
unnest()
#> # A tibble: 80 x 6
#> service year my.hyperlink my.cleaning.funct… my_text my_num
#> <fct> <int> <chr> <chr> <chr> <dbl>
#> 1 Army 2010 http://_Army_201… Army_cleaner http://_army… 0.478
#> 2 Army 2010 http://_Army_201… Army_cleaner http://_army… 0.386
#> 3 Army 2010 http://_Army_201… Army_cleaner http://_army… 0.225
#> 4 Army 2010 http://_Army_201… Army_cleaner http://_army… 0.421
#> 5 Navy 2010 http://_Navy_201… Navy_cleaner HTTP://_NAVY… 0.450
#> 6 Navy 2010 http://_Navy_201… Navy_cleaner HTTP://_NAVY… 0.515
#> 7 Navy 2010 http://_Navy_201… Navy_cleaner HTTP://_NAVY… 0.429
#> 8 Navy 2010 http://_Navy_201… Navy_cleaner HTTP://_NAVY… 0.0371
#> 9 Army 2011 http://_Army_201… Army_cleaner http://_army… 0.433
#> 10 Army 2011 http://_Army_201… Army_cleaner http://_army… 0.354
#> # ... with 70 more rows