r-适用于多个文件


BorderData07 <- read_csv("Downloads/BorderData/BorderApprehension2007.csv")
BorderData08 <- read_csv("Downloads/BorderData/BorderApprehension2008.csv")
BorderData07[is.na(BorderData07)] = 0
B08[is.na(B08)] = 0
BorderData07$CITIZENSHIP <- str_to_title(BorderData07$CITIZENSHIP)
BorderData07$Region <- countrycode(sourcevar = BorderData07$CITIZENSHIP, origin = "country.name", destination = "region")
BorderData07[nrow(BorderData07), 26] <- "Total"
World_Region <- ddply(BorderData07,"Region",numcolwise(sum))
ggplot(World_Region, aes(x = Region, y = Total)) + geom_col(width = 0.5, position = position_dodge(3), fill = 'blue', alpha = 0.5) + scale_y_log10() + coord_flip() +  geom_text(aes(label=Total), alpha = 1.0, check_overlap = TRUE) +  ggtitle("Apprehension By World Region Totals in 2007")

我正在尝试使用lapply来运行我每年的边界数据的每个csv文件。每一个的唯一区别是csv文件的结尾和图形的标题。我对lapply的了解非常有限,我很难学会如何让它正常工作。

library(tidyverse) # a helpful package to make coding easier
library(stringr)
library(readr)
library(ggplot2)
list.files( # get multiple file paths
path = "Downloads/BorderData", 
pattern = "BorderApprehension*.csv", 
full.names = TRUE
) %>%
setNames(., paste0("BorderData", str_extract(., "\d{2}(?=\.csv)"))) %>% # (optional; provides names to file paths)
lapply(function(file) {
year <- str_extract(file, "\d+(?=\.csv)") # use in `ggtitle`
df <- read_csv(file) %>% 
mutate_all(replace_na, 0) %>% # `BorderData07[is.na(BorderData07)] = 0` equivalent
mutate(
CITIZENSHIP = str_to_title(CITIZENSHIP), 
Region = countrycode(sourcevar = CITIZENSHIP, origin = "country.name", destination = "region")
)
df[nrow(df), 26] <- "Total" # BorderData07[nrow(BorderData07), 26] <- "Total"
World_Region <- ddply(df, "Region", numcolwise(sum))
ggplot(World_Region, aes(x = Region, y = Total)) + 
geom_col(width = 0.5, position = position_dodge(3), fill = 'blue', alpha = 0.5) + 
scale_y_log10() + 
coord_flip() + 
geom_text(aes(label=Total), alpha = 1.0, check_overlap = TRUE) +  
ggtitle(paste("Apprehension By World Region Totals in", year))
})

输出是ggplots的列表。

如果您希望它返回读取和清理.csv文件的数据帧,可以在lapply的末尾添加一行return(df)

如果您使用可选的setNames(如代码中所示(,则列表将具有对应于";BorderData07"BorderData08";,等

代码的setNames中的str_extract(., "\d{2}(?=\.csv)")))使用正则表达式来提取"之前的最后两位数字;。csv";。

代码的lapply中的str_extract(file, "\d+(?=\.csv)")))使用正则表达式来提取"之前的一个或多个数字;。csv";,在你的例子中是哪一年。只有当文件路径中的其他位置出现数字时才需要(?=\.csv),因为它指示";。csv";必须立即遵循数字模式,使模式更加具体。

管道运算符(%>%(和mutate函数来自dplyrR程序包,该程序包包含在tidyverse程序包中。它们有助于减少要编写的冗余代码,例如数据帧的名称。

将要应用于每个文件的所有内容放在函数中

apply_fun <- function(file) {
x <- read_csv(file)
year <- str_extract(file, '\d+')
x[is.na(x)] = 0
x$CITIZENSHIP <- str_to_title(x$CITIZENSHIP)
x$Region <- countrycode(sourcevar = x$CITIZENSHIP, origin = "country.name", destination = "region")
x[nrow(x), 26] <- "Total"
World_Region <- ddply(x,"Region",numcolwise(sum))
ggplot(World_Region, aes(x = Region, y = Total)) + 
geom_col(width = 0.5, position = position_dodge(3), fill = 'blue', alpha = 0.5) + 
scale_y_log10() + coord_flip() +  
geom_text(aes(label=Total), alpha = 1.0, check_overlap = TRUE) +  
ggtitle(paste0("Apprehension By World Region Totals in", year))
}

然后使用lapply-

filename <- list.files('Downloads/BorderData/', pattern = '\.csv$', full.names = TRUE)
list_plots <- lapply(filename, apply_fun)