r-我如何用管道清理或不运行for循环通过数据帧



我目前执行这段代码来清理一些excel文件。我想知道管道会是什么样子来清理它,甚至可能通过使用purrr和map函数的列表来实现同样的事情。这个代码基本上是

  1. 读取excel文件
  2. 删除不需要的特殊字符
  3. 根据某些条件写入指定的文件夹

谢谢。

files <- 
list.files(path = "folder1/", pattern = "*xls")
for (i in 1:length(files)) {
files1 <- 
basename(files[i]) 

files1 <- gsub(pattern = "'", replacement = "" , x = files1) 
files1 <- gsub(pattern = "\*", replacement = "" , x = files1)

data <- 
read_excel(paste0("folder1/", files[i])) 

data$column1 <- as.character(data$column1) 
data <- data %>% 
mutate(charCount = nchar(data$column1))
data <- data %>% mutate(across(where(is.character), str_trim))

if(any(grepl(pattern = "10", x = data$charCount) == FALSE) || any(grepl(pattern = "Coolguy",x = data$column5,ignore.case = T) == TRUE)) {
write_xlsx(data, path = (paste0("bad/", files1, "x")))
}
if (any(grepl(pattern = "10", x = data$charCount) == TRUE) && any(grepl(pattern = "Coolerguy",x = data$column4,ignore.case = T) == TRUE)){
write_xlsx(data[,1:11], path = (paste0("good/", files1, "x"))) 
}else{
write_xlsx(data[,1:11], path = (paste0("bad/", files1, "x")))
}}

这里有一种方法:

步骤1:定义文件名

file_names = list.files(path = "folder1/", pattern = "*xls") %>%
map(basename) %>%
map(~ gsub(pattern = "'|\*", replacement = "", x = .x))

请注意我是如何将您最初的两个gsub()调用合并为一个的。

步骤2:创建一个用于清理数据的函数,并将该函数应用于每个文件。

clean_data = function(data) {
data %>%
mutate(
column1 = as.character(column1)
) %>%
mutate(
charCount = nchar(column1),
across(where(is.character), str_trim)
)
}
all_data = file_names %>%
map(~ read_excel(file.path("folder1", .x))) %>%
map(clean_data)

请注意,我们如何使用file.path()而不是paste0()来指定read_excel()中的文件路径。我们还将在步骤3中使用file.path()。(参见@Martin Gal的评论(

步骤3:创建一个用于保存数据的函数,并使用walk()将该函数应用于每个数据集。

save_data = function(data, file_name) {
charcount_10 = data$charCount == 10
coolguy = grepl(pattern = "Coolguy", x = data$column5, ignore.case = TRUE)
coolerguy = grepl(pattern = "Coolerguy", x = data$column4, ignore.case = TRUE)

if (any(!charcount_10) || any(coolguy)) {
write_xlsx(data, path = file.path("bad", file_name, "x"))
} 
if (any(charcount_10) && any(coolerguy)){
write_xlsx(data[, 1:11], path = file.path("good", file_name, "x"))) 
} else {
write_xlsx(data[,1:11], path = file.path("bad", file_name, "x"))
}
}
walk2(all_data, file_names, save_data)

步骤3注意事项:

  • 我根据@Maurits-Evers的帮助注释更新了charCount == 10条件(您不应该在数字向量上使用grepl(),而data$charCount是一个数字向量
  • 你是说第二个if()语句实际上是else if ()语句吗
  • 我使用walk2()对每个数据集及其相应的文件名进行迭代

最新更新