我目前执行这段代码来清理一些excel文件。我想知道管道会是什么样子来清理它,甚至可能通过使用purrr和map函数的列表来实现同样的事情。这个代码基本上是
- 读取excel文件
- 删除不需要的特殊字符
- 根据某些条件写入指定的文件夹
谢谢。
files <-
list.files(path = "folder1/", pattern = "*xls")
for (i in 1:length(files)) {
files1 <-
basename(files[i])
files1 <- gsub(pattern = "'", replacement = "" , x = files1)
files1 <- gsub(pattern = "\*", replacement = "" , x = files1)
data <-
read_excel(paste0("folder1/", files[i]))
data$column1 <- as.character(data$column1)
data <- data %>%
mutate(charCount = nchar(data$column1))
data <- data %>% mutate(across(where(is.character), str_trim))
if(any(grepl(pattern = "10", x = data$charCount) == FALSE) || any(grepl(pattern = "Coolguy",x = data$column5,ignore.case = T) == TRUE)) {
write_xlsx(data, path = (paste0("bad/", files1, "x")))
}
if (any(grepl(pattern = "10", x = data$charCount) == TRUE) && any(grepl(pattern = "Coolerguy",x = data$column4,ignore.case = T) == TRUE)){
write_xlsx(data[,1:11], path = (paste0("good/", files1, "x")))
}else{
write_xlsx(data[,1:11], path = (paste0("bad/", files1, "x")))
}}
这里有一种方法:
步骤1:定义文件名
file_names = list.files(path = "folder1/", pattern = "*xls") %>%
map(basename) %>%
map(~ gsub(pattern = "'|\*", replacement = "", x = .x))
请注意我是如何将您最初的两个gsub()
调用合并为一个的。
步骤2:创建一个用于清理数据的函数,并将该函数应用于每个文件。
clean_data = function(data) {
data %>%
mutate(
column1 = as.character(column1)
) %>%
mutate(
charCount = nchar(column1),
across(where(is.character), str_trim)
)
}
all_data = file_names %>%
map(~ read_excel(file.path("folder1", .x))) %>%
map(clean_data)
请注意,我们如何使用file.path()
而不是paste0()
来指定read_excel()
中的文件路径。我们还将在步骤3中使用file.path()
。(参见@Martin Gal的评论(
步骤3:创建一个用于保存数据的函数,并使用walk()
将该函数应用于每个数据集。
save_data = function(data, file_name) {
charcount_10 = data$charCount == 10
coolguy = grepl(pattern = "Coolguy", x = data$column5, ignore.case = TRUE)
coolerguy = grepl(pattern = "Coolerguy", x = data$column4, ignore.case = TRUE)
if (any(!charcount_10) || any(coolguy)) {
write_xlsx(data, path = file.path("bad", file_name, "x"))
}
if (any(charcount_10) && any(coolerguy)){
write_xlsx(data[, 1:11], path = file.path("good", file_name, "x")))
} else {
write_xlsx(data[,1:11], path = file.path("bad", file_name, "x"))
}
}
walk2(all_data, file_names, save_data)
步骤3注意事项:
- 我根据@Maurits-Evers的帮助注释更新了
charCount == 10
条件(您不应该在数字向量上使用grepl()
,而data$charCount
是一个数字向量 - 你是说第二个
if()
语句实际上是else if ()
语句吗 - 我使用
walk2()
对每个数据集及其相应的文件名进行迭代