r-使用pmap和a将不同的正则表达式应用于tibble中的不同变量

这个问题与使用pmap将不同的正则表达式应用于tibble？中的不同变量非常相似？，但有所不同，因为我意识到我的例子不足以描述我的问题。

我正在尝试将不同的正则表达式应用于tibble中的不同变量。例如，我制作了一个tibble列表：1(我想要修改的变量名，2(我想要匹配的正则表达式，以及3(替换字符串。我想将regex/replacement应用于不同数据帧中的变量。请注意，目标tibble中可能存在我不想修改的变量，并且我的"配置"tibble中的行顺序可能与我的"目标"tibable中的列/变量顺序不对应。

所以我的"配置"tibble可能看起来像这样：

test_config <-  dplyr::tibble(
string_col = c("col1", "col2", "col4", "col3"),
pattern = c("^\.$", "^NA$", "^$", "^NULL$"),
replacement = c("","","", "")
)

我想把它应用到一个目标上：

test_target <- dplyr::tibble(
col1 = c("Foo", "bar", ".", "NA", "NULL"),
col2 = c("Foo", "bar", ".", "NA", "NULL"),
col3 = c("Foo", "bar", ".", "NA", "NULL"),
col4 = c("NULL", "NA", "Foo", ".", "bar"),
col5 = c("I", "am", "not", "changing", ".")
)

因此，目标是在test_target的用户指定列/变量中用空字符串替换不同的字符串。

结果应该是这样的：

result <- dplyr::tibble(
col1 = c("Foo", "bar", "", "NA", "NULL"),
col2 = c("Foo", "bar", ".", "", "NULL"),
col3 = c("Foo", "bar", ".", "NA", ""),
col4 = c("NULL", "NA", "Foo", ".", "bar"),
col5 = c("I", "am", "not", "changing", ".")
)

我可以用for循环做我想做的事情，比如这样：

for (i in seq(nrow(test_config))) {
test_target <- dplyr::mutate_at(test_target,
.vars = dplyr::vars(
tidyselect::matches(test_config$string_col[[i]])),
.funs = dplyr::funs(
stringr::str_replace_all(
., test_config$pattern[[i]], 
test_config$replacement[[i]]))
)
}

相反，有没有一种更整洁的方式来做我想做的事？到目前为止，我认为purrr::pmap是这项工作的工具，我已经制作了一个函数，它接受数据帧、变量名、正则表达式和替换值，并返回修改了单个变量的数据帧。它的行为如预期：

testFun <- function(df, colName, regex, repVal){
colName <- dplyr::enquo(colName)
df <- dplyr::mutate_at(df,
.vars = dplyr::vars(
tidyselect::matches(!!colName)),
.funs = dplyr::funs(
stringr::str_replace_all(., regex, repVal))
)
}
# try with example
out <- testFun(test_target, 
test_config$string_col[[1]], 
test_config$pattern[[1]], 
"")

然而，当我尝试将该函数与pmap一起使用时，我遇到了几个问题：1( 有比这更好的方法来建立pmap调用的列表吗？

purrr::pmap(
list(test_target, 
test_config$string_col, 
test_config$pattern, 
test_config$replacement),
testFun
)

2( 当我调用pmap时，我得到一个错误：

Error: Element 2 has length 4, not 1 or 5.

所以pmap不高兴我试图传递一个长度为5的tibble作为其他元素长度为4的列表的元素(我以为它会回收tibble(。

还要注意，以前，当我用4行tibble调用pmap时，我得到了一个不同的错误，

Error in UseMethod("tbl_vars") : 
no applicable method for 'tbl_vars' applied to an object of class "character"
Called from: tbl_vars(tbl)

你们中的任何人能建议一种使用pmap来做我想做的事情的方法吗？或者有没有一种不同或更好的方法来解决这个问题？

谢谢！

这里有两种tidyverse方式。一个类似于data.table的答案，因为它涉及到对数据进行整形，将其与配置连接起来，并将其整形回宽。另一种是基于purrr的，在我看来，这是一种有点奇怪的方法。我推荐第一个，因为它感觉更直观。

使用tidyr::gather使数据变长，然后使用dplyr::left_join确保来自test_target的每个文本值都具有相应的模式&替换——即使是没有模式的case(col5(也将通过使用左联接来保留。

library(tidyverse)
...
test_target %>%
gather(key = col, value = text) %>%
left_join(test_config, by = c("col" = "string_col"))
#> # A tibble: 25 x 4
#>    col   text  pattern replacement
#>    <chr> <chr> <chr>   <chr>      
#>  1 col1  Foo   "^\.$" ""         
#>  2 col1  bar   "^\.$" ""         
#>  3 col1  .     "^\.$" ""         
#>  4 col1  NA    "^\.$" ""         
#>  5 col1  NULL  "^\.$" ""         
#>  6 col2  Foo   ^NA$    ""         
#>  7 col2  bar   ^NA$    ""         
#>  8 col2  .     ^NA$    ""         
#>  9 col2  NA    ^NA$    ""         
#> 10 col2  NULL  ^NA$    ""         
#> # ... with 15 more rows

使用ifelse替换存在模式的模式，或者如果没有，则保留原始文本。只保留必要的模式，因为spread需要唯一的ID，所以添加一个行号，并使数据再次变宽。

test_target %>%
gather(key = col, value = text) %>%
left_join(test_config, by = c("col" = "string_col")) %>% 
mutate(new_text = ifelse(is.na(pattern), text, str_replace(text, pattern, replacement))) %>%
select(col, new_text) %>%
group_by(col) %>%
mutate(row = row_number()) %>%
spread(key = col, value = new_text) %>%
select(-row)
#> # A tibble: 5 x 5
#>   col1  col2  col3  col4  col5    
#>   <chr> <chr> <chr> <chr> <chr>   
#> 1 Foo   Foo   Foo   NULL  I       
#> 2 bar   bar   bar   NA    am      
#> 3 ""    .     .     Foo   not     
#> 4 NA    ""    NA    .     changing
#> 5 NULL  NULL  ""    bar   .

第二种方法是只生成列名的小tibble，将其与configs连接起来，然后拆分成列表列表。然后purrr::map2_dfc映射到您创建的列表和test_target的列上，并通过cbinding返回一个数据帧。这之所以有效，是因为从技术上讲，数据帧是列的列表，所以如果映射到数据帧上，则将每一列视为列表项。我无法让ifelse在这里工作——逻辑中的某些东西只有单个字符串返回，而不是整个向量。

tibble(all_cols = names(test_target)) %>%
left_join(test_config, by = c("all_cols" = "string_col")) %>%
split(.$all_cols) %>%
map(as.list) %>%
map2_dfc(test_target, function(info, text) {
if (is.na(info$pattern)) {
text
} else {
str_replace(text, info$pattern, info$replacement)
}
})
#> # A tibble: 5 x 5
#>   col1  col2  col3  col4  col5    
#>   <chr> <chr> <chr> <chr> <chr>   
#> 1 Foo   Foo   Foo   NULL  I       
#> 2 bar   bar   bar   NA    am      
#> 3 ""    .     .     Foo   not     
#> 4 NA    ""    NA    .     changing
#> 5 NULL  NULL  ""    bar   .

^{创建于2018-10-30由reprex包(v0.2.1(}

我对purrr和dplyr没有经验，但这里有一种使用data.table的方法。这种方法可以通过谷歌搜索进入dplyr：(

就可解释性而言，使用循环的方法可以说是更好的，因为它更简单。

编辑：对代码进行了一些更改，最后没有使用purrr

# alternative with data.table
library(data.table)
library(dplyr)
# objects
test_config <-  dplyr::tibble(
string_col = c("col1", "col2", "col4", "col3"),
pattern = c("^\.$", "^NA$", "^$", "^NULL$"),
replacement = c("","","", "")
)
test_target <- dplyr::tibble(
col1 = c("Foo", "bar", ".", "NA", "NULL"),
col2 = c("Foo", "bar", ".", "NA", "NULL"),
col3 = c("Foo", "bar", ".", "NA", "NULL"),
col4 = c("NULL", "NA", "Foo", ".", "bar"),
col5 = c("I", "am", "not", "changing", ".")
)
multiColStringReplace <- function(test_target, test_config){
# data.table conversion
test_target <- as.data.table(test_target)
test_config <- as.data.table(test_config)
# adding an id column, as I'm reshaping the data, helps for identification of rows
# throughout the process
test_target[,id:=1:.N]
# wide to long format
test_target2 <- melt(test_target, id.vars="id")
head(test_target2)
# pull in the configuration, can join up on one column now
test_target2 <- merge(test_target2, test_config, by.x="variable",
by.y="string_col", all.x=TRUE)
# this bit still looks messy to me, haven't used pmap before.
# I've had to subset the data to the required format, run the pmap with gsub
# to complete the task, then assign the unlisted vector back in to the original
# data. Would like to see a better option too!
test_target2[, result := value]
test_target2[!is.na(pattern), result := gsub(pattern, replacement, value),
by = .(id, variable)]
# case from long to original format, and drop the id
output <- dcast(test_target2, id~variable,
value.var = "result")
output[, id := NULL]
# back to tibble
output <- as_tibble(output)
return(output)
}
output <- multiColStringReplace(test_target, test_config)
output
result <- dplyr::tibble(
col1 = c("Foo", "bar", "", "NA", "NULL"),
col2 = c("Foo", "bar", ".", "", "NULL"),
col3 = c("Foo", "bar", ".", "NA", ""),
col4 = c("NULL", "NA", "Foo", ".", "bar"),
col5 = c("I", "am", "not", "changing", ".")
)
output == result
# compare with old method
old <- test_target
for (i in seq(nrow(test_config))) {
old <- dplyr::mutate_at(old,
.vars = dplyr::vars(
tidyselect::matches(test_config$string_col[[i]])),
.funs = dplyr::funs(
stringr::str_replace_all(
., test_config$pattern[[i]], 
test_config$replacement[[i]]))
)
}
old == result
# speed improves, but complexity rises
microbenchmark::microbenchmark("old" = {
old <- test_target
for (i in seq(nrow(test_config))) {
old <- dplyr::mutate_at(old,
.vars = dplyr::vars(
tidyselect::matches(test_config$string_col[[i]])),
.funs = dplyr::funs(
stringr::str_replace_all(
., test_config$pattern[[i]], 
test_config$replacement[[i]]))
)
}
},
"data.table" = {
multiColStringReplace(test_target, test_config)
}, times = 20)

为了子孙后代的利益，如果我将test_target作为列表传递给pmap_dfr，我也可以完成这项任务(但这不是一个好的解决方案(：

purrr::pmap_dfr(
list(list(test_target),
test_config$string_col,
test_config$pattern,
test_config$replacement),
testFun
) %>% dplyr::distinct()

尽管它有效，但这不是一个好的解决方案，因为它回收了test_target列表的元素，在遍历参数时，有效地为test_config的每一行创建了test_target tibble的副本，然后将生成的4个tibble的行绑定在一起，以生成一个大的最终输出tibble(我正在用distinct()过滤它(。

可能有一些方法可以做一些类似<<-的方法来避免复制目标tibble，但这更奇怪和糟糕。

FYI，基准测试结果-@camille建议的"笨拙整洁"方法是我硬件上的赢家！

Unit: milliseconds
expr       min        lq      mean    median        uq      max neval
loop 14.808278 16.098818 17.937283 16.811716 20.438360 24.38021    20
pmap_function  9.486146 10.157526 10.978879 10.628205 11.112485 15.39436    20
nice_tidy  8.313868  8.633266  9.597485  8.986735  9.870532 14.32946    20
awkward_tidy  1.535919  1.639706  1.772211  1.712177  1.783465  2.87615    20
data.table  5.611538  5.652635  8.323122  5.784507  6.359332 51.63031    20

相关内容

最新更新

热门标签：