我一直在尝试使用dplyr作为解决下一个问题的手段,但任何其他方法都会受到赞赏。
我有这个数据帧
df <- tibble(
x = sample(rep(c(0, 1),10),10),
a_1 = rnorm(10),
b_1 = rnorm(10),
c_1 = rnorm(10),
a_2 = rnorm(10),
b_2 = rnorm(10),
c_2 = rnorm(10),
...
)
我的目标是基于x
和另一个不同变量的值,在同一数据帧中创建一组与a_2, b_2, ...
值相等的新变量a_2_temp, b_2_temp, ...
。
等:
df%>% mutate(a_2_temp = (ifelse(x==1 & a_1 > 0, a_2, 0)))
现在,我需要一种方法来自动化这个函数,将它与across
一起使用,以便为几百列的数据帧创建新的变量。我可以简单地通过重复代码来实现这一点只是改变变量的名称但这对于我的实际数据集来说非常困难,因为它有几百个变量
df%>% mutate(a_2_temp = (ifelse(x==1 & a_1 > 0, a_2, 0))) %>%
mutate(b_2_temp = (ifelse(x==1 & b_1 > 0, b_2, 0))) %>%
mutate(c_2_temp = (ifelse(x==1 & c_1 > 0, c_2, 0))) %>%
mutate(d_2_temp = (ifelse(x==1 & d_1 > 0, d_2, 0))) %>%
...
到目前为止,我最接近的解决方案是这样的:
eval<-function(a,b){
ifelse(b==1 & a>0, a, 0)
}
df<-df%>%mutate(across(c("a_1":"n_2"), list(temp=~eval(a=.x, b=x))
然而,这只能引用变量it0s用作基准,而我希望它使用*_1作为基准来复制*_2
中的值这是across
的一个选项。循环遍历列名为ends_with
"_2"的列,创建逻辑条件,'x'值为1,对应的列值大于0(通过将'_2'替换为'_1'和get
的列值创建),然后返回'_2'列值或0,通过在.names
中添加'_temp'作为后缀来更改列名({.col}
-返回原始列名)
library(dplyr)
library(stringr)
df1 <- df %>%
mutate(across(ends_with('_2'),
~ case_when(x == 1 & get(str_replace(cur_column(), '_2', '_1')) > 0 ~
.,
TRUE ~ 0), .names = '{.col}_temp'))
与产出
df1
# A tibble: 10 x 10
# x a_1 b_1 c_1 a_2 b_2 c_2 a_2_temp b_2_temp c_2_temp
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 2.73 -0.140 -0.782 2.07 0.364 0.245 2.07 0 0
# 2 1 -0.321 -0.114 0.333 -0.0401 -0.547 0.719 0 0 0.719
# 3 1 -0.753 -0.103 1.53 0.0359 1.85 1.36 0 0 1.36
# 4 0 -0.994 0.980 -0.651 1.13 -0.179 0.557 0 0 0
# 5 0 -0.639 1.01 -0.374 -0.325 0.475 0.287 0 0 0
# 6 1 0.450 -0.0441 -0.924 0.856 0.217 1.65 0.856 0 0
# 7 0 0.120 0.282 -0.931 -1.36 -0.0353 -1.82 0 0 0
# 8 0 -0.756 0.0408 -0.309 0.731 -0.169 0.153 0 0 0
# 9 1 0.140 0.494 1.65 0.912 -0.330 -0.0840 0.912 -0.330 -0.0840
#10 0 -0.928 1.16 -1.06 -1.59 0.0439 -1.08 0 0 0
另外,由于我们只想用0
替换,与逻辑向量的简单乘法将足够作为TRUE -> 1
,和FALSE -> 0
,因此任何值乘以0返回0,与1返回值
df %>%
mutate(across(ends_with('_2'),
~ . *(x == 1 & get(str_replace(cur_column(), '_2', '_1'))),
.names = '{.col}_temp'))
另一种选择是使用split.default
将数据分割成数据块,使用map
循环list
,进行转换并将这些列与原始
library(purrr)
df %>%
select(-x) %>%
split.default(str_remove(names(.), '_\d+$')) %>%
map_dfc(~ .x[[2]] * (df[['x']] > 0 & .x[[1]] > 0)) %>%
rename_all(~ str_c(., '_2_temp')) %>%
bind_cols(df, .)
数据df <- structure(list(x = c(1, 1, 1, 0, 0, 1, 0, 0, 1, 0), a_1 = c(2.73310355409357,
-0.320612007980402, -0.753457274553722, -0.993806784470467, -0.638863336940367,
0.449760522371564, 0.119872527846818, -0.755664301704646, 0.139745073657684,
-0.92777433835819), b_1 = c(-0.139788654259498, -0.114412680908762,
-0.102836187925709, 0.980330559943683, 1.01472611411422, -0.0441288105926913,
0.2815151064984, 0.0407677709798372, 0.49417281865305, 1.16312935730339
), c_1 = c(-0.78179575165366, 0.33274093322335, 1.53346307214684,
-0.650564763278306, -0.373704486693932, -0.924228720715619, -0.931179032930509,
-0.309468200147579, 1.6513839050529, -1.06455672195892), a_2 = c(2.07296416623927,
-0.040135834336151, 0.0359118773308408, 1.13285720793684, -0.324655504171795,
0.856081768489117, -1.36456191552214, 0.730800040331243, 0.912096452304384,
-1.59124725717562), b_2 = c(0.36365730618185, -0.547314112818983,
1.850134670075, -0.178995695839892, 0.474832212746808, 0.216839372888426,
-0.0353431588238, -0.169393100775411, -0.330432553833477, 0.043945304544359
), c_2 = c(0.245070864427874, 0.71886275016605, 1.35567222367957,
0.556607205459845, 0.287483186639216, 1.65350317111755, -1.81872622002345,
0.152993150129941, -0.0840400626089268, -1.08300472554552)), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))