对于我的数据集中的每个人(每个人一行(,我试图搜索一组变量(月份,所以在我的例子中是Jan-Jul(,看看它们中是否有任何一个与不同变量(后续月份(中的值相匹配。我想创建一个新的变量,它表示是或否,在一组变量中有一个与1变量匹配的值。
基本上,我正在努力为后续访问创建一个时间表。我有下面的"拥有"one_answers"想要"数据集。
谢谢!
拥有:
ID | Jan | Feb | >Mar | Jul | 随访月份 |
---|---|---|---|---|---|
1 | NA | 2 | 3 | >td style="text-align:centre">4A>4 | |
2 | NA | NA | >td style="text align:center;">4A">6|||
3 | 1 | NA | 4 | NA | NA | >NA
这里有一个带有枢轴的版本:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(
-c(ID, Follow.up_month)
) %>%
group_by(ID) %>%
mutate(Follow_up_status = ifelse(Follow.up_month %in% value, "Yes", "No")) %>%
pivot_wider(
names_from = name,
values_from = value
)
输出:
ID Follow.up_month Follow_up_status Jan Feb Mar Apr May June Jul
<int> <int> <chr> <int> <int> <int> <int> <int> <int> <int>
1 1 4 Yes NA 2 3 4 NA NA NA
2 2 6 No NA NA NA 4 NA NA NA
3 3 5 Yes 1 NA 3 4 5 NA NA
4 4 9 No NA NA NA NA NA 6 7
我认为rowwise
和if_any
会为您工作:
library(dplyr)
quux %>%
rowwise() %>%
mutate(
Follow2_int = which(c_across(Jan:Jul) %in% Follow.up.month)[1],
Follow2_lgl = !is.na(Follow2_int)
) %>%
ungroup()
# # A tibble: 4 x 12
# ID Jan Feb Mar Apr May June Jul Follow.up.month Follow_up_Status Follow2_int Follow2_lgl
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <chr> <int> <lgl>
# 1 1 NA 2 3 4 NA NA NA 4 Yes 4 TRUE
# 2 2 NA NA NA 4 NA NA NA 6 No NA FALSE
# 3 3 1 NA 3 4 5 NA NA 5 Yes 5 TRUE
# 4 4 NA NA NA NA NA 6 7 9 No NA FALSE
编辑以包括匹配的逻辑列号和第一个列号(在Jan:Jul
内计数(。
数据
quux <- structure(list(ID = 1:4, Jan = c(NA, NA, 1L, NA), Feb = c(2L, NA, NA, NA), Mar = c(3L, NA, 3L, NA), Apr = c(4L, 4L, 4L, NA), May = c(NA, NA, 5L, NA), June = c(NA, NA, NA, 6L), Jul = c(NA, NA, NA, 7L), Follow.up.month = c(4L, 6L, 5L, 9L), Follow_up_Status = c("Yes", "No", "Yes", "No")), class = "data.frame", row.names = c(NA, -4L))
另一个dplyr
解决方案。
library(dplyr)
dat2 <- dat %>%
mutate(across(Jan:Jul, .fns = ~.x - Follow_up_month == 0)) %>%
mutate(Follow_up_status = as.character(rowSums(select(., Jan:Jul), na.rm = TRUE))) %>%
transmute(Follow_up_status = recode(Follow_up_status, "0" = "No", "1" = "Yes")) %>%
bind_cols(dat, .)
dat2
# ID Jan Feb Mar Apr May June Jul Follow_up_month Follow_up_status
# 1 1 NA 2 3 4 NA NA NA 4 Yes
# 2 2 NA NA NA 4 NA NA NA 6 No
# 3 3 1 NA 3 4 5 NA NA 5 Yes
# 4 4 NA NA NA NA NA 6 7 9 No
日期
dat <- structure(list(ID = 1:4, Jan = c(NA, NA, 1L, NA), Feb = c(2L, NA, NA, NA), Mar = c(3L, NA, 3L, NA), Apr = c(4L, 4L, 4L, NA), May = c(NA, NA, 5L, NA), June = c(NA, NA, NA, 6L), Jul = c(NA, NA, NA, 7L), Follow_up_month = c(4L, 6L, 5L, 9L)), class = "data.frame", row.names = c(NA, -4L))
性能
当数据帧很小时,这里的所有解决方案都会起作用。但是,当数据帧很大时,旋转方法和逐行方法可能会很慢。下面我试图展示三种解决方案的性能比较。尽管最终输出不同,数据类型和列顺序不同,但我仍然会对它们进行比较,假设这些差异是可以接受的。
这是设置。
library(microbenchmark)
library(dplyr)
library(tidyr)
pivot_fun <- function(x){
x2 <- x %>%
pivot_longer(
-c(ID, Follow_up_month)
) %>%
group_by(ID) %>%
mutate(Follow_up_status = ifelse(Follow_up_month %in% value, "Yes", "No")) %>%
pivot_wider(
names_from = name,
values_from = value
)
return(x2)
}
rowwise_fun <- function(x){
x2 <- x %>%
pivot_longer(
-c(ID, Follow_up_month)
) %>%
group_by(ID) %>%
mutate(Follow_up_status = ifelse(Follow_up_month %in% value, "Yes", "No")) %>%
pivot_wider(
names_from = name,
values_from = value
)
return(x2)
}
rowSums_fun <- function(x){
x2 <- x %>%
mutate(across(Jan:Jul, .fns = ~.x - Follow_up_month == 0)) %>%
mutate(Follow_up_status = as.character(rowSums(select(., Jan:Jul), na.rm = TRUE))) %>%
transmute(Follow_up_status = recode(Follow_up_status, "0" = "No", "1" = "Yes")) %>%
bind_cols(x, .)
return(x2)
}
以下是对原始示例的比较。本文提供的解决方案是最快的。
set.seed(1)
microbenchmark(pivot_fun(dat), rowwise_fun(dat), rowSums_fun(dat))
# Unit: milliseconds
# expr min lq mean median uq max neval
# pivot_fun(dat) 11.037401 11.927201 13.58003 12.659001 13.882151 30.0207 100
# rowwise_fun(dat) 10.907602 11.670701 13.56004 12.295051 13.614201 24.4249 100
# rowSums_fun(dat) 6.590502 7.147702 8.48469 7.714351 8.808602 17.0109 100
这里是一个较大数据帧的比较。这篇文章提供的解决方案比其他答案快10倍。
set.seed(12)
n <- 100000
dat_n <- data.frame(
ID = 1:n,
Jan = sample(dat$Jan, size = n, replace = TRUE),
Feb = sample(dat$Feb, size = n, replace = TRUE),
Mar = sample(dat$Mar, size = n, replace = TRUE),
Apr = sample(dat$Apr, size = n, replace = TRUE),
May = sample(dat$May, size = n, replace = TRUE),
June = sample(dat$June, size = n, replace = TRUE),
Jul = sample(dat$Jul, size = n, replace = TRUE),
Follow_up_month = sample(1:12, size = n, replace = TRUE)
)
set.seed(123)
microbenchmark(pivot_fun(dat_n), rowwise_fun(dat_n), rowSums_fun(dat_n))
# Unit: milliseconds
# expr min lq mean median uq max neval
# pivot_fun(dat_n) 1168.416 1405.5724 1496.6545 1471.0253 1574.3927 2327.1624 100
# rowwise_fun(dat_n) 1159.790 1401.0586 1494.9987 1465.8929 1580.0092 1982.5099 100
# rowSums_fun(dat_n) 84.494 102.0946 122.2843 111.8158 123.6288 296.3234 100