r-确定一组变量中的任何值是否与另一个变量中的值匹配



对于我的数据集中的每个人(每个人一行(,我试图搜索一组变量(月份,所以在我的例子中是Jan-Jul(,看看它们中是否有任何一个与不同变量(后续月份(中的值相匹配。我想创建一个新的变量,它表示是或否,在一组变量中有一个与1变量匹配的值。

基本上,我正在努力为后续访问创建一个时间表。我有下面的"拥有"one_answers"想要"数据集。

谢谢!

拥有:

>>td style="text-align:centre">4A>>td style="text align:center;">4A">6>NA
ID Jan FebMarJul随访月份
1 NA 2 34
2 NA NA
3 1 NA 4 NA NA

这里有一个带有枢轴的版本:

library(dplyr)
library(tidyr)
df %>% 
pivot_longer(
-c(ID, Follow.up_month)
) %>% 
group_by(ID) %>% 
mutate(Follow_up_status = ifelse(Follow.up_month %in% value, "Yes", "No")) %>% 
pivot_wider(
names_from = name,
values_from = value
)

输出:

ID Follow.up_month Follow_up_status   Jan   Feb   Mar   Apr   May  June   Jul
<int>           <int> <chr>            <int> <int> <int> <int> <int> <int> <int>
1     1               4 Yes                 NA     2     3     4    NA    NA    NA
2     2               6 No                  NA    NA    NA     4    NA    NA    NA
3     3               5 Yes                  1    NA     3     4     5    NA    NA
4     4               9 No                  NA    NA    NA    NA    NA     6     7

我认为rowwiseif_any会为您工作:

library(dplyr)
quux %>%
rowwise() %>%
mutate(
Follow2_int = which(c_across(Jan:Jul) %in% Follow.up.month)[1], 
Follow2_lgl = !is.na(Follow2_int)
) %>%
ungroup()
# # A tibble: 4 x 12
#      ID   Jan   Feb   Mar   Apr   May  June   Jul Follow.up.month Follow_up_Status Follow2_int Follow2_lgl
#   <int> <int> <int> <int> <int> <int> <int> <int>           <int> <chr>                  <int> <lgl>      
# 1     1    NA     2     3     4    NA    NA    NA               4 Yes                        4 TRUE       
# 2     2    NA    NA    NA     4    NA    NA    NA               6 No                        NA FALSE      
# 3     3     1    NA     3     4     5    NA    NA               5 Yes                        5 TRUE       
# 4     4    NA    NA    NA    NA    NA     6     7               9 No                        NA FALSE      

编辑以包括匹配的逻辑列号和第一个列号(在Jan:Jul内计数(。


数据

quux <- structure(list(ID = 1:4, Jan = c(NA, NA, 1L, NA), Feb = c(2L, NA, NA, NA), Mar = c(3L, NA, 3L, NA), Apr = c(4L, 4L, 4L, NA),     May = c(NA, NA, 5L, NA), June = c(NA, NA, NA, 6L), Jul = c(NA,     NA, NA, 7L), Follow.up.month = c(4L, 6L, 5L, 9L), Follow_up_Status = c("Yes",     "No", "Yes", "No")), class = "data.frame", row.names = c(NA, -4L))

另一个dplyr解决方案。

library(dplyr)
dat2 <- dat %>%
mutate(across(Jan:Jul, .fns = ~.x - Follow_up_month == 0)) %>%
mutate(Follow_up_status = as.character(rowSums(select(., Jan:Jul), na.rm = TRUE))) %>%
transmute(Follow_up_status = recode(Follow_up_status, "0" = "No", "1" = "Yes")) %>%
bind_cols(dat, .)
dat2
#   ID Jan Feb Mar Apr May June Jul Follow_up_month Follow_up_status
# 1  1  NA   2   3   4  NA   NA  NA               4              Yes
# 2  2  NA  NA  NA   4  NA   NA  NA               6               No
# 3  3   1  NA   3   4   5   NA  NA               5              Yes
# 4  4  NA  NA  NA  NA  NA    6   7               9               No

日期

dat <- structure(list(ID = 1:4, Jan = c(NA, NA, 1L, NA), Feb = c(2L, NA, NA, NA), Mar = c(3L, NA, 3L, NA), Apr = c(4L, 4L, 4L, NA),     May = c(NA, NA, 5L, NA), June = c(NA, NA, NA, 6L), Jul = c(NA,     NA, NA, 7L), Follow_up_month = c(4L, 6L, 5L, 9L)), class = "data.frame", row.names = c(NA, -4L))

性能

当数据帧很小时,这里的所有解决方案都会起作用。但是,当数据帧很大时,旋转方法和逐行方法可能会很慢。下面我试图展示三种解决方案的性能比较。尽管最终输出不同,数据类型和列顺序不同,但我仍然会对它们进行比较,假设这些差异是可以接受的。

这是设置。

library(microbenchmark)
library(dplyr)
library(tidyr)

pivot_fun <- function(x){
x2 <- x %>%
pivot_longer(
-c(ID, Follow_up_month)
) %>% 
group_by(ID) %>% 
mutate(Follow_up_status = ifelse(Follow_up_month %in% value, "Yes", "No")) %>% 
pivot_wider(
names_from = name,
values_from = value
)
return(x2)
}
rowwise_fun <- function(x){
x2 <- x %>%
pivot_longer(
-c(ID, Follow_up_month)
) %>% 
group_by(ID) %>% 
mutate(Follow_up_status = ifelse(Follow_up_month %in% value, "Yes", "No")) %>% 
pivot_wider(
names_from = name,
values_from = value
)
return(x2)
}
rowSums_fun <- function(x){
x2 <- x %>%
mutate(across(Jan:Jul, .fns = ~.x - Follow_up_month == 0)) %>%
mutate(Follow_up_status = as.character(rowSums(select(., Jan:Jul), na.rm = TRUE))) %>%
transmute(Follow_up_status = recode(Follow_up_status, "0" = "No", "1" = "Yes")) %>%
bind_cols(x, .)
return(x2)
}

以下是对原始示例的比较。本文提供的解决方案是最快的。

set.seed(1)
microbenchmark(pivot_fun(dat), rowwise_fun(dat), rowSums_fun(dat))
# Unit: milliseconds
#             expr       min        lq     mean    median        uq     max neval
#   pivot_fun(dat) 11.037401 11.927201 13.58003 12.659001 13.882151 30.0207   100
# rowwise_fun(dat) 10.907602 11.670701 13.56004 12.295051 13.614201 24.4249   100
# rowSums_fun(dat)  6.590502  7.147702  8.48469  7.714351  8.808602 17.0109   100

这里是一个较大数据帧的比较。这篇文章提供的解决方案比其他答案快10倍。

set.seed(12)
n <- 100000
dat_n <- data.frame(
ID = 1:n,
Jan = sample(dat$Jan, size = n, replace = TRUE),
Feb = sample(dat$Feb, size = n, replace = TRUE),
Mar = sample(dat$Mar, size = n, replace = TRUE),
Apr = sample(dat$Apr, size = n, replace = TRUE),
May = sample(dat$May, size = n, replace = TRUE),
June = sample(dat$June, size = n, replace = TRUE),
Jul = sample(dat$Jul, size = n, replace = TRUE),
Follow_up_month = sample(1:12, size = n, replace = TRUE)
)
set.seed(123)
microbenchmark(pivot_fun(dat_n), rowwise_fun(dat_n), rowSums_fun(dat_n))
# Unit: milliseconds
#               expr      min        lq      mean    median        uq       max neval
#   pivot_fun(dat_n) 1168.416 1405.5724 1496.6545 1471.0253 1574.3927 2327.1624   100
# rowwise_fun(dat_n) 1159.790 1401.0586 1494.9987 1465.8929 1580.0092 1982.5099   100
# rowSums_fun(dat_n)   84.494  102.0946  122.2843  111.8158  123.6288  296.3234   100

最新更新