假设我有一个以下格式的表:
CowId DIM Type
1 13 Case
2 7 Case
3 3 Control
4 4 Control
5 9 Control
6 3 Control
7 5 Control
8 10 Control
9 1 Control
10 6 Control
11 7 Control
12 4 Control
我想根据+/-3 DIM将案例随机匹配到对照(1到1(。有没有一种方便的方法可以使用dplyr来完成这项任务?如有任何反馈,我们将不胜感激。
dput的输出被附加:
structure(list(CowId = 1:12, DIM = c(13L, 7L, 3L, 4L, 9L, 3L,
5L, 10L, 1L, 6L, 7L, 4L), Type = structure(c(2L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Control", "Case"
), class = "factor")), row.names = c(NA, -12L), class = "data.frame")
基本R中的一种方法:
#Get the index where Type = 'Case'
inds <- df$Type == 'Case'
#Get all the values within -3-3 for each DIM value
vals <- unique(c(sapply(df$DIM[inds], `+`, -3:3)))
#select random rows within range
result <- sample(which(df$DIM %in% vals & !inds), sum(inds))
#Combine case and control data.
df[c(which(inds), result), ]
# CowId DIM Type
#1 1 13 Case
#2 2 7 Case
#5 5 9 Control
#10 10 6 Control
随机零件可能很棘手。这是我的方法:
- 对于每种情况,Id计算最小/最大DIM
- 然后随机选择一个或一半的可用控件
- 更新参照CAse ID拾取的控件,并从将来的拾取中排除这些行
- 重复此步骤,直到完成所有案例
- 如果没有可用的选择,将弹出一条消息
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(magrittr)
df <- structure(list(CowId = 1:12, DIM = c(13L, 7L, 3L, 4L, 9L, 3L,
5L, 10L, 1L, 6L, 7L, 4L), Type = structure(c(2L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Control", "Case"
), class = "factor")), row.names = c(NA, -12L), class = "data.frame")
# create variable for tracking sample picking process
df %<>% mutate(Picked = FALSE, Case_ID = -1)
# get list of case - assume the df is unique
list_case_id <- df$CowId[df$Type == "Case"]
for (i_case_id in list_case_id) {
# calculate the min/max DIM
current_case <- df %>% filter(CowId == i_case_id)
expecting_DIM_min <- current_case$DIM - 3
expecting_DIM_max <- current_case$DIM + 3
# Pick with sample
possible_sample <- df %>%
filter(Type == "Control", DIM >= expecting_DIM_min & DIM <= expecting_DIM_max,
Picked == FALSE)
if (nrow(possible_sample) == 0) {
message("There is no possible sample for Case ID: ", i_case_id)
message("DIM Range is: ", expecting_DIM_min, " - ", expecting_DIM_max)
} else {
max_sample <- nrow(possible_sample)
# Maximum pick - in this case OP ask for 1 - 1 matched
# pick_number <- max(1, max_sample / 2)
pick_number <- 1
sample <- possible_sample %>%
sample_n(size = 1)
df$Picked[df$CowId %in% sample$CowId] <- TRUE
df$Case_ID[df$CowId %in% sample$CowId] <- i_case_id
}
}
这是一个输出
df %>% filter(Picked | Type == "Case")
#> CowId DIM Type Picked Case_ID
#> 1 1 13 Case FALSE -1
#> 2 2 7 Case FALSE -1
#> 3 8 10 Control TRUE 1
#> 4 10 6 Control TRUE 2
更新:仅匹配1-1
创建于2021-04-10由reprex包(v1.0.0(