r-根据条件创建匹配对

  • 本文关键字:创建 条件 r dplyr
  • 更新时间 :
  • 英文 :


假设我有一个以下格式的表:

CowId    DIM    Type
1        13     Case
2         7     Case
3         3     Control
4         4     Control
5         9     Control
6         3     Control
7         5     Control
8        10     Control
9         1     Control
10        6     Control
11        7     Control
12        4     Control

我想根据+/-3 DIM将案例随机匹配到对照(1到1(。有没有一种方便的方法可以使用dplyr来完成这项任务?如有任何反馈,我们将不胜感激。

dput的输出被附加:

structure(list(CowId = 1:12, DIM = c(13L, 7L, 3L, 4L, 9L, 3L, 
5L, 10L, 1L, 6L, 7L, 4L), Type = structure(c(2L, 2L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Control", "Case"
), class = "factor")), row.names = c(NA, -12L), class = "data.frame")

基本R中的一种方法:

#Get the index where Type = 'Case'
inds <- df$Type == 'Case'
#Get all the values within -3-3 for each DIM value
vals <- unique(c(sapply(df$DIM[inds], `+`, -3:3)))
#select random rows within range
result <- sample(which(df$DIM %in% vals & !inds), sum(inds))
#Combine case and control data. 
df[c(which(inds), result), ]
#   CowId DIM    Type
#1      1  13    Case
#2      2   7    Case
#5      5   9 Control
#10    10   6 Control

随机零件可能很棘手。这是我的方法:

  • 对于每种情况,Id计算最小/最大DIM
  • 然后随机选择一个或一半的可用控件
  • 更新参照CAse ID拾取的控件,并从将来的拾取中排除这些行
  • 重复此步骤,直到完成所有案例
  • 如果没有可用的选择,将弹出一条消息
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(magrittr)
df <- structure(list(CowId = 1:12, DIM = c(13L, 7L, 3L, 4L, 9L, 3L, 
5L, 10L, 1L, 6L, 7L, 4L), Type = structure(c(2L, 2L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Control", "Case"
), class = "factor")), row.names = c(NA, -12L), class = "data.frame")

# create variable for tracking sample picking process
df %<>% mutate(Picked = FALSE, Case_ID = -1)
# get list of case - assume the df is unique
list_case_id <- df$CowId[df$Type == "Case"]
for (i_case_id in list_case_id) {
# calculate the min/max DIM
current_case <- df %>% filter(CowId == i_case_id)  
expecting_DIM_min <- current_case$DIM - 3
expecting_DIM_max <- current_case$DIM + 3

# Pick with sample
possible_sample <- df %>%
filter(Type == "Control", DIM >= expecting_DIM_min & DIM <= expecting_DIM_max,
Picked == FALSE)
if (nrow(possible_sample) == 0) {
message("There is no possible sample for Case ID: ", i_case_id)
message("DIM Range is: ", expecting_DIM_min, " - ", expecting_DIM_max)
} else {
max_sample <- nrow(possible_sample)
# Maximum pick - in this case OP ask for 1 - 1 matched
# pick_number <- max(1, max_sample / 2)
pick_number <- 1
sample <- possible_sample %>%
sample_n(size = 1)
df$Picked[df$CowId %in% sample$CowId] <- TRUE
df$Case_ID[df$CowId %in% sample$CowId] <- i_case_id
}
}

这是一个输出

df %>% filter(Picked | Type == "Case")
#>   CowId DIM    Type Picked Case_ID
#> 1     1  13    Case  FALSE      -1
#> 2     2   7    Case  FALSE      -1
#> 3     8  10 Control   TRUE       1
#> 4    10   6 Control   TRUE       2

更新:仅匹配1-1

创建于2021-04-10由reprex包(v1.0.0(

最新更新