r-对两个变量执行匹配风险集抽样(发生率密度抽样),而不进行替换匹配



我有一个数据帧,如下例所示:

### Packages needed for reproducible example
library(lubridate)
library(dplyr)
### Create data frame:
Person_IDs <- seq(1,1000000,1)
Example_DF <- as.data.frame(Person_IDs)
### Sex and age for matching:
set.seed(2021)
Example_DF$Sex <- sample(c("Male", "Female"), size = 1000000, replace = T)
set.seed(2021)
Example_DF$Age <- sample(c(1:100), size = 1000000, replace = T)
### Study start and end date (just for clarity):
Example_DF$Start_Date <- as.Date("2020-01-01")
Example_DF$End_Date <- as.Date("2021-05-01")
### Study outcome (85% not experiencing the outcome, 15% experiencing the outcome):
set.seed(2021)
Example_DF$Outcome <- sample(c(0, 1), size = 1000000, replace = TRUE, prob = c(0.85, 0.15))
### Timestamp for outcome (either as exposed (Outcome = 1) or censored (Outcome = 0):
Example_DF$Timestamp_Outcome <- as.Date("1900-01-01") 
set.seed(2021)
Example_DF$Timestamp_Outcome[Example_DF$Outcome == 1] <- Example_DF$Start_Date[Example_DF$Outcome == 1] + days(sample (c(45:295), size=length(unique(Example_DF$Person_IDs[Example_DF$Outcome == 1])), replace =T)) 
set.seed(2021)
Example_DF$Timestamp_Outcome[Example_DF$Outcome == 0] <- Example_DF$Start_Date[Example_DF$Outcome == 0] + days(sample (c(275:340), size=length(unique(Example_DF$Person_IDs[Example_DF$Outcome == 0])), replace =T)) 
### Arrange data by timestamp outcome:
Example_DF <- Example_DF %>% arrange(Timestamp_Outcome)
### Show first rows of data frame:
head(Example_DF)

正如你所看到的,有:

  1. 1000000个唯一个体(Person_ID(,其共同开始日期为2020-01-01(即;2021-05-01";。

  2. 关于性别和年龄的信息是可用的;匹配";ID,其中结果==1与对照组。

  3. 所有个体都有一个结果日期(结果==0或结果==1(。

**我现在要执行的是所谓的风险集采样(或发病率密度采样(。数据帧按结果的时间戳排列,现在:

  1. 每次;算法";遇到一行,其中结果==1,随机选择三(3(个具有相同性别、相同年龄和较晚时间戳的Person_ID(即时间戳_结果至少晚一天,无论结果==0还是结果==(。

  2. 这4个个体(1个暴露个体和3个未暴露个体(应从数据帧中删除(即替换=假(,因此不能再次选择(称为无替换采样(。**

如果需要,请考虑以下示例:

head(Example_DF)

正如你所看到的,Person_ID 1030、1269、3180、4245等都在2020-02-15体验结果。以Person_ID1030为例,这是一位86岁的女性。因此,她应该与三名在2020-02-15没有暴露的86岁女性配对(她们可能在2020-02-26、2020-02-20或任何时候暴露(。如果这不可能,则应选择尽可能多的匹配个体(从0到3个匹配个体(。

你知道怎么做吗?

以下是使用data.table和递归的潜在解决方案:

library(data.table)
library(lubridate)
set.seed(123)
dt <- data.table(Person_IDs = 1:1e6, Start_Date = as.Date("2020-01-01"), Exposure_Date = as.Date("2020-01-01") + days(sample(c(45:365), size = 1e6, replace = TRUE)), End_Date = as.Date("2021-05-01"), Sex = sample(c("Male", "Female"), size = 1e6, replace = TRUE), Age = sample(c(1:100), size = 1e6, replace = TRUE))
matched_risk_sample_rec <- function(id, Exposure_Date, size = 5L, out_vec, idx = 1L) {
# perform the matched risk sampling

# get the index of the next unexposed person
idxUnexposed <- sum(Exposure_Date == Exposure_Date[1]) + 1L

if (length(id) - idxUnexposed + 1L < size) {
# not enough for another sample set
return(out_vec)
}

# get a sample set
sample.id <- c(1L, sample(idxUnexposed:length(id), size = size, replace = FALSE))
out_vec[idx:(idx + size)] <- id[sample.id]
# remove the samples and recurse
return(matched_risk_sample_rec(id[-sample.id], Exposure_Date[-sample.id], size, out_vec, idx + size + 1L))
}
# order the dataset by Sex, Age, and Exposure_Date, and mark as sorted
setkey(dt, Sex, Age, Exposure_Date)
# add a column for the sample set ordering
# every 6 values of "set_ids" is a sample set of IDs, with the first value being the exposed person id
dt[, set_ids := matched_risk_sample_rec(Person_IDs, Exposure_Date, 5L, rep(NA, .N)), by = .(Sex, Age)]
# rearrange the data.table by the "set_ids" column
# override "set_ids" with a unique ID for each set
dtSamples <- dt[dt[!is.na(set_ids), "set_ids"], on = .(Person_IDs == set_ids)][, set_ids := rep(1:(.N/6L), each = 6L)]

dtSamples现在有166588个样本集,每个样本集有6个人,每个样本集中的第一个是暴露的人。

最新更新