我有一个血液测试标记的数据框架结果,我想通过以下标准填写NA:
对于每组ID(时间为上升顺序(,如果标记值为Na,则在该组中以最接近的Na值填充(可能是过去或将来(,但仅当时间差小于14。
我的数据示例:
df<-data.frame(ID=c(rep(2,5),rep(4,3)), TIME =c(1,22,33,43,85,-48,1,30),
CEA = c(1.32,1.42,1.81,2.33,2.23,29.7,23.34,18.23),
CA.15.3 = c(14.62,14.59,16.8,22.34,36.33,56.02,94.09,121.5),
CA.125 = c(33.98,27.56,30.31,NA,39.57,1171.00,956.50,825.30),
CA.19.9 = c(6.18,7.11,5.72, NA, 7.38,39.30,118.20,98.26),
CA.72.4 = c(rep(NA,5),1.32, NA, NA),
NSE = c(NA, 13.21, rep(NA,6)))
ID TIME CEA CA.15.3 CA.125 CA.19.9 CA.72.4 NSE
2 1 1.32 14.62 33.98 6.18 NA NA
2 22 1.42 14.59 27.56 7.11 NA 13.21
2 33 1.81 16.80 30.31 5.72 NA NA
2 43 2.33 22.34 NA NA NA NA
2 85 2.23 36.33 39.57 7.38 NA NA
4 -48 29.70 56.02 1171.00 39.30 1.32 NA
4 1 23.34 94.09 956.50 118.20 NA NA
4 30 18.23 121.50 825.30 98.26 NA NA
id是病人。时间是血液检查的时间。其他是标记。
我唯一可以做到的是循环,我尝试避免尽可能避免。
我希望输出为:
ID TIME CEA CA.15.3 CA.125 CA.19.9 CA.72.4 NSE
2 1 1.32 14.62 33.98 6.18 NA NA
2 22 1.42 14.59 27.56 7.11 NA 13.21
2 33 1.81 16.80 30.31 5.72 NA 13.21
2 43 2.33 22.34 30.31 5.72 NA NA
2 85 2.23 36.33 39.57 7.38 NA NA
4 -48 29.70 56.02 1171.00 39.30 1.32 NA
4 1 23.34 94.09 956.50 118.20 NA NA
4 30 18.23 121.50 825.30 98.26 NA NA
ca.19.9和ca.124充满了过去(10天前(NSE充满了前(11天(
ca.72.4没有填充,因为时间差为-48是下一个措施49天。
我敢打赌,有一个更简单的,矢量化的解决方案,但是以下工作。
fill_NA <- function(DF){
sp <- split(df, df$ID)
sp <- lapply(sp, function(DF){
d <- diff(DF$TIME)
i_diff <- c(FALSE, d < 14)
res <- sapply(DF[-(1:2)], function(X){
inx <- i_diff & is.na(X)
if(any(inx)){
inx <- which(inx)
last_change <- -1
for(i in inx){
if(i > last_change + 1){
if(i == 1){
X[i] <- X[i + 1]
}else{
X[i] <- X[i - 1]
}
last_change <- i
}
}
}
X
})
cbind(DF[1:2], res)
})
res <- do.call(rbind, sp)
row.names(res) <- NULL
res
}
fill_NA(df)
# ID TIME CEA CA.15.3 CA.125 CA.19.9 CA.72.4 NSE
#1 2 1 1.32 14.62 33.98 6.18 NA NA
#2 2 22 1.42 14.59 27.56 7.11 NA 13.21
#3 2 33 1.81 16.80 30.31 5.72 NA 13.21
#4 2 43 2.33 22.34 30.31 5.72 NA NA
#5 2 85 2.23 36.33 39.57 7.38 NA NA
#6 4 -48 29.70 56.02 1171.00 39.30 1.32 NA
#7 4 1 23.34 94.09 956.50 118.20 NA NA
#8 4 30 18.23 121.50 825.30 98.26 NA NA
是的,您可以具有矢量化解决方案。首先,让我们考虑您仅使用未来价值进行算的情况。您需要创建几个辅助变量:
- 一个变量告诉您下一个观察结果是否属于同一ID(因此可以用来估算(,
- 一个变量,告诉您下一个观察结果是否与当前的观察相距不到14天。
这些不取决于您要算的特定变量。为了归纳每个变量,您还需要一个变量,该变量告诉您下一个变量是否丢失。
然后,您可以对以下逻辑进行矢量化:下一个观察结果具有相同的ID,而当它距离当前一个逻辑少于14天,并且在当前的ID中不丢失其值。
。当您需要决定使用过去还是将来的价值时,事情会变得更加复杂,但是逻辑是相同的。代码在下面,有点长,但是您可以简化它,我只想清楚它的作用。
希望这会有所帮助
x <-data.frame(ID=c(rep(2,5),rep(4,3)), TIME =c(1,22,33,43,85,-48,1,30),
CEA = c(1.32,1.42,1.81,2.33,2.23,29.7,23.34,18.23),
CA.15.3 = c(14.62,14.59,16.8,22.34,36.33,56.02,94.09,121.5),
CA.125 = c(33.98,27.56,30.31,NA,39.57,1171.00,956.50,825.30),
CA.19.9 = c(6.18,7.11,5.72, NA, 7.38,39.30,118.20,98.26),
CA.72.4 = c(rep(NA,5),1.32, NA, NA),
NSE = c(NA, 13.21, rep(NA,6)))
### these are the columns we want to input
cols.to.impute <- colnames(x)[! colnames(x) %in% c("ID","TIME")]
### is the next id the same?
x$diffidf <- NA
x$diffidf[1:(nrow(x)-1)] <- diff(x$ID)
x$diffidf[x$diffidf > 0] <- NA
### is the previous id the same?
x$diffidb <- NA
x$diffidb[2:nrow(x)] <- diff(x$ID)
x$diffidb[x$diffidb > 0] <- NA
### diff in time with next observation
x$difftimef <- NA
x$difftimef[1:(nrow(x)-1)] <- diff(x$TIME)
### diff in time with previous observation
x$difftimeb <- NA
x$difftimeb[2:nrow(x)] <- diff(x$TIME)
### if next (previous) id is not the same time difference is not meaningful
x$difftimef[is.na(x$diffidf)] <- NA
x$difftimeb[is.na(x$diffidb)] <- NA
### we do not need diffid anymore (due to previous statement)
x$diffidf <- x$diffidb <- NULL
### if next (previous) point in time is more than 14 days it is not useful for imputation
x$difftimef[abs(x$difftimef) > 14] <- NA
x$difftimeb[abs(x$difftimeb) > 14] <- NA
### create variable usef that tells us whether we should attempt to use the forward observation for imputation
### it is 1 only if difftime forward is less than difftime backward
x$usef <- NA
x$usef[!is.na(x$difftimef) & x$difftimef < x$difftimeb] <- 1
x$usef[!is.na(x$difftimef) & is.na(x$difftimeb)] <- 1
x$usef[is.na(x$difftimef) & !is.na(x$difftimeb)] <- 0
if (!is.na(x$usef[nrow(x)]))
stop("nlast observation usef is not missingn")
### now we get into column specific operations.
for (col in cols.to.impute){
### we will store the results in x$imputed, and copy into c[,col] at the end
x$imputed <- x[,col]
### x$usef needs to be modified depending on the specific column, so we define a local version of it
x$usef.local <- x$usef
### if a variable is not missing no point in looking at usef.local, so we make it missing
x$usef.local[!is.na(x[,col])] <- NA
### when usef.local is 1 but the next observation is missing it cannot be used for imputation, so we
### make it 0. but a value of 0 does not mean we can use the previous observation because that may
### be missing too. so first we make usef 0 and next we check the previous observation and if that
### is missing too we make usef missing
x$previous.value <- c(NA,x[1:(nrow(x)-1),col])
x$next.value <- c(x[2:nrow(x),col],NA)
x$next.missing <- is.na(x$next.value)
x$previous.missing <- is.na(x$previous.value)
x$usef.local[x$next.missing & x$usef.local == 1] <- 0
x$usef.local[x$previous.missing & x$usef.local == 0] <- NA
### now we can impute properly: use next value when usef.local is 1 and previous value when usef.local is 0
tmp <- rep(FALSE,nrow(x))
tmp[x$usef.local == 1] <- TRUE
x$imputed[tmp] <- x$next.value[tmp]
tmp <- rep(FALSE,nrow(x))
tmp[x$usef.local == 0] <- TRUE
x$imputed[tmp] <- x$previous.value[tmp]
### copy to column
x[,col] <- x$imputed
}
### get rid of useless temporary stuff
x$previous.value <- x$previous.missing <- x$next.value <- x$next.missing <- x$imputed <- x$usef.local <- NULL
ID TIME CEA CA.15.3 CA.125 CA.19.9 CA.72.4 NSE difftimef difftimeb usef
1 2 1 1.32 14.62 33.98 6.18 NA NA NA NA NA
2 2 22 1.42 14.59 27.56 7.11 NA 13.21 11 NA 1
3 2 33 1.81 16.80 30.31 5.72 NA 13.21 10 11 1
4 2 43 2.33 22.34 30.31 5.72 NA NA NA 10 0
5 2 85 2.23 36.33 39.57 7.38 NA NA NA NA NA
6 4 -48 29.70 56.02 1171.00 39.30 1.32 NA NA NA NA
7 4 1 23.34 94.09 956.50 118.20 NA NA NA NA NA
8 4 30 18.23 121.50 825.30 98.26 NA NA NA NA NA
>