r-如何将两个数据帧与日期进行比较,在特定间隔内返回匹配的日期,并为新数据帧中的每一行标记不匹配的日期

  • 本文关键字:日期 数据帧 返回 不匹配 一行 两个 比较 r date
  • 更新时间 :
  • 英文 :


我有一个日期框,每行中每个受试者都有多个测量日期,另一个数据框,每行都有同一受试者的多个就诊日期(还包括一些NA(。

我想要的是在特定的时间间隔内(比如从就诊日期起+/-10天(提取与某个受试者的就诊日期匹配的测量日期,并标记不在此时间间隔内的测量日期(例如,用"FALSE"或-99(,并保持NA的原样。

这里也提出了类似的问题,但不允许测量日期在访问日期的间隔期内。

set.seed(1)
# Dataframe with measure dates
df1 <- rbind.data.frame(sort(sample(seq(as.Date("2018-01-01"), as.Date("2019-01-01"), by = "day"), 10)),
c(sort(sample(seq(as.Date("2018-06-01"), as.Date("2019-06-01"), by = "day"), 8)), NA, NA),
c(sort(sample(seq(as.Date("2019-06-01"), as.Date("2020-06-01"), by = "day"), 6)), rep(NA, 4)))
names(df1) <- paste("MEASUREDATE", 1:10, sep = "")
myfun <- function(x) as.Date(x, format = "%Y-%m-%d", origin = "1970-01-01")
df1 <- data.frame(lapply(df1, myfun))
df1
# Dataframe with visit dates
df2 <- rbind.data.frame(as.numeric(df1[1, 2:7]), as.numeric(c(df1[2, 4:6], NA, NA, NA)), as.numeric(c(df1[3, 1:2], rep(NA, 4))))
df2 <- data.frame(lapply(df2, myfun))
names(df2) <- paste("VISIT", 1:6, sep = "")
df2

因此,新数据帧的第一行是这样的:

# New dataframe
df3 <- df1[1, ]
df3[1] <- FALSE
df3[9:10] <- FALSE
df3

你知道如何解决这个问题吗?非常感谢您的帮助。

这里有一个data.table解决方案。在倒数第二行中,缺失的访问日期设置为1-1-1970(NA是不可能的,或者它们会与当前NA混合。。并且必须是一个日期(。如果不需要日期格式,您可以切换到字符5并使用您喜欢的任何值填充。。。

library(data.table)
# set as data.table
setDT(df1); setDT(df2)
# add subject numbering
df1[, id := .I]
df2[, id := .I]
# melt to long format
df1.melt <- melt(setDT(df1), id.vars = "id")
df2.melt <- melt(setDT(df2), id.vars = "id")
# add margins arround visit dates
df2.melt[, `:=`(mindate = value - 10, maxdate = value + 10)][]
# join visitdays within 10 days of measure (non-equi join)
df1.melt[df2.melt, visitdate := i.value, on = .(id, value >= mindate, value <= maxdate)]
# set missing visitdates to 31-12-2099 (keep date format)
df1.melt[!is.na(value) & is.na(visitdate), visitdate := 0]
# last step is to cast to wide again
dcast(df1.melt, id ~ variable, value.var = "visitdate")
#    id MEASUREDATE1 MEASUREDATE2 MEASUREDATE3 MEASUREDATE4 MEASUREDATE5 MEASUREDATE6 MEASUREDATE7 MEASUREDATE8 MEASUREDATE9 MEASUREDATE10
# 1:  1   1970-01-01   2018-05-09   2018-06-16   2018-07-06   2018-10-04   2018-10-04   2018-10-26   2018-10-26   1970-01-01    1970-01-01
# 2:  2   1970-01-01   1970-01-01   1970-01-01   2018-11-12   2019-01-03   2019-01-03   1970-01-01   1970-01-01         <NA>          <NA>
# 3:  3   2019-08-28   2020-03-15   2020-03-15   1970-01-01   1970-01-01   1970-01-01         <NA>         <NA>         <NA>          <NA>

正如Wimpel所说,逻辑和日期不能在同一列中。所以我将使用1970-01-01作为FALSE值。

一种使用dplyr的解决方案

library(dplyr)
# convert a row from a Date dataframe to a Date vector
convert_to_vector <- function(row){
return(row %>% t %>% as.Date)
}
# given a Date vector where columns 1:10 are measurement date and
# 11:16 visit dates, create a logical vector of length 10 where
# the value is TRUE if the corresponding measurement column
# is within 10 days of any of the visit dates
check_within_10d <- function(row){
return(sapply(row[1:10], function(x){abs(x-row[11:16])<=10}) %>% apply(2, any))
}
# temporary dataframe of logical values for all checks on all dates
df_lgl <- cbind(df1,df2) %>% 
apply(1, function(row){check_within_10d(convert_to_vector(row))}) %>% 
data.frame %>% 
t
# create a result dataframe replacing logicals with corresponding dates
df3 <- df1
for(i in 1:ncol(df3)){ 
df3[,i] <- if_else(df_lgl[,i], df3[,i], as.Date("1970-01-01"))
}

输出

> df3
MEASUREDATE1 MEASUREDATE2 MEASUREDATE3 MEASUREDATE4 MEASUREDATE5 MEASUREDATE6 MEASUREDATE7 MEASUREDATE8 MEASUREDATE9 MEASUREDATE10
1   1970-01-01   2018-05-09   2018-06-16   2018-07-06   2018-09-27   2018-10-04   2018-10-26   2018-11-03   1970-01-01    1970-01-01
2         <NA>         <NA>         <NA>   2018-11-12   2018-12-30   2019-01-03         <NA>         <NA>         <NA>          <NA>
3   2019-08-28   2020-03-15   2020-03-16         <NA>         <NA>         <NA>         <NA>         <NA>         <NA>          <NA>

由于某些就诊日期为NA,因此存在一些NA值。因此check_within_10d函数无法确保其中一个缺失的就诊日期在测量日期的10个日期内。

如果您想忽略支票中遗漏的就诊日期,请使用

convert_to_vector <- function(row){
return(row %>% t %>% as.Date)
}
# changed function to any(na.rm=TRUE)
check_within_10d <- function(row){
return(sapply(row[1:10], function(x){abs(x-row[11:16])<=10}) %>% apply(2, function(x){any(x,na.rm=T)}))
}
df_lgl <- cbind(df1,df2) %>% 
apply(1, function(row){check_within_10d(convert_to_vector(row))}) %>% 
data.frame %>% 
t
# replace missing measurement values to NA
df3 <- df1
for(i in 1:ncol(df3)){ 
df3[,i] <- if_else(df_lgl[,i], df3[,i], as.Date("1970-01-01"))
df3[,i] <- if_else(is.na(df1[,i]), df1[,i], df3[,i])
}

输出

> df3
MEASUREDATE1 MEASUREDATE2 MEASUREDATE3 MEASUREDATE4 MEASUREDATE5 MEASUREDATE6 MEASUREDATE7 MEASUREDATE8 MEASUREDATE9 MEASUREDATE10
1   1970-01-01   2018-05-09   2018-06-16   2018-07-06   2018-09-27   2018-10-04   2018-10-26   2018-11-03   1970-01-01    1970-01-01
2   1970-01-01   1970-01-01   1970-01-01   2018-11-12   2018-12-30   2019-01-03   1970-01-01   1970-01-01         <NA>          <NA>
3   2019-08-28   2020-03-15   2020-03-16   1970-01-01   1970-01-01   1970-01-01         <NA>         <NA>         <NA>          <NA>

相关内容

  • 没有找到相关文章