r语言 - 确定给定日期向量出现的间隔



这是我第一次在这里发帖!我非常执着于一件我确信很容易做到的事情。

我有一个不规则间隔的数据帧和一个日期向量。如果其中一个日期发生在任何给定的间隔内,我希望一个新的列来标记它(因为需要删除间隔)。类似于这篇文章,但解决方案不工作,由于不规则的间隔。我有超过2000个间隔和2000个日期。

我可以使用%within%函数获得在给定间隔内发生的日期,但这不好,因为我无法找到日期发生在哪个间隔内。

我在这个类似的帖子中尝试了解决方案,但我没有分组变量,不能让它们工作。

任何建议都会非常有帮助!!非常感谢!!示例原始数据(不作为润滑间隔):

>df1
diveno               start                 fin
1      1 2018-08-01 08:20:40 2018-08-01 08:39:20
2      2 2018-08-01 08:40:50 2018-08-01 08:53:40
3      3 2018-08-01 10:01:00 2018-08-01 10:16:30
4      4 2018-08-01 15:45:30 2018-08-01 15:58:20
5      5 2018-08-01 17:06:00 2018-08-01 17:18:20
>df2
date
1 2018-08-01 08:30:00
2 2018-08-01 15:47:00
3 2018-08-02 17:10:00
What I'd like
> df3
diveno               start                 fin dateoccurs
1      1 2018-08-01 08:20:40 2018-08-01 08:39:20          Y
2      2 2018-08-01 08:40:50 2018-08-01 08:53:40          N
3      3 2018-08-01 10:01:00 2018-08-01 10:16:30          N
4      4 2018-08-01 15:45:30 2018-08-01 15:58:20          Y
5      5 2018-08-01 17:06:00 2018-08-01 17:18:20          N

如果df2中的日期在给定的时间间隔内出现在df1中,则dateoccurs列标记在其中

代码示例data:


df1<-data.frame(diveno=c(1,2,3,4,5), 
start=c("2018-08-01 08:20:40","2018-08-01 08:40:50", "2018-08-01 10:01:00","2018-08-01 15:45:30","2018-08-01 17:06:00"),
fin=c("2018-08-01 08:39:20","2018-08-01 08:53:40","2018-08-01 10:16:30","2018-08-01 15:58:20", "2018-08-01 17:18:20"))
df1$start <- as.POSIXct(df1$start,format="%Y-%m-%d %H:%M:%S",tz="CET")
df1$fin <- as.POSIXct(df1$fin,format="%Y-%m-%d %H:%M:%S",tz="CET")

df2<-data.frame(date=c("2018-08-01 08:30:00", "2018-08-01 15:47:00", "2018-08-02 17:10:00"))
df2$date <- as.POSIXct(df2$date,format="%Y-%m-%d %H:%M:%S",tz="CET")

我需要什么:


df3<-data.frame(diveno=c(1,2,3,4,5), 
start=c("2018-08-01 08:20:40","2018-08-01 08:40:50", "2018-08-01 10:01:00","2018-08-01 15:45:30","2018-08-01 17:06:00"),
fin=c("2018-08-01 08:39:20","2018-08-01 08:53:40","2018-08-01 10:16:30","2018-08-01 15:58:20", "2018-08-01 17:18:20"),
dateoccurs=c("Y","N","N","Y","N"))

我得到的最接近的答案是从这篇文章中得到的但是它返回改变的'fin'次,当应用于真正的大规模数据集时,似乎会复制值并改变'diveno'的数量!

intervals<-df1
elements<-df2[,1]
library(data.table) #v1.10.0
j<-setDT(intervals)[data.table(elements), on = .(start <= elements, fin >= elements)]
j2<-as.data.frame(j)
na.omit(j2)

更新样本数据的df2似乎产生假阳性?

> dput(df2) structure(list(date = structure(c(1533096000, 1533096300, 1533096600,  1533096900, 1533097200, 1533097500, 1533097800, 1533098100, 1533098400,  1533098700, 1533099000, 1533099300, 1533099600, 1533099900, 1533100200,  1533100500, 1533100800, 1533101100, 1533101400, 1533101700, 1533102000,  1533102300, 1533102600, 1533102900, 1533103200, 1533103500, 1533103800,  1533104100, 1533104400, 1533104700, 1533105000, 1533105300, 1533105600,  1533105900, 1533106200, 1533106500, 1533106800, 1533107100, 1533107400,  1533107700, 1533108000, 1533108300, 1533108600, 1533108900, 1533109200,  1533109500, 1533109800, 1533110100), tzone = "UTC", class = c("POSIXct",  "POSIXt")), depth = c(NA_real_, NA_real_, NA_real_, NA_real_,  NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,  NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,  NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,  NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,  NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,  NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,  NA_real_, NA_real_)), class = "data.frame", row.names = c(NA, 
-48L))

更新2(抱歉!)

30 2018-08-01 06:25:00    NA
31 2018-08-01 06:30:00    NA
32 2018-08-01 06:35:00    NA

似乎可以被识别为属于以下间隔:

diveno               start                 fin dateoccurs
1        1 2018-08-01 08:20:40 2018-08-01 08:39:20          Y
2        2 2018-08-01 08:40:50 2018-08-01 08:53:40          Y
3        3 2018-08-01 10:01:00 2018-08-01 10:16:30          N

为什么会发生这种情况?

s<-df1[1,2] f<-df1[1,3]     int<-interval(s,f)
df2[,1] %within% ir

使用非相等连接的data.table解决方案

library(data.table)
setDT(df1); setDT(df2)
# initialise new column with "N"
df1[, dateoccurs := "N"]
# update join
df1[df2, dateoccurs := "Y", on = .(start <= date, fin >= date)][]
#    diveno               start                 fin dateoccurs
# 1:      1 2018-08-01 08:20:40 2018-08-01 08:39:20          Y
# 2:      2 2018-08-01 08:40:50 2018-08-01 08:53:40          N
# 3:      3 2018-08-01 10:01:00 2018-08-01 10:16:30          N
# 4:      4 2018-08-01 15:45:30 2018-08-01 15:58:20          Y
# 5:      5 2018-08-01 17:06:00 2018-08-01 17:18:20          N

您可以使用outer

fun <- function(i, j) data.table::between(df2[i, 'date'], df1[j, 'start'], df1[j, 'fin'])
df1$occ <- colSums(outer(seq_len(nrow(df2)), seq_len(nrow(df1)), Vectorize(fun)))
df1
#   diveno               start                 fin occ
# 1      1 2018-08-01 08:20:40 2018-08-01 08:39:20   1
# 2      2 2018-08-01 08:40:50 2018-08-01 08:53:40   0
# 3      3 2018-08-01 10:01:00 2018-08-01 10:16:30   0
# 4      4 2018-08-01 15:45:30 2018-08-01 15:58:20   1
# 5      5 2018-08-01 17:06:00 2018-08-01 17:18:20   0

如果您愿意,二进制列可以很容易地包装为factor

df1$occ <- colSums(outer(seq_len(nrow(df2)), seq_len(nrow(df1)), Vectorize(fun))) |> 
factor(labels=c("N", "Y"))
df1
#   diveno               start                 fin occ
# 1      1 2018-08-01 08:20:40 2018-08-01 08:39:20   Y
# 2      2 2018-08-01 08:40:50 2018-08-01 08:53:40   N
# 3      3 2018-08-01 10:01:00 2018-08-01 10:16:30   N
# 4      4 2018-08-01 15:45:30 2018-08-01 15:58:20   Y
# 5      5 2018-08-01 17:06:00 2018-08-01 17:18:20   N

数据:

df1 <- structure(list(diveno = c(1, 2, 3, 4, 5), start = structure(c(1533104440, 
1533105650, 1533110460, 1533131130, 1533135960), class = c("POSIXct", 
"POSIXt"), tzone = "CET"), fin = structure(c(1533105560, 1533106420, 
1533111390, 1533131900, 1533136700), class = c("POSIXct", "POSIXt"
), tzone = "CET"), occ = structure(c(2L, 1L, 1L, 2L, 1L), levels = c("N", 
"Y"), class = "factor")), row.names = c(NA, -5L), class = "data.frame")
df2 <- structure(list(date = structure(c(1533105000, 1533131220, 1533222600
), class = c("POSIXct", "POSIXt"), tzone = "CET")), row.names = c(NA, 
-3L), class = "data.frame")

最新更新