这是我第一次在这里发帖!我非常执着于一件我确信很容易做到的事情。
我有一个不规则间隔的数据帧和一个日期向量。如果其中一个日期发生在任何给定的间隔内,我希望一个新的列来标记它(因为需要删除间隔)。类似于这篇文章,但解决方案不工作,由于不规则的间隔。我有超过2000个间隔和2000个日期。
我可以使用%within%函数获得在给定间隔内发生的日期,但这不好,因为我无法找到日期发生在哪个间隔内。
我在这个类似的帖子中尝试了解决方案,但我没有分组变量,不能让它们工作。
任何建议都会非常有帮助!!非常感谢!!示例原始数据(不作为润滑间隔):
>df1
diveno start fin
1 1 2018-08-01 08:20:40 2018-08-01 08:39:20
2 2 2018-08-01 08:40:50 2018-08-01 08:53:40
3 3 2018-08-01 10:01:00 2018-08-01 10:16:30
4 4 2018-08-01 15:45:30 2018-08-01 15:58:20
5 5 2018-08-01 17:06:00 2018-08-01 17:18:20
>df2
date
1 2018-08-01 08:30:00
2 2018-08-01 15:47:00
3 2018-08-02 17:10:00
What I'd like
> df3
diveno start fin dateoccurs
1 1 2018-08-01 08:20:40 2018-08-01 08:39:20 Y
2 2 2018-08-01 08:40:50 2018-08-01 08:53:40 N
3 3 2018-08-01 10:01:00 2018-08-01 10:16:30 N
4 4 2018-08-01 15:45:30 2018-08-01 15:58:20 Y
5 5 2018-08-01 17:06:00 2018-08-01 17:18:20 N
如果df2中的日期在给定的时间间隔内出现在df1中,则dateoccurs列标记在其中
代码示例data:
df1<-data.frame(diveno=c(1,2,3,4,5),
start=c("2018-08-01 08:20:40","2018-08-01 08:40:50", "2018-08-01 10:01:00","2018-08-01 15:45:30","2018-08-01 17:06:00"),
fin=c("2018-08-01 08:39:20","2018-08-01 08:53:40","2018-08-01 10:16:30","2018-08-01 15:58:20", "2018-08-01 17:18:20"))
df1$start <- as.POSIXct(df1$start,format="%Y-%m-%d %H:%M:%S",tz="CET")
df1$fin <- as.POSIXct(df1$fin,format="%Y-%m-%d %H:%M:%S",tz="CET")
df2<-data.frame(date=c("2018-08-01 08:30:00", "2018-08-01 15:47:00", "2018-08-02 17:10:00"))
df2$date <- as.POSIXct(df2$date,format="%Y-%m-%d %H:%M:%S",tz="CET")
我需要什么:
df3<-data.frame(diveno=c(1,2,3,4,5),
start=c("2018-08-01 08:20:40","2018-08-01 08:40:50", "2018-08-01 10:01:00","2018-08-01 15:45:30","2018-08-01 17:06:00"),
fin=c("2018-08-01 08:39:20","2018-08-01 08:53:40","2018-08-01 10:16:30","2018-08-01 15:58:20", "2018-08-01 17:18:20"),
dateoccurs=c("Y","N","N","Y","N"))
我得到的最接近的答案是从这篇文章中得到的但是它返回改变的'fin'次,当应用于真正的大规模数据集时,似乎会复制值并改变'diveno'的数量!
intervals<-df1
elements<-df2[,1]
library(data.table) #v1.10.0
j<-setDT(intervals)[data.table(elements), on = .(start <= elements, fin >= elements)]
j2<-as.data.frame(j)
na.omit(j2)
更新样本数据的df2似乎产生假阳性?
> dput(df2) structure(list(date = structure(c(1533096000, 1533096300, 1533096600, 1533096900, 1533097200, 1533097500, 1533097800, 1533098100, 1533098400, 1533098700, 1533099000, 1533099300, 1533099600, 1533099900, 1533100200, 1533100500, 1533100800, 1533101100, 1533101400, 1533101700, 1533102000, 1533102300, 1533102600, 1533102900, 1533103200, 1533103500, 1533103800, 1533104100, 1533104400, 1533104700, 1533105000, 1533105300, 1533105600, 1533105900, 1533106200, 1533106500, 1533106800, 1533107100, 1533107400, 1533107700, 1533108000, 1533108300, 1533108600, 1533108900, 1533109200, 1533109500, 1533109800, 1533110100), tzone = "UTC", class = c("POSIXct", "POSIXt")), depth = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_)), class = "data.frame", row.names = c(NA,
-48L))
更新2(抱歉!)
30 2018-08-01 06:25:00 NA
31 2018-08-01 06:30:00 NA
32 2018-08-01 06:35:00 NA
似乎可以被识别为属于以下间隔:
diveno start fin dateoccurs
1 1 2018-08-01 08:20:40 2018-08-01 08:39:20 Y
2 2 2018-08-01 08:40:50 2018-08-01 08:53:40 Y
3 3 2018-08-01 10:01:00 2018-08-01 10:16:30 N
为什么会发生这种情况?
s<-df1[1,2] f<-df1[1,3] int<-interval(s,f)
df2[,1] %within% ir
使用非相等连接的data.table
解决方案
library(data.table)
setDT(df1); setDT(df2)
# initialise new column with "N"
df1[, dateoccurs := "N"]
# update join
df1[df2, dateoccurs := "Y", on = .(start <= date, fin >= date)][]
# diveno start fin dateoccurs
# 1: 1 2018-08-01 08:20:40 2018-08-01 08:39:20 Y
# 2: 2 2018-08-01 08:40:50 2018-08-01 08:53:40 N
# 3: 3 2018-08-01 10:01:00 2018-08-01 10:16:30 N
# 4: 4 2018-08-01 15:45:30 2018-08-01 15:58:20 Y
# 5: 5 2018-08-01 17:06:00 2018-08-01 17:18:20 N
您可以使用outer
fun <- function(i, j) data.table::between(df2[i, 'date'], df1[j, 'start'], df1[j, 'fin'])
df1$occ <- colSums(outer(seq_len(nrow(df2)), seq_len(nrow(df1)), Vectorize(fun)))
df1
# diveno start fin occ
# 1 1 2018-08-01 08:20:40 2018-08-01 08:39:20 1
# 2 2 2018-08-01 08:40:50 2018-08-01 08:53:40 0
# 3 3 2018-08-01 10:01:00 2018-08-01 10:16:30 0
# 4 4 2018-08-01 15:45:30 2018-08-01 15:58:20 1
# 5 5 2018-08-01 17:06:00 2018-08-01 17:18:20 0
如果您愿意,二进制列可以很容易地包装为factor
。
df1$occ <- colSums(outer(seq_len(nrow(df2)), seq_len(nrow(df1)), Vectorize(fun))) |>
factor(labels=c("N", "Y"))
df1
# diveno start fin occ
# 1 1 2018-08-01 08:20:40 2018-08-01 08:39:20 Y
# 2 2 2018-08-01 08:40:50 2018-08-01 08:53:40 N
# 3 3 2018-08-01 10:01:00 2018-08-01 10:16:30 N
# 4 4 2018-08-01 15:45:30 2018-08-01 15:58:20 Y
# 5 5 2018-08-01 17:06:00 2018-08-01 17:18:20 N
数据:
df1 <- structure(list(diveno = c(1, 2, 3, 4, 5), start = structure(c(1533104440,
1533105650, 1533110460, 1533131130, 1533135960), class = c("POSIXct",
"POSIXt"), tzone = "CET"), fin = structure(c(1533105560, 1533106420,
1533111390, 1533131900, 1533136700), class = c("POSIXct", "POSIXt"
), tzone = "CET"), occ = structure(c(2L, 1L, 1L, 2L, 1L), levels = c("N",
"Y"), class = "factor")), row.names = c(NA, -5L), class = "data.frame")
df2 <- structure(list(date = structure(c(1533105000, 1533131220, 1533222600
), class = c("POSIXct", "POSIXt"), tzone = "CET")), row.names = c(NA,
-3L), class = "data.frame")