到目前为止,我已经在这里尝试了各种答案:
- 组合IRanges对象并维护mcol
- 查找R中开始日期和结束日期重叠的所有日期范围
- 查找具有数据的重叠间隔组.table
- 在R中寻找中央凹一次迭代中的所有重叠数据表
- 按组查找期间间隔内的日期
- R查找时间段之间的重叠
- 用R按组检测重叠日期
一些工作,但对于非常大的数据集(8-12m行(性能不是很好
只是我一直在尝试的一些示例代码:
library(tidyverse)
library(data.table)
size = 10000
df <- data.frame(
ID = sample(1:round(size / 5, 0)),
period = sample(c(5,10,30,45), size, replace = TRUE),
start = sample(seq(
as.Date('1999/01/01'), as.Date('2000/01/01'), by = "day"
), size, replace = TRUE)
) %>% mutate(end = start + period)
dt <-
data.table(df, key = c("start", "end"))[, `:=`(row = 1:nrow(df))]
overlapping <-
unique(foverlaps(dt, dt)[ID == i.ID & row != i.row, ID])
dt[, `:=`(Overlap = FALSE)][ID %in% overlapping, Overlap :=
TRUE][order(ID, start)] %>%
distinct(ID,Overlap) %>%
count(Overlap) %>%
mutate(freq = n/sum(n))
这个很好,但如果数据集变大,要么速度慢,要么存在负向量错误:
Error in foverlaps(dt, dt) : negative length vectors are not allowed
有更好的方法吗?
您可以在foverlaps
中通过ID
直接加入,并计算重叠次数:
size = 1e5
df <- data.frame(
ID = sample(1:round(size / 5, 0)),
period = sample(c(5,10,30,45), size, replace = TRUE),
start = sample(seq(
as.Date('1999/01/01'), as.Date('2000/01/01'), by = "day"
), size, replace = TRUE)
) %>% mutate(end = start + period)
dt <- data.table(df, key = c("start", "end"))[, `:=`(row = 1:nrow(df))]
setkey(dt,ID,start,end)
foverlaps(dt,dt,by.x=c("ID","start","end"),by.y=c("ID","start","end"))[
,.(noverlap=.N),by=.(ID,row)][
,.(overlap = max(noverlap>1)),by=ID][
,.(n=.N),by=.(overlap)][
,pct:=n/sum(n)][]
Overlap n freq
1: FALSE 547 0.2735
2: TRUE 1453 0.7265
性能比较:
microbenchmark::microbenchmark(old(),new())
Unit: milliseconds
expr min lq mean median uq max neval
old() 672.6338 685.8825 788.78851 694.7804 864.95855 1311.9752 100
new() 16.9942 17.7659 24.66032 18.7095 20.59965 63.3928 100