如何在R中的非常大的数据集中找到具有重叠间隔日期的ID的比例



到目前为止,我已经在这里尝试了各种答案:

  • 组合IRanges对象并维护mcol
  • 查找R中开始日期和结束日期重叠的所有日期范围
  • 查找具有数据的重叠间隔组.table
  • 在R中寻找中央凹一次迭代中的所有重叠数据表
  • 按组查找期间间隔内的日期
  • R查找时间段之间的重叠
  • 用R按组检测重叠日期

一些工作,但对于非常大的数据集(8-12m行(性能不是很好

只是我一直在尝试的一些示例代码:

library(tidyverse)
library(data.table)
size = 10000
df <- data.frame(
ID = sample(1:round(size / 5, 0)),
period = sample(c(5,10,30,45), size, replace = TRUE),
start = sample(seq(
as.Date('1999/01/01'), as.Date('2000/01/01'), by = "day"
), size, replace = TRUE)
) %>% mutate(end = start + period)

dt <-
data.table(df, key = c("start", "end"))[, `:=`(row = 1:nrow(df))]
overlapping <-
unique(foverlaps(dt, dt)[ID == i.ID & row != i.row, ID])
dt[, `:=`(Overlap = FALSE)][ID %in% overlapping, Overlap :=
TRUE][order(ID, start)] %>% 
distinct(ID,Overlap) %>% 
count(Overlap) %>% 
mutate(freq = n/sum(n))

这个很好,但如果数据集变大,要么速度慢,要么存在负向量错误:

Error in foverlaps(dt, dt) : negative length vectors are not allowed

有更好的方法吗?

您可以在foverlaps中通过ID直接加入,并计算重叠次数:

size = 1e5
df <- data.frame(
ID = sample(1:round(size / 5, 0)),
period = sample(c(5,10,30,45), size, replace = TRUE),
start = sample(seq(
as.Date('1999/01/01'), as.Date('2000/01/01'), by = "day"
), size, replace = TRUE)
) %>% mutate(end = start + period)
dt <- data.table(df, key = c("start", "end"))[, `:=`(row = 1:nrow(df))]
setkey(dt,ID,start,end)
foverlaps(dt,dt,by.x=c("ID","start","end"),by.y=c("ID","start","end"))[
,.(noverlap=.N),by=.(ID,row)][
,.(overlap = max(noverlap>1)),by=ID][
,.(n=.N),by=.(overlap)][
,pct:=n/sum(n)][]
Overlap    n   freq
1:   FALSE  547 0.2735
2:    TRUE 1453 0.7265

性能比较:

microbenchmark::microbenchmark(old(),new())
Unit: milliseconds
expr      min       lq      mean   median        uq       max neval
old() 672.6338 685.8825 788.78851 694.7804 864.95855 1311.9752   100
new()  16.9942  17.7659  24.66032  18.7095  20.59965   63.3928   100

最新更新