r语言 - data.table:快速计算双向时间移动窗口内行时间的统计信息


library(data.table)
library(lubridate)
df <- data.table(col1 = c('A', 'A', 'A', 'B', 'B', 'B'), col2 = c("2015-03-06 01:37:57", "2015-03-06 01:39:57", "2015-03-06 01:45:28", "2015-03-06 02:31:44", "2015-03-06 03:55:45", "2015-03-06 04:01:40"))

对于每一行,我想计算具有相同值"col2"的行的时间(col2(和窗口内该行时间(包括(之前10分钟和该行时间之后10分钟的时间(包括(的标准偏差

我尝试使用基于上一个问题的解决方案的快速方法

df$col2 <- as_datetime(df$col2)
gap <- 10L
df[, feat1 := .SD[.(col1 = col1, t1 = col2 - gap * 60L, t2 = col2 + gap * 60L)
                  , on = .(col1, col2 >= t1, col2 <= t2)
                  , .(col1, col2 = x.col2, times = as.numeric(col2))
                  ][, .(sd_times = sd(times))
                    , by = .(col1, col2)]$sd_times][]

但我有下一个错误:

Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__,  : 
  Join results in 14 rows; more than 12 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.

我已经使用上面的弗兰克评论解决了我的任务:

df[, feat1 := .SD[.(col1 = col1, t1 = col2 - gap * 60L, t2 = col2 + gap * 60L)
                  , on = .(col1, col2 >= t1, col2 <= t2)
                  , .(col1, col2 = x.col2, times = as.numeric(col2)), allow.cartesian=TRUE
                  ][, .(sd_times = sd(times))
                    , by = .(col1, col2)]$sd_times][]

最新更新