r-不一定连续的时间间隔的并集

我正在为时间间隔寻找union的实现，它能够处理本身不是间隔的并集。

我注意到lubridate包括一个用于时间间隔的union函数，但它总是返回一个单独的间隔，即使并集不是间隔（即它返回由两个开始日期的最小值和两个结束日期的最大值定义的间隔，忽略两个间隔未覆盖的中间时段）：

library(lubridate)
int1 <- new_interval(ymd("2001-01-01"), ymd("2002-01-01"))
int2 <- new_interval(ymd("2003-06-01"), ymd("2004-01-01"))
union(int1, int2)
# Union includes intervening time between intervals.
# [1] 2001-01-01 UTC--2004-01-01 UTC

我也看过interval包，但它的文档中没有提到union。

我的最终目标是使用%within%:的复杂并集

my_int %within% Reduce(union, list_of_intervals)

因此，如果我们考虑一个具体的例子，假设list_of_intervals是：

[[1]] 2000-01-01 -- 2001-01-02 
[[2]] 2001-01-01 -- 2004-01-02 
[[3]] 2005-01-01 -- 2006-01-02

则my_int <- 2001-01-01 -- 2004-01-01不是%within%，而是list_of_intervals，所以它应该返回FALSE，而my_int <- 2003-01-01 -- 2006-01-01是，所以它应当是TRUE。

然而，我怀疑这个复杂的联合会有更多的用途。

如果我正确理解您的问题，您希望从一组可能重叠的区间开始，并获得一个表示输入集UNION的区间列表，而不仅仅是跨越输入集最小值和最大值的单个区间。这和我的问题是一样的。

一个类似的问题被问到：区间联合

但是所接受的响应以重叠的间隔失败。然而，hosolmaz（我是SO的新手，所以不知道如何链接到这个用户）发布了一个修改（用Python）来修复这个问题，然后我将其转换为R，如下所示：

library(dplyr) # for %>%, arrange, bind_rows
interval_union <- function(input) {
  if (nrow(input) == 1) {
    return(input)
  }
  input <- input %>% arrange(start)
  output = input[1, ]
  for (i in 2:nrow(input)) {
    x <- input[i, ]
    if (output$stop[nrow(output)] < x$start) {
      output <- bind_rows(output, x)
    } else if (output$stop[nrow(output)] == x$start) {
      output$stop[nrow(output)] <- x$stop
    }
    if (x$stop > output$stop[nrow(output)]) {
      output$stop[nrow(output)] <- x$stop
    }
  }
  return(output)
}

以重叠和不连续间隔为例：

d <- as.data.frame(list(
  start = c('2005-01-01', '2000-01-01', '2001-01-01'),
  stop = c('2006-01-02', '2001-01-02', '2004-01-02')),
  stringsAsFactors = FALSE)

这产生：

> d
       start       stop
1 2005-01-01 2006-01-02
2 2000-01-01 2001-01-02
3 2001-01-01 2004-01-02
> interval_union(d)
       start       stop
1 2000-01-01 2004-01-02
2 2005-01-01 2006-01-02

我是R编程的新手，所以如果有人能转换上面的interval_union（）函数，不仅接受输入数据帧，还接受要使用的"开始"one_answers"停止"列的名称作为参数，这样函数就可以更容易地重复使用，那就太好了。

在您提供的示例中，int1和int2的并集可以看作是具有两个区间的向量：

int1 <- new_interval(ymd("2001-01-01"), ymd("2002-01-01"))
int2 <- new_interval(ymd("2003-06-01"), ymd("2004-01-01"))
ints <- c(int1,int2)

%within%处理矢量，所以你可以这样做：

my_int <- new_interval(ymd("2001-01-01"), ymd("2004-01-01"))
my_int %within% ints
# [1]  TRUE FALSE

因此，您可以使用any:检查您的间隔是否在列表中的某个间隔中

any(my_int %within% ints)
# [1] TRUE

你的评论是对的，%within%给出的结果似乎与文件不一致，文件上写着：

如果a是区间，则其开始日期和结束日期都必须在b内返回TRUE。

如果我看一下%within%的源代码，当a和b都是区间时，它似乎如下：

setMethod("%within%", signature(a = "Interval", b = "Interval"), function(a,b){
    as.numeric(a@start) - as.numeric(b@start) <= b@.Data & as.numeric(a@start) - as.numeric(b@start) >= 0
})

因此，似乎只有a的起点与b进行了测试，并且看起来与结果一致。也许这应该被认为是一个错误，应该报告？

相关内容

最新更新

热门标签：