删除日期重叠的行,并使用dplyr或sqldf在R中保持最长的时间间隔



我需要删除日期重叠的行,并保留重叠日期中最大的x值。这是一个数据帧:

data.frame(time_left = c("2011-08-05",
"2011-07-25",
"2017-08-20",
"2017-08-20", 
"2017-10-09", 
"2019-06-01"), 
time_right= c("2011-09-14",
"2011-09-01",
"2017-09-12",
"2017-09-26",
"2017-10-15",
"2019-11-05"),
x = c(114,20,10,1,5,100) ) -> df

所以我的输入是:

time_left time_right   x
1 2011-08-05 2011-09-14 114
2 2011-07-25 2011-09-01  20
3 2017-08-20 2017-09-12  10
4 2017-08-20 2017-09-26   1
5 2017-10-09 2017-10-15   5
6 2019-06-01 2019-11-05 100

我想要的输出是:

time_left   time_right   x
1 2011-08-05 2011-09-14  114
2 2011-07-25 2011-09-01   20
4 2017-08-20 2017-09-26   10  
5 2017-10-09 2017-10-15    5
6 2019-06-01 2019-11-05  100

我感谢你的帮助。

@Maël在ivs问题页面上引起了我的注意https://github.com/DavisVaughan/ivs/issues/20.

我认为使用ivs可以非常优雅有效地解决这个问题,但要想出解决方案有点困难,所以我可能会添加一个助手来更容易地做到这一点。

该解决方案与";递归的";容器,也就是说,范围A包含范围B,但范围C也包含范围A,所以你只想列出范围C。我在这里用例子更详细地描述了这一点https://github.com/DavisVaughan/ivs/issues/20#issuecomment-123479783.

library(ivs)
library(dplyr)
library(vctrs)
df <- tibble(
time_left = as.Date(c(
"2011-08-05", "2011-07-25", "2017-08-20",
"2017-08-20", "2017-10-09", "2019-06-01"
)),
time_right = as.Date(c(
"2011-09-14", "2011-09-01", "2017-09-12",
"2017-09-26", "2017-10-15", "2019-11-05"
)),
x = c(114, 20, 10, 1, 5, 100)
)
df <- df %>% 
mutate(range = iv(time_left, time_right), .keep = "unused")
df
#> # A tibble: 6 × 2
#>       x                    range
#>   <dbl>               <iv<date>>
#> 1   114 [2011-08-05, 2011-09-14)
#> 2    20 [2011-07-25, 2011-09-01)
#> 3    10 [2017-08-20, 2017-09-12)
#> 4     1 [2017-08-20, 2017-09-26)
#> 5     5 [2017-10-09, 2017-10-15)
#> 6   100 [2019-06-01, 2019-11-05)
iv_locate_max_containment <- function(x) {
# Find all locations where the range "contains" any other range
# (including itself)
locs <- iv_locate_overlaps(x, x, type = "contains")

# Find the "top" ranges, i.e. the containers that aren't contained
# by any other containers
top <- !vec_duplicate_detect(locs$haystack)
top <- vec_slice(locs$haystack, top)
top <- vec_in(locs$needles, top)

locs <- vec_slice(locs, top)

locs
}
# i.e. row 4 "contains" rows 3 and 4
locs <- iv_locate_max_containment(df$range)
locs
#>   needles haystack
#> 1       1        1
#> 2       2        2
#> 3       4        3
#> 4       4        4
#> 5       5        5
#> 6       6        6
iv_align(df$range, df$x, locations = locs) %>%
rename(range = needles) %>%
group_by(range) %>%
summarise(x = max(haystack))
#> # A tibble: 5 × 2
#>                      range     x
#>                 <iv<date>> <dbl>
#> 1 [2011-07-25, 2011-09-01)    20
#> 2 [2011-08-05, 2011-09-14)   114
#> 3 [2017-08-20, 2017-09-26)    10
#> 4 [2017-10-09, 2017-10-15)     5
#> 5 [2019-06-01, 2019-11-05)   100

创建于2022-09-01,reprex v2.0.2

这听起来可能有点冗长,但这也可能是一个解决方案:

  • 首先,我们确定那些可能重叠的观察结果
  • 然后我们将相似的分组
  • 在每组中,我们选择最小的time_left和最大的time_rightx
library(tidyverse)
df %>%
mutate(across(starts_with('time'), ymd), 
intv = interval(time_left, time_right),
id = row_number()) %>%
mutate(id2 = map2(intv, id, ~ if (any(.x %within% intv[intv != .x])) {
id[which(.x %within% intv[intv != .x]) + 1] 
} else {
.y
})) %>%
group_by(id2) %>%
summarise(time_left = min(time_left), 
across(c(time_right, x), max)) %>%
select(!(id2))

# A tibble: 4 × 3
time_left  time_right     x
<date>     <date>     <dbl>
1 2011-08-05 2011-09-14   114
2 2017-08-20 2017-09-26    10
3 2017-10-09 2017-10-15     5
4 2019-06-01 2019-11-05   100

我将Anousiravan的解决方案与结合起来

如何在R中确定数据帧中同一个人的日期间隔是否与另一个日期间隔重叠?

我认为它现在正在发挥作用。

df %>%
mutate(id = row_number(), days = as.numeric(as.Date(time_right) -  as.Date(time_left)) ) %>%
mutate(Int = interval(time_left, time_right), 
within = map(seq_along(Int), function(x){
y = setdiff(seq_along(Int), x)
if(any(id[which((Int[x] %within% Int[y]))+1])){
return(id[days == max(days[which((Int[x] %within% Int[y]))+1])])
}else{ return(0)}
})
) %>% 
mutate(within = ifelse(within > 0 , within, id)) %>% 
group_by(within) %>% 
summarise(time_left = min(time_left), time_right = max(time_right), x = max(x)) %>% 
select(!within)

但它仍然有一些错误。对于下面的df,除非我更改记录的顺序,否则此代码将不起作用。

df = data.frame(time_left = c("2014-01-01", "2014-01-01", "2014-12-01", "2014-12-26"),
time_right = c("2014-04-23", "2014-12-31", "2014-12-31", "2014-12-31"),
x = c(10,100,200,20))

最新更新