创建新的Event_ID基于ID与滑动窗口的日期列



假设我有一个像

这样的表
<表类>ID日期tbody><<tr>12021-01-0112021-01-0512021-01-1712021-02-0112021-02-1812021-02-2812021-03-3022021-01-0122021-01-1422021-02-15

以下是R中的一种data.table方法:

library(data.table)
#Change to data.table
setDT(df)
#Order the dataset
setorder(df, ID, Date)
#Set flag to TRUE/FALSE if difference is greater than 15
df[, greater_than_15 := c(TRUE, diff(Date) > 15), ID]
#Take cumulative sum to create consecutive event id.
df[, Event_ID := cumsum(greater_than_15)]
df
#    ID       Date greater_than_15 Event_ID
# 1:  1 2021-01-01            TRUE        1
# 2:  1 2021-01-05           FALSE        1
# 3:  1 2021-01-17           FALSE        1
# 4:  1 2021-02-01           FALSE        1
# 5:  1 2021-02-18            TRUE        2
# 6:  1 2021-02-28           FALSE        2
# 7:  1 2021-03-30            TRUE        3
# 8:  2 2021-01-01            TRUE        4
# 9:  2 2021-01-14           FALSE        4
#10:  2 2021-02-15            TRUE        5

df <- structure(list(ID = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2), 
Date = structure(c(18628, 18632, 18644, 18659, 18676, 18686, 18716, 
18628, 18641, 18673), class = "Date")), 
row.names = c(NA, -10L), class = "data.frame")

Dplyr 1.1.0的新更新

随着dplyr版本1.1.0的更新,我们得到了两个新的函数/参数。

  • 首先是consecutive_id(),它与data.table::rleid()完全相同;和
  • .by参数在mutate中消除了随后ungroup的需要。

现在我们可以这样做了。

df <- structure(list(ID = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2), 
Date = structure(c(18628, 18632, 18644, 18659, 18676, 18686, 18716, 
18628, 18641, 18673), class = "Date")), 
row.names = c(NA, -10L), class = "data.frame")
library(dplyr)

df |>
mutate(
Event = c(0, diff.Date(Date)) > 15,
.by = ID
) |>
mutate(
Event = consecutive_id(paste(ID, Event, sep = '_'))
)
#>    ID       Date Event
#> 1   1 2021-01-01     1
#> 2   1 2021-01-05     1
#> 3   1 2021-01-17     1
#> 4   1 2021-02-01     1
#> 5   1 2021-02-18     2
#> 6   1 2021-02-28     3
#> 7   1 2021-03-30     4
#> 8   2 2021-01-01     5
#> 9   2 2021-01-14     5
#> 10  2 2021-02-15     6

创建于2023-03-29 with reprex v2.0.2


老回答

r解决方案可以使用dplyr方法和data.table中的rleid函数

library(dplyr)
library(data.table)
df %>% group_by(ID) %>%
mutate(Date = as.Date(Date)) %>% #mutating Date column as Date
arrange(ID, Date) %>% #arranging the rows in order
mutate(Event = if_else(is.na(Date - lag(Date)), Date - Date, Date - lag(Date)),
Event = paste(ID, cumsum(if_else(Event > 15, 1, 0)), sep = "_")) %>%
ungroup() %>% #since the event numbers are not to be created group-wise
mutate(Event = rleid(Event))
# A tibble: 9 x 3
ID Date       Event
<int> <date>     <int>
1     1 2021-01-01     1
2     1 2021-01-05     1
3     1 2021-01-17     1
4     1 2021-02-15     2
5     1 2021-02-28     2
6     1 2021-03-30     3
7     2 2021-01-01     4
8     2 2021-01-14     4
9     2 2021-02-15     5

最新更新