假设我有一个像
这样的表<表类>ID 日期 tbody><<tr>1 2021-01-01 12021-01-05 12021-01-17 12021-02-01 12021-02-18 12021-02-28 12021-03-30 22021-01-01 22021-01-14 22021-02-15 表类>
以下是R中的一种data.table
方法:
library(data.table)
#Change to data.table
setDT(df)
#Order the dataset
setorder(df, ID, Date)
#Set flag to TRUE/FALSE if difference is greater than 15
df[, greater_than_15 := c(TRUE, diff(Date) > 15), ID]
#Take cumulative sum to create consecutive event id.
df[, Event_ID := cumsum(greater_than_15)]
df
# ID Date greater_than_15 Event_ID
# 1: 1 2021-01-01 TRUE 1
# 2: 1 2021-01-05 FALSE 1
# 3: 1 2021-01-17 FALSE 1
# 4: 1 2021-02-01 FALSE 1
# 5: 1 2021-02-18 TRUE 2
# 6: 1 2021-02-28 FALSE 2
# 7: 1 2021-03-30 TRUE 3
# 8: 2 2021-01-01 TRUE 4
# 9: 2 2021-01-14 FALSE 4
#10: 2 2021-02-15 TRUE 5
df <- structure(list(ID = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2),
Date = structure(c(18628, 18632, 18644, 18659, 18676, 18686, 18716,
18628, 18641, 18673), class = "Date")),
row.names = c(NA, -10L), class = "data.frame")
Dplyr 1.1.0的新更新
随着dplyr
版本1.1.0的更新,我们得到了两个新的函数/参数。
- 首先是
consecutive_id()
,它与data.table::rleid()
完全相同;和 .by
参数在mutate中消除了随后ungroup
的需要。
现在我们可以这样做了。
df <- structure(list(ID = c(1, 1, 1, 1, 1, 1, 1, 2, 2, 2),
Date = structure(c(18628, 18632, 18644, 18659, 18676, 18686, 18716,
18628, 18641, 18673), class = "Date")),
row.names = c(NA, -10L), class = "data.frame")
library(dplyr)
df |>
mutate(
Event = c(0, diff.Date(Date)) > 15,
.by = ID
) |>
mutate(
Event = consecutive_id(paste(ID, Event, sep = '_'))
)
#> ID Date Event
#> 1 1 2021-01-01 1
#> 2 1 2021-01-05 1
#> 3 1 2021-01-17 1
#> 4 1 2021-02-01 1
#> 5 1 2021-02-18 2
#> 6 1 2021-02-28 3
#> 7 1 2021-03-30 4
#> 8 2 2021-01-01 5
#> 9 2 2021-01-14 5
#> 10 2 2021-02-15 6
创建于2023-03-29 with reprex v2.0.2
老回答
r
解决方案可以使用dplyr
方法和data.table
中的rleid
函数
library(dplyr)
library(data.table)
df %>% group_by(ID) %>%
mutate(Date = as.Date(Date)) %>% #mutating Date column as Date
arrange(ID, Date) %>% #arranging the rows in order
mutate(Event = if_else(is.na(Date - lag(Date)), Date - Date, Date - lag(Date)),
Event = paste(ID, cumsum(if_else(Event > 15, 1, 0)), sep = "_")) %>%
ungroup() %>% #since the event numbers are not to be created group-wise
mutate(Event = rleid(Event))
# A tibble: 9 x 3
ID Date Event
<int> <date> <int>
1 1 2021-01-01 1
2 1 2021-01-05 1
3 1 2021-01-17 1
4 1 2021-02-15 2
5 1 2021-02-28 2
6 1 2021-03-30 3
7 2 2021-01-01 4
8 2 2021-01-14 4
9 2 2021-02-15 5