我有一个如下所示的表:
Date Season
1 2022-01-01 Val_1
2 2022-01-02 Val_1
3 2022-01-03 Val_1
4 2022-01-04 Val_2
5 2022-01-05 Val_2
6 2022-01-06 Val_2
7 2022-01-07 Val_1
8 2022-01-08 Val_1
9 2022-01-09 Val_1
10 2022-01-10 Val_2
11 2022-01-11 Val_2
12 2022-01-12 Val_2
13 2022-01-13 Val_1
14 2022-01-14 Val_1
15 2022-01-15 Val_1
我想做的是为列中的每个值标记每个连续Season
值序列,从1到列中存在的连续序列总数。我见过类似的解决方案用rle
这样的函数来解决,但我目前还不知道如何将其转化为这个问题。以下是我想要的输出示例:
Date Season Season_Num
1 2022-01-01 Val_1 1
2 2022-01-02 Val_1 1
3 2022-01-03 Val_1 1
4 2022-01-04 Val_2 1
5 2022-01-05 Val_2 1
6 2022-01-06 Val_2 1
7 2022-01-07 Val_1 2
8 2022-01-08 Val_1 2
9 2022-01-09 Val_1 2
10 2022-01-10 Val_2 2
11 2022-01-11 Val_2 2
12 2022-01-12 Val_2 2
13 2022-01-13 Val_1 3
14 2022-01-14 Val_1 3
15 2022-01-15 Val_1 3
使用单个mutate
调用,使用cumsum
和lag
:
library(dplyr)
df %>%
mutate(Season_num = cumsum(Season == "Val_1" & lag(Season, default = "Val_2") != Season))
# Date Season Season_num
# 1 2022-01-01 Val_1 1
# 2 2022-01-02 Val_1 1
# 3 2022-01-03 Val_1 1
# 4 2022-01-04 Val_2 1
# 5 2022-01-05 Val_2 1
# 6 2022-01-06 Val_2 1
# 7 2022-01-07 Val_1 2
# 8 2022-01-08 Val_1 2
# 9 2022-01-09 Val_1 2
# 10 2022-01-10 Val_2 2
# 11 2022-01-11 Val_2 2
# 12 2022-01-12 Val_2 2
# 13 2022-01-13 Val_1 3
# 14 2022-01-14 Val_1 3
# 15 2022-01-15 Val_1 3
我们可以在按"季节"分组后获得日期对象之间的diff
,并进行累积和
library(dplyr)
df1 %>%
group_by(Season) %>%
mutate(Season_Num = cumsum(c(TRUE, diff(Date) != 1))) %>%
ungroup
-输出
# A tibble: 15 × 3
Date Season Season_Num
<date> <chr> <int>
1 2022-01-01 Val_1 1
2 2022-01-02 Val_1 1
3 2022-01-03 Val_1 1
4 2022-01-04 Val_2 1
5 2022-01-05 Val_2 1
6 2022-01-06 Val_2 1
7 2022-01-07 Val_1 2
8 2022-01-08 Val_1 2
9 2022-01-09 Val_1 2
10 2022-01-10 Val_2 2
11 2022-01-11 Val_2 2
12 2022-01-12 Val_2 2
13 2022-01-13 Val_1 3
14 2022-01-14 Val_1 3
15 2022-01-15 Val_1 3
数据
df1 <- structure(list(Date = structure(c(18993, 18994, 18995, 18996,
18997, 18998, 18999, 19000, 19001, 19002, 19003, 19004, 19005,
19006, 19007), class = "Date"), Season = c("Val_1", "Val_1",
"Val_1", "Val_2", "Val_2", "Val_2", "Val_1", "Val_1", "Val_1",
"Val_2", "Val_2", "Val_2", "Val_1", "Val_1", "Val_1")), row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15"), class = "data.frame")