对于每个事件,我试图计算在同一组(类和区域(内前一个冬天(以及之前的冬天(发生了多少以前的事件。
这是一个简单的数据集(我的实际数据集有85000条记录(
# creating a simple data.frame
class <- c(1,1,1,1,1,2,2,2,2)
area <- c("a", "a","b", "a", "a","b", "a", "a","b" )
event <- as.Date(c("2023-04-01", "2022-12-01", "2022-01-01",
"2021-12-01", "2022-12-01", "2022-12-01",
"2020-04-01", "2022-04-01", "2022-04-01"))
df <- data.frame(class, area, event)
str(df) # checking the structure of the data.frame
df <- df[order(class, area, event),] # sorting the order
df
df$events_in_previous_winter <- c(0,1,1,2,0,0,0,0,0) # this is the desired answer
df
我尝试过使用dplyr/group_by(class,area(和mutate来计数,但我无法使其工作。
我把冬天定义为12、1和2月份(12月、1月、2月(。
我想知道每个组(类和区域的唯一配对(有多少";事件";发生在该事件的前一个冬天。
有什么想法吗?
我会使用分组摘要来创建一个单独的previous_winter_events
数据帧,其中包含每个冬天的事件数。然后,您可以使用dplyr::lag()
获得前一个冬季每个的事件数。(您也可以通过设置lag(x, n = 2)
等来获取两个冬天前的事件。(然后,使用左联接将这些值合并回原始数据帧。
该解决方案使用winter_year
辅助列将每年12月与接下来的1月和2月分组,即使它们的年份不同。我使用tidyr::complete()
将没有事件的年份添加到previous_winter_events
。
library(dplyr)
library(tidyr)
library(lubridate)
event_df <- event_df %>%
mutate(winter_year = year(event %m-% months(2))) %>%
arrange(event_class, area, event)
previous_winter_events <- event_df %>%
complete(event_class, area, winter_year = full_seq(winter_year, 1)) %>%
group_by(event_class, area, winter_year) %>%
summarize(
events_this_winter = sum(month(event) %in% c(12, 1, 2)),
.groups = "drop_last"
) %>%
mutate(
events_in_previous_winter = dplyr::lag(events_this_winter, default = 0)
) %>%
ungroup()
event_df <- event_df %>%
left_join(previous_winter_events) %>%
select(!c(winter_year, events_this_winter)) # remove helper columns
event_df
输出:
event_class area event events_in_previous_winter
1 1 a 2018-01-01 0
2 1 a 2021-12-01 0
3 1 a 2022-01-31 0
4 1 a 2022-12-01 2
5 1 a 2022-12-01 2
6 1 a 2023-04-01 2
7 1 b 2022-01-01 0
8 2 a 2020-04-01 0
9 2 a 2022-04-01 0
10 2 b 2022-04-01 0
11 2 b 2022-12-01 0
数据:
# Added a couple additional test cases from OP comments.
# Changed names of `df` and `class` because those are function names in R.
event_df <- data.frame(
event_class = c(1,1,1,1,1,1,1,2,2,2,2),
area = c("a", "a","a", "a","b", "a", "a","b", "a", "a","b"),
event = as.Date(c("2018-01-01", "2023-04-01", "2022-12-01", "2022-01-31",
"2022-01-01", "2021-12-01", "2022-12-01", "2022-12-01",
"2020-04-01", "2022-04-01", "2022-04-01"))
)