r-查找组内前一个冬天的事件数



对于每个事件,我试图计算在同一组(类和区域(内前一个冬天(以及之前的冬天(发生了多少以前的事件。

这是一个简单的数据集(我的实际数据集有85000条记录(

# creating a simple data.frame
class <- c(1,1,1,1,1,2,2,2,2)
area <-  c("a", "a","b", "a", "a","b",  "a", "a","b" )
event <- as.Date(c("2023-04-01", "2022-12-01", "2022-01-01",
"2021-12-01", "2022-12-01", "2022-12-01",
"2020-04-01", "2022-04-01", "2022-04-01"))
df <- data.frame(class, area, event)

str(df) # checking the structure of the data.frame
df <- df[order(class, area, event),] # sorting the order
df
df$events_in_previous_winter <- c(0,1,1,2,0,0,0,0,0)  # this is the desired answer
df

我尝试过使用dplyr/group_by(class,area(和mutate来计数,但我无法使其工作。

我把冬天定义为12、1和2月份(12月、1月、2月(。

我想知道每个组(类和区域的唯一配对(有多少";事件";发生在该事件的前一个冬天。

有什么想法吗?

我会使用分组摘要来创建一个单独的previous_winter_events数据帧,其中包含每个冬天的事件数。然后,您可以使用dplyr::lag()获得前一个冬季每个的事件数。(您也可以通过设置lag(x, n = 2)等来获取两个冬天前的事件。(然后,使用左联接将这些值合并回原始数据帧。

该解决方案使用winter_year辅助列将每年12月与接下来的1月和2月分组,即使它们的年份不同。我使用tidyr::complete()将没有事件的年份添加到previous_winter_events

library(dplyr)
library(tidyr)
library(lubridate)
event_df <- event_df %>% 
mutate(winter_year = year(event %m-% months(2))) %>% 
arrange(event_class, area, event)
previous_winter_events <- event_df %>% 
complete(event_class, area, winter_year = full_seq(winter_year, 1)) %>% 
group_by(event_class, area, winter_year) %>% 
summarize(
events_this_winter = sum(month(event) %in% c(12, 1, 2)),
.groups = "drop_last"
) %>% 
mutate(
events_in_previous_winter = dplyr::lag(events_this_winter, default = 0)
) %>% 
ungroup()
event_df <- event_df %>% 
left_join(previous_winter_events) %>% 
select(!c(winter_year, events_this_winter)) # remove helper columns
event_df

输出:

event_class area      event events_in_previous_winter
1            1    a 2018-01-01                         0
2            1    a 2021-12-01                         0
3            1    a 2022-01-31                         0
4            1    a 2022-12-01                         2
5            1    a 2022-12-01                         2
6            1    a 2023-04-01                         2
7            1    b 2022-01-01                         0
8            2    a 2020-04-01                         0
9            2    a 2022-04-01                         0
10           2    b 2022-04-01                         0
11           2    b 2022-12-01                         0

数据:

# Added a couple additional test cases from OP comments.
# Changed names of `df` and `class` because those are function names in R.
event_df <- data.frame(
event_class = c(1,1,1,1,1,1,1,2,2,2,2),
area = c("a", "a","a", "a","b", "a", "a","b", "a", "a","b"),
event = as.Date(c("2018-01-01", "2023-04-01", "2022-12-01", "2022-01-31", 
"2022-01-01", "2021-12-01", "2022-12-01", "2022-12-01",
"2020-04-01", "2022-04-01", "2022-04-01"))
)

相关内容

  • 没有找到相关文章