在包含起始日期和结束日期为r的单个记录的数据集中,查找一个月内打开的记录数



我有一个数据帧,其中每一行都是一条记录,包含其id、开始和结束日期。我想创建另一个数据帧,它包含每个日历月的开始日期(例如"2020-01-01"是1月(,以及第二列,计算该月(任何/所有部分(打开的唯一记录数量。

我可以为每个日历月创建新的列,并为该月是否打开记录生成假人,然后将每个列相加。做这件事有什么更有效的方法?

ds <- data.frame(record_id = c("00a", "00b", "00c"),
record_start_date = as.Date(c("2020-01-16", "2020-03-25", "2020-02-22")),
record_end_date = as.Date(c("2020-12-05", "2020-06-21", "2020-11-12")))

ivs包是为处理这样的间隔而创建的。iv_count_between()非常适合这个问题。

library(ivs)
library(dplyr)
library(clock)
ds <- data.frame(
record_id = c("00a", "00b", "00c"),
record_start_date = as.Date(c("2020-01-16", "2020-03-25", "2020-02-22")),
record_end_date = as.Date(c("2020-12-05", "2020-06-21", "2020-11-12"))
)
# Record the start and end months to generate the counts for
start <- date_start(min(ds$record_start_date), "year")
end <- date_end(max(ds$record_end_date), "year") + 1L
# Construct an interval vector
ds <- ds %>%
mutate(
record_range = iv(record_start_date, record_end_date), 
.keep = "unused"
)
ds
#>   record_id             record_range
#> 1       00a [2020-01-16, 2020-12-05)
#> 2       00b [2020-03-25, 2020-06-21)
#> 3       00c [2020-02-22, 2020-11-12)
# Generate the months sequence to count along
result <- tibble(
month = date_seq(
from = start, 
to = end, 
by = duration_months(1)
)
)
# Count the number of times `month[[i]]` is between any of the
# ranges in `ds$record_range`
result %>%
mutate(
count = iv_count_between(month, ds$record_range)
)
#> # A tibble: 13 × 2
#>    month      count
#>    <date>     <int>
#>  1 2020-01-01     0
#>  2 2020-02-01     1
#>  3 2020-03-01     2
#>  4 2020-04-01     3
#>  5 2020-05-01     3
#>  6 2020-06-01     3
#>  7 2020-07-01     2
#>  8 2020-08-01     2
#>  9 2020-09-01     2
#> 10 2020-10-01     2
#> 11 2020-11-01     2
#> 12 2020-12-01     1
#> 13 2021-01-01     0

创建于2022-09-01,reprex v2.0.2

这里有一种方法,我们可以重塑数据,并为每个月的开始添加行。然后,它可以是一个非常有效的矢量化累积计数,以计算出截至每月1日结束的活动记录。如果要将在1日结束的记录(或在开始的同一天结束的记录(计入计数,可以添加一行以将结束日期推迟一天。

library(tidyverse); library(lubridate)
ds %>%
pivot_longer(-record_id) %>%
mutate(change = if_else(name == "record_start_date", 1, -1)) %>%
# mutate(value = value + if_else(name == "record_end_date", 1, 0)) %>%
add_row(name = "month_start", 
value = seq.Date(floor_date(min(ds$record_start_date), "month"),
floor_date(max(ds$record_end_date), "month"),
by = "month"),
change = 0) %>%
arrange(value, desc(name)) %>%
mutate(count = cumsum(change)) %>%
filter(name == "month_start") %>% 
select(value, count)

结果:

# A tibble: 12 × 2
value      count
<date>     <dbl>
1 2020-01-01     0
2 2020-02-01     1
3 2020-03-01     2
4 2020-04-01     3
5 2020-05-01     3
6 2020-06-01     3
7 2020-07-01     2
8 2020-08-01     2
9 2020-09-01     2
10 2020-10-01     2
11 2020-11-01     2
12 2020-12-01     1

相关内容

  • 没有找到相关文章

最新更新