我搜索了一段时间,但似乎大多数天数计数方法只针对两个日期(即单个日期范围(。我不知道是否有比使用for循环计算多个日期范围之间的天数更快的方法。
这是我的数据DF:
ItemName StartDate EndDate
Apple 2014-01-02 2018-01-01
Orange 2014-08-06 2017-02-03
Banana 2015-03-01 2019-12-31
...
现在我有一个报告日期范围:2015-01-01至2019-12-31。我想按季度计算报告日期所涵盖项目的天数。
例如,第一季度的报告日期在2015-01-01&2015-03-31(含两端(
ItemName StartDate EndDate CountDays SampleQuarter
Apple 2014-01-02 2018-01-02 90 2015-Q1
Orange 2014-08-06 2017-02-03 90 2015-Q1
Banana 2015-03-01 2019-12-31 31 2015-Q1
...
Apple 2014-01-02 2018-01-02 2 2018-Q1
Orange 2014-08-06 2017-02-03 0 2018-Q1
Banana 2015-03-01 2019-12-31 31 2018-Q1
....
我的目标是使用上面的结果来得到这个表:
SampleQuarter TotalDays
2015-Q1 211
2015-Q2 273
...
2018-Q1 33
2018-Q2 91
...
我使用for循环来处理它。首先我创建了一个参考日期列表:
Date = c("2015-01-01","2015-04-01","2015-06-01",....)
TotalCount = NULL
for(i in 1:length(Date)){
START = as.Date(Date[i])
END = START %m+% months(3) -1 ## the end of each quarter
## find items that start and end date range is covered by the selected quarter
## than calculated the days covered by the selected quarter only
df = DF %>% filter( StartDate <=END & EndDate >=END ) %>%
mutate(StartDateNEW = ifelse(StartDate <= START,START, StartDate ))%>%
mutate(EndDateNEW = ifelse(EndDate <= END,EndDate , END)) %>%
mutate(CountDays= EndDateNEW -StartDateNEW +1) %>%
select(-StartDateNEW, EndDateNEW ) %>%
mutate(SampleQuarter =paste0( format(START , "%Y") ,"-Q",quarter(START) ) %>%
group_by(SampleQuarter) %>% summarise(TotalDays = sum(CountDays))
TotalCount = rbind(TotalCount ,df)
START = START %m+% months(3) ## the beginning of next quarter
}
对于小样本数据,For循环运行良好。当样本量很大时,代码将需要一段时间。我想知道是否有一种方法可以绕过for循环,使整个过程更快?
提前谢谢。
也许我们可以使用
library(dplyr)
library(purrr)
library(tidyr)
library(zoo)
DF %>%
mutate(across(ends_with("Date"), as.Date),
new = map2(StartDate, EndDate,
~ tibble(days = seq(.x, .y, by = 'day'),
Quarter = cut(days, breaks = seq(.x, .y, by = 'quarter'))))) %>%
unnest(c(new)) %>%
group_by(ItemName, StartDate, EndDate, Quarter = as.yearqtr(Quarter)) %>%
summarise(n = n(), .groups = 'drop')