我的数据框由 3 列、分组因子、Current_Date
和Start_Date
(根据定义,Current_Date
≥Start_Date
;日期格式为dmy
),每列中有多个重复的日期,并且它们之间的滞后长度不同。有些日期在组之间重叠,但有些则不重叠。
实际数据长达数十万行,所以我的问题是找到一种有效的方法来为每一行分配整个Date
列(by
分组因子)之间的unique
重叠数,以及时间窗口定义的日期seq
(Start_Date
到Current_Date
),这是特定于每一行的。
下面显示了一个虚拟数据,具有添加Dates_in_range
列的所需结果,但没有分组因子,我不知道如何在for
循环格式中处理 (例如,窗口中在
Current_Date Start_Date Dates_in_range
1 21-10-22 21-10-22 1
2 26-10-22 26-10-22 1
3 26-10-22 21-10-22 2
4 26-10-22 26-10-22 1
5 26-10-22 21-10-22 2
6 28-10-22 26-10-22 2
7 28-10-22 28-10-22 1
8 28-10-22 21-10-22 3
我的解决方案基于创建两种类型的包含日期的列表,方法是使用for
循环,并将每种类型作为临时列添加到数据表中: 第一种类型是整个数据集(或其中的组)共享的所有日期的相同列表,这些日期在数据表(或其中的组)中的所有行上重复; 第二种类型是特定于行的列表,派生自Current_Date
和Start_Date
指定的时间窗口。 然后,我在每行的两个列表列之间找到一个intersect
,应用另一个for
循环。
下面附上可重现的代码:
library(data.table)
## Load the data set
dt = data.table(Current_Date= c("21-10-22","26-10-22","26-10-22","26-10-22","26-10-22","28-10-22","28-10-22","28-10-22"),
Start_Date = c("21-10-22","26-10-22","21-10-22","26-10-22","21-10-22","26-10-22","28-10-22","21-10-22"))
# Specify dates into DMY date format
library(lubridate)
dt$Current_Date<- dmy(dt$Current_Date)
dt$Start_Date <- dmy(dt$Start_Date)
## Create a list of all current dates within the data set (= Current_Date column)
Dates_all <- as.list(dt$Current_Date)
# Add the list as a Dates_all column to the data set
dt$All_dates <- list()
for (i in 1:length(dt[, Current_Date])){
dt$All_dates[[i]] <- Dates_all
}
## Create a list of sequences of all possible dates within the date period (from Start_Date to Current_Date) for each row
Date_window <- list()
for (i in 1:length(dt[, Current_Date])){
Date_window[[i]] <- as.list(seq(as.Date(dt[i, Start_Date]), as.Date(dt[i, Current_Date]), by="days"))
}
# Add the list as a Date_window column to the data set
dt$Date_window <- Date_window
## Add the Dates_in_range column containing the number of dates from Current_Date column, occurring in the row-specific time window
for (i in 1:length(dt[, Current_Date])){
dt$Dates_in_range[[i]] <- length(intersect(dt$Date_window[[i]], dt$All_dates[[i]]))
}
# Cleanup & print
dt[, c("Date_window","All_dates") := NULL]
rm(Dates_all, Date_window, i)
print(dt)
我怀疑可以使用foverlaps
函数来完成它,但我不确定在这种情况下如何应用它。
提前感谢!
使用data.table
,您可以执行以下操作:
library(data.table)
dt = data.table(Current_Date= c("21-10-22","26-10-22","26-10-22","26-10-22","26-10-22","28-10-22","28-10-22","28-10-22"),
Start_Date = c("21-10-22","26-10-22","21-10-22","26-10-22","21-10-22","26-10-22","28-10-22","21-10-22"))
dt[,Dates_in_range := sum(between(dt[,unique(Current_Date)], Start_Date, Current_Date)),
by=rownames(dt)]
dt
#> Current_Date Start_Date Dates_in_range
#> 1: 21-10-22 21-10-22 1
#> 2: 26-10-22 26-10-22 1
#> 3: 26-10-22 21-10-22 2
#> 4: 26-10-22 26-10-22 1
#> 5: 26-10-22 21-10-22 2
#> 6: 28-10-22 26-10-22 2
#> 7: 28-10-22 28-10-22 1
#> 8: 28-10-22 21-10-22 3
使用sapply
:
dt[, n := sapply(Start_Date, function(x, y) sum(x <= y), y = unique(Start_Date)), by = Current_Date]
或使用frank
:
dt[, n := frank(1/as.integer(Start_Date), ties.method = "dense"), by = Current_Date]
这是另一种方法,仍然使用应该有效的 for 循环。基本上 我们首先获取所有可能日期的向量,然后定义一个函数来检查这些日期是否在最小值或最大值定义的范围内,然后我们使用数据集的 for 循环将该函数应用于数据的每一行。当然,如果您更喜欢矢量化,我们可以在这里使用vapply()
或类似的东西。
# get unique dates from all columns
dates <- unique(c(dt$Current_Date, dt$Current_Date))
# function to see how many are in a range
n_in_range <- function(d, mn, mx) {
sum(d <= mx & d >= mn)
}
#for loop
dt$Dates_in_range <- NA
for (i in 1:nrow(dt)) {
dt$Dates_in_range[i] <- n_in_range(dates, dt$Start_Date[i], dt$Current_Date[i])
}
dt
Current_Date Start_Date Dates_in_range
1: 2022-10-21 2022-10-21 1
2: 2022-10-26 2022-10-26 1
3: 2022-10-26 2022-10-21 2
4: 2022-10-26 2022-10-26 1
5: 2022-10-26 2022-10-21 2
6: 2022-10-28 2022-10-26 2
7: 2022-10-28 2022-10-28 1
8: 2022-10-28 2022-10-21 3
不使用分组
library(lubridate)
library(tidyverse)
dt = data.frame(Current_Date= c("21-10-22","26-10-22","26-10-22","26-10-22","26-10-22","28-10-22","28-10-22","28-10-22"),
Start_Date = c("21-10-22","26-10-22","21-10-22","26-10-22","21-10-22","26-10-22","28-10-22","21-10-22"))
dt %>%
mutate(across(ends_with("_Date"), dmy)) %>%
mutate(Dates_in_range = map2_dbl(.x = Start_Date, .y = Current_Date, .f = ~sum(between(x = unique(Current_Date), left = .x, right = .y))))
#> Current_Date Start_Date Dates_in_range
#> 1 2022-10-21 2022-10-21 1
#> 2 2022-10-26 2022-10-26 1
#> 3 2022-10-26 2022-10-21 2
#> 4 2022-10-26 2022-10-26 1
#> 5 2022-10-26 2022-10-21 2
#> 6 2022-10-28 2022-10-26 2
#> 7 2022-10-28 2022-10-28 1
#> 8 2022-10-28 2022-10-21 3
创建于 2022-10-29 与 reprex v2.0.2