r-使用data.table和for循环提高代码执行时间效率



问题:如何使下面代码中的for循环更高效地运行?对于这个玩具示例,它在合理的时间内工作。然而,unique_ids将是一个大约8000个条目的向量,for循环会大大减慢计算速度。有什么想法吗?非常感谢!

目的:根据for循环中的计算逻辑,将每天的IID回顾性地聚类为跃点和顶部。

初始数据:

IID      ENTRY     FINISH     TARGET max_finish_target_date
1:      1 2020-02-11 2020-02-19 2020-02-15             2020-02-19
2:      2 2020-02-13 2020-02-17 2020-02-19             2020-02-19

最终(目标(数据:

IID      Dates    ind_frist
1:      1 2020-02-10             
2:      1 2020-02-11 hop
3:      1 2020-02-12 hop
4:      1 2020-02-13 hop
5:      1 2020-02-14 hop
6:      1 2020-02-15 hop
7:      1 2020-02-16 top
8:      1 2020-02-17 top
9:      1 2020-02-18 top
10:      1 2020-02-19 top
11:      2 2020-02-10             
12:      2 2020-02-11             
13:      2 2020-02-12             
14:      2 2020-02-13 hop
15:      2 2020-02-14 hop
16:      2 2020-02-15 hop
17:      2 2020-02-16 hop
18:      2 2020-02-17 hop
19:      2 2020-02-18             
20:      2 2020-02-19             
21:      3 2020-02-10             
22:      3 2020-02-11             
23:      3 2020-02-12             
24:      3 2020-02-13             
25:      3 2020-02-14             
26:      3 2020-02-15 hop
27:      3 2020-02-16 hop
28:      3 2020-02-17 top
29:      3 2020-02-18 top
30:      3 2020-02-19 top

代码

rm(list = ls())
library(data.table)
# Some sample start data
initial_dt <- data.table(IID = c(1, 2, 3),
ENTRY = c("2020-02-11", "2020-02-13", "2020-02-15"),
FINISH = c("2020-02-19", "2020-02-17", ""),
TARGET = c("2020-02-15", "2020-02-19", "2020-02-16"))
initial_dt[, ":="(ENTRY = ymd(ENTRY),
FINISH = ymd(FINISH),
TARGET = ymd(TARGET))]
initial_dt[is.na(FINISH), FINISH := as.Date(ymd_hms(Sys.time()), format = "%Y-%m-%d")]

initial_dt[, max_finish_target_date := pmax(FINISH, TARGET)]

# Specify target data shape and output format
unique_ids <- c(1, 2, 3) 
dts <- seq(as.Date("2020-02-10", format = "%Y-%m-%d"), as.Date(ymd_hms(Sys.time()), format = "%Y-%m-%d"), by = "days")
ids <- rep(unique_ids, each = length(dts))
len <- length(unique_ids)
final_dt <- data.table(IID = ids,
Dates = rep(dts, times = len))
# Calculation logic
# QUESTION: How can I make this part below run more efficiently and less time costly?
for (d_id in unique_ids){
final_dt[(IID == d_id) & (Dates %between% c(initial_dt[IID == d_id, ENTRY], initial_dt[IID == d_id, max_finish_target_date])), 
ind_frist := ifelse((Dates > initial_dt[IID == d_id, TARGET]) & (Dates <= initial_dt[IID == d_id, max_finish_target_date]), 
"hop", 
"top")]
}

循环不会产生所显示的输出。以下非等连接产生该输出,但可以很容易地根据其他规则(例如,来自for循环的规则(进行调整:

final_dt <- CJ(IID = initial_dt[["IID"]], Dates = dts)
final_dt[initial_dt, ind_frist := "hop", on = .(IID, Dates >= ENTRY, Dates <= FINISH)]
final_dt[initial_dt, ind_frist := "top", on = .(IID, Dates > TARGET, Dates <= FINISH)]

这些联接应该非常快。

结果:

#    IID      Dates ind_frist
# 1:   1 2020-02-10      <NA>
# 2:   1 2020-02-11       hop
# 3:   1 2020-02-12       hop
# 4:   1 2020-02-13       hop
# 5:   1 2020-02-14       hop
# 6:   1 2020-02-15       hop
# 7:   1 2020-02-16       top
# 8:   1 2020-02-17       top
# 9:   1 2020-02-18       top
#10:   1 2020-02-19       top
#11:   2 2020-02-10      <NA>
#12:   2 2020-02-11      <NA>
#13:   2 2020-02-12      <NA>
#14:   2 2020-02-13       hop
#15:   2 2020-02-14       hop
#16:   2 2020-02-15       hop
#17:   2 2020-02-16       hop
#18:   2 2020-02-17       hop
#19:   2 2020-02-18      <NA>
#20:   2 2020-02-19      <NA>
#21:   3 2020-02-10      <NA>
#22:   3 2020-02-11      <NA>
#23:   3 2020-02-12      <NA>
#24:   3 2020-02-13      <NA>
#25:   3 2020-02-14      <NA>
#26:   3 2020-02-15       hop
#27:   3 2020-02-16       hop
#28:   3 2020-02-17       top
#29:   3 2020-02-18       top
#30:   3 2020-02-19       top
#    IID      Dates ind_frist

使用data.table-join:的可能替代方案

final_dt[initial_dt
, on = .(IID)
, ind_frist := c("", "top","hop")[1L + (Dates > TARGET & Dates <= max_finish_target_date) +
Dates %between% .(ENTRY, max_finish_target_date)]][]

它给出:

IID      Dates ind_frist
1:   1 2020-02-10          
2:   1 2020-02-11       top
3:   1 2020-02-12       top
4:   1 2020-02-13       top
5:   1 2020-02-14       top
6:   1 2020-02-15       top
7:   1 2020-02-16       hop
8:   1 2020-02-17       hop
9:   1 2020-02-18       hop
10:   1 2020-02-19       hop
11:   2 2020-02-10          
12:   2 2020-02-11          
13:   2 2020-02-12          
14:   2 2020-02-13       top
15:   2 2020-02-14       top
16:   2 2020-02-15       top
17:   2 2020-02-16       top
18:   2 2020-02-17       top
19:   2 2020-02-18       top
20:   2 2020-02-19       top
21:   3 2020-02-10          
22:   3 2020-02-11          
23:   3 2020-02-12          
24:   3 2020-02-13          
25:   3 2020-02-14          
26:   3 2020-02-15       top
27:   3 2020-02-16       top
28:   3 2020-02-17       hop
29:   3 2020-02-18       hop
30:   3 2020-02-19       hop

这与for循环的输出相同。

一些解释:部分1L + (Dates > TARGET & Dates <= max_finish_target_date) + Dates %between% .(ENTRY, max_finish_target_date)创建了一个长度等于final_dt的行数的一、二和三的索引向量;如果你把它放在c("", "top","hop")后面的方括号里,每一个都会得到一个空字符串,每两个都会获得"top",每三个都会收到"hop"

最新更新