问题:如何使下面代码中的for循环更高效地运行?对于这个玩具示例,它在合理的时间内工作。然而,unique_ids
将是一个大约8000个条目的向量,for循环会大大减慢计算速度。有什么想法吗?非常感谢!
目的:根据for循环中的计算逻辑,将每天的IID回顾性地聚类为跃点和顶部。
初始数据:
IID ENTRY FINISH TARGET max_finish_target_date
1: 1 2020-02-11 2020-02-19 2020-02-15 2020-02-19
2: 2 2020-02-13 2020-02-17 2020-02-19 2020-02-19
最终(目标(数据:
IID Dates ind_frist
1: 1 2020-02-10
2: 1 2020-02-11 hop
3: 1 2020-02-12 hop
4: 1 2020-02-13 hop
5: 1 2020-02-14 hop
6: 1 2020-02-15 hop
7: 1 2020-02-16 top
8: 1 2020-02-17 top
9: 1 2020-02-18 top
10: 1 2020-02-19 top
11: 2 2020-02-10
12: 2 2020-02-11
13: 2 2020-02-12
14: 2 2020-02-13 hop
15: 2 2020-02-14 hop
16: 2 2020-02-15 hop
17: 2 2020-02-16 hop
18: 2 2020-02-17 hop
19: 2 2020-02-18
20: 2 2020-02-19
21: 3 2020-02-10
22: 3 2020-02-11
23: 3 2020-02-12
24: 3 2020-02-13
25: 3 2020-02-14
26: 3 2020-02-15 hop
27: 3 2020-02-16 hop
28: 3 2020-02-17 top
29: 3 2020-02-18 top
30: 3 2020-02-19 top
代码
rm(list = ls())
library(data.table)
# Some sample start data
initial_dt <- data.table(IID = c(1, 2, 3),
ENTRY = c("2020-02-11", "2020-02-13", "2020-02-15"),
FINISH = c("2020-02-19", "2020-02-17", ""),
TARGET = c("2020-02-15", "2020-02-19", "2020-02-16"))
initial_dt[, ":="(ENTRY = ymd(ENTRY),
FINISH = ymd(FINISH),
TARGET = ymd(TARGET))]
initial_dt[is.na(FINISH), FINISH := as.Date(ymd_hms(Sys.time()), format = "%Y-%m-%d")]
initial_dt[, max_finish_target_date := pmax(FINISH, TARGET)]
# Specify target data shape and output format
unique_ids <- c(1, 2, 3)
dts <- seq(as.Date("2020-02-10", format = "%Y-%m-%d"), as.Date(ymd_hms(Sys.time()), format = "%Y-%m-%d"), by = "days")
ids <- rep(unique_ids, each = length(dts))
len <- length(unique_ids)
final_dt <- data.table(IID = ids,
Dates = rep(dts, times = len))
# Calculation logic
# QUESTION: How can I make this part below run more efficiently and less time costly?
for (d_id in unique_ids){
final_dt[(IID == d_id) & (Dates %between% c(initial_dt[IID == d_id, ENTRY], initial_dt[IID == d_id, max_finish_target_date])),
ind_frist := ifelse((Dates > initial_dt[IID == d_id, TARGET]) & (Dates <= initial_dt[IID == d_id, max_finish_target_date]),
"hop",
"top")]
}
循环不会产生所显示的输出。以下非等连接产生该输出,但可以很容易地根据其他规则(例如,来自for
循环的规则(进行调整:
final_dt <- CJ(IID = initial_dt[["IID"]], Dates = dts)
final_dt[initial_dt, ind_frist := "hop", on = .(IID, Dates >= ENTRY, Dates <= FINISH)]
final_dt[initial_dt, ind_frist := "top", on = .(IID, Dates > TARGET, Dates <= FINISH)]
这些联接应该非常快。
结果:
# IID Dates ind_frist
# 1: 1 2020-02-10 <NA>
# 2: 1 2020-02-11 hop
# 3: 1 2020-02-12 hop
# 4: 1 2020-02-13 hop
# 5: 1 2020-02-14 hop
# 6: 1 2020-02-15 hop
# 7: 1 2020-02-16 top
# 8: 1 2020-02-17 top
# 9: 1 2020-02-18 top
#10: 1 2020-02-19 top
#11: 2 2020-02-10 <NA>
#12: 2 2020-02-11 <NA>
#13: 2 2020-02-12 <NA>
#14: 2 2020-02-13 hop
#15: 2 2020-02-14 hop
#16: 2 2020-02-15 hop
#17: 2 2020-02-16 hop
#18: 2 2020-02-17 hop
#19: 2 2020-02-18 <NA>
#20: 2 2020-02-19 <NA>
#21: 3 2020-02-10 <NA>
#22: 3 2020-02-11 <NA>
#23: 3 2020-02-12 <NA>
#24: 3 2020-02-13 <NA>
#25: 3 2020-02-14 <NA>
#26: 3 2020-02-15 hop
#27: 3 2020-02-16 hop
#28: 3 2020-02-17 top
#29: 3 2020-02-18 top
#30: 3 2020-02-19 top
# IID Dates ind_frist
使用data.table-join:的可能替代方案
final_dt[initial_dt
, on = .(IID)
, ind_frist := c("", "top","hop")[1L + (Dates > TARGET & Dates <= max_finish_target_date) +
Dates %between% .(ENTRY, max_finish_target_date)]][]
它给出:
IID Dates ind_frist 1: 1 2020-02-10 2: 1 2020-02-11 top 3: 1 2020-02-12 top 4: 1 2020-02-13 top 5: 1 2020-02-14 top 6: 1 2020-02-15 top 7: 1 2020-02-16 hop 8: 1 2020-02-17 hop 9: 1 2020-02-18 hop 10: 1 2020-02-19 hop 11: 2 2020-02-10 12: 2 2020-02-11 13: 2 2020-02-12 14: 2 2020-02-13 top 15: 2 2020-02-14 top 16: 2 2020-02-15 top 17: 2 2020-02-16 top 18: 2 2020-02-17 top 19: 2 2020-02-18 top 20: 2 2020-02-19 top 21: 3 2020-02-10 22: 3 2020-02-11 23: 3 2020-02-12 24: 3 2020-02-13 25: 3 2020-02-14 26: 3 2020-02-15 top 27: 3 2020-02-16 top 28: 3 2020-02-17 hop 29: 3 2020-02-18 hop 30: 3 2020-02-19 hop
这与for循环的输出相同。
一些解释:部分1L + (Dates > TARGET & Dates <= max_finish_target_date) + Dates %between% .(ENTRY, max_finish_target_date)
创建了一个长度等于final_dt
的行数的一、二和三的索引向量;如果你把它放在c("", "top","hop")
后面的方括号里,每一个都会得到一个空字符串,每两个都会获得"top"
,每三个都会收到"hop"
。