我想计算一列,该列统计特定ID在预定义时间间隔(例如2天)内向后看的发生次数。我在R中有以下数据结构(见下面的代码),并希望自动计算列countLast2d:
userID <- c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3)
datetime <-c("2015-07-02 13:20:00", "2015-07-03 13:20:00", "2015-07-04 01:20:00",
"2015-07-10 01:20:00", "2015-07-23 01:20:00", "2015-07-23 06:08:00", "2015-07-24 06:08:00",
"2015-09-02 09:01:00", "2015-08-19 11:41:00", "2015-08-19 14:38:00", "2015-08-19 17:36:00",
"2015-08-19 20:33:00", "2015-08-19 23:30:00", "2015-08-19 23:46:00", "2015-08-19 05:19:00",
"2015-09-13 17:02:00", "2015-10-01 00:32:00", "2015-10-01 00:50:00")
结果应具有以下值:
countLast2d <- c(0,1,2,0,0,1,2,0,0,1,0,0,0,1,0,0,0,1)
df <- data.frame(userID, countLast2d, datetime)
df$datetime = as.POSIXct(strptime(df$datetime, format = "%Y-%m-%d %H:%M:%S"))
在Excel中,我会使用以下公式:
=countifs([datecolumn],"<"&[date cell in that row],[datecolumn],"<"&[date cell in that row]-2,[idcolumn],[id cell in that row])
(例如[C2]=+COUNTIFS($B:$B,"<"&$B2,$B:$B,">="&$B2-2,$A:$A,$A2)
,如果A列包含id,B列包含日期)
我以前已经问过一次这个问题了(https://stackoverflow.com/questions/30998596/r-count-number-of-occurences-by-id-in-the-last-48h)但我的问题中没有包括一个例子。很抱歉再次询问。
这里有一个解决方案:
df <- data.frame(userID=c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3),datetime=as.POSIXct(c('2015-07-02 13:20:00','2015-07-03 13:20:00','2015-07-04 01:20:00','2015-07-10 01:20:00','2015-07-23 01:20:00','2015-07-23 06:08:00','2015-07-24 06:08:00','2015-09-02 09:01:00','2015-08-19 11:41:00','2015-08-19 14:38:00','2015-08-19 17:36:00','2015-08-19 20:33:00','2015-08-19 23:30:00','2015-08-19 23:46:00','2015-08-19 05:19:00','2015-09-13 17:02:00','2015-10-01 00:32:00','2015-10-01 00:50:00')));
window <- as.difftime(2,units='days');
df$countLast2d <- sapply(1:nrow(df),function(r) sum(df$userID==df$userID[r] & df$datetime<df$datetime[r] & df$datetime>=df$datetime[r]-window));
df;
## userID datetime countLast2d
## 1 1 2015-07-02 13:20:00 0
## 2 1 2015-07-03 13:20:00 1
## 3 1 2015-07-04 01:20:00 2
## 4 1 2015-07-10 01:20:00 0
## 5 1 2015-07-23 01:20:00 0
## 6 1 2015-07-23 06:08:00 1
## 7 1 2015-07-24 06:08:00 2
## 8 1 2015-09-02 09:01:00 0
## 9 2 2015-08-19 11:41:00 1
## 10 2 2015-08-19 14:38:00 2
## 11 2 2015-08-19 17:36:00 3
## 12 2 2015-08-19 20:33:00 4
## 13 2 2015-08-19 23:30:00 5
## 14 2 2015-08-19 23:46:00 6
## 15 2 2015-08-19 05:19:00 0
## 16 3 2015-09-13 17:02:00 0
## 17 3 2015-10-01 00:32:00 0
## 18 3 2015-10-01 00:50:00 1
请注意,这与预期输出不同,因为userID==2
的预期输出不正确。
无论df
的顺序如何,此解决方案都将起作用,这对于您的示例df
来说是必不可少的,因为它对于userID==2
来说是无序的(或者至少不是完全有序的)。
编辑以下是一种可能性,使用by()
按userID
分组,并仅将每个元素与较小的索引元素进行比较,假设只有这些元素可以在回顾窗口中:
df2 <- df[order(df$userID,df$datetime),];
df2$countLast2d <- do.call(c,by(df2$datetime,df$userID,function(x) c(0,sapply(2:length(x),function(i) sum(x[1:(i-1)]>=x[i]-window)))));
df2;
## userID datetime countLast2d
## 1 1 2015-07-02 13:20:00 0
## 2 1 2015-07-03 13:20:00 1
## 3 1 2015-07-04 01:20:00 2
## 4 1 2015-07-10 01:20:00 0
## 5 1 2015-07-23 01:20:00 0
## 6 1 2015-07-23 06:08:00 1
## 7 1 2015-07-24 06:08:00 2
## 8 1 2015-09-02 09:01:00 0
## 15 2 2015-08-19 05:19:00 0
## 9 2 2015-08-19 11:41:00 1
## 10 2 2015-08-19 14:38:00 2
## 11 2 2015-08-19 17:36:00 3
## 12 2 2015-08-19 20:33:00 4
## 13 2 2015-08-19 23:30:00 5
## 14 2 2015-08-19 23:46:00 6
## 16 3 2015-09-13 17:02:00 0
## 17 3 2015-10-01 00:32:00 0
## 18 3 2015-10-01 00:50:00 1