我想在按id分组后,根据每一行来计算某个时间范围内的行数。例如,假设列中的每个日期时间条目周围有一个1个月的窗口;cleaned_date";。
head(data$cleaned_date)
[1] "2004-10-11 CDT" "2008-09-10 CDT" "2011-10-25 CDT" "2011-12-31 CST"
日期采用POSIXct格式。
对于第一个条目,我需要计算从2004-09-11到2004-11-11时间内的行数,对于第二个条目,计算从2008-08-10到2008-10-10时间内的列数,依此类推。
我大致使用了以下代码
data %>% group_by(id) %>% filter(cleaned_date %within% interval(cleaned_date - 24 * 60 * 60 * 30, cleaned_date + 24 * 60 * 60 * 30)) %>% mutate(counts = n())
但它似乎不起作用,我被算作一个空栏。任何帮助都将不胜感激,谢谢!
可重复的例子如下:
输入是
cleaned_date id
1 2008-09-11 A
2 2008-09-10 B
3 2008-09-30 B
4 2011-10-25 A
5 2011-11-14 A
我希望输出是
cleaned_date id counts
1 2008-09-11 A 1
2 2008-09-10 B 2
3 2008-09-30 B 2
4 2011-10-25 A 2
5 2011-11-14 A 2
对于第一个条目,我想对时间框架2008-08-11到2008-10-11中的行进行计数,第二个条目似乎满足,但我们需要按"0"进行分组;id";,所以它不算数。对于第二个条目,我想对时间帧2008-08-10到2008-10-10中的行进行计数,第2行和第3行满足,因此计数为2。对于第三个条目,我想对时间帧2008-08-30到2008-10-30中的行进行计数,第2行和第3行再次满足,依此类推。
请注意,我想操作的实际数据集有数百万行,因此使用tidyverse而不是基本R可能更有效。
也许不是最优雅的解决方案。
# input data. Dates as character vector
input = data.frame(
cleaned_date = c("2008-09-11", "2008-09-10", "2008-09-30", "2011-10-25", "2011-11-14"),
id = c("A", "B", "B", "A", "A")
)
# function to create a date window n months around specified date
window <- function(x, n = 1){
x <- rep(as.POSIXlt(x),2)
x[1]$mon <- x[1]$mon - n
x[2]$mon <- x[2]$mon + n
return(format(seq(from = x[1], to = x[2], by = "day"), format="%Y-%m-%d"))
}
# find counts for each row
input$counts <- unlist(lapply(1:nrow(input), function(x){
length(which((input$cleaned_date %in% window(input$cleaned_date[x])) & input$id == input$id[x]))
}))
input
cleaned_date id counts
1 2008-09-11 A 1
2 2008-09-10 B 2
3 2008-09-30 B 2
4 2011-10-25 A 2
5 2011-11-14 A 2
编辑大型数据集:
# dummy dataset with 1,000,000 rows
years <- c(2000:2020)
months <- c(1:12)
days <- c(1:20)
n <- 1000000
dates <- paste(sample(years, size = n, replace = T), sample(months, size = n, replace = T), sample(days, size = n, replace = T), sep = "-")
groups <- sample(c("A","B","C"), size = n, replace = T)
input <- data.frame(
cleaned_date = dates,
id = groups
)
input$cleaned_date <- format(as.POSIXlt(input$cleaned_date), format="%Y-%m-%d")
# optional, sort data by date for small boost in performance
input <- input[order(input$cleaned_date),]
counts <- NULL
#pb <- progress::progress_bar$new(total = length(unique(input$cleaned_date)))
t1 <- Sys.time()
# split up vectorization for each unique date.
for(date in unique(input$cleaned_date)){
#pb$tick()
w <- window(date)
tmp <- input[which(input$cleaned_date %in% w),]
tmp_counts <- unlist(lapply(which(tmp$cleaned_date == date), function(x){
length(which(tmp$id == tmp$id[x]))
}))
counts <- c(counts, tmp_counts)
}
# add counts to dataset
input$counts <- counts
# optional, re-order data to original format
input <- input[order(as.numeric(rownames(input))),]
print(Sys.time() - t1)
时差3.247204分钟
如果你想走得更快,你可以在并行中运行循环
library(foreach)
library(doParallel)
cores=detectCores()
cl <- makeCluster(cores[1]-1)
registerDoParallel(cl)
dates = unique(input$cleaned_date)
t1 <- Sys.time()
counts <- foreach(i=1:length(dates), .combine= "c") %dopar% {
w <- window(dates[i])
tmp <- input[which(input$cleaned_date %in% w),]
tmp_counts <- unlist(lapply(which(tmp$cleaned_date == dates[i]), function(x){
length(which(tmp$id == tmp$id[x]))
}))
tmp_counts
}
stopCluster(cl)
input$counts <- counts
input <- input[order(as.numeric(rownames(input))),]
print(Sys.time() - t1)
时差37.37211秒
注意,我在MacBook Pro上运行此软件,该软件具有2.3 GHz四核Intel Core i7和16 GB RAM。
仍然很难确定你想要完成什么,但这至少会让你在指定的日期范围内计数:
df %>%
group_by(id) %>%
filter(cleaned_date >= "2008-08-11" & cleaned_date <= "2008-10-11") %>%
mutate(counts = n())
将给我们:
cleaned_date id counts
<date> <chr> <int>
1 2008-09-11 A 1
2 2008-09-10 B 2
3 2008-09-30 B 2