根据R中的每一行,计算特定时间范围内的行数(tidyverse)



我想在按id分组后,根据每一行来计算某个时间范围内的行数。例如,假设列中的每个日期时间条目周围有一个1个月的窗口;cleaned_date";。

head(data$cleaned_date)

[1] "2004-10-11 CDT" "2008-09-10 CDT" "2011-10-25 CDT" "2011-12-31 CST"

日期采用POSIXct格式。

对于第一个条目,我需要计算从2004-09-11到2004-11-11时间内的行数,对于第二个条目,计算从2008-08-10到2008-10-10时间内的列数,依此类推。

我大致使用了以下代码

data %>% group_by(id) %>% filter(cleaned_date %within% interval(cleaned_date - 24 * 60 * 60 * 30, cleaned_date + 24 * 60 * 60 * 30)) %>% mutate(counts = n())

但它似乎不起作用,我被算作一个空栏。任何帮助都将不胜感激,谢谢!

可重复的例子如下:

输入是

cleaned_date id
1   2008-09-11  A
2   2008-09-10  B
3   2008-09-30  B
4   2011-10-25  A
5   2011-11-14  A

我希望输出是

cleaned_date id counts
1   2008-09-11  A      1
2   2008-09-10  B      2
3   2008-09-30  B      2
4   2011-10-25  A      2
5   2011-11-14  A      2

对于第一个条目,我想对时间框架2008-08-11到2008-10-11中的行进行计数,第二个条目似乎满足,但我们需要按"0"进行分组;id";,所以它不算数。对于第二个条目,我想对时间帧2008-08-10到2008-10-10中的行进行计数,第2行和第3行满足,因此计数为2。对于第三个条目,我想对时间帧2008-08-30到2008-10-30中的行进行计数,第2行和第3行再次满足,依此类推。

请注意,我想操作的实际数据集有数百万行,因此使用tidyverse而不是基本R可能更有效。

也许不是最优雅的解决方案。

# input data. Dates as character vector
input = data.frame(
cleaned_date = c("2008-09-11", "2008-09-10", "2008-09-30", "2011-10-25", "2011-11-14"), 
id = c("A", "B", "B", "A", "A")
)
# function to create a date window n months around specified date
window <- function(x, n = 1){
x <- rep(as.POSIXlt(x),2)
x[1]$mon <- x[1]$mon - n
x[2]$mon <- x[2]$mon + n
return(format(seq(from = x[1], to = x[2], by = "day"), format="%Y-%m-%d"))
}
# find counts for each row
input$counts <- unlist(lapply(1:nrow(input), function(x){
length(which((input$cleaned_date %in% window(input$cleaned_date[x])) & input$id == input$id[x]))
}))
input
cleaned_date id counts
1   2008-09-11  A      1
2   2008-09-10  B      2
3   2008-09-30  B      2
4   2011-10-25  A      2
5   2011-11-14  A      2

编辑大型数据集:

# dummy dataset with 1,000,000 rows
years <- c(2000:2020)
months <- c(1:12)
days <- c(1:20)
n <- 1000000
dates <- paste(sample(years, size = n, replace = T), sample(months, size = n, replace = T), sample(days, size = n, replace = T), sep = "-")
groups <- sample(c("A","B","C"), size = n, replace = T)
input <- data.frame(
cleaned_date = dates,
id = groups
)
input$cleaned_date <- format(as.POSIXlt(input$cleaned_date), format="%Y-%m-%d")
# optional, sort data by date for small boost in performance
input <- input[order(input$cleaned_date),]
counts <- NULL
#pb <- progress::progress_bar$new(total = length(unique(input$cleaned_date)))
t1 <- Sys.time()
# split up vectorization for each unique date.
for(date in unique(input$cleaned_date)){
#pb$tick()
w <- window(date)
tmp <- input[which(input$cleaned_date %in% w),]
tmp_counts <- unlist(lapply(which(tmp$cleaned_date == date), function(x){
length(which(tmp$id == tmp$id[x]))
}))
counts <- c(counts, tmp_counts)
}
# add counts to dataset
input$counts <- counts 
# optional, re-order data to original format
input <- input[order(as.numeric(rownames(input))),]
print(Sys.time() - t1)

时差3.247204分钟

如果你想走得更快,你可以在并行中运行循环

library(foreach)
library(doParallel)
cores=detectCores()
cl <- makeCluster(cores[1]-1)
registerDoParallel(cl)
dates = unique(input$cleaned_date)
t1 <- Sys.time()
counts <- foreach(i=1:length(dates), .combine= "c") %dopar% {
w <- window(dates[i])
tmp <- input[which(input$cleaned_date %in% w),]
tmp_counts <- unlist(lapply(which(tmp$cleaned_date == dates[i]), function(x){
length(which(tmp$id == tmp$id[x]))
}))
tmp_counts
}
stopCluster(cl)
input$counts <- counts
input <- input[order(as.numeric(rownames(input))),]
print(Sys.time() - t1)

时差37.37211秒

注意,我在MacBook Pro上运行此软件,该软件具有2.3 GHz四核Intel Core i7和16 GB RAM。

仍然很难确定你想要完成什么,但这至少会让你在指定的日期范围内计数:

df %>% 
group_by(id) %>% 
filter(cleaned_date >= "2008-08-11" & cleaned_date <= "2008-10-11") %>% 
mutate(counts = n())

将给我们:

cleaned_date id    counts
<date>       <chr>  <int>
1 2008-09-11   A          1
2 2008-09-10   B          2
3 2008-09-30   B          2

最新更新