我有一个具有时间序列观测值的数据框架。我希望为每个观测值添加一个变量,其值为上一年最接近相似日期的值和次年最接近相似日期的值(例如,对于2014年5月15日的值,这可能是2013年5月13日和2015年5月21日)。是否有一个聪明的方法,例如使用dplyr,做到这一点?请在下面找到示例代码(大多数代码专注于创建一组随机的日期和值,这要归功于之前的SO问题)。提前感谢。
date value nearest_val_nextyear nearest_val_prevyear
1 2009-02-14 6.511781 0 0
2 2009-12-23 5.389843 0 0
3 2011-08-01 4.378759 0 0
4 2014-04-07 2.785300 0 0
5 2008-08-12 6.124931 0 0
6 2014-03-10 4.955066 0 0
7 2014-07-23 4.983810 0 0
8 2012-04-14 5.943836 0 0
9 2012-01-13 5.821221 0 0
10 2007-06-30 5.593901 0 0
11 2008-08-24 5.918977 0 0
12 2008-05-30 5.782136 0 0
13 2012-06-30 5.074565 0 0
14 2010-01-27 3.010648 0 0
15 2013-02-27 5.619826 0 0
16 2010-12-25 4.943871 0 0
17 2012-09-27 4.844204 0 0
18 2014-12-08 3.529248 0 0
19 2010-01-15 4.521850 0 0
20 2013-03-21 5.417942 0 0
# set start and end dates to sample between
day.start <- "2007/01/01"
day.end <- "2014/12/31"
set.seed(1)
# define a random date/time selection function
rand.day.time <- function(day.start,day.end,size) {
dayseq <- seq.Date(as.Date(day.start),as.Date(day.end),by="day")
dayselect <- sample(dayseq,size,replace=TRUE)
as.POSIXlt(paste(dayselect) )
}
dateval=rand.day.time(day.start,day.end,size=20)
value=rnorm(n=20,mean=5,sd=1)
df=data.frame(date=dateval,value=value)
df$nearest_val_nextyear=0
df$nearest_val_prevyear=0
df
这绝对不是一个聪明的方法,但我发帖希望有人,也许你,可以使它聪明/漂亮。
library(dplyr)
library(lubridate)
dat <- data.frame(dateval, value)
dat <- dat %>% mutate(year = year(dateval), nv_next = NA, nv_prev = NA)
#You don't really need dplyr just for this...
shifts <- c(1, -1) #nextyear, prevyear
for (s in 1:2) { #Once for each shift
for (i in 1:nrow(dat)) {
otheryear <- dat[dat[,"year"]==dat[i,"year"]+shifts[s],] #Subset the df with only dates of other year
if (nrow(otheryear) == 0) { #Ends if there's no other year
dat[i,3+s] <- NA
} else {
cands <- otheryear$dateval #Candidates to have their value chosen
cands_shifted <- cands
year(cands_shifted) <- dat[i,"year"] #Change the year in cand's copy
nearest_date <- which.min(abs(difftime(dat[i,"dateval"], cands_shifted))) #After the years are the same, the closest date can be calculated with difftime
dat[i,3+s] <- dat[dat$dateval == cands[nearest_date],"value"] #We check back on cands what real date that was, and assign it's value
}
}
}
结果是
> dat
dateval value year nv_next nv_prev
1 2009-02-14 6.511781 2009 3.010648 5.782136
2 2009-12-23 5.389843 2009 4.943871 5.918977
3 2011-08-01 4.378759 2011 5.074565 4.943871
4 2014-04-07 2.785300 2014 NA 5.417942
5 2008-08-12 6.124931 2008 5.389843 5.593901
6 2014-03-10 4.955066 2014 NA 5.619826
7 2014-07-23 4.983810 2014 NA 5.417942
8 2012-04-14 5.943836 2012 5.417942 4.378759
9 2012-01-13 5.821221 2012 5.619826 4.378759
10 2007-06-30 5.593901 2007 5.782136 NA
11 2008-08-24 5.918977 2008 5.389843 5.593901
12 2008-05-30 5.782136 2008 6.511781 5.593901
13 2012-06-30 5.074565 2012 5.417942 4.378759
14 2010-01-27 3.010648 2010 4.378759 6.511781
15 2013-02-27 5.619826 2013 4.955066 5.821221
16 2010-12-25 4.943871 2010 4.378759 5.389843
17 2012-09-27 4.844204 2012 5.417942 4.378759
18 2014-12-08 3.529248 2014 NA 5.417942
19 2010-01-15 4.521850 2010 4.378759 6.511781
20 2013-03-21 5.417942 2013 4.955066 5.943836
我嵌套了for循环,而不是为每次移位使用副本,但是您必须小心nv_next
和nv_prev
,因为它们是通过索引而不是名称选择的。
Molx的答案比我的更短更好,但因为我已经把它写出来了,以防你想要一个答案a)更多地使用函数,b)以R为底,这里是我的答案。
加载数据:
dates = read.table("date_data.txt")
此函数仅用于查找以月份和日期表示的天数之差;你想要这样不同的年份就不会在亲密程度上被比较。
#get differences in terms of the months and dates only
compare_dates_days <- function(date1, date2, date_format = "%Y-%m-%d"){
#give them all "blank" years of "00"
month_day_only1 = paste("00", strsplit(date1, "-")[[1]][2], strsplit(date1, "-")[[1]][3], sep = "-")
month_day_only2 = paste("00", strsplit(date2, "-")[[1]][2], strsplit(date2, "-")[[1]][3], sep = "-")
day_difference = as.numeric(as.Date(as.character(month_day_only1, format = "%m-%d")) -
as.Date(as.character(month_day_only2, format = "%m-%d")))
return(day_difference)
}
#testing the above function
a = "2009-02-14"
b = "2009-02-28"
diff = compare_dates_days(a, b)
求向量中非零值的绝对值的最小值的函数。
min_abs_index <- function(v){
v.na = abs(v)
v.na[v==0] = NA
return(c( which.min(v.na) ))
}
下面是一个函数,它将一个日期与一个日期向量进行比较,并给出该日期在上面和下面年份中最近的一天的索引;它使用了上述函数。
above_below_year_date <- function(date, date_ref_compare, date_format = "%Y-%m-%d"){
one_year_ahead_diffs = rep(0, length(date_ref_compare))
one_year_behind_diffs = rep(0, length(date_ref_compare))
date_diffs = unlist(lapply(seq_along(1:length(date_ref_compare)),
function(i) compare_dates_days(date_ref_compare[i],date )))
for(i in 1:length(date_ref_compare)){
#calendar year ahead
if(as.numeric(sapply(strsplit(date, "-"),"[[", 1)) -
as.numeric(sapply(strsplit(date_ref_compare[i],
"-"),"[[", 1)) == 1){
one_year_ahead_diffs[i] = date_diffs[i]
}
#calendar year behind
if(as.numeric(sapply(strsplit(date, "-"),"[[", 1)) -
as.numeric(sapply(strsplit(date_ref_compare[i],
"-"),"[[", 1)) == -1){
one_year_behind_diffs[i] = date_diffs[i]
}
}
res_ahead = min_abs_index(one_year_ahead_diffs)
print(res_ahead)
print(one_year_ahead_diffs[res_ahead])
print(one_year_ahead_diffs)
res_behind = min_abs_index(one_year_behind_diffs)
return(c(res_ahead, res_behind))
}
我们将上述函数应用于所提供向量中的每个日期:
vector_of_ahead_indices = rep(0, length(dates$date))
vector_of_behind_indices = rep(0, length(dates$date))
for(i in 1:length(dates$date)){
res = above_below_year_date(dates$date[i], dates$date)
vector_of_ahead_indices[i] = res[1]
vector_of_behind_indices[i] = res[2]
}
dates$nearest_val_nextyear = dates$value[vector_of_behind_indices]
dates$nearest_val_prevyear = dates$value[vector_of_ahead_indices]
然后我们订购,以便于手工检查,并重新订购第一年的NA值在错误的列。
#order to make it easier to manually check
dates = dates[order(dates$date), ]
#reorder the first year
dates[1, "nearest_val_nextyear"] = dates[1, "nearest_val_prevyear"]
dates[1, "nearest_val_prevyear"] = NA
最后是排序后的输出以及原始行名:
date value nearest_val_nextyear nearest_val_prevyear
10 2007-06-30 5.593901 5.782136 NA
12 2008-05-30 5.782136 6.511781 5.593901
5 2008-08-12 6.124931 5.389843 5.593901
11 2008-08-24 5.918977 5.389843 5.593901
1 2009-02-14 6.511781 3.010648 5.782136
2 2009-12-23 5.389843 4.943871 5.918977
19 2010-01-15 4.521850 4.378759 6.511781
14 2010-01-27 3.010648 4.378759 6.511781
16 2010-12-25 4.943871 4.378759 5.389843
3 2011-08-01 4.378759 5.074565 4.943871
9 2012-01-13 5.821221 5.619826 4.378759
8 2012-04-14 5.943836 5.417942 4.378759
13 2012-06-30 5.074565 5.417942 4.378759
17 2012-09-27 4.844204 5.417942 4.378759
15 2013-02-27 5.619826 4.955066 5.821221
20 2013-03-21 5.417942 4.955066 5.943836
6 2014-03-10 4.955066 NA 5.417942
4 2014-04-07 2.785300 NA 5.417942
7 2014-07-23 4.983810 NA 5.417942
18 2014-12-08 3.529248 NA 5.417942