所以我有一个列ID为DATE和STOCK 的数据表df
在该表中,同一ID具有多个日期和库存值:
ID DATE STOCK
a1 2017-05-04 1
a1 2017-06-04 4
a1 2017-06-05 1
a1 2018-05-04 1
a1 2018-06-04 3
a1 2018-06-05 1
a2 2016-11-26 2
a2 ... ..
使用lubridate我可以得到哪一周的日期如下:
dfWeeks=df[,"WEEK" := floor_date(df$`Date`, "week")]
ID DATE STOCK WEEK
a1 2017-05-04 1 2017-04-30
a1 2017-06-04 4 2017-06-04
a1 2017-06-05 1 2017-06-04
a1 2018-05-04 1 2018-04-29
a1 2018-06-04 3 2018-06-03
a1 2018-06-05 1 2018-06-03
a2 2016-11-26 2 2016-11-20
a2 ... ..
因此,从DATE列中,我知道我的旧日期是2017-05-04
,最新日期为2018-06-05
,大约有56.71429周:
dates <- c( "2017-05-04","2018-06-05")
dif <- diff(as.numeric(strptime(dates, format = "%Y-%m-%d")))/(60 * 60 * 24 * 7)
我的表只有4个唯一的周,所以我们的想法是对每周的库存进行汇总,并插入库存中缺失的(57-4=53周(0值的库存。
然后我可以像一样计算所有周的平均值
meanStock<- dfWeeks[, .(mean=sum(Stock, na.rm = T)/dif <- diff(as.numeric(strptime(c(min(Date), max(Date)), format = "%Y-%m-%d")))/(60 * 60 * 24 * 7) ), by = .(ID)]
但我不知道它是否有效,希望我已经明确表示,欢迎任何建议或方法。
更新:
这就是我获取最大和最小日期的方式
max = aggregate(df$`Date`,by=list(df$ID),max)
colnames(max) = c("ID", "MAX")
min = aggregate(df$`Date`,by=list(df$ID),min)
colnames(min) = c("ID", "MIN")
test <- merge(max, min, by="ID", all=T)
类似于:
library(data.table)
setDT(df)[, DATE := as.Date(DATE)][, `:=` (st = min(DATE), end = max(DATE) + 7), by = ID][
, .(ID = ID, DATE = DATE, STOCK = STOCK, Expanded = seq(st, end, by = "week")), by = 1:nrow(df)][
, `:=` (WEEK = floor_date(Expanded, "week"), WEEK2 = floor_date(DATE, "week"))][
WEEK != WEEK2, STOCK := 0][
, .(SUM_STOCK = sum(STOCK)), by = .(WEEK, ID)]
输出(2017-04-02
至2017-06-11
和ID
至a1
周的行(:
WEEK ID SUM_STOCK
1: 2017-04-02 a1 0
2: 2017-04-09 a1 0
3: 2017-04-16 a1 0
4: 2017-04-23 a1 0
5: 2017-04-30 a1 1
6: 2017-05-07 a1 0
7: 2017-05-14 a1 0
8: 2017-05-21 a1 0
9: 2017-05-28 a1 0
10: 2017-06-04 a1 5
11: 2017-06-11 a1 0