我不确定我是否正确使用术语宽/长,但我正试图以一种我可以看到每小时增量使用多少劳动力的方式重新格式化移位数据。
假设我有以下数据集:
library(data.table)
sample_DT <- data.table(
store = c("A", "A", "A", "A", "B", "B"),
date = ymd(c("2019-03-24", "2019-03-24", "2019-03-24", "2019-03-24", "2019-03-24", "2019-03-24")),
start_hr = c(23,0,2,7,4,2),
duration_hr = c(8,4,4,12,6,10)
)
看起来像:
store date start_hr duration_hr
<char> <Date> <num> <num>
1: A 2019-03-24 23 8
2: A 2019-03-24 0 4
3: A 2019-03-24 2 4
4: A 2019-03-24 7 12
5: B 2019-03-24 4 6
6: B 2019-03-24 2 10
我想看看每个日期的每个小时间隔(0- 1,1 - 2,2 - 3,3 -4,…)使用了多少劳动力。所以数据应该看起来像这样:
store date time_hr usage
A 2019-03-24 0 1
A 2019-03-24 1 1
A 2019-03-24 2 2
A 2019-03-24 3 2
A 2019-03-24 4 1
A 2019-03-24 5 1
B ...
B ...
其中,time_hr
表示时间间隔(例如,time_hr = 0
表示从午夜到凌晨1点的时间间隔)。请注意,有时轮班可以连续数天(例如,从23:00开始,持续8小时)。
谢谢!
这可以通过几种方式完成,这里有一种使用枢轴数据的方法,以便每个移位开始变成+1
,每个移位结束变成-1
。
library(dplyr); library(tidyr); library(lubridate)
sample_df %>%
transmute(store,
start_hr = ymd_h(paste(date, start_hr)),
end_hr = start_hr + dhours(duration_hr)) %>%
pivot_longer(-store,
names_to = "change",
values_to = "timestamp") %>%
mutate(change = if_else(change == "start_hr", 1, -1)) %>%
count(store, timestamp, wt = change, name = "change") %>% # total change/hr
arrange(store, timestamp) %>%
complete(store,
timestamp = seq.POSIXt(min(timestamp), max(timestamp),
by = "hour"),
fill = list(change = 0)) %>%
mutate(usage = cumsum(change))
结果
# A tibble: 64 × 4
store timestamp change usage
<chr> <dttm> <dbl> <dbl>
1 A 2019-03-24 00:00:00 1 1
2 A 2019-03-24 01:00:00 0 1
3 A 2019-03-24 02:00:00 1 2
4 A 2019-03-24 03:00:00 0 2
5 A 2019-03-24 04:00:00 -1 1
6 A 2019-03-24 05:00:00 0 1
7 A 2019-03-24 06:00:00 -1 0
8 A 2019-03-24 07:00:00 1 1
9 A 2019-03-24 08:00:00 0 1
10 A 2019-03-24 09:00:00 0 1
# … with 54 more rows
这是一个使用数据的方法。表,并拆分出一个小函数f
,该函数生成hr间隔序列。然后将不同的间隔按store求和。
f <- function(d,h, dur) seq(ymd_h(paste(d,h)), by="hour",length.out=dur)
sample_DT[,id:=.I] %>%
.[, .(hr_interval = f(date,start_hr, duration_hr)), by=.(store, id)] %>%
.[, .(usage = .N), .(store, hr_interval)] %>%
.[order(store,hr_interval)]
输出:
store hr_interval usage
1: A 2019-03-24 00:00:00 1
2: A 2019-03-24 01:00:00 1
3: A 2019-03-24 02:00:00 2
4: A 2019-03-24 03:00:00 2
5: A 2019-03-24 04:00:00 1
6: A 2019-03-24 05:00:00 1
7: A 2019-03-24 07:00:00 1
8: A 2019-03-24 08:00:00 1
9: A 2019-03-24 09:00:00 1
10: A 2019-03-24 10:00:00 1
11: A 2019-03-24 11:00:00 1
12: A 2019-03-24 12:00:00 1
13: A 2019-03-24 13:00:00 1
14: A 2019-03-24 14:00:00 1
15: A 2019-03-24 15:00:00 1
16: A 2019-03-24 16:00:00 1
17: A 2019-03-24 17:00:00 1
18: A 2019-03-24 18:00:00 1
19: A 2019-03-24 23:00:00 1
20: A 2019-03-25 00:00:00 1
21: A 2019-03-25 01:00:00 1
22: A 2019-03-25 02:00:00 1
23: A 2019-03-25 03:00:00 1
24: A 2019-03-25 04:00:00 1
25: A 2019-03-25 05:00:00 1
26: A 2019-03-25 06:00:00 1
27: B 2019-03-24 02:00:00 1
28: B 2019-03-24 03:00:00 1
29: B 2019-03-24 04:00:00 2
30: B 2019-03-24 05:00:00 2
31: B 2019-03-24 06:00:00 2
32: B 2019-03-24 07:00:00 2
33: B 2019-03-24 08:00:00 2
34: B 2019-03-24 09:00:00 2
35: B 2019-03-24 10:00:00 1
36: B 2019-03-24 11:00:00 1
store hr_interval usage
为了完整起见,我想提到IRanges
包中的coverage()
函数,它完成了所有繁重的工作。它的工作间隔允许紧凑的RLE(运行长度编码)表示结果,并将其转换为op所要求的网格显示。
library(data.table)
library(lubridate)
library(magrittr)
if (!"IRanges" %in% rownames(installed.packages())) {
install.packages("IRanges",
repos = "https://bioconductor.org/packages/3.15/bioc")
}
library(IRanges)
origin <- min(d$date)
usage <- d[, IRanges(start = as.integer(date - origin) * 24L + start_hr,
width = duration_hr,
names = store)] %>%
split(names(.)) %>%
coverage()
usage
RleList of length 2 $A integer-Rle of length 30 with 7 runs Lengths: 1 2 2 1 12 4 8 Values : 1 2 1 0 1 0 1 $B integer-Rle of length 11 with 4 runs Lengths: 1 2 6 2 Values : 0 1 2 1
对于每个商店,这是使用量(或覆盖率)的运行长度编码表示。例如,对于商店A
,第一个小时的使用量为1,接下来的两个小时的使用量为2,以此类推。
可以强制转换为更易于理解的使用间隔表示:
lapply(usage, function(x) data.table(
start_dt = origin + hours(start(x) - 1L),
duration_hr = width(x),
usage = runValue(x))) %>%
rbindlist(idcol = "shop")
shop start_dt duration_hr usage 1: A 2019-03-24 00:00:00 1 1 2: A 2019-03-24 01:00:00 2 2 3: A 2019-03-24 03:00:00 2 1 4: A 2019-03-24 05:00:00 1 0 5: A 2019-03-24 06:00:00 12 1 6: A 2019-03-24 18:00:00 4 0 7: A 2019-03-24 22:00:00 8 1 8: B 2019-03-24 00:00:00 1 0 9: B 2019-03-24 01:00:00 2 1 10: B 2019-03-24 03:00:00 6 2 11: B 2019-03-24 09:00:00 2 1
或者,根据OP的要求,使用情况可以显示为每小时间隔的序列:
lapply(usage, function(x) data.table(
usage = decode(x))[, timestamp := origin + hours(seq_along(usage) - 1L)]) %>%
rbindlist(idcol = "shop")
数据shop usage timestamp 1: A 1 2019-03-24 00:00:00 2: A 2 2019-03-24 01:00:00 3: A 2 2019-03-24 02:00:00 4: A 1 2019-03-24 03:00:00 5: A 1 2019-03-24 04:00:00 6: A 0 2019-03-24 05:00:00 7: A 1 2019-03-24 06:00:00 8: A 1 2019-03-24 07:00:00 9: A 1 2019-03-24 08:00:00 10: A 1 2019-03-24 09:00:00 11: A 1 2019-03-24 10:00:00 12: A 1 2019-03-24 11:00:00 13: A 1 2019-03-24 12:00:00 14: A 1 2019-03-24 13:00:00 15: A 1 2019-03-24 14:00:00 16: A 1 2019-03-24 15:00:00 17: A 1 2019-03-24 16:00:00 18: A 1 2019-03-24 17:00:00 19: A 0 2019-03-24 18:00:00 20: A 0 2019-03-24 19:00:00 21: A 0 2019-03-24 20:00:00 22: A 0 2019-03-24 21:00:00 23: A 1 2019-03-24 22:00:00 24: A 1 2019-03-24 23:00:00 25: A 1 2019-03-25 00:00:00 26: A 1 2019-03-25 01:00:00 27: A 1 2019-03-25 02:00:00 28: A 1 2019-03-25 03:00:00 29: A 1 2019-03-25 04:00:00 30: A 1 2019-03-25 05:00:00 31: B 0 2019-03-24 00:00:00 32: B 1 2019-03-24 01:00:00 33: B 1 2019-03-24 02:00:00 34: B 2 2019-03-24 03:00:00 35: B 2 2019-03-24 04:00:00 36: B 2 2019-03-24 05:00:00 37: B 2 2019-03-24 06:00:00 38: B 2 2019-03-24 07:00:00 39: B 2 2019-03-24 08:00:00 40: B 1 2019-03-24 09:00:00 41: B 1 2019-03-24 10:00:00 shop usage timestamp
library(data.table)
library(lubridate)
d <- data.table(
store = c("A", "A", "A", "A", "B", "B"),
date = ymd(rep("2019-03-24", 6L)),
start_hr = c(23, 0, 2, 7, 4, 2),
duration_hr = c(8, 4, 4, 12, 6, 10)
)