r语言 - 时间数据:从宽到长



我不确定我是否正确使用术语宽/长,但我正试图以一种我可以看到每小时增量使用多少劳动力的方式重新格式化移位数据。

假设我有以下数据集:

library(data.table)
sample_DT <- data.table(
store = c("A", "A", "A", "A", "B", "B"),
date = ymd(c("2019-03-24", "2019-03-24", "2019-03-24", "2019-03-24", "2019-03-24", "2019-03-24")),
start_hr = c(23,0,2,7,4,2),
duration_hr = c(8,4,4,12,6,10)
)

看起来像:

store     date    start_hr duration_hr
<char>     <Date>    <num>       <num>
1:      A 2019-03-24       23           8
2:      A 2019-03-24        0           4
3:      A 2019-03-24        2           4
4:      A 2019-03-24        7          12
5:      B 2019-03-24        4           6
6:      B 2019-03-24        2          10

我想看看每个日期的每个小时间隔(0- 1,1 - 2,2 - 3,3 -4,…)使用了多少劳动力。所以数据应该看起来像这样:


store    date    time_hr usage
A   2019-03-24    0       1
A   2019-03-24    1       1
A   2019-03-24    2       2
A   2019-03-24    3       2
A   2019-03-24    4       1
A   2019-03-24    5       1 
B   ...
B   ...

其中,time_hr表示时间间隔(例如,time_hr = 0表示从午夜到凌晨1点的时间间隔)。请注意,有时轮班可以连续数天(例如,从23:00开始,持续8小时)。

谢谢!

这可以通过几种方式完成,这里有一种使用枢轴数据的方法,以便每个移位开始变成+1,每个移位结束变成-1

library(dplyr); library(tidyr); library(lubridate)
sample_df %>%
transmute(store, 
start_hr = ymd_h(paste(date, start_hr)),
end_hr = start_hr + dhours(duration_hr)) %>%
pivot_longer(-store, 
names_to = "change",
values_to = "timestamp") %>%
mutate(change = if_else(change == "start_hr", 1, -1)) %>%
count(store, timestamp, wt = change, name = "change") %>%  # total change/hr
arrange(store, timestamp) %>%
complete(store, 
timestamp = seq.POSIXt(min(timestamp), max(timestamp), 
by = "hour"), 
fill = list(change = 0)) %>%
mutate(usage = cumsum(change))

结果

# A tibble: 64 × 4
store timestamp           change usage
<chr> <dttm>               <dbl> <dbl>
1 A     2019-03-24 00:00:00      1     1
2 A     2019-03-24 01:00:00      0     1
3 A     2019-03-24 02:00:00      1     2
4 A     2019-03-24 03:00:00      0     2
5 A     2019-03-24 04:00:00     -1     1
6 A     2019-03-24 05:00:00      0     1
7 A     2019-03-24 06:00:00     -1     0
8 A     2019-03-24 07:00:00      1     1
9 A     2019-03-24 08:00:00      0     1
10 A     2019-03-24 09:00:00      0     1
# … with 54 more rows

这是一个使用数据的方法。表,并拆分出一个小函数f,该函数生成hr间隔序列。然后将不同的间隔按store求和。

f <- function(d,h, dur) seq(ymd_h(paste(d,h)), by="hour",length.out=dur)
sample_DT[,id:=.I] %>% 
.[, .(hr_interval = f(date,start_hr, duration_hr)), by=.(store, id)] %>% 
.[, .(usage = .N), .(store, hr_interval)] %>% 
.[order(store,hr_interval)]

输出:

store         hr_interval usage
1:     A 2019-03-24 00:00:00     1
2:     A 2019-03-24 01:00:00     1
3:     A 2019-03-24 02:00:00     2
4:     A 2019-03-24 03:00:00     2
5:     A 2019-03-24 04:00:00     1
6:     A 2019-03-24 05:00:00     1
7:     A 2019-03-24 07:00:00     1
8:     A 2019-03-24 08:00:00     1
9:     A 2019-03-24 09:00:00     1
10:     A 2019-03-24 10:00:00     1
11:     A 2019-03-24 11:00:00     1
12:     A 2019-03-24 12:00:00     1
13:     A 2019-03-24 13:00:00     1
14:     A 2019-03-24 14:00:00     1
15:     A 2019-03-24 15:00:00     1
16:     A 2019-03-24 16:00:00     1
17:     A 2019-03-24 17:00:00     1
18:     A 2019-03-24 18:00:00     1
19:     A 2019-03-24 23:00:00     1
20:     A 2019-03-25 00:00:00     1
21:     A 2019-03-25 01:00:00     1
22:     A 2019-03-25 02:00:00     1
23:     A 2019-03-25 03:00:00     1
24:     A 2019-03-25 04:00:00     1
25:     A 2019-03-25 05:00:00     1
26:     A 2019-03-25 06:00:00     1
27:     B 2019-03-24 02:00:00     1
28:     B 2019-03-24 03:00:00     1
29:     B 2019-03-24 04:00:00     2
30:     B 2019-03-24 05:00:00     2
31:     B 2019-03-24 06:00:00     2
32:     B 2019-03-24 07:00:00     2
33:     B 2019-03-24 08:00:00     2
34:     B 2019-03-24 09:00:00     2
35:     B 2019-03-24 10:00:00     1
36:     B 2019-03-24 11:00:00     1
store         hr_interval usage

为了完整起见,我想提到IRanges包中的coverage()函数,它完成了所有繁重的工作。它的工作间隔允许紧凑的RLE(运行长度编码)表示结果,并将其转换为op所要求的网格显示。

唯一的缺点是它需要整数间隔。因此,我们必须将日期-时间间隔强制转换为来回的整数间隔。
library(data.table)
library(lubridate)
library(magrittr)
if (!"IRanges" %in% rownames(installed.packages())) {
install.packages("IRanges", 
repos = "https://bioconductor.org/packages/3.15/bioc")
}
library(IRanges)
origin <- min(d$date)
usage <- d[, IRanges(start = as.integer(date - origin) * 24L + start_hr, 
width = duration_hr, 
names = store)] %>% 
split(names(.)) %>% 
coverage() 
usage
RleList of length 2
$A
integer-Rle of length 30 with 7 runs
Lengths:  1  2  2  1 12  4  8
Values :  1  2  1  0  1  0  1
$B
integer-Rle of length 11 with 4 runs
Lengths: 1 2 6 2
Values : 0 1 2 1

对于每个商店,这是使用量(或覆盖率)的运行长度编码表示。例如,对于商店A,第一个小时的使用量为1,接下来的两个小时的使用量为2,以此类推。

可以强制转换为更易于理解的使用间隔表示:

lapply(usage, function(x) data.table(
start_dt = origin + hours(start(x) - 1L),
duration_hr = width(x),
usage = runValue(x))) %>% 
rbindlist(idcol = "shop")
shop            start_dt duration_hr usage
1:      A 2019-03-24 00:00:00           1     1
2:      A 2019-03-24 01:00:00           2     2
3:      A 2019-03-24 03:00:00           2     1
4:      A 2019-03-24 05:00:00           1     0
5:      A 2019-03-24 06:00:00          12     1
6:      A 2019-03-24 18:00:00           4     0
7:      A 2019-03-24 22:00:00           8     1
8:      B 2019-03-24 00:00:00           1     0
9:      B 2019-03-24 01:00:00           2     1
10:      B 2019-03-24 03:00:00           6     2
11:      B 2019-03-24 09:00:00           2     1

或者,根据OP的要求,使用情况可以显示为每小时间隔的序列:

lapply(usage, function(x) data.table(
usage = decode(x))[, timestamp := origin + hours(seq_along(usage) - 1L)]) %>% 
rbindlist(idcol = "shop")
shop usage           timestamp
1:      A     1 2019-03-24 00:00:00
2:      A     2 2019-03-24 01:00:00
3:      A     2 2019-03-24 02:00:00
4:      A     1 2019-03-24 03:00:00
5:      A     1 2019-03-24 04:00:00
6:      A     0 2019-03-24 05:00:00
7:      A     1 2019-03-24 06:00:00
8:      A     1 2019-03-24 07:00:00
9:      A     1 2019-03-24 08:00:00
10:      A     1 2019-03-24 09:00:00
11:      A     1 2019-03-24 10:00:00
12:      A     1 2019-03-24 11:00:00
13:      A     1 2019-03-24 12:00:00
14:      A     1 2019-03-24 13:00:00
15:      A     1 2019-03-24 14:00:00
16:      A     1 2019-03-24 15:00:00
17:      A     1 2019-03-24 16:00:00
18:      A     1 2019-03-24 17:00:00
19:      A     0 2019-03-24 18:00:00
20:      A     0 2019-03-24 19:00:00
21:      A     0 2019-03-24 20:00:00
22:      A     0 2019-03-24 21:00:00
23:      A     1 2019-03-24 22:00:00
24:      A     1 2019-03-24 23:00:00
25:      A     1 2019-03-25 00:00:00
26:      A     1 2019-03-25 01:00:00
27:      A     1 2019-03-25 02:00:00
28:      A     1 2019-03-25 03:00:00
29:      A     1 2019-03-25 04:00:00
30:      A     1 2019-03-25 05:00:00
31:      B     0 2019-03-24 00:00:00
32:      B     1 2019-03-24 01:00:00
33:      B     1 2019-03-24 02:00:00
34:      B     2 2019-03-24 03:00:00
35:      B     2 2019-03-24 04:00:00
36:      B     2 2019-03-24 05:00:00
37:      B     2 2019-03-24 06:00:00
38:      B     2 2019-03-24 07:00:00
39:      B     2 2019-03-24 08:00:00
40:      B     1 2019-03-24 09:00:00
41:      B     1 2019-03-24 10:00:00
shop usage           timestamp

数据
library(data.table)
library(lubridate)
d <- data.table(
store = c("A", "A", "A", "A", "B", "B"),
date = ymd(rep("2019-03-24", 6L)),
start_hr = c(23, 0, 2, 7, 4, 2),
duration_hr = c(8, 4, 4, 12, 6, 10)
)

最新更新