我有一个类似的dataframe
(编辑;添加分组变量measurement_type
(:
data <- data.frame(ID = as.factor(c(rep(1, 10),
rep(2, 10))),
measurement_type = as.factor(c(rep("type_1", 5),
rep("type_2", 5),
rep("type_1", 5),
rep("type_2", 5))),
measurement_time = as.POSIXct(c("2014-06-17 04:00:00",
"2014-06-17 11:52:00",
"2014-06-17 18:58:00",
"2014-06-18 02:05:00",
"2014-06-18 08:00:00",
"2014-06-17 05:27:00",
"2014-06-17 11:10:00",
"2014-06-17 17:02:00",
"2014-06-17 23:56:00",
"2014-06-18 07:01:00",
"2014-07-03 16:01:00",
"2014-07-03 19:19:00",
"2014-07-03 23:55:00",
"2014-07-04 08:08:00",
"2014-07-04 13:55:00",
"2014-07-03 22:12:00",
"2014-07-04 08:59:00",
"2014-07-04 14:10:00",
"2014-07-04 17:00:00",
"2014-07-04 23:00:00")),
amount = c(350,470,310,470,650,
175,275,45,255,395,
130,460,540,790,69,
80,210,58,147,326),
entry_time = as.POSIXct(c(rep("2014-06-17 01:53:00", 10),
rep("2014-07-03 14:35:00", 10))))
具有ID 1
和ID 2
的受试者在指定的entry_time
进入,然后在特定的measurement_times
测量累积amounts
。然而,每天中午,金额都会再次归零,并再次开始计数(从零开始(。我想实现的是,一旦中午休息(因此重置为零(,它就会不断将新开始的累积量添加到中午之前已经累积的累积量中(由分组变量measurement_type
分组(。
更新
多亏了@Istrel,我使用提供的答案几乎得到了正确的输出:
data %>% as_tibble() %>%
# Check 12 hours passed --> `pm` column
mutate(pm = format(measurement_time, "%H") >= 12) %>%
mutate(date_fct = format(measurement_time, "%Y_%d")) %>%
# Group by ID and `pm`
group_by(ID, measurement_type, date_fct, pm) %>%
# Turn cumsum into actual values
mutate(amount_act = amount - lag(amount, default = 0)) %>%
# Cumsum over ID
ungroup() %>%
group_by(ID, measurement_type) %>%
mutate(amount_cums = cumsum(amount_act)) %>%
ungroup() %>%
select(-c(pm, date_fct, amount_act))
输出
# A tibble: 20 x 6
ID measurement_type measurement_time amount entry_time amount_cums
<fct> <fct> <dttm> <dbl> <dttm> <dbl>
1 1 type_1 2014-06-17 04:00:00 350 2014-06-17 01:53:00 350
2 1 type_1 2014-06-17 11:52:00 470 2014-06-17 01:53:00 470
3 1 type_1 2014-06-17 18:58:00 310 2014-06-17 01:53:00 780
4 1 type_1 2014-06-18 02:05:00 470 2014-06-17 01:53:00 1250
5 1 type_1 2014-06-18 08:00:00 650 2014-06-17 01:53:00 1430
6 1 type_2 2014-06-17 05:27:00 175 2014-06-17 01:53:00 175
7 1 type_2 2014-06-17 11:10:00 275 2014-06-17 01:53:00 275
8 1 type_2 2014-06-17 17:02:00 45 2014-06-17 01:53:00 320
9 1 type_2 2014-06-17 23:56:00 255 2014-06-17 01:53:00 530
10 1 type_2 2014-06-18 07:01:00 395 2014-06-17 01:53:00 925
11 2 type_1 2014-07-03 16:01:00 130 2014-07-03 14:35:00 130
12 2 type_1 2014-07-03 19:19:00 460 2014-07-03 14:35:00 460
13 2 type_1 2014-07-03 23:55:00 540 2014-07-03 14:35:00 540
14 2 type_1 2014-07-04 08:08:00 790 2014-07-03 14:35:00 1330
15 2 type_1 2014-07-04 13:55:00 69 2014-07-03 14:35:00 1399
16 2 type_2 2014-07-03 22:12:00 80 2014-07-03 14:35:00 80
17 2 type_2 2014-07-04 08:59:00 210 2014-07-03 14:35:00 290
18 2 type_2 2014-07-04 14:10:00 58 2014-07-03 14:35:00 348
19 2 type_2 2014-07-04 17:00:00 147 2014-07-03 14:35:00 437
20 2 type_2 2014-07-04 23:00:00 326 2014-07-03 14:35:00 616
正如您所看到的,午休时间的累计总和会得到正确更新。然而,在夜间病例的情况下,代码将午夜后的数量添加到午夜前的总数中。然而,在午夜没有计数重置,并且该金额应该保持累积到午夜之前的金额(没有计数重置(。然而,在上面显示的输出中,累积量被添加到午夜之前的值:例如,第10行,它将395的值添加到530的amount_cums(第9行(,而它应该简单地将差值添加到最后一个值(395-255=140(,对于第10行来说,正确的amount_cums是670。
你知道我如何修改你的代码吗?
我可以提出这个策略。首先,按ID、日期(year_month_day(和AM/PM时间标记对数据进行分组。然后将累积和转换为每组中的原始值。然后按ID和日期分组重新计算累计总和。
解决方案可能是这样的:
library(tidyverse)
dat_alt <- data %>% as_tibble() %>%
# Check 12 hours passed --> `pm` column
mutate(pm = format(measurement_time, "%H") >= 12) %>%
mutate(date_fct = format(measurement_time, "%Y_%d")) %>%
# Group by ID and `pm`
group_by(ID, measurement_type, date_fct, pm) %>%
# Turn cumsum into actual values
mutate(amount_act = amount - lag(amount, default = 0)) %>%
# Cumsum over ID
ungroup() %>%
group_by(ID, measurement_type) %>%
mutate(amount_cums = cumsum(amount_act)) %>%
ungroup() %>%
select(-c(pm, date_fct, amount_act))