我有一个从部署的传感器中获取的每日水化学值的数据框架。我正在尝试计算每日最大值的7天滚动平均值。这在现场环境数据中,数据可能有点混乱。
以下是计算平均值和分配质量水平的规则:
- 对数据进行分级,并给出当天的质量值(DQL((dyDQL(
- "A"是高质量,"B"是中等质量,"E"是差质量
- 在7天周期结束时计算7天平均值
- 数据集只需要6天就可以计算出7天的平均值(可能会错过1天的数据(
- 如果至少有6天的"A"one_answers"B"分级数据和1天的"E",则丢弃"E"数据,并使用6天的"A"one_answers"B"数据计算7天的平均值
我使用一个循环来处理代码,该循环遍历每个结果,创建一个包含7天窗口的新数据帧,然后计算移动平均值。请参阅下面的最小示例。
请注意,本例中缺少11日、16日、17日和18日的日期:
daily_data <- tibble::tribble(
~Monitoring.Location.ID, ~date, ~dyMax, ~dyMin, ~dyDQL,
"River 1", as.Date("2018-07-01"), 24.219, 22.537, "A",
"River 1", as.Date("2018-07-02"), 24.557, 20.388, "A",
"River 1", as.Date("2018-07-03"), 24.847, 20.126, "A",
"River 1", as.Date("2018-07-04"), 25.283, 20.674, "A",
"River 1", as.Date("2018-07-05"), 25.501, 20.865, "A",
"River 1", as.Date("2018-07-06"), 25.04, 21.008, "A",
"River 1", as.Date("2018-07-07"), 24.847, 20.674, "A",
"River 1", as.Date("2018-07-08"), 23.424, 20.793, "B",
"River 1", as.Date("2018-07-09"), 22.657, 18.866, "E",
"River 1", as.Date("2018-07-10"), 22.298, 18.2, "A",
"River 1", as.Date("2018-07-12"), 22.92, 19.008, "A",
"River 1", as.Date("2018-07-13"), 23.978, 19.532, "A",
"River 1", as.Date("2018-07-14"), 24.508, 19.936, "A",
"River 1", as.Date("2018-07-15"), 25.137, 20.627, "A",
"River 1", as.Date("2018-07-19"), 24.919, 20.674, "A"
)
for (l in seq_len(nrow(daily_data))){
station_7day <- filter(daily_data,
dplyr::between(date, daily_data[[l,'date']] - lubridate::days(6), daily_data[l,'date']))
daily_data[l,"ma.max7"] <- dplyr::case_when(nrow(subset(station_7day, dyDQL %in% c("A")))== 7 & l >=7 ~ mean(station_7day$dyMax),
nrow(subset(station_7day, dyDQL %in% c("A", 'B'))) >= 6 & l >=7~ mean(station_7day$dyMax),
max(station_7day$dyDQL == 'E') & nrow(subset(station_7day, dyDQL %in% c("A", "B"))) >= 6 & l >=7 ~ mean(station_7day$dyMax[station_7day$dyDQL %in% c("A", "B")]),
nrow(subset(station_7day, dyDQL %in% c("A", "B", "E"))) >= 6 & l >=7~ mean(station_7day$dyMax),
TRUE ~ NA_real_)
daily_data[l, "ma.max7_DQL"] <- dplyr::case_when(nrow(subset(station_7day, dyDQL %in% c("A")))== 7 & l >=7 ~ "A",
nrow(subset(station_7day, dyDQL %in% c("A", 'B'))) >= 6 & l >=7~ "B",
max(station_7day$dyDQL == 'E') & nrow(subset(station_7day, dyDQL %in% c("A", "B"))) >= 6 & l >=7 ~ "B",
nrow(subset(station_7day, dyDQL %in% c("A", "B", "E"))) >= 6 & l >=7~ "E",
TRUE ~ NA_character_)
}
预期结果为:
tibble::tribble(
~Monitoring.Location.ID, ~date, ~dyMax, ~dyMin, ~dyDQL, ~ma.max7, ~ma.max7_DQL,
"River 1", as.Date("2018-07-01"), 24.219, 22.537, "A", NA, NA,
"River 1", as.Date("2018-07-02"), 24.557, 20.388, "A", NA, NA,
"River 1", as.Date("2018-07-03"), 24.847, 20.126, "A", NA, NA,
"River 1", as.Date("2018-07-04"), 25.283, 20.674, "A", NA, NA,
"River 1", as.Date("2018-07-05"), 25.501, 20.865, "A", NA, NA,
"River 1", as.Date("2018-07-06"), 25.04, 21.008, "A", NA, NA,
"River 1", as.Date("2018-07-07"), 24.847, 20.674, "A", 24.8991428571429, "A",
"River 1", as.Date("2018-07-08"), 23.424, 20.793, "B", 24.7855714285714, "B",
"River 1", as.Date("2018-07-09"), 22.657, 18.866, "E", 24.5141428571429, "B",
"River 1", as.Date("2018-07-10"), 22.298, 18.2, "A", 24.15, "B",
"River 1", as.Date("2018-07-12"), 22.92, 19.008, "A", 23.531, "E",
"River 1", as.Date("2018-07-13"), 23.978, 19.532, "A", 23.354, "E",
"River 1", as.Date("2018-07-14"), 24.508, 19.936, "A", 23.2975, "E",
"River 1", as.Date("2018-07-15"), 25.137, 20.627, "A", 23.583, "E",
"River 1", as.Date("2018-07-19"), 24.919, 20.674, "A", NA, NA
)
该代码运行良好,但在计算多个位置具有多个不同水质参数的多年水平数据值时非常缓慢。
由于7天的值可以从6天的数据中计算出来,我认为我不能使用zoo包中的任何滚动函数。我认为我不能使用roll包中的roll_mean函数,因为当有6天的"A"或"B"数据时,丢弃1天的"E"数据是可变的。
有没有办法将其向量化,以避免在每一行数据中循环?
我使用了tidyverse
和runner
,并在一个管道语法中这样做。语法解释-
- 我首先使用
runner
将七天(根据提供的逻辑(的DQL和MAX值收集到一个列表中 - 在此之前,我已经将DQL转换为一个有序因子变量,它将在最后一个语法中使用
- 其次,我使用
purrr::map
根据给定的条件对每个列表进行修改,- 不少于六个
- 如果7个值中正好有一个
E
,则不必计算在内
- 最后,我使用
unnest_wider
取消了列表的测试
library(runner)
daily_data %>% mutate(dyDQL = factor(dyDQL, levels = c("A", "B", "E"), ordered = T),
d = runner(x = data.frame(a = dyMax, b= dyDQL),
k = "7 days",
lag = 0,
idx = date,
f = function(x) list(x))) %>%
mutate(d = map(d, ~ .x %>% group_by(b) %>%
mutate(c = n()) %>%
ungroup() %>%
filter(!n() < 6) %>%
filter(!(b == 'E' & c == 1 & n() == 7)) %>%
summarise(ma.max7 = ifelse(n() == 0, NA, mean(a)), ma.max7.DQL = max(b))
)
) %>%
unnest_wider(d)
# A tibble: 15 x 7
Monitoring.Location.ID date dyMax dyMin dyDQL ma.max7 ma.max7.DQL
<chr> <date> <dbl> <dbl> <ord> <dbl> <ord>
1 River 1 2018-07-01 24.2 22.5 A NA NA
2 River 1 2018-07-02 24.6 20.4 A NA NA
3 River 1 2018-07-03 24.8 20.1 A NA NA
4 River 1 2018-07-04 25.3 20.7 A NA NA
5 River 1 2018-07-05 25.5 20.9 A NA NA
6 River 1 2018-07-06 25.0 21.0 A 24.9 A
7 River 1 2018-07-07 24.8 20.7 A 24.9 A
8 River 1 2018-07-08 23.4 20.8 B 24.8 B
9 River 1 2018-07-09 22.7 18.9 E 24.8 B
10 River 1 2018-07-10 22.3 18.2 A 24.4 B
11 River 1 2018-07-12 22.9 19.0 A 23.5 E
12 River 1 2018-07-13 24.0 19.5 A 23.4 E
13 River 1 2018-07-14 24.5 19.9 A 23.3 E
14 River 1 2018-07-15 25.1 20.6 A 23.6 E
15 River 1 2018-07-19 24.9 20.7 A NA NA
以下是一种矢量化方法,使用slider:slide_index
来计算高质量和备份质量值,然后组合以获得最佳可用值:
library(tidyverse); library(slider)
以下函数按位置分组,计算每周平均值(包括从日期-6到日期(包括日期(的所有值(和包括的观测值数量,然后过滤为只有6+个观测值。
get_weekly_by_loc <- function(df) {
df %>%
group_by(Monitoring.Location.ID) %>%
mutate(mean = slide_index_dbl(dyMax, date, mean, .complete = TRUE, .before = lubridate::days(6)),
count = slide_index_dbl(dyMax, date, ~sum(!is.na(.)), .before = lubridate::days(6))) %>%
ungroup() %>%
filter(count >= 6)
}
然后我们可以仅在A/B数据上运行此功能,并且总体而言:
daily_data_high_quality <- daily_data %>%
filter(dyDQL %in% c("A", "B")) %>%
get_weekly_by_loc() %>%
select(Monitoring.Location.ID, date, high_qual_mean = mean)
daily_data_backup <- daily_data %>%
get_weekly_by_loc() %>%
select(Monitoring.Location.ID, date, backup_mean = mean)
然后加入这些并使用高质量(如果可用(:
daily_data %>%
left_join(daily_data_high_quality) %>%
left_join(daily_data_backup) %>%
mutate(max7_DQL = coalesce(high_qual_mean, backup_mean)) %>%
mutate(moar_digits = format(max7_DQL, nsmall = 6))
结果
# A tibble: 15 x 9
Monitoring.Location.ID date dyMax dyMin dyDQL high_qual_mean backup_mean max7_DQL moar_digits
<chr> <date> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr>
1 River 1 2018-07-01 24.2 22.5 A NA NA NA " NA"
2 River 1 2018-07-02 24.6 20.4 A NA NA NA " NA"
3 River 1 2018-07-03 24.8 20.1 A NA NA NA " NA"
4 River 1 2018-07-04 25.3 20.7 A NA NA NA " NA"
5 River 1 2018-07-05 25.5 20.9 A NA NA NA " NA"
6 River 1 2018-07-06 25.0 21.0 A NA NA NA " NA"
7 River 1 2018-07-07 24.8 20.7 A 24.9 24.9 24.9 "24.899143"
8 River 1 2018-07-08 23.4 20.8 B 24.8 24.8 24.8 "24.785571"
9 River 1 2018-07-09 22.7 18.9 E NA 24.5 24.5 "24.514143"
10 River 1 2018-07-10 22.3 18.2 A 24.4 24.2 24.4 "24.398833"
11 River 1 2018-07-12 22.9 19.0 A NA 23.5 23.5 "23.531000"
12 River 1 2018-07-13 24.0 19.5 A NA 23.4 23.4 "23.354000"
13 River 1 2018-07-14 24.5 19.9 A NA 23.3 23.3 "23.297500"
14 River 1 2018-07-15 25.1 20.6 A NA 23.6 23.6 "23.583000"
15 River 1 2018-07-19 24.9 20.7 A NA NA NA " NA"