我有一个包含多列的大数据帧,但对于这个查询,我对3列感兴趣。
df <- structure(list(country = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "ireland", class = "factor"),
parameter = structure(c(2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L,2L, 1L, 3L), .Label = c("rainfall", "temp", "wind"), class = "factor"),
value = c(10L, 15L, 20L, 9L, 18L, 10L, 12L, 25L, 15L, 10L, 10L, 20L), unit = structure(c(3L, 2L, 1L, 3L, 2L, 1L, 3L,
2L, 1L, 3L, 2L, 1L), .Label = c("km/hr", "mm", "oC"), class = "factor")), class = "data.frame", row.names = c(NA, -12L))
country parameter value unit
ireland temp 10 oC
ireland rainfall 15 mm
ireland wind 20 km/hr
ireland temp 9 oC
ireland rainfall 18 mm
ireland wind 10 km/hr
ireland temp 12 oC
ireland rainfall 25 mm
ireland wind 15 km/hr
ireland temp 10 oC
ireland rainfall 10 mm
ireland wind 20 km/hr
我想按国家和参数分组,以提取值列连续增加3倍或更多倍的行。
欲望输出的例子。
country parameter value unit
ireland rainfall 15 mm
ireland rainfall 18 mm
ireland rainfall 25 mm
ireland wind 10 km/hr
ireland wind 15 km/hr
ireland wind 20 km/hr
您可以按国家/地区和参数进行分组,然后为不小于滞后值的值创建第三个分组变量,然后过滤大小为3或更大的组:
library(dplyr)
df %>%
group_by(country, parameter) %>%
group_by(x = cumsum(value <= lag(value, default = FALSE)), .add = TRUE) %>%
filter(n() >= 3) %>%
ungroup() %>%
arrange(country, parameter) %>%
select(-x)
# A tibble: 6 x 4
country parameter value unit
<fct> <fct> <int> <fct>
1 ireland rainfall 15 mm
2 ireland rainfall 18 mm
3 ireland rainfall 25 mm
4 ireland wind 10 km/hr
5 ireland wind 15 km/hr
6 ireland wind 20 km/hr
以下是使用subset
+ave
+rle
的基本R选项
subset(
df[with(df,order(country, parameter)), ],
!!ave(value, country, parameter, FUN = function(x) with(rle(cumsum(c(1, diff(x) <= 0))), rep(lengths >= 3, lengths)))
)
它给出
country parameter value unit
2 ireland rainfall 15 mm
5 ireland rainfall 18 mm
8 ireland rainfall 25 mm
6 ireland wind 10 km/hr
9 ireland wind 15 km/hr
12 ireland wind 20 km/hr
df %>%
group_by(country, parameter) %>%
mutate(
flag = c(0, diff(value)) > 0,
flag_lag = lead(flag),
seq_end = flag == TRUE & flag_lag %in% c(NA, FALSE),
seq_begin = flag == FALSE & flag_lag == TRUE,
) %>%
slice(if(length(which(seq_begin == TRUE):which(seq_end == TRUE)) >= 3) which(seq_begin == TRUE):which(seq_end == TRUE) else NA) %>%
select(-contains("flag"), -contains("seq"))
以下是使用dplyr
的解决方案
df %>%
arrange(value,decreasing = FALSE) %>% # Arrange by value
filter(lag(value) > 3) # And filter for a difference above 3 in value
df
country parameter value unit
1 ireland temp 10 oC
2 ireland wind 10 km/hr
3 ireland temp 10 oC
4 ireland rainfall 10 mm
5 ireland temp 12 oC
6 ireland rainfall 15 mm
7 ireland wind 15 km/hr
8 ireland rainfall 18 mm
9 ireland wind 20 km/hr
10 ireland wind 20 km/hr
11 ireland rainfall 25 mm