r-如何为连续增加的值筛选数据帧



我有一个包含多列的大数据帧,但对于这个查询,我对3列感兴趣。

df <- structure(list(country = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "ireland", class = "factor"), 
parameter = structure(c(2L, 1L, 3L, 2L, 1L, 3L, 2L, 1L, 3L,2L, 1L, 3L), .Label = c("rainfall", "temp", "wind"), class = "factor"), 
value = c(10L, 15L, 20L, 9L, 18L, 10L, 12L, 25L, 15L, 10L, 10L, 20L), unit = structure(c(3L, 2L, 1L, 3L, 2L, 1L, 3L, 
2L, 1L, 3L, 2L, 1L), .Label = c("km/hr", "mm", "oC"), class = "factor")), class = "data.frame", row.names = c(NA, -12L))
country parameter   value   unit
ireland temp          10    oC
ireland rainfall      15    mm
ireland wind          20    km/hr
ireland temp           9    oC
ireland rainfall      18    mm
ireland wind          10    km/hr
ireland temp          12    oC
ireland rainfall      25    mm
ireland wind          15    km/hr
ireland temp          10    oC
ireland rainfall      10    mm
ireland wind          20    km/hr

我想按国家和参数分组,以提取值列连续增加3倍或更多倍的行。

欲望输出的例子。

country parameter   value   unit
ireland rainfall      15    mm
ireland rainfall      18    mm
ireland rainfall      25    mm
ireland wind          10    km/hr
ireland wind          15    km/hr
ireland wind          20    km/hr

您可以按国家/地区和参数进行分组,然后为不小于滞后值的值创建第三个分组变量,然后过滤大小为3或更大的组:

library(dplyr)
df %>%
group_by(country, parameter) %>%
group_by(x = cumsum(value <= lag(value, default = FALSE)), .add = TRUE) %>%
filter(n() >= 3) %>%
ungroup() %>%
arrange(country, parameter) %>%
select(-x)
# A tibble: 6 x 4
country parameter value unit 
<fct>   <fct>     <int> <fct>
1 ireland rainfall     15 mm   
2 ireland rainfall     18 mm   
3 ireland rainfall     25 mm   
4 ireland wind         10 km/hr
5 ireland wind         15 km/hr
6 ireland wind         20 km/hr

以下是使用subset+ave+rle的基本R选项

subset(
df[with(df,order(country, parameter)), ],
!!ave(value, country, parameter, FUN = function(x) with(rle(cumsum(c(1, diff(x) <= 0))), rep(lengths >= 3, lengths)))
)

它给出

country parameter value  unit
2  ireland  rainfall    15    mm
5  ireland  rainfall    18    mm
8  ireland  rainfall    25    mm
6  ireland      wind    10 km/hr
9  ireland      wind    15 km/hr
12 ireland      wind    20 km/hr
df %>% 
group_by(country, parameter) %>% 
mutate(
flag = c(0, diff(value)) > 0,
flag_lag = lead(flag),
seq_end = flag == TRUE & flag_lag %in% c(NA, FALSE),
seq_begin = flag == FALSE & flag_lag == TRUE,
) %>%
slice(if(length(which(seq_begin == TRUE):which(seq_end == TRUE)) >= 3) which(seq_begin == TRUE):which(seq_end == TRUE) else NA) %>%
select(-contains("flag"), -contains("seq"))

以下是使用dplyr的解决方案

df %>%
arrange(value,decreasing = FALSE) %>% # Arrange by value
filter(lag(value) > 3) # And filter for a difference above 3 in value
df
country parameter value  unit
1  ireland      temp    10    oC
2  ireland      wind    10 km/hr
3  ireland      temp    10    oC
4  ireland  rainfall    10    mm
5  ireland      temp    12    oC
6  ireland  rainfall    15    mm
7  ireland      wind    15 km/hr
8  ireland  rainfall    18    mm
9  ireland      wind    20 km/hr
10 ireland      wind    20 km/hr
11 ireland  rainfall    25    mm

最新更新