r语言 - 按顺序分组,然后在列中查找最小值



我有一个数据集,其他列有date, sequence and low列,请参阅下面的df。 来自1-to-9的序列被视为sequence列中的一个块或一个完整循环 数据集有几个这样的完整块/周期和部分完成的块/周期,eg: 1-to-4

这就是我试图解决的问题:

  1. 删除部分完成的周期,然后将整个周期分组(见df1)
  2. 对于每个块/周期(即从 1 到 9 的序列),我想找到 街区的低点以及低点发生的那一天。
  3. 如果有两个值相同但日期不同的低点,则 它应该只输出最新的日期(参见输出中的第三个块)

    library(lubridate)
    library(tidyverse)
    ### Sample data
    df <- data.frame(stringsAsFactors=FALSE,
    date = c("1/01/2019", "2/01/2019", "3/01/2019", "4/01/2019",
    "5/01/2019", "6/01/2019", "7/01/2019", "8/01/2019",
    "9/01/2019", "10/01/2019", "11/01/2019", "12/01/2019", "13/01/2019",
    "14/01/2019", "15/01/2019", "16/01/2019", "17/01/2019", "18/01/2019",
    "19/01/2019", "20/01/2019", "21/01/2019", "22/01/2019",
    "23/01/2019", "24/01/2019", "25/01/2019", "26/01/2019", "27/01/2019",
    "28/01/2019", "29/01/2019", "30/01/2019", "31/01/2019",
    "1/02/2019", "2/02/2019", "3/02/2019", "4/02/2019"),
    sequence = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 1, 2, 3, 4, 5, 6, 7, 8,
    9, 1, 2, 3, 4, 1, 2, 3, 4, 5, 6, 7, 8, 9),
    low = c(96, 81, 43, 18, 43, 65, 48, 90, 69, 50, 41, 73, 1, 1, 7, 49,
    16, 79, 2, 74, 8, 88, 56, 57, 66, 29, 79, 51, 52, 47, 42, 9,
    41, 9, 50)) %>% mutate(date = dmy(date))
    

    按周期/块分组的数据

    df1 <- data.frame(stringsAsFactors=FALSE,
    date = c("1/01/2019", "2/01/2019", "3/01/2019", "4/01/2019",
    "5/01/2019", "6/01/2019", "7/01/2019", "8/01/2019",
    "9/01/2019", "14/01/2019", "15/01/2019", "16/01/2019", "17/01/2019",
    "18/01/2019", "19/01/2019", "20/01/2019", "21/01/2019", "22/01/2019",
    "27/01/2019", "28/01/2019", "29/01/2019", "30/01/2019",
    "31/01/2019", "1/02/2019", "2/02/2019", "3/02/2019", "4/02/2019"),
    sequence = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3,
    4, 5, 6, 7, 8, 9),
    low = c(96, 81, 43, 18, 43, 65, 48, 90, 69, 1, 7, 49, 16, 79, 2, 74,
    8, 88, 79, 51, 52, 47, 42, 9, 41, 9, 50),
    group = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3,
    3, 3, 3, 3, 3, 3)) %>% mutate(date = dmy(date))
    

我追求的最终输出

df_final <- data.frame(stringsAsFactors=FALSE,
date = c("4/01/2019", "14/01/2019", "3/02/2019"),
low = c(18, 1, 9)) %>% mutate(date = dmy(date))

有什么想法吗?
附言。我在格式化这个问题时遇到了一些问题,因此不整洁。

我们通过获取序列为 1 的累积总和来创建分组变量,然后仅filter具有 9 个元素的组,并在按结束顺序arrange"日期"后slice"低"最小的行desc以处理与"最低"值有联系的情况

df %>% 
group_by(group = cumsum(sequence == 1)) %>% 
filter(n() == 9) %>% 
select(date, low) %>%
arrange(desc(date)) %>%
slice(which.min(low)) %>%
ungroup %>%
select(-group)
# A tibble: 3 x 2
#  date         low
#  <date>     <dbl>
#1 2019-01-04    18
#2 2019-01-14     1
#3 2019-02-03     9

或带有data.table的类似选项

library(data.table)
setDT(df)[, .SD[.N == 9], .(group = cumsum(sequence == 1))
][order(-date), .SD[which.min(low)], group]

另一种dplyr可能性可能是:

df %>%
group_by(group = cumsum(sequence == 1), rleid = with(rle(group), rep(seq_along(lengths), lengths))) %>%
filter(all(c(1:9) %in% sequence)) %>%
slice(which.min(rank(low, ties.method = "last"))) %>%
ungroup() %>%
select(-group, -rleid)
date       sequence   low
<date>        <dbl> <dbl>
1 2019-01-04        4    18
2 2019-01-14        1     1
3 2019-02-03        8     9

在这里,它首先创建一个"序列" == 1的累积和,以及一个基于累积和的类似rleid()变量,然后按两者执行分组。其次,它删除了序列不包含所有九个值的情况。最后,它返回每组的最小值,在领带返回最后一个最小值的情况下(您可以通过参数ties.method修改它)。

这在基本R中也是可能的。

w <- which(df$sequence == 1)
w <- w[sapply(w, function(x) df$sequence[x + 8] == 9 & sum(df$sequence[x:(x + 8)]) == 45)]
do.call(rbind, Map(function(x) x[which.min(x$low), ], 
Map(function(s) df[s, ], Map(seq, w, l=9))))
#          date sequence low
# 4  2019-01-04        4  18
# 14 2019-01-14        1   1
# 32 2019-02-01        6   9

诀窍是找到完成的序列并将它们分组到一个列表中,然后rbind每个组的which.minsum(.) == 45检查应考虑是否实际上没有错误序列。

数据

df <- structure(list(date = structure(c(17897, 17898, 17899, 17900, 
17901, 17902, 17903, 17904, 17905, 17906, 17907, 17908, 17909, 
17910, 17911, 17912, 17913, 17914, 17915, 17916, 17917, 17918, 
17919, 17920, 17921, 17922, 17923, 17924, 17925, 17926, 17927, 
17928, 17929, 17930, 17931), class = "Date"), sequence = c(1, 
2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 1, 2, 3, 4, 5, 6, 7, 8, 9, 
1, 2, 3, 4, 1, 2, 3, 4, 5, 6, 7, 8, 9), low = c(96, 81, 43, 18, 
43, 65, 48, 90, 69, 50, 41, 73, 1, 1, 7, 49, 16, 79, 2, 74, 8, 
88, 56, 57, 66, 29, 79, 51, 52, 47, 42, 9, 41, 9, 50)), row.names = c(NA, 
-35L), class = "data.frame")

相关内容

最新更新