R根据一列中的值对另一列中的时间戳应用规则,以方便过滤



这种情况的措辞对我来说有点棘手,所以如果它是重复的,我道歉。我根据我对所需输出的最佳理解寻找解决方案

假设我有这样的数据帧:

df <- data.frame(ID = c(555, 555, 555, 555, 555, 555, 555, 555, 555, 555, 555, 555),
A = c('2020-04-13 19:10:00', '2020-04-14 20:00:00', '2020-04-13 19:10:00', '2020-04-14 20:00:00',
'2020-04-22 08:13:00', '2020-04-23 19:00:00', '2020-04-13 19:10:00', '2020-04-14 20:00:00',
'2020-04-22 08:13:00', '2020-04-23 19:00:00', '2020-04-27 13:41:10', '2020-05-01 22:01:00'),
B = c('2020-04-15 12:00:00', '2020-04-15 12:00:00', '2020-04-24 11:00:00', '2020-04-24 11:00:00',
'2020-04-24 11:00:00', '2020-04-24 11:00:00', '2020-05-07 10:30:00', '2020-05-07 10:30:00',
'2020-05-07 10:30:00', '2020-05-07 10:30:00', '2020-05-07 10:30:00', '2020-05-07 10:30:00')
)
df$A <- as.POSIXct(df$A)
df$B <- as.POSIXct(df$B)

> df
ID                   A                   B
1  555 2020-04-13 19:10:00 2020-04-15 12:00:00
2  555 2020-04-14 20:00:00 2020-04-15 12:00:00
3  555 2020-04-13 19:10:00 2020-04-24 11:00:00
4  555 2020-04-14 20:00:00 2020-04-24 11:00:00
5  555 2020-04-22 08:13:00 2020-04-24 11:00:00
6  555 2020-04-23 19:00:00 2020-04-24 11:00:00
7  555 2020-04-13 19:10:00 2020-05-07 10:30:00
8  555 2020-04-14 20:00:00 2020-05-07 10:30:00
9  555 2020-04-22 08:13:00 2020-05-07 10:30:00
10 555 2020-04-23 19:00:00 2020-05-07 10:30:00
11 555 2020-04-27 13:41:10 2020-05-07 10:30:00
12 555 2020-05-01 22:01:00 2020-05-07 10:30:00

我想将B列的每个不同值与对应的A的最小值过滤到一行,并且要求A的值大于前面的B列的值(如果有的话)。因此,期望的输出将减少为:

> df
ID                   A                   B
1 555 2020-04-13 19:10:00 2020-04-15 12:00:00
2 555 2020-04-22 08:13:00 2020-04-24 11:00:00
3 555 2020-04-27 13:41:10 2020-05-07 10:30:00

在这种情况下,将有更多的ID,因此ID和B将是分组变量。对于dplyr,这是可能的吗?

您可以试试下面的代码-

library(dplyr)
df %>%
group_by(ID) %>%
mutate(previous_B = lag(B, default = as.POSIXct(0, origin = '1970-01-01'))) %>%
group_by(B, .add = TRUE) %>%
filter(A > first(previous_B)) %>%
summarise(A = min(A)) %>%
ungroup %>%
select(ID, A, B)
#    ID    A                   B                  
#  <dbl> <dttm>              <dttm>             
#1   555 2020-04-13 19:10:00 2020-04-15 12:00:00
#2   555 2020-04-22 08:13:00 2020-04-24 11:00:00
#3   555 2020-04-27 13:41:10 2020-05-07 10:30:00

过滤数据中A大于前一组B最后值的行。之后,只需保持各组A值的最小值即可。

我想我找到你的解决方案了:

my_df <- data.frame(ID = c(555, 555, 555, 555, 555, 555, 555, 555, 555, 555, 555, 555),
A = c('2020-04-13 19:10:00', '2020-04-14 20:00:00', '2020-04-13 19:10:00', '2020-04-14 20:00:00',
'2020-04-22 08:13:00', '2020-04-23 19:00:00', '2020-04-13 19:10:00', '2020-04-14 20:00:00',
'2020-04-22 08:13:00', '2020-04-23 19:00:00', '2020-04-27 13:41:10', '2020-05-01 22:01:00'),
B = c('2020-04-15 12:00:00', '2020-04-15 12:00:00', '2020-04-24 11:00:00', '2020-04-24 11:00:00',
'2020-04-24 11:00:00', '2020-04-24 11:00:00', '2020-05-07 10:30:00', '2020-05-07 10:30:00',
'2020-05-07 10:30:00', '2020-05-07 10:30:00', '2020-05-07 10:30:00', '2020-05-07 10:30:00')
)
my_df$A <- as.POSIXct(my_df$A)
my_df$B <- as.POSIXct(my_df$B)
my_df$Time <- str_split(string = my_df$A, pattern = " ", n = 2, simplify = TRUE)[, 2]
my_df$Time2 <- str_split(string = my_df$B, pattern = " ", n = 2, simplify = TRUE)[, 2]
my_result <- my_df %>% group_by(ID, B) %>% filter(Time > Time2) %>% filter(Time == min(Time))

我让你尝试你所有的数据,让我知道它是否给你预期的结果

最新更新