在 R 中:根据时间段条件联接两个数据帧



作为R的新手,我正在尝试通过考虑时间段条件来合并两个数据帧。

df1 <- data.frame("first_event" = c("4f7d", "a10a", "e79b"), "second_event" = c("9346","a839", "d939"), "device_serial" = c("123","123","123") , "start_timestamp" = c("2019-12-06 11:47:0", "2019-09-06 11:47:0", "2019-09-05 10:00:00"),"end_timestamp" = c("2020-01-10 12:59:38", "2019-11-22 12:06:28", "2019-11-22 12:06:28"), "exp_id" = NA)
df2 <- data.frame("device_serial" =  c("123","123") , exp_id= c("a","b") ,    start_timestamp = c("2019-12-03 07:12:20", "2019-09-04 10:00:00") ,       end_timestamp = c("2020-01-17 00:05:10", NULL)     ,    current_event_id = c("1", "2")   ,current_event_timestamp= c("2020-01-17 00:05:09", "2020-01-17 00:05:09"))

这有点难以解释,我会尽力提出问题。

基本上,我正在监视一些探险(df2(,我想知道哪些事件(df1(与某个探险有关(看看df1中的exp_id,我想填满这个栏目(。

请注意,每个探险都是由设备创建的,显然,每个事件都是由设备生成的。您可以通过根据设备的 id 连接两个表来说明这是可行的。但是,问题是每个设备都可以与多个探险相关联。 因此,目标是在特定时间段内查看设备与哪个探险相关,以便我们可以将事件与该探险相匹配。如果你看第三行df1你会看到我在时间段条件下的难度。因为考虑到记录第三行的持续时间,我们不能将其与探险a联系起来。

另一个问题来了。有时探险还没有完成,所以,我们必须考虑最后看到的事件时间戳(这是df2中的current_event_timestamp(。

>df1
first_event   second_event      device_serial      start_timestamp        end_timestamp           exp_id
4f7d            9346             123           2019-12-06 11:47:0     2020-01-10 12:59:38         NA
a10a            a839             123             2019-09-06 11:47:0    2019-11-22 12:06:28        NA
e79b            d939             123           "2019-09-05 10:00:00"    "2019-11-22 12:06:28")    NA
>df2
device_serial   exp_id    start_timestamp        end_timestamp         current_event_id   current_event_timestamp
123             a      2019-12-03 07:12:20    2020-01-17 00:05:10        1             2020-01-17 00:05:09
123             b      2019-09-04 10:00:00    NULL                       2             2019-11-23 12:06:28

我正在寻找的结果是这样的表df3

>df3
first_event   second_event      device_serial      start_timestamp        end_timestamp           exp_id
4f7d            9346             123           2019-12-06 11:47:0     2020-01-10 12:59:38         a
a10a            a839             123             2019-09-06 11:47:0    2019-11-22 12:06:28        b
e79b            d939             123           "2019-09-05 10:00:00"    "2019-11-22 12:06:28")     b

感谢您阅读此问题并帮助我解决它。

如果我理解正确,这里有一些建议。

首先,您的数据,并进行一些编辑:

  1. 根据@r2evans评论,我假设NULL应该是NA_real
  2. "current_event_timestamp"来自第一块df2代码与您在第二个块中键入的内容不匹配;我用了 第二个块的日期时间,因为它导致了你的答案 寻找
df1 <- data.frame("first_event" = c("4f7d", "a10a", "e79b"), 
"second_event" = c("9346","a839", "d939"), 
"device_serial" = c("123","123","123") , 
"start_timestamp" = c("2019-12-06 11:47:0", "2019-09-06 11:47:0", "2019-09-05 10:00:00"),
"end_timestamp" = c("2020-01-10 12:59:38", "2019-11-22 12:06:28", "2019-11-22 12:06:28"), 
"exp_id" = NA)
df2 <- data.frame("device_serial" =  c("123","123") , 
exp_id= c("a","b") ,    
start_timestamp = c("2019-12-03 07:12:20", "2019-09-04 10:00:00") ,       
end_timestamp = c("2020-01-17 00:05:10", NA_real_)     ,   
current_event_id = c("1", "2")   ,
current_event_timestamp= c("2020-01-17 00:05:09", "2019-11-23 12:06:28"))

现在,稍微整理一下数据。

两个要点:

  1. 似乎df1中的start_timestamp和end_timestamp列指的是开始 和事件的结束,而df2中的那些相同的列名引用开始和探险结束。如果是这样,最好为这些变量分配名称 这反映了它们包含的数据不同的事实。在这种情况下,这 连接两个表时,区别很重要。
  2. 至少在您的示例中dfs,请注意所有列都作为因子读取 最初。如果将变量存储为 它们表示的数据类型,对于日期时间数据尤其如此。
library(dplyr)
library(lubridate)
df1 <- df1 %>% 
as_tibble(df1) %>% # convert to tibble; prints data type of each column
select(-exp_id, evnt_start = start_timestamp, evnt_end = end_timestamp) %>% # removing exp_id (not necessary, & messes up join) & changing names of time cols.
mutate(evnt_start = as_datetime(evnt_start), # converting time columns to datetime type
evnt_end = as_datetime(evnt_end))
df1
# A tibble: 3 x 5
first_event second_event device_serial evnt_start          evnt_end           
<fct>       <fct>        <fct>         <dttm>              <dttm>             
1 4f7d        9346         123           2019-12-06 11:47:00 2020-01-10 12:59:38
2 a10a        a839         123           2019-09-06 11:47:00 2019-11-22 12:06:28
3 e79b        d939         123           2019-09-05 10:00:00 2019-11-22 12:06:28
df2 <-  df2 %>% 
as_tibble(df2) %>% # convert to tibble
rename(exp_start = start_timestamp, exp_end = end_timestamp) %>% # changing names of time cols
mutate_at(.vars=c("exp_start", "exp_end", "current_event_timestamp"), ~as_datetime(.)) # converting time cols from factor into datetime type
df2
# A tibble: 2 x 6
device_serial exp_id exp_start           exp_end             current_event_id current_event_timestamp
<fct>         <fct>  <dttm>              <dttm>              <fct>            <dttm>                 
1 123           a      2019-12-03 07:12:20 2020-01-17 00:05:10 1                2020-01-17 00:05:09    
2 123           b      2019-09-04 10:00:00 NA                  2                2019-11-23 12:06:28    

现在,尝试使用dplyr::left_joindplyr::filter的解决方案:

df3 <- df2 %>% 
mutate(exp_end_or_current = if_else(is.na(exp_end), current_event_timestamp, exp_end)) %>% #creating a new col with either exp_end OR, if NA, then current timestamp
left_join(df1, ., by = ("device_serial")) %>%  #join df2 to df1 by serial #
filter(evnt_start > exp_start & evnt_end < exp_end_or_current) %>%  #filter, keeping only records where EVENT start & end times are between expedition start & end times 
select(-c(exp_end, current_event_id, current_event_timestamp))
df3
# A tibble: 3 x 8
first_event second_event device_serial evnt_start          evnt_end            exp_id exp_start           exp_end_or_current 
<fct>       <fct>        <fct>         <dttm>              <dttm>              <fct>  <dttm>              <dttm>             
1 4f7d        9346         123           2019-12-06 11:47:00 2020-01-10 12:59:38 a      2019-12-03 07:12:20 2020-01-17 00:05:10
2 a10a        a839         123           2019-09-06 11:47:00 2019-11-22 12:06:28 b      2019-09-04 10:00:00 2019-11-23 12:06:28
3 e79b        d939         123           2019-09-05 10:00:00 2019-11-22 12:06:28 b      2019-09-04 10:00:00 2019-11-23 12:06:28

最新更新