作为R的新手,我正在尝试通过考虑时间段条件来合并两个数据帧。
df1 <- data.frame("first_event" = c("4f7d", "a10a", "e79b"), "second_event" = c("9346","a839", "d939"), "device_serial" = c("123","123","123") , "start_timestamp" = c("2019-12-06 11:47:0", "2019-09-06 11:47:0", "2019-09-05 10:00:00"),"end_timestamp" = c("2020-01-10 12:59:38", "2019-11-22 12:06:28", "2019-11-22 12:06:28"), "exp_id" = NA)
df2 <- data.frame("device_serial" = c("123","123") , exp_id= c("a","b") , start_timestamp = c("2019-12-03 07:12:20", "2019-09-04 10:00:00") , end_timestamp = c("2020-01-17 00:05:10", NULL) , current_event_id = c("1", "2") ,current_event_timestamp= c("2020-01-17 00:05:09", "2020-01-17 00:05:09"))
这有点难以解释,我会尽力提出问题。
基本上,我正在监视一些探险(df2
(,我想知道哪些事件(df1
(与某个探险有关(看看df1
中的exp_id
,我想填满这个栏目(。
请注意,每个探险都是由设备创建的,显然,每个事件都是由设备生成的。您可以通过根据设备的 id 连接两个表来说明这是可行的。但是,问题是每个设备都可以与多个探险相关联。 因此,目标是在特定时间段内查看设备与哪个探险相关,以便我们可以将事件与该探险相匹配。如果你看第三行df1
你会看到我在时间段条件下的难度。因为考虑到记录第三行的持续时间,我们不能将其与探险a
联系起来。
另一个问题来了。有时探险还没有完成,所以,我们必须考虑最后看到的事件时间戳(这是df2
中的current_event_timestamp(。
>df1
first_event second_event device_serial start_timestamp end_timestamp exp_id
4f7d 9346 123 2019-12-06 11:47:0 2020-01-10 12:59:38 NA
a10a a839 123 2019-09-06 11:47:0 2019-11-22 12:06:28 NA
e79b d939 123 "2019-09-05 10:00:00" "2019-11-22 12:06:28") NA
>df2
device_serial exp_id start_timestamp end_timestamp current_event_id current_event_timestamp
123 a 2019-12-03 07:12:20 2020-01-17 00:05:10 1 2020-01-17 00:05:09
123 b 2019-09-04 10:00:00 NULL 2 2019-11-23 12:06:28
我正在寻找的结果是这样的表df3
:
>df3
first_event second_event device_serial start_timestamp end_timestamp exp_id
4f7d 9346 123 2019-12-06 11:47:0 2020-01-10 12:59:38 a
a10a a839 123 2019-09-06 11:47:0 2019-11-22 12:06:28 b
e79b d939 123 "2019-09-05 10:00:00" "2019-11-22 12:06:28") b
感谢您阅读此问题并帮助我解决它。
如果我理解正确,这里有一些建议。
首先,您的数据,并进行一些编辑:
- 根据@r2evans评论,我假设
NULL
应该是NA_real
- "current_event_timestamp"来自第一块
df2
代码与您在第二个块中键入的内容不匹配;我用了 第二个块的日期时间,因为它导致了你的答案 寻找
df1 <- data.frame("first_event" = c("4f7d", "a10a", "e79b"),
"second_event" = c("9346","a839", "d939"),
"device_serial" = c("123","123","123") ,
"start_timestamp" = c("2019-12-06 11:47:0", "2019-09-06 11:47:0", "2019-09-05 10:00:00"),
"end_timestamp" = c("2020-01-10 12:59:38", "2019-11-22 12:06:28", "2019-11-22 12:06:28"),
"exp_id" = NA)
df2 <- data.frame("device_serial" = c("123","123") ,
exp_id= c("a","b") ,
start_timestamp = c("2019-12-03 07:12:20", "2019-09-04 10:00:00") ,
end_timestamp = c("2020-01-17 00:05:10", NA_real_) ,
current_event_id = c("1", "2") ,
current_event_timestamp= c("2020-01-17 00:05:09", "2019-11-23 12:06:28"))
现在,稍微整理一下数据。
两个要点:
- 似乎
df1
中的start_timestamp和end_timestamp列指的是开始 和事件的结束,而df2
中的那些相同的列名引用开始和探险结束。如果是这样,最好为这些变量分配名称 这反映了它们包含的数据不同的事实。在这种情况下,这 连接两个表时,区别很重要。 - 至少在您的示例中
df
s,请注意所有列都作为因子读取 最初。如果将变量存储为 它们表示的数据类型,对于日期时间数据尤其如此。
library(dplyr)
library(lubridate)
df1 <- df1 %>%
as_tibble(df1) %>% # convert to tibble; prints data type of each column
select(-exp_id, evnt_start = start_timestamp, evnt_end = end_timestamp) %>% # removing exp_id (not necessary, & messes up join) & changing names of time cols.
mutate(evnt_start = as_datetime(evnt_start), # converting time columns to datetime type
evnt_end = as_datetime(evnt_end))
df1
# A tibble: 3 x 5
first_event second_event device_serial evnt_start evnt_end
<fct> <fct> <fct> <dttm> <dttm>
1 4f7d 9346 123 2019-12-06 11:47:00 2020-01-10 12:59:38
2 a10a a839 123 2019-09-06 11:47:00 2019-11-22 12:06:28
3 e79b d939 123 2019-09-05 10:00:00 2019-11-22 12:06:28
df2 <- df2 %>%
as_tibble(df2) %>% # convert to tibble
rename(exp_start = start_timestamp, exp_end = end_timestamp) %>% # changing names of time cols
mutate_at(.vars=c("exp_start", "exp_end", "current_event_timestamp"), ~as_datetime(.)) # converting time cols from factor into datetime type
df2
# A tibble: 2 x 6
device_serial exp_id exp_start exp_end current_event_id current_event_timestamp
<fct> <fct> <dttm> <dttm> <fct> <dttm>
1 123 a 2019-12-03 07:12:20 2020-01-17 00:05:10 1 2020-01-17 00:05:09
2 123 b 2019-09-04 10:00:00 NA 2 2019-11-23 12:06:28
现在,尝试使用dplyr::left_join
和dplyr::filter
的解决方案:
df3 <- df2 %>%
mutate(exp_end_or_current = if_else(is.na(exp_end), current_event_timestamp, exp_end)) %>% #creating a new col with either exp_end OR, if NA, then current timestamp
left_join(df1, ., by = ("device_serial")) %>% #join df2 to df1 by serial #
filter(evnt_start > exp_start & evnt_end < exp_end_or_current) %>% #filter, keeping only records where EVENT start & end times are between expedition start & end times
select(-c(exp_end, current_event_id, current_event_timestamp))
df3
# A tibble: 3 x 8
first_event second_event device_serial evnt_start evnt_end exp_id exp_start exp_end_or_current
<fct> <fct> <fct> <dttm> <dttm> <fct> <dttm> <dttm>
1 4f7d 9346 123 2019-12-06 11:47:00 2020-01-10 12:59:38 a 2019-12-03 07:12:20 2020-01-17 00:05:10
2 a10a a839 123 2019-09-06 11:47:00 2019-11-22 12:06:28 b 2019-09-04 10:00:00 2019-11-23 12:06:28
3 e79b d939 123 2019-09-05 10:00:00 2019-11-22 12:06:28 b 2019-09-04 10:00:00 2019-11-23 12:06:28