r-如何按每个ID的最近日期合并数据帧(和重复值)



我有两个数据帧,我正试图按日期加入(按个人分组(。

我已经制作了两者的示例数据帧(实际的df1是5700行,实际的df2是287行(。

df1具有ID(包括一些不在df2中的ID(、日期和行为值。

df2有ID(虽然少于df1(、日期(少于df1中的日期(和激素值。

我的目标是匹配给定个体从df2中最近日期到df1中最近日期的激素(尽可能接近匹配,但仅在最近日期间隔小于或等于2天时复制df1中df2的激素值(。

我希望将与行为观察结果不匹配的激素打印在新数据框的底部及其日期,这样它们就不会丢失(例如df3(

df1
ID    Date        behavior
a     1-12-2020   0
b     1-12-2020   1
b     1-13-2020   1
c     1-12-2020   2
d     1-12-2020   0
c     1-13-2020   1
c     1-14-2020   0
c     1-15-2020   1
c     1-16-2020   2
df2
ID    Date        hormone
a     1-10-2020   20
b     1-18-2019   70
c     1-10-2020   80
c     1-16-2020   90
#goal dataframe
df3
ID    Date        behavior hormone
a     1-12-2020   0        20
b     1-12-2020   1        NA [> 2 days from hormone]
b     1-13-2020   1        NA [> 2 days from hormone]
c     1-12-2020   2        80
d     1-12-2020   0        NA [no matching individual in df2]
c     1-13-2020   1        NA [> 2 days from hormone]
c     1-14-2020   0        90
c     1-15-2020   1        90
c     1-16-2020   2        90
b     1-18-2019   NA       70 [unmatched hormone at bottom of df3]

以下是创建这些数据帧的代码:

df1 <- data.frame(ID = c("a", "b", "b", "c", "d", "c", "c","c", "c"),
date = c("1-12-2020", "1-12-2020", "1-13-2020", "1-12-2020", "1-12-2020","1-13-2020","1-14-2020","1-15-2020","1-16-2020"),
behavior = c(0,1,1,2,0,1,0,1,2) )
df2 <- data.frame(ID = c("a", "b", "c", "c"),
date = c("1-10-2020", "1-18-2019", "1-10-2020", "1-16-2020"),
hormone = c(20,70,80,90) )
df1$date<-as.factor(df1$date)
df1$date<-strptime(df1$date,format="%m-%d-%Y")
#for nearest date function to work
df1$date<-as.Date(df1$date,"%m/%d/%y")
df2$date<-as.factor(df2$date)
df2$date<-strptime(df2$date,format="%m-%d-%Y")
#for nearest date function to work
df2$date<-as.Date(df2$date,"%m/%d/%y")

我已经能够使用论坛上以前问题中的一个功能(下面的链接和代码(来匹配最近的日期并重复填写,,但无法限制匹配的时间范围,或在新行中打印不匹配的日期有办法做到这一点吗?

这是我开始工作的地方(代码如下(:如何根据两个数据帧中最近的日期进行匹配?

# Function to get the index specifying closest or after
Ind_closest_or_after <- function(d1, d2){
which.min(ifelse(d1 - d2 < 0, Inf, d1 - d2))
}
# Calculate the indices
closest_or_after_ind <- map_int(.x = df1$date, .f = Ind_closest_or_after, d2 = df2$date)
# Add index columns to the data frames and join
df2 <- df2 %>% 
mutate(ind = 1:nrow(df2))
df1 <- df1 %>% 
mutate(ind = closest_or_after_ind)
df3<-left_join(df2, df1, by = 'ind')

这个答案似乎最接近,但并不限制数值:按最近的日期和ID 合并两个数据帧

#function to do all but limit dates and print unmatched
library(data.table)
setDT(df2)[, date := date]
df2[df1, on = .(ID, date = date), roll = 'nearest']

您可以通过过滤所有可能的组合(使用expand_grid的叉积(来连接表:

library(tidyverse)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
df1 <- data.frame(ID = c("a", "b", "b", "c", "d", "c", "c","c", "c"),
date = c("1-12-2020", "1-12-2020", "1-13-2020", "1-12-2020", "1-12-2020","1-13-2020","1-14-2020","1-15-2020","1-16-2020"),
behavior = c(0,1,1,2,0,1,0,1,2) )
df2 <- data.frame(ID = c("a", "b", "c", "c"),
date = c("1-10-2020", "1-18-2019", "1-10-2020", "1-16-2020"),
hormone = c(20,70,80,90) )
joined <-
df1 %>%
rename_all(~ paste0(., ".1")) %>%
expand_grid(df2 %>% rename_all(~ paste0(., ".2"))) %>%
mutate(across(starts_with("date"), ~ .x %>%  parse_date(format = "%m-%d-%Y"))) %>%
mutate(time_diff = abs(date.1 - date.2)) %>%
filter(time_diff <= days(2) & ID.1 == ID.2) %>%
select(ID = ID.1, behavior = behavior.1, hormone = hormone.2)
joined
#> # A tibble: 5 x 3
#>   ID    behavior hormone
#>   <chr>    <dbl>   <dbl>
#> 1 a            0      20
#> 2 c            2      80
#> 3 c            0      90
#> 4 c            1      90
#> 5 c            2      90
df1 %>%
left_join(joined) %>%
full_join(df2) %>%
as_tibble() %>%
distinct(ID, behavior, .keep_all = TRUE) %>%
arrange(ID, behavior)
#> Joining, by = c("ID", "behavior")
#> Joining, by = c("ID", "date", "hormone")
#> # A tibble: 9 x 4
#>   ID    date      behavior hormone
#>   <chr> <chr>        <dbl>   <dbl>
#> 1 a     1-12-2020        0      20
#> 2 a     1-10-2020       NA      20
#> 3 b     1-12-2020        1      NA
#> 4 b     1-18-2019       NA      70
#> 5 c     1-14-2020        0      90
#> 6 c     1-13-2020        1      90
#> 7 c     1-12-2020        2      80
#> 8 c     1-10-2020       NA      80
#> 9 d     1-12-2020        0      NA

创建于2022-02-18由reprex包(v2.0.0(

这将为每个(ID,行为(对生成一行。您可以将其替换为例如ID, date,使每个ID在任何给定时间点都只有一个时间点。

最新更新