我想加入两个数据集,涵盖来自同一家庭的个人信息,但我无法访问个人ID。我只知道,当观察结果属于同一个家庭时,它们会联系在一起,但数据帧之间的组ID不匹配。
我想将这些数据与两个数据帧之间共享的变量进行匹配,因为我知道,随着这些共享变量集的增加,重复匹配的概率往往为零。然而,在我的情况下,这组变量还不够大。
以下是一个只有两个共享变量(年龄和性别(的工作示例:
Df1 <- data.frame(family_id = c(1,1,1,2,2,3,3,3), age = c(42,37,10,42,35,42,37,10), gender= c("M","F","M","M","F","M","F","F"), tenure = c(12,10,NA,15,9,9,10,NA) )
Df2 <- data.frame(family_id = c("C","C","C","A","A","B","B","B"), age = c(42,37,10,42,35,42,37,10), gender= c("M","F","M","M","F","M","F","F"), employed= c(0,1,NA,1,1,1,0,NA))
最终结果应该是这样的:
Df3 <- data.frame(family_id_x = c(1,1,1,2,2,3,3,3), family_id_y = c("C","C","C","A","A","B","B","B"), age = c(42,37,10,42,35,42,37,10), gender= c("M","F","M","M","F","M","F","F"), tenure = c(12,10,NA,15,9,9,10,NA), employed= c(0,1,NA,1,1,1,0,NA))
然而,使用简单的左连接,我仍然会生成重复:例如,年龄=42、性别=M的观察结果出现三次,并导致重复匹配。
left_join(Df1, Df2, by = c("age","gender"))
然后,我想利用其他家庭成员的信息,在我的加入中没有重复的信息。例如,虽然有三个obs.年龄=42,性别=M,但其中只有一个与另外两个成员共享同一个家庭,一个年龄=37,性别=F,另一个年龄=10,性别=M.
我想到的一个解决方案是扩大数据集,使每一行也包含关于家庭其他成员的信息,但这可能会变得非常繁琐。如果Df2的观测值比Df1多,并且两个数据集之间的一些观测值不同,则对观测值排序和添加列也可能失败。我想知道是否有一个更优雅的解决方案,可以寻找匹配的组合,比如下面的图片:Df1Df2
换言之,是否存在一个连接函数来匹配同时共享相同变量组合并且都是同一组链接观测的一部分的观测?非常感谢您的耐心和帮助。
好吧,如果你非常确信所有家庭成员都出现在两个表中,并且他们有相同的特征,你可以让他们这样加入:
library(dplyr)
Df1 <- data.frame(family_id = c(1,1,1,2,2,3,3,3), age = c(42,37,10,42,35,42,37,10), gender= c("M","F","M","M","F","M","F","F"), tenure = c(12,10,NA,15,9,9,10,NA) )
Df2 <- data.frame(family_id = c("C","C","C","A","A","B","B","B"), age = c(42,37,10,42,35,42,37,10), gender= c("M","F","M","M","F","M","F","F"), employed= c(0,1,NA,1,1,1,0,NA))
Df3 <- data.frame(family_id_x = c(1,1,1,2,2,3,3,3), family_id_y = c("C","C","C","A","A","B","B","B"), age = c(42,37,10,42,35,42,37,10), gender= c("M","F","M","M","F","M","F","F"), tenure = c(12,10,NA,15,9,9,10,NA), employed= c(0,1,NA,1,1,1,0,NA))
# aggregate by family
m1 <- Df1 |>
# keep unique combinations only. Not an issue in your example data
select(family_id, age, gender) |>
unique() |>
# important to have the same order in both tables
arrange(family_id, age, gender) |>
group_by(family_id) |>
summarise(age_gender = paste(age, gender, collapse = ", "))
m1
#> # A tibble: 3 × 2
#> family_id age_gender
#> <dbl> <chr>
#> 1 1 10 M, 37 F, 42 M
#> 2 2 35 F, 42 M
#> 3 3 10 F, 37 F, 42 M
m2 <- Df2 |>
# keep unique combinations only. Not an issue in your example data
select(family_id, age, gender) |>
unique() |>
# important to have the same order in both tables
arrange(family_id, age, gender) |>
group_by(family_id) |>
# create helper variable to identify same family members
summarise(age_gender = paste(age, gender, collapse = ", "))
m2
#> # A tibble: 3 × 2
#> family_id age_gender
#> <chr> <chr>
#> 1 A 35 F, 42 M
#> 2 B 10 F, 37 F, 42 M
#> 3 C 10 M, 37 F, 42 M
# create a mapping table of ids based on demographic characteristics (age + gender)
matches <- full_join(m1, m2, by = "age_gender", suffix = c("_x", "_y")) |> select(family_id_x, family_id_y)
matches
#> # A tibble: 3 × 2
#> family_id_x family_id_y
#> <dbl> <chr>
#> 1 1 C
#> 2 2 A
#> 3 3 B
# join both initial tables using ther matched ids
df_result <- Df1 |>
left_join(matches, by = c("family_id" = "family_id_x")) |>
left_join(Df2, by = c("family_id_y" = "family_id", "age", "gender")) |>
select(family_id_x = family_id, family_id_y, age, gender, tenure, employed)
df_result
#> family_id_x family_id_y age gender tenure employed
#> 1 1 C 42 M 12 0
#> 2 1 C 37 F 10 1
#> 3 1 C 10 M NA NA
#> 4 2 A 42 M 15 1
#> 5 2 A 35 F 9 1
#> 6 3 B 42 M 9 1
#> 7 3 B 37 F 10 0
#> 8 3 B 10 F NA NA
# result as expected
all.equal(df_result, Df3)
#> [1] TRUE
创建映射表的第二个解决方案
# solution by full join
df_full <- full_join(Df1 |>
select(family_id, age, gender) |>
unique(),
Df2 |>
select(family_id, age, gender) |>
unique(),
by = c("age", "gender"))
df_full |>
group_by(family_id.x, family_id.y) |>
summarise(n = n()) |>
filter(n == max(n))
#> `summarise()` has grouped output by 'family_id.x'. You can override using the
#> `.groups` argument.
#> # A tibble: 3 × 3
#> # Groups: family_id.x [3]
#> family_id.x family_id.y n
#> <dbl> <chr> <int>
#> 1 1 C 3
#> 2 2 A 2
#> 3 3 B 3
每个案例都需要唯一的标签(对变量进行分组(
在这种情况下,连接变量(by=c("年龄","性别"(具有重复项。因此,我们必须为这些重复项指定唯一的标签,以便区分共享相同密钥的每个案例。
library(tidyverse)
df1 <- data.frame(family_id = c(1,1,1,2,2,3,3,3),
age = c(42,37,10,42,35,42,37,10),
gender= c("M","F","M","M","F","M","F","F"),
tenure = c(12,10,NA,15,9,9,10,NA)) %>%
group_by(age, gender) %>% # The age and gender columns have duplicates
mutate(rown = row_number()) # Therefore, we need a case-specific column
df2 <- data.frame(family_id = c("C","C","C","A","A","B","B","B"),
age = c(42,37,10,42,35,42,37,10),
gender= c("M","F","M","M","F","M","F","F"),
employed= c(0,1,NA,1,1,1,0,NA)) %>%
group_by(age, gender) %>%
mutate(rown = row_number())
df3 <- data.frame(family_id_x = c(1,1,1,2,2,3,3,3),
family_id_y = c("C","C","C","A","A","B","B","B"),
age = c(42,37,10,42,35,42,37,10),
gender= c("M","F","M","M","F","M","F","F"),
tenure = c(12,10,NA,15,9,9,10,NA),
employed= c(0,1,NA,1,1,1,0,NA))
# Join with three columns that are able to distinguish each cases
df4 <- left_join(df1, df2, by = c('age', 'gender', "rown"))
> df4
# A tibble: 8 x 7
# Groups: age, gender [5]
family_id.x age gender tenure rown family_id.y employed
<dbl> <dbl> <chr> <dbl> <int> <chr> <dbl>
1 1 42 M 12 1 C 0
2 1 37 F 10 1 C 1
3 1 10 M NA 1 C NA
4 2 42 M 15 2 A 1
5 2 35 F 9 1 A 1
6 3 42 M 9 3 B 1
7 3 37 F 10 2 B 0
8 3 10 F NA 1 B NA