r-将行与列上差异最小的其他行进行匹配

我想在数据帧中的两个组之间执行匹配，其中，如果属于一个组(二进制(的所有行在另一列上的差异小于预设阈值，则将它们与来自另一组的观测值匹配(带替换(。让我们使用下面的玩具数据集：

set.seed(123)
df <- data.frame(id = c(1:10),
group = rbinom(10,1, 0.3),
value = round(runif(10),2))
threshold <- round(sd(df$value),2)

看起来像这个

> df
id group value
1   1     0  0.96
2   2     1  0.45
3   3     0  0.68
4   4     1  0.57
5   5     1  0.10
6   6     0  0.90
7   7     0  0.25
8   8     1  0.04
9   9     0  0.33
10 10     0  0.95
> threshold 
[1] 0.35

在这种情况下，我想将具有group==1的行与具有group==2的行进行匹配，其中value之间的差小于threshold(0.35(。这应该会导致数据帧看起来像这样(对潜在错误表示歉意，手动完成(。

id matched_id
1   2          3
2   2          7
3   2          9
4   4          3
5   4          6
6   4          7
7   4          9
8   5          7
9   5          9
10  8          7
11  8          9

谢谢！

您可以使用df |> left_join(df, by = character())，这是执行笛卡尔乘积的另一种方式。然后根据threshold进行过滤。

library(dplyr)
df |>
left_join(df, by = character()) |>
filter(group.x != group.y,
id.x < id.y,
abs(value.x - value.y) < threshold)
#>+    id.x group.x value.x id.y group.y value.y
#>1     2       1    0.45    3       0    0.68
#>2     2       1    0.45    7       0    0.25
#>3     2       1    0.45    9       0    0.33
#>4     3       0    0.68    4       1    0.57
#>5     4       1    0.57    6       0    0.90
#>6     4       1    0.57    7       0    0.25
#>7     4       1    0.57    9       0    0.33
#>8     5       1    0.10    7       0    0.25
#>9     5       1    0.10    9       0    0.33
#>10    7       0    0.25    8       1    0.04
#>11    8       1    0.04    9       0    0.33

更新的答案：在更大的数据集上速度较慢，所以我试图提高代码的效率

想出了一个似乎能满足我需求的解决方案。不确定此代码在较大数据上的效率，但似乎可以工作。

library(tidyverse)
library(data.table)
# All values
dist_mat <- df$value
# Adding identifier
names(dist_mat) <- df$id
# Dropping combinations that are not of interest
dist_mat_col <-dist_mat[df$group == 0]
dist_mat_row <- dist_mat[df$group == 1]
# Difference between each value
dist_mat <- abs(outer(dist_mat_row, dist_mat_col, "-"))
# Identifying matches that fulfills the criteria
dist_mat <- dist_mat <= threshold 
# From matrix to a long dataframe
dist_mat <- melt(dist_mat)
# Tidying up the dataframe and dropping unneccecary columns and rows. 
dist_mat <- dist_mat %>%
rename(id = Var1,
matched_id = Var2,
cond = value) %>%
filter(cond == TRUE) %>%
left_join(df, by = "id") %>%
select(id, matched_id)

这导致以下数据帧：

> arrange(dist_mat, id)
id matched_id
1   2          3
2   2          7
3   2          9
4   4          3
5   4          6
6   4          7
7   4          9
8   5          7
9   5          9
10  8          7
11  8          9

相关内容

最新更新

热门标签：