在 R 中,如何根据另一个不需要精确值的数据框提取行?他们只需要足够接近?



我有两个值大致相同的数据框

DF1:

v1 v2 v3 v4 v5 v6 v7 ......
500 40 5.2 z1 .....
500 40 7.2 z2 .....
500 40 9.0 z3 .....
500 40 3.5 z4 .....
500 40 4.2 z5 .....

DF2:

v1 v2 v3 v4 v5 v6 v7 .....
500 40 5.1 m1 .....
500 40 7.9 m2 .....
500 20 8.6 m3 .....
500 40 3.7 m4 .....
500 40 4.0 m5 .....

我想合并(或任何类似的功能),以便我的新 df1 文件具有完全匹配的 v1 和 v2,但 v3 不需要严格精确。有没有办法将 v3 匹配到 +/- 0.2 以内?

我希望最终的 df1 看起来像:

v1 v2 v3 v4 v5 v6 v7 .....
500 40 5.2 z1 .....
500 40 3.5 z4 .....
500 40 4.2 z5 .....

我得到了下面,但我不确定如何解释列 v3 的可变性。

hed <- c("v1", "v2", "v3") #original data didn't have header
df1_final <- merge(df1, df2[hed],by=hed)

如果有更好的语言来处理这个问题,我也会接受,但这只是我正在处理的整个 R 脚本的一部分。

使用tidyverse,我们可以首先join,然后用near(和容差)filter

library(tidyverse)
df1 <- data_frame(v1 = c(500, 500, 500, 500, 500),
v2 = c(40, 40, 40, 40, 40),
v3 = c(5.2, 7.2, 9.0, 3.5, 4.2),
v4 = c("z1", "z2", "z3", "z4", "z5"))
df2 <- data_frame(v1 = c(500, 500, 500, 500, 500),
v2 = c(40, 40, 20, 40, 40),
v3 = c(5.1, 7.9, 8.6, 3.7, 4.0),
v4 = c("m1", "m2", "m3", "m4", "m5"))
df1 %>%
full_join(df2, by = c("v1", "v2")) %>%    # join on v1 and v2
filter(near(v3.x, v3.y, tol = 0.21)) %>%  # filter with a tolerance
rename(v3 = v3.x, v4 = v4.x) %>%          # rename the columns
select(v1:v4)                             # select em

这会产生

# A tibble: 3 x 4
v1    v2    v3 v4   
<dbl> <dbl> <dbl> <chr>
1  500.   40.  5.20 z1   
2  500.   40.  3.50 z4   
3  500.   40.  4.20 z5 

如果你熟悉SQL语法,这(以及许多其他复杂的非相等合并)很容易sqldf

library(sqldf)
df1 <- data.frame(v1 = c(500, 500, 500, 500, 500),
v2 = c(40, 40, 40, 40, 40),
v3 = c(5.2, 7.2, 9.0, 3.5, 4.2),
v4 = c("z1", "z2", "z3", "z4", "z5"))
df2 <- data.frame(v1 = c(500, 500, 500, 500, 500),
v2 = c(40, 40, 20, 40, 40),
v3 = c(5.1, 7.9, 8.6, 3.7, 4.0),
v4 = c("m1", "m2", "m3", "m4", "m5"))

sqldf('
select df1.* 
from df1
join df2 
on df1.v3 <= df2.v3+0.2
and df1.v3 >= df2.v3-0.2
')

相关内容

最新更新