我有两个值大致相同的数据框
DF1:
v1 v2 v3 v4 v5 v6 v7 ......
500 40 5.2 z1 .....
500 40 7.2 z2 .....
500 40 9.0 z3 .....
500 40 3.5 z4 .....
500 40 4.2 z5 .....
DF2:
v1 v2 v3 v4 v5 v6 v7 .....
500 40 5.1 m1 .....
500 40 7.9 m2 .....
500 20 8.6 m3 .....
500 40 3.7 m4 .....
500 40 4.0 m5 .....
我想合并(或任何类似的功能),以便我的新 df1 文件具有完全匹配的 v1 和 v2,但 v3 不需要严格精确。有没有办法将 v3 匹配到 +/- 0.2 以内?
我希望最终的 df1 看起来像:
v1 v2 v3 v4 v5 v6 v7 .....
500 40 5.2 z1 .....
500 40 3.5 z4 .....
500 40 4.2 z5 .....
我得到了下面,但我不确定如何解释列 v3 的可变性。
hed <- c("v1", "v2", "v3") #original data didn't have header
df1_final <- merge(df1, df2[hed],by=hed)
如果有更好的语言来处理这个问题,我也会接受,但这只是我正在处理的整个 R 脚本的一部分。
使用tidyverse
,我们可以首先join
,然后用near
(和容差)filter
:
library(tidyverse)
df1 <- data_frame(v1 = c(500, 500, 500, 500, 500),
v2 = c(40, 40, 40, 40, 40),
v3 = c(5.2, 7.2, 9.0, 3.5, 4.2),
v4 = c("z1", "z2", "z3", "z4", "z5"))
df2 <- data_frame(v1 = c(500, 500, 500, 500, 500),
v2 = c(40, 40, 20, 40, 40),
v3 = c(5.1, 7.9, 8.6, 3.7, 4.0),
v4 = c("m1", "m2", "m3", "m4", "m5"))
df1 %>%
full_join(df2, by = c("v1", "v2")) %>% # join on v1 and v2
filter(near(v3.x, v3.y, tol = 0.21)) %>% # filter with a tolerance
rename(v3 = v3.x, v4 = v4.x) %>% # rename the columns
select(v1:v4) # select em
这会产生
# A tibble: 3 x 4
v1 v2 v3 v4
<dbl> <dbl> <dbl> <chr>
1 500. 40. 5.20 z1
2 500. 40. 3.50 z4
3 500. 40. 4.20 z5
如果你熟悉SQL语法,这(以及许多其他复杂的非相等合并)很容易sqldf
library(sqldf)
df1 <- data.frame(v1 = c(500, 500, 500, 500, 500),
v2 = c(40, 40, 40, 40, 40),
v3 = c(5.2, 7.2, 9.0, 3.5, 4.2),
v4 = c("z1", "z2", "z3", "z4", "z5"))
df2 <- data.frame(v1 = c(500, 500, 500, 500, 500),
v2 = c(40, 40, 20, 40, 40),
v3 = c(5.1, 7.9, 8.6, 3.7, 4.0),
v4 = c("m1", "m2", "m3", "m4", "m5"))
sqldf('
select df1.*
from df1
join df2
on df1.v3 <= df2.v3+0.2
and df1.v3 >= df2.v3-0.2
')