如何通过匹配两个数字列与+-5范围合并两个数据帧?



我有两个数据帧,如下:

df1 <- data.frame(chrom = c(1,1,3,6,6),
chromStart = c(15433, 1959,34205,35043, 77456),
chromEnd = c(15700, 2001,36245,36245,78469), 
id = c('aaad', 'dfk', 'bb', 'llk', 'ie9o'))
df2 <- data.frame(chrom = c(1,1,5,1,6),
chromStart2 = c(15433, 1961,34205,1962, 77456),
chromEnd2 = c(15700, 2002,36245,1999,78480))

我想通过匹配chrom == chrom, chromStart = between(chromStart2 -5, chromStart2 +5)chromEnd = between(chromEnd2 -5, chromEnd2 +5)来合并两个数据帧。我试过的是:

library(dplyr)
colnames(df2) <- c('chrom','chromStart', 'chromEnd')
merged <- inner_join(df1,df2)

然而,这只匹配确切的chromStartchromEnd,在我们的例子中只有aaad匹配。我想给它一个加减的范围,以便dfk也匹配。我的实际数据帧是260000行和179000行,所以如果可能的话,我更喜欢内存高效的方式。以下是我正在寻找的结果:

data.frame(chrom = c(1,1,1),
chromStart = c(15433, 1959,1959),
chromEnd = c(15700, 2001,2001), 
id = c('aaad', 'dfk', 'dfk'),
chromStart2 = c(15433, 1961,1962),
chromEnd2 = c(15700, 2002,1999))

可能有更好/更有效的方法,但这些都应该可行。

dplyr方法:根据您的条件创建两个临时逻辑向量,然后基于满足两个条件的filter,然后删除(select)临时列:

merged <- inner_join(df1, df2) %>%
mutate(
inStart = chromStart >= chromStart2 - 5 & chromStart <= chromStart2 + 5,
inEnd = chromEnd >= chromEnd2 - 5 & chromEnd <= chromEnd + 5) %>%
filter(inStart, inEnd) %>%
select(-inStart, -inEnd)
### or in one `mutate` command:
# merged <- inner_join(df1, df2) %>%
#   mutate(inrows  =  (chromStart >= chromStart2 - 5 & chromStart <= #chromStart2 + 5) &
#       (chromEnd >= chromEnd2 - 5 & chromEnd <= chromEnd + 5)) %>%
#   filter(inrows) %>%
#   select(-inrows)

输出:

#   chrom chromStart chromEnd   id chromStart2 chromEnd2
# 1     1      15433    15700 aaad       15433     15700
# 2     1       1959     2001  dfk        1961      2002
# 3     1       1959     2001  dfk        1962      1999

并检查以确保它完全符合最终所需的数据:

all.equal(merged,
data.frame(chrom = c(1,1,1),
chromStart = c(15433, 1959,1959),
chromEnd = c(15700, 2001,2001), 
id = c('aaad', 'dfk', 'dfk'),
chromStart2 = c(15433, 1961,1962),
chromEnd2 = c(15700, 2002,1999))
)
# [1] TRUE

base R方法:通过识别满足相同条件的行来将数据子集

base1 <- merge(df1, df2, by = "chrom")
base_merged <- base1[(base1$chromStart >= base1$chromStart2 - 5 & base1$chromStart <= base1$chromStart2 + 5) &
(base1$chromEnd >= base1$chromEnd2 - 5 & base1$chromEnd <= base1$chromEnd + 5),]

最新更新