如果列的值位于R中的两列之间,则填充一个新列



我有两个不同长度的数据帧,比如:df1

locusnum CHR     MinBP     MaxBP
1:        1   1  13982248  14126651
2:        2   1  21538708  21560253
3:        3   1  28892760  28992798
4:        4   1  43760070  43927877
5:        5   1 149999059 150971195
6:        6   1 200299701 200441048

df2

position chr
27751 13982716   1
27750 13982728   1
10256 13984208   1
27729 13985591   1
27730 13988076   1
27731 13988403   1

两个dfs都有其他列。df260000行,df164行。

我想用df1中的locusnum填充df2中的新列。条件是df2$chr == df1$CHR & df2$position %in% df1$MinBP:df1$MaxBP

我的期望输出将是

position chr locusnum
27751 13982716   1  1
27750 13982728   1  1
10256 13984208   1  1
27729 13985591   1  1
27730 13988076   1  1
27731 13988403   1  1
到目前为止,我已经尝试了ifelse语句和for循环,如下所示:
if (df2$chr == df1$CHR & df2$position >= df1$MinBP & df2$position <= df1$MaxBP) df2$locusnum=df1$locusnum

for(i in 1:length(df2$position)){        #runs the following code for each line
if(df2$chr[i] == df1$CHR & df2$position[i] %in% df1$MinBP:df1$MaxBP){              #if logical TRUE then it runs the next line
df2$locusnum[i] <- df1$locusnum    #gives value of another column to a new column

but got error:

the condition has length > 1
longer object length is not a multiple of shorter object length

帮忙吗?我把问题解释清楚了吗?}}

data.table包中使用foverlaps(...)

你的例子是无趣的,因为所有的行都对应于locusnum = 1,所以我改变了df2一点,以演示这是如何工作的。

##
#  df1 is as you provided it
#  in df2: note changes to position column in row 2, 3, and 6
#
df2 <- read.table(text="
id    position  chr
27751 13982716    1
27750 21538718    1
10256 43760080    1
27729 13985591    1
27730 13988076    1
27731 200299711   1", header=TRUE)
##
#   you start here
#
library(data.table)
setDT(df1)
setDT(df2)
df2[, c('indx', 'start', 'end'):=.(seq(.N), position, position)]
setkey(df1, CHR, MinBP, MaxBP)
setkey(df2, chr, start, end)
result <- foverlaps(df2, df1)[order(indx), .(id, position, chr, locusnum)]
##       id  position chr locusnum
## 1: 27751  13982716   1        1
## 2: 27750  21538718   1        2
## 3: 10256  43760080   1        4
## 4: 27729  13985591   1        1
## 5: 27730  13988076   1        1
## 6: 27731 200299711   1        6
如果两个data.table

都有键,那么foverlaps(...)效果最好,但是这会改变df2中的行顺序,所以我添加了一个index列来恢复原始顺序,然后在最后将其删除。

这应该是非常快的,但是60,000行是一个很小的数据集,所以您可能不会注意到差异。

最新更新