我有两个不同长度的数据帧,比如:df1
locusnum CHR MinBP MaxBP
1: 1 1 13982248 14126651
2: 2 1 21538708 21560253
3: 3 1 28892760 28992798
4: 4 1 43760070 43927877
5: 5 1 149999059 150971195
6: 6 1 200299701 200441048
df2
position chr
27751 13982716 1
27750 13982728 1
10256 13984208 1
27729 13985591 1
27730 13988076 1
27731 13988403 1
两个dfs
都有其他列。df2
有60000
行,df1
有64
行。
我想用df1
中的locusnum
填充df2
中的新列。条件是df2$chr == df1$CHR & df2$position %in% df1$MinBP:df1$MaxBP
我的期望输出将是
position chr locusnum
27751 13982716 1 1
27750 13982728 1 1
10256 13984208 1 1
27729 13985591 1 1
27730 13988076 1 1
27731 13988403 1 1
到目前为止,我已经尝试了ifelse
语句和for循环,如下所示:
if (df2$chr == df1$CHR & df2$position >= df1$MinBP & df2$position <= df1$MaxBP) df2$locusnum=df1$locusnum
和
for(i in 1:length(df2$position)){ #runs the following code for each line
if(df2$chr[i] == df1$CHR & df2$position[i] %in% df1$MinBP:df1$MaxBP){ #if logical TRUE then it runs the next line
df2$locusnum[i] <- df1$locusnum #gives value of another column to a new column
but got error:
the condition has length > 1
longer object length is not a multiple of shorter object length
帮忙吗?我把问题解释清楚了吗?}}
从data.table
包中使用foverlaps(...)
你的例子是无趣的,因为所有的行都对应于locusnum = 1
,所以我改变了df2
一点,以演示这是如何工作的。
##
# df1 is as you provided it
# in df2: note changes to position column in row 2, 3, and 6
#
df2 <- read.table(text="
id position chr
27751 13982716 1
27750 21538718 1
10256 43760080 1
27729 13985591 1
27730 13988076 1
27731 200299711 1", header=TRUE)
##
# you start here
#
library(data.table)
setDT(df1)
setDT(df2)
df2[, c('indx', 'start', 'end'):=.(seq(.N), position, position)]
setkey(df1, CHR, MinBP, MaxBP)
setkey(df2, chr, start, end)
result <- foverlaps(df2, df1)[order(indx), .(id, position, chr, locusnum)]
## id position chr locusnum
## 1: 27751 13982716 1 1
## 2: 27750 21538718 1 2
## 3: 10256 43760080 1 4
## 4: 27729 13985591 1 1
## 5: 27730 13988076 1 1
## 6: 27731 200299711 1 6
如果两个data.table
都有键,那么foverlaps(...)
效果最好,但是这会改变df2
中的行顺序,所以我添加了一个index
列来恢复原始顺序,然后在最后将其删除。
这应该是非常快的,但是60,000行是一个很小的数据集,所以您可能不会注意到差异。