r-如何从两个数据表中删除重叠的序列



我有两个数据表,它们提供不同染色体(类别(的序列坐标。例如:

library(data.table)
dt1 <- data.table(chromosome = c("1", "1", "1", "1", "X"),
start = c(1, 50, 110, 150, 110),
end = c(11, 100, 121, 200, 200))
dt2 <- data.table(chromosome = c("1", "1", "X"),
start = c(12, 60, 50),
end = c(20, 115, 80))

我需要创建第三个data.table,它为包含dt1中所有整数的序列提供坐标,这些整数与dt2中序列中的任何整数都不重叠。例如:

dt3 <- data.table(chromosome = c("1", "1", "1", "1", "X"),
start = c(1, 50, 116, 150, 110),
end = c(11, 59, 121, 200, 200))

我需要运行这个程序的data.tables非常大,因此我需要最大限度地提高性能。我尝试过使用foverlaps((函数,但没有成功。如有任何帮助,我们将不胜感激!

您可以从foverlaps开始

setkey(dt2,chromosome,start,end)
ds = foverlaps(dt1,dt2,  type="any")
ds[,.(chromosome, 
start = fcase(is.na(start) | i.start <= start,i.start,
i.end >= end, end + 1),
end = fcase(is.na(end) | i.end >= end, i.end,
i.start <= start, start - 1)
)]
#   chromosome start   end
#       <char> <num> <num>
#1:          1     1    11
#2:          1    50    59
#3:          1   116   121
#4:          1   150   200
#5:          X   110   200

为了完整起见,使用Bioconductor的GenomicRanges包有一个简洁的解决方案:

library(GenomicRanges)
setdiff(makeGRangesFromDataFrame(dt1), makeGRangesFromDataFrame(dt2))
GRanges object with 5 ranges and 0 metadata columns:
seqnames    ranges strand
<Rle> <IRanges>  <Rle>
[1]        1      1-11      *
[2]        1     50-59      *
[3]        1   116-121      *
[4]        1   150-200      *
[5]        X   110-200      *
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths

如果结果要求为data.table类:

library(data.table) # development version 1.14.3 used
library(GenomicRanges)
setdiff(makeGRangesFromDataFrame(dt1), makeGRangesFromDataFrame(dt2)) |> 
as.data.table() |>
DT(, .(chromosome = seqnames, start, end))
chromosome start   end
<fctr> <int> <int>
1:          1     1    11
2:          1    50    59
3:          1   116   121
4:          1   150   200
5:          X   110   200

如Waldi所述,CRAN不提供GenomicRanges包。Waldi在BiocManager小插曲中提供了安装指南的链接。这是简短的版本:

install.packages("BiocManager")
BiocManager::install("GenomicRanges")

相关内容

  • 没有找到相关文章

最新更新