我有一个GRanges
对象,每个seqnames
具有不同的基因组范围(例如染色体(
如何获得仅包含每个序列名/染色体的最长范围的GRanges
?
例如,如果gr
是GRanges
:
library(GenomicRanges)
# Make a GRanges object
set.seed(123)
gr <- GRanges(seqnames = rep(c("chr1", "chr2", "chr3"), times=2:4),
ranges = IRanges(start=sample.int(10000, 9),
width = c(3,5,50,20,10,500,100,500,200)))
# Add a column with the width for clarity:
mcols(gr)$width <- width(gr)
gr
#GRanges object with 9 ranges and 1 metadata column:
# seqnames ranges strand | width
# <Rle> <IRanges> <Rle> | <integer>
# [1] chr1 2463-2465 * | 3
# [2] chr1 2511-2515 * | 5
# [3] chr2 8718-8767 * | 50
# [4] chr2 2986-3005 * | 20
# [5] chr2 1842-1851 * | 10
# [6] chr3 9334-9833 * | 500
# [7] chr3 3371-3470 * | 100
# [8] chr3 4761-5260 * | 500
# [9] chr3 6746-6945 * | 200
# -------
# seqinfo: 3 sequences from an unspecified genome; no seqlengths
然后我想获得以下GRanges
:
#GRanges object with 3 ranges and 1 metadata column:
# seqnames ranges strand | width
# <Rle> <IRanges> <Rle> | <integer>
# [1] chr1 2511-2515 * | 5
# [2] chr2 8718-8767 * | 50
# [3] chr3 9334-9833 * | 500
对于我的应用程序,我可以只获得chr3
的第一个最长范围,但我希望有一个也可以选择所有领带(如果有的话(的解决方案。
我找到了这个解决方案:
library(GenomicRanges)
# Make the GRanges object
set.seed(123)
gr <- GRanges(seqnames = rep(c("chr1", "chr2", "chr3"), times=2:4),
ranges = IRanges(start=sample.int(10000, 9),
width = c(3,5,50,20,10,500,100,500,200)))
# Add a column with the width for clarity:
mcols(gr)$width <- width(gr)
# split gr by seqnames
grl <- split(gr, seqnames(gr))
# Get the width of each range organized as an IntegerList
wgr <- width(grl)
# if gr is ordered by seqnames this is equivalent to:
wgr <- splitAsList(width(gr), seqnames(gr))
wgr
#IntegerList of length 3
#[["chr1"]] 3 5
#[["chr2"]] 50 20 10
#[["chr3"]] 500 100 500 200
# Get the ranges with the largest width
# See ?'which.max,NumericList-method' for the global argument
longestRanges <- gr[which.max(wgr, global = TRUE)]
longestRanges
#GRanges object with 3 ranges and 1 metadata column:
# seqnames ranges strand | width
# <Rle> <IRanges> <Rle> | <integer>
# [1] chr1 2511-2515 * | 5
# [2] chr2 8718-8767 * | 50
# [3] chr3 9334-9833 * | 500
# -------
# seqinfo: 3 sequences from an unspecified genome; no seqlengths
请注意,此解决方案仅在gr
中的范围最初由seqnames
分组/排序时有效
如果不是,则需要先用gr <- sort(gr)
对gr
进行排序
要获得所有的平局,可以使用:
longestRanges <- unlist(grl[which(wgr == max(wgr))])
longestRanges
#GRanges object with 4 ranges and 1 metadata column:
# seqnames ranges strand | width
# <Rle> <IRanges> <Rle> | <integer>
# chr1 chr1 2511-2515 * | 5
# chr2 chr2 8718-8767 * | 50
# chr3 chr3 9334-9833 * | 500
# chr3 chr3 4761-5260 * | 500
# -------
# seqinfo: 3 sequences from an unspecified genome; no seqlengths