r-Sqldf非常慢,如何将此sql代码更改为data.table



我已经在r中运行了r脚本,我已经使用sqldf运行了大约一天,现在我正在尝试使用data.table包,但我仍然不知道如何更改所有这些脚本以使用data.table

> dput(df[1:4, ])
structure(list(chr = c("chr1", "chr1", "chr1", "chr1"), cpg = c(4222, 
4234, 4235, 4313), count_c = c(0L, 0L, 0L, 2L), total_coverage = c(8L, 
6L, 8L, 8L)), row.names = 12:15, class = "data.frame")
> dput(annotation_with_total_cpgs[1:4, ])
structure(list(gene_id = c("PSOL00004", "PSOL00004", "PSOL00004-TA", 
"PSOL00004-TA"), chr = c("chr5", "chr5", "chr5", "chr5"), start = c(9914646L, 
9914646L, 9914646L, 9914646L), end = c(9917882L, 9917882L, 9914818L, 
9914818L), feature = c("gene", "mRNA", "CDS", "exon"), cpg_count = c(101L, 
101L, 11L, 11L)), row.names = c(NA, 4L), class = "data.frame")
df = read.table("final_coverage.txt", header = T) annotation_with_total_cpgs <- read_table("total_cpgs.txt") 
output <- sqldf("
SELECT sample.chr, sample.cpg, sample.count_c, 
sample.total_coverage, annot.chr, annot.start, 
annot.end, annot.gene_id, annot.cpg_count, annot.feature 
FROM df AS sample 
LEFT JOIN annotation_with_total_cpgs AS annot 
ON sample.chr = annot.chr 
AND (sample.cpg >= annot.start AND sample.cpg <= annot.end)
") 
output <- output[!is.na(output$gene_id),]

如果没有样本数据,这只是猜测,但。。。

library(data.table)
setDT(df)
setDT(annotation_with_total_cpgs)
annotation_with_total_cpgs[, c("start0", "end0") := .(start, end)
][df[,cpg0 := cpg], on = .(chr, start0 <= cpg0, end0 >= cpg0)
][, .(chr, cpg, count_c, total_coverage, start, end, gene_id, cpg_count, feature)]

我创建start0end0cpg0的原因是,data.table-联接往往会覆盖重叠/非等联接中使用的联接变量,而且我永远不确定哪个是哪个,所以我明确地保留了所需的值。如果我们没有显式地用.(chr, cpg, ...)对列进行子集设置,那么您将同时看到startstart0,并且可能希望删除我创建的*0变量。

查看data.table包中的foverlaps函数

相关内容

最新更新