samtools calmd is pretty slow



我正在使用"samtools calmed";以将MD标记添加回BAM文件。原始BAM的大小约为50Gb(通过使用pacbio HIFI读取的全基因组序列(。我遇到的问题是;平静的";非常慢!作业已经运行了12个小时,并且只生成了带有MD标记的600MB BAM。这样,50GB BAM将需要30天才能完成!

这是我用来添加MD标签的代码(非常正常(:

rule addMDTag:
input:
rules.pbmm2_alignment.output        
output: 
strBAMDir + "/pbmm2/v37/{wcReadsType}/Tmp/rawReads{readsIndex}.MD.bam"               
params:
ref = strRef
threads:
16
log:
strBAMDir + "/pbmm2/v37/{wcReadsType}/Log/rawReads{readsIndex}.MD.log"
benchmark:
strBAMDir + "/pbmm2/v37/{wcReadsType}/Benchmark/rawReads{readsIndex}.MD.benchmark.txt"
shell:
"samtools calmd -@ {threads} {input} {params.ref} -bAr > {output}"

我使用的samtools版本是v1.10。

顺便说一句,我用16个核心来运行calmed,然而,看起来samtools仍然用1个核心来执行它:

top - 11:44:53 up 47 days, 20:35,  1 user,  load average: 2.00, 2.01, 2.00
Tasks: 1723 total,   3 running, 1720 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.8%us,  0.3%sy,  0.0%ni, 96.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  529329180k total, 232414724k used, 296914456k free,    84016k buffers
Swap: 12582908k total,    74884k used, 12508024k free, 227912476k cached
PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                       
93137 lix33     20   0  954m 151m 2180 R 100.2  0.0 659:04.13 samtools 

我可以知道如何让平静得更快吗?或者有没有其他工具可以更有效地完成同样的工作?

非常感谢

经过与samtools维护团队的协作,这个问题已经得到解决。如果bam没有排序,平静将非常缓慢。因此,请始终确保在运行平静之前对BAM进行了排序。

请参阅以下详细信息:

Are your files name sorted, and does your reference have more than one entry? 
If so calmd will be switching between references all the time, 
which means it may be doing a lot of reference loading and not much MD calculation.
You may find it goes a lot faster if you position-sort the input, and then run it through calmd.

最新更新