由于处理了大量文件,Snakemake的速度呈指数级下降



我目前正在编写一个管道,它生成阳性RNA序列,对它们进行混洗,然后分析阳性序列和混洗(阴性(序列。例如,我想生成100个正序列,并用三种不同的算法将这些序列中的每一个打乱1000次。为此,我使用了两个通配符(pos_index和pred_index(,范围分别为0到100和0到1000。作为最后一步,所有文件都由另外三种不同的工具进行分析。

现在我的问题是:DAG的构建过程实际上需要几个小时,而实际管道的执行速度甚至更慢。当它启动时,它会执行一批32个作业(因为我为snakemake分配了32个内核(,然后需要10到15分钟才能执行下一批作业(我想是由于一些文件检查(。管道的完全实施将需要大约2个月的时间。

下面是我的蛇形文件的简化示例。有没有办法,我可以在某种程度上优化它,这样蛇制造和它的开销就不再是瓶颈了?

ITER_POS = 100
ITER_PRED = 1000
SAMPLE_INDEX = range(0, ITER_POS)
PRED_INDEX = range(0, ITER_PRED)
SHUFFLE_TOOLS = ["1", "2", "3"]
PRED_TOOLS = ["A", "B", "C"]
rule all:
input:
# Expand for negative sample analysis
expand("predictions_{pred_tool}/neg_sample_{shuffle_tool}_{sample_index}_{pred_index}.txt",
pred_tool = PRED_TOOLS,
shuffle_tool = SHUFFLE_TOOLS,
sample_index = SAMPLE_INDEX,
pred_index = PRED_INDEX),
# Expand for positive sample analysis
expand("predictions_{pred_tool}/pos_sample_{sample_index}.txt",
pred_tool = PRED_TOOLS,
sample_index = SAMPLE_INDEX)

# GENERATION
rule generatePosSample:
output: "samples/pos_sample_{sample_index}.clu"
shell:  "sequence_generation.py > {output}"

# SHUFFLING
rule shufflePosSamples1:
input:  "samples/pos_sample_{sample_index}.clu"
output: "samples/neg_sample_1_{sample_index}_{pred_index}.clu"
shell:  "sequence_shuffling.py {input} > {output}"
rule shufflePosSamples2:
input:  "samples/pos_sample_{sample_index}.clu"
output: "samples/neg_sample_2_{sample_index}_{pred_index}.clu"
shell:  "sequence_shuffling.py {input} > {output}"
rule shufflePosSamples3:
input:  "samples/pos_sample_{sample_index}.clu"
output: "samples/neg_sample_3_{sample_index}_{pred_index}.clu"
shell:  "sequence_shuffling.py {input} > {output}"

# ANALYSIS
rule analysePosSamplesA:
input:  "samples/pos_sample_{sample_index}.clu"
output: "predictions_A/pos_sample_{sample_index}.txt"
shell:  "sequence_analysis_A.py {input} > {output}"
rule analysePosSamplesB:
input:  "samples/pos_sample_{sample_index}.clu"
output: "predictions_B/pos_sample_{sample_index}.txt"
shell:  "sequence_analysis_B.py {input} > {output}"
rule analysePosSamplesC:
input:  "samples/pos_sample_{sample_index}.clu"
output: "predictions_C/pos_sample_{sample_index}.txt"
shell:  "sequence_analysis_C.py {input} > {output}"
rule analyseNegSamplesA:
input:  "samples/neg_sample_{shuffle_tool}_{sample_index}_{pred_index}.clu"
output: "predictions_A/neg_sample_{shuffle_tool}_{sample_index}_{pred_index}.txt"
shell:  "sequence_analysis_A.py {input} > {output}"
rule analyseNegSamplesB:
input:  "samples/neg_sample_{shuffle_tool}_{sample_index}_{pred_index}.clu"
output: "predictions_B/neg_sample_{shuffle_tool}_{sample_index}_{pred_index}.txt"
shell:  "sequence_analysis_B.py {input} > {output}"
rule analyseNegSamplesC:
input:  "samples/neg_sample_{shuffle_tool}_{sample_index}_{pred_index}.clu"
output: "predictions_C/neg_sample_{shuffle_tool}_{sample_index}_{pred_index}.txt"
shell:  "sequence_analysis_C.py {input} > {output}"

尽管我并没有真正处理大量的文件,也没有经历执行时间减慢的情况,但我确实经历了DAG计算步骤的显著减慢

因此,我想分享我的解决方案:

如果你在输入中引用了另一个规则的输出,只需使用snakemake的规则依赖性和规则引用的内置功能:

### Bad example
rule bad_example_rule:
input:
"output_from_previous_rule.txt"
output:
"output.txt"
shell:
"touch {output[0]}"
### Solution
rule solution_example_rule:
input:
rules.previous_rule_name.output[0]
output:
"output.txt"
shell:
"touch {output[0]}"

我不知道为什么,但对我来说,它至少加快了DAG构建过程x100

最新更新