当不需要某些通配符组合(缺少输入文件)时,如何使用 snakemake 中的 expand,并具有"merge"规则?



这是一个比这里报告的情况稍微复杂一些的情况。我的输入文件如下:

ont_Hv1_2.5+.fastq                                                                                                                                                                              
ont_Hv2_2.5+.fastq                                                                                                                                                                              
pacBio_Hv1_1-1.5.fastq                                                                                                                                                                          
pacBio_Hv1_1.5-2.5.fastq                                                                                                                                                                        
pacBio_Hv1_2.5+.fastq                                                                                                                                                                           
pacBio_Hv2_1-1.5.fastq                                                                                                                                                                          
pacBio_Hv2_1.5-2.5.fastq
pacBio_Hv2_2.5+.fastq
pacBio_Mv1_1-1.5.fastq
pacBio_Mv1_1.5-2.5.fastq
pacBio_Mv1_2.5+.fastq

我只想处理现有的输入文件,自动跳过那些与不存在的输入文件相对应的通配符组合。

我的蛇文件看起来像这样:

import glob
import os.path
from itertools import product
#make wildcards regexps non-greedy:
wildcard_constraints:
    capDesign = "[^_/]+",
    sizeFrac = "[^_/]+",
    techname = "[^_/]+",
# get TECHNAMES (sequencing technology, i.e. 'ont' or 'pacBio'), CAPDESIGNS (capture designs, i.e. Hv1, Hv2, Mv1) and SIZEFRACS (size fractions) variables from input FASTQ file names:
(TECHNAMES, CAPDESIGNS, SIZEFRACS) = glob_wildcards("{techname}_{capDesign}_{sizeFrac}.fastq")
# make lists non-redundant:
CAPDESIGNS=set(CAPDESIGNS)
SIZEFRACS=set(SIZEFRACS)
TECHNAMES=set(TECHNAMES)
# make list of authorized wildcard combinations (based on presence of input files)
AUTHORIZEDCOMBINATIONS = []
for comb in product(TECHNAMES,CAPDESIGNS,SIZEFRACS):
    if(os.path.isfile(comb[0] + "_" + comb[1] + "_" + comb[2] + ".fastq")):
        tup=(("techname", comb[0]),("capDesign", comb[1]),("sizeFrac", comb[2]))
        AUTHORIZEDCOMBINATIONS.append(tup)
# Function to create filtered combinations of wildcards, based on the presence of input files.
# Inspired by:
# https://stackoverflow.com/questions/41185567/how-to-use-expand-in-snakemake-when-some-particular-combinations-of-wildcards-ar
def filter_combinator(whitelist):
    def filtered_combinator(*args, **kwargs):
        for wc_comb in product(*args, **kwargs):
            for ac in AUTHORIZEDCOMBINATIONS:
                if(wc_comb[0:3] == ac):
                    print ("SUCCESS")
                    yield(wc_comb)
                    break
    return filtered_combinator
filtered_product = filter_combinator(AUTHORIZEDCOMBINATIONS)
rule all:
    input:
        expand("{techname}_{capDesign}_all.readlength.tsv", filtered_product, techname=TECHNAMES, capDesign=CAPDESIGNS, sizeFrac=SIZEFRACS)
#get read lengths for all FASTQ files:
rule getReadLength:
    input: "{techname}_{capDesign}_{sizeFrac}.fastq"
    output: "{techname}_{capDesign}_{sizeFrac}.readlength.tsv"
    shell: "fastq2tsv.pl {input} | awk -v s={wildcards.sizeFrac} '{{print s"\t"length($2)}}' > {output}" #fastq2tsv.pl converts each FASTQ record into a tab-separated line, with the sequence in second field
#combine read length data over all sizeFracs of a given techname/capDesign combo:
rule aggReadLength:
    input: expand("{{techname}}_{{capDesign}}_{sizeFrac}.readlength.tsv", sizeFrac=SIZEFRACS)
    output: "{techname}_{capDesign}_all.readlength.tsv"
    shell: "cat {input} > {output}"

规则 getReadLength 收集每个输入 FASTQ 的读取长度(即每个techname, capDesign, sizeFrac组合(。

规则 aggReadLength 合并由 getReadLength 生成的读取长度统计信息,对于每个techname, capDesign组合。

工作流失败,并显示以下消息:

Missing input files for rule getReadLength:
ont_Hv1_1-1.5.fastq

因此,应用于目标的通配符组合过滤步骤似乎并未传播到它所依赖的所有上游规则。有人知道如何做到这一点吗?

(使用 Snakemake 版本 4.4.0。

提前非常感谢

我想

我解决了这个问题,希望这对其他人有用。

aggReadlength规则中,我替换了

input: expand("{{techname}}_{{capDesign}}_{sizeFrac}.readlength.tsv", sizeFrac=SIZEFRACS)

input: lambda wildcards: expand("{techname}_{capDesign}_{sizeFrac}.readlength.tsv", filtered_product, techname=wildcards.techname, capDesign=wildcards.capDesign, sizeFrac=SIZEFRACS)

最新更新