我在snakemaker中编写了这个管道来处理我的fastq文件并获得原始计数,但由于某种原因,我不理解最后一条规则(featureaccounts),我得到了这个错误:
/mnt/c/Users/manso/Desktop/hel/pe.py第175行出现WildcardError:输入文件中的通配符无法从输出文件中确定:'sample'
其他规则使用与featurets规则相同的输入,所以我不明白为什么它为特定规则返回此错误。
我真的很感激你的帮助。
这是我的蛇文件:
(SAMPLE,FRR) = glob_wildcards("rawReads/{sample}_{frr}.fastq.gz")
rule all:
input:
#raw_FASTQC
expand("rawQC/fastqc/{sample}_{frr}_fastqc.html", sample=SAMPLE, frr=FRR),
expand("rawQC/fastqc/{sample}_{frr}_fastqc.zip", sample=SAMPLE, frr=FRR),
#raw_MultiQC
"rawQC/multiqc_report.html",
#FASTP
expand("trimmedReads/{sample}_1.fastq.gz", sample=SAMPLE),
expand("trimmedReads/{sample}_2.fastq.gz", sample=SAMPLE),
expand("trimmedReads/{sample}_fastp_report.html", sample=SAMPLE),
#trimmed_FASTQC
expand("trimmedQC/fastqc/{sample}_{frr}_fastqc.html", sample=SAMPLE, frr=FRR),
expand("trimmedQC/fastqc/{sample}_{frr}_fastqc.zip", sample=SAMPLE, frr=FRR),
#trimmed_MultiQC
"trimmedQC/multiqc_report.html",
#get fa and gtf files
"genome/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa",
"genome/Homo_sapiens.GRCh38.106.gtf.gz",
#HISAT2_index
["index." + str(i) + ".ht2" for i in range(1,9)],
#HISAT_align
expand("aligned/{sample}.bam", sample=SAMPLE),
#samtools
expand("aligned/{sample}.sorted.bam", sample=SAMPLE),
expand("samtools_stats/{sample}.stats.txt", sample=SAMPLE),
expand("samtools_stats/{sample}.flagstat.txt", sample=SAMPLE),
#rawCounts
"raw_Counts"
rule raw_FASTQC:
input:
"rawReads/{sample}_{frr}.fastq.gz",
output:
html="rawQC/fastqc/{sample}_{frr}_fastqc.html",
zip= "rawQC/fastqc/{sample}_{frr}_fastqc.zip", # the suffix _fastqc.zip is necessary for multiqc to find the file. If not using multiqc, you are free to choose an arbitrary filename
params: "--quiet"
log:
"logs/fastqc/{sample}_{frr}.log"
threads: 16
wrapper:
"v1.7.0/bio/fastqc"
rule raw_MultiQC:
input:
expand("rawQC/fastqc/{sample}_{frr}_fastqc.zip", sample=SAMPLE, frr=FRR),
params:
path="rawQC/fastqc"
output:
"rawQC/multiqc_report.html"
shell:
"multiqc --force -n {output} {params.path}"
rule FASTP:
input:
read1="rawReads/{sample}_1.fastq.gz",
read2="rawReads/{sample}_2.fastq.gz",
output:
trimmed1="trimmedReads/{sample}_1.fastq.gz",
trimmed2="trimmedReads/{sample}_2.fastq.gz",
report_html= "trimmedReads/{sample}_fastp_report.html",
threads: 16
shell:
" fastp --thread {threads} -i {input.read1} -I {input.read2} -o {output.trimmed1} -O {output.trimmed2} -h {output.report_html} "
rule trimmed_FASTQC:
input:
"trimmedReads/{sample}_{frr}.fastq.gz"
output:
html="trimmedQC/fastqc/{sample}_{frr}_fastqc.html",
zip="trimmedQC/fastqc/{sample}_{frr}_fastqc.zip", # the suffix _fastqc.zip is necessary for multiqc to find the file. If not using multiqc, you are free to choose an arbitrary filename
params: "--quiet"
log:
"logs/fastqc/{sample}_{frr}.log"
threads: 16
wrapper:
"v1.7.0/bio/fastqc"
rule trimmed_MultiQC:
input:
expand("trimmedQC/fastqc/{sample}_{frr}_fastqc.zip", sample=SAMPLE, frr=FRR),
params:
path="trimmedQC/fastqc"
output:
"trimmedQC/multiqc_report.html"
shell:
"multiqc --force -n {output} {params.path} "
#Get annotation GTF
rule get_genome_gtf:
"Downloading Genome annotation file from Ensemble, Homo sapiens primary assembly (GRCh38)"
output:
gtf = "genome/Homo_sapiens.GRCh38.106.gtf.gz"
shell:
"cd genome"
" && wget ftp://ftp.ensembl.org/pub/release-106/gtf/homo_sapiens/Homo_sapiens.GRCh38.106.gtf.gz"
" && gunzip -k Homo_sapiens.GRCh38.106.gtf.gz "
# Get genome fa
rule get_genome_fa:
"Downloading Genome sequence, Homo sapiens primary assembly (GRCh38)"
output:
fa = "genome/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa"
shell:
"cd genome"
" && wget ftp://ftp.ensembl.org/pub/release-106/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz"
" && gunzip -k Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa "
rule HISAT2_index:
input:
fa = rules.get_genome_fa.output.fa
output:
["index." + str(i) + ".ht2" for i in range(1,9)],
message:
"indexing genome"
threads: 16
shell:
" hisat2-build -p {threads} {input.fa} index --quiet"
rule HISAT2_align:
input:
read1=rules.FASTP.output.trimmed1,
read2=rules.FASTP.output.trimmed2,
index=rules.HISAT2_index.output
output:
bam="aligned/{sample}.bam",
metrics="logs/{sample}_HISATmetrics.txt"
threads: 16
shell:
" hisat2 --threads {threads} -x index -1 {input.read1} -2 {input.read2} 2> {output.metrics}"
" | samtools view -Sbh -o {output.bam} "
rule samtools_sort:
input:
aligned=rules.HISAT2_align.output.bam
#"aligned/{sample}.bam"
output:
"aligned/{sample}.sorted.bam"
threads: 8
shell:
"samtools sort {input.aligned} -o {output}"
rule samtools_stats:
input:
"aligned/{sample}.sorted.bam",
output:
"samtools_stats/{sample}.stats.txt",
shell:
"samtools stats {input} > {output} "
rule samtools_flagstat:
input:
"aligned/{sample}.sorted.bam",
output:
"samtools_stats/{sample}.flagstat.txt",
shell:
"samtools flagstat {input} > {output} "
rule featureCounts:
input:
samples="aligned/{sample}.sorted.bam",
gtf=rules.get_genome_gtf.output.gtf
output:
"raw_Counts"
threads:
16
shell:
"featureCounts -T {threads} -a {input.gtf} -o {output} {input.samples}"
´´´
Snakemake使用输出中的模式来推断要使用哪些输入。在最后一条规则中,输出是raw_Counts
,它没有指示{sample}
通配符使用什么。将其更改为以下内容可能适用于您的用例:
rule featureCounts:
input:
samples="aligned/{sample}.sorted.bam",
gtf=rules.get_genome_gtf.output.gtf
output:
"raw_Counts_{sample}.txt"
这需要将扩展版本添加到规则all
:
# add this target to rule all
expand("raw_Counts_{sample}.txt", sample=SAMPLE),
编辑:如果这个规则是一个聚合,那么在输入指令中,你需要通过替换所有的值来删除通配符搜索。
rule featureCounts:
input:
samples=expand("aligned/{sample}.sorted.bam", sample=SAMPLE),
gtf=rules.get_genome_gtf.output.gtf
output:
"raw_Counts"
编辑2:注意glob_wildcards
不返回每个通配符的唯一值,而是返回与每个全局文件相关联的通配符。如果您想要唯一的值,那么实现这一目标的一个简单方法是将SAMPLE
转换为一个集合(专门针对此规则)。
rule featureCounts:
input:
samples=expand("aligned/{sample}.sorted.bam", sample=set(SAMPLE)),
gtf=rules.get_genome_gtf.output.gtf
output:
"raw_Counts"