使用snakemake进行映射读取



我正在尝试使用snakemake运行hisat2映射。基本上,我使用的是这样的config.yaml文件:

reads:
set1: /path/to/set1/samplelist.tab
hisat2:
database: genome
genome: genome.fa
nodes: 2
memory: 8G
arguments: --dta
executables:
hisat2: /Tools/hisat2-2.1.0/hisat2
samtools: /Tools/samtools-1.3/samtools

然后Snakefile:

configfile: "config.yaml"
workdir: "/path/to/working_dir/"
# Hisat2
rule hisat2:
input:
reads = lambda wildcards: config["reads"][wildcards.sample]
output:
bam = "{sample}/{sample}.bam"
params:
idx=config["hisat2"]["database"],
executable = config["executables"]["hisat2"],
nodes = config["hisat2"]["nodes"],
memory = config["hisat2"]["memory"],
executable2 = config["executables"]["samtools"]
run:
shell("{params.executable} --dta -p {params.nodes} -x {params.idx} {input.reads} |"
"{params.executable2} view -Sbh -o {output.bam} -")
# all
rule all:
input:
lambda wildcards: [sample + "/" + sample + ".bam"
for sample in config["reads"].keys()]

我的samplelist.tab是这样的:

id  reads1  reads2
set1a set1a_R1.fastq.gz set1a_R2.fastq.gz
set1b set1b_R1.fastq.gz set1b_R2.fastq.gz

有什么提示可以让它发挥作用吗?我不喜欢一个乱七八糟的剧本,刚开始用蛇制作。

您必须执行以下操作:

import pandas as pd
reads = pd.read_csv(config["reads"]['set1'], sep='t', index_col=0)
def get_fastq(wildcards):
return list(reads.loc[wildcards.sample].values)
rule hisat2:
input:
get_fastq
...

首先,您需要加载samplelist并存储它(我是作为pandas数据帧来完成的(。然后,您可以查找哪些文件属于该示例名称。

编辑

把代码重写成这样可读性更强(在我看来(。

rule hisat2:
input:
[{sample}_R1.fastq.gz,
{sample}_R2.fastq.gz]
...

最新更新