Snakemake - Using config file when having multiple inputs per sample - snakemake

I want to do this:
configfile: "samples.yaml"
rule all:
input:
in1="../INPUT/TEST_A_1.fa",
in2="../INPUT/TEST_A_2.fa"
output:
out1="../OUTPUT/TEST_A_1.fa",
out2="../OUTPUT/TEST_A_2.fa",
json="../OUTPUT/TEST_A.json"
shell:
"fastp --in1 {input.in1} --in2 {input.in2} --out1 {output.out1} --out2 {output.out2} --json {output.json}"
but for all the samples in samples.yaml:
samples:
TEST_A
TEST_B
Feels like it should be simple, but I am really struggling with wrapping my head around how to do it.
Edit:
I think I figured it out, thanks to this: Snakemake using multi inputs.
The code below seems to work how I intended it:
configfile: "samples.yaml"
SAMPLES = config["samples"].split()
rule all:
input:
expand("../OUT_0001_IN_0002_clean_fasta_files/{sample}_{n}.fa", sample=SAMPLES, n=["1", "2"])
rule fastp:
input:
in1="../IN_0001_raw_fasta_files/{sample}_1.fa",
in2="../IN_0001_raw_fasta_files/{sample}_2.fa"
output:
out1="../OUT_0001_IN_0002_clean_fasta_files/{sample}_1.fa",
out2="../OUT_0001_IN_0002_clean_fasta_files/{sample}_2.fa",
json="../OUT_0001_IN_0002_clean_fasta_files/{sample}.json"
shell:
"fastp --in1 {input.in1} --in2 {input.in2} --out1 {output.out1} --out2 {output.out2} --json {output.json}"

Related

Snakemake input two variables and output one variable

I want to rename and move my fastq.gz files from these:
NAME-BOB_S1_L001_R1_001.fastq.gz
NAME-BOB_S1_L001_R2_001.fastq.gz
NAME-JOHN_S2_L001_R1_001.fastq.gz
NAME-JOHN_S2_L001_R2_001.fastq.gz
to these:
NAME_BOB/reads/NAME_BOB.R1.fastq.gz
NAME_BOB/reads/NAME_BOB.R2.fastq.gz
NAME_JOHN/reads/NAME_JOHN.R1.fastq.gz
NAME_JOHN/reads/NAME_JOHN.R2.fastq.gz
This is my code. The problem I have is the second variable S which I do not know how to specify in the code as I do not need it in my output filename.
workdir: "/path/to/workdir/"
DIR=["BOB","JOHN"]
S=["S1","S2"]
rule all:
input:
expand("NAME_{dir}/reads/NAME_{dir}.R1.fastq.gz", dir=DIR),
expand("NAME_{dir}/reads/NAME_{dir}.R2.fastq.gz", dir=DIR)
rule rename:
input:
fastq1=("fastq/NAME-{dir}_{s}_L001_R1_001.fastq.gz", zip, dir=DIR, s=S),
fastq2=("fastq/NAME-{dir}_{s}_L001_R2_001.fastq.gz", zip, dir=DIR, s=S)
output:
fastq1="NAME_{dir}/reads/NAME_{dir}.R1.fastq.gz",
fastq2="NAME_{dir}/reads/NAME_{dir}.R2.fastq.gz"
shell:
"""
mv {input.fastq1} {output.fastq1}
mv {input.fastq2} {output.fastq2}
"""
There are several problems in your code. First of all, the {dir} in your output and {dir} in your input are two different variables. Actually the {dir} in the output is a wildcard, while the {dir} in the input is a parameter for the expand function (moreover, you even forgot to call this function, and that is the second problem).
The third problem is that the shell section shall contain only a single command. You may try mv {input.fastq1} {output.fastq1}; mv {input.fastq2} {output.fastq2}, but this is not an idiomatic solution. Much better would be to create a rule that produces a single file, letting Snakemake to do the rest of the work.
Finally the S value fully depend on the DIR value, so it becomes a function of {dir}, and that can be solved with a lambda in input:
workdir: "/path/to/workdir/"
DIR=["BOB","JOHN"]
dir2s = {"BOB": "S1", "JOHN": "S2"}
rule all:
input:
expand("NAME_{dir}/reads/NAME_{dir}.{r}.fastq.gz", dir=DIR, r=["R1", "R2"])
rule rename:
input:
lambda wildcards:
"fastq/NAME-{{dir}}_{s}_L001_{{r}}_001.fastq.gz".format(s=dir2s[wildcards.dir])
output:
"NAME_{dir}/reads/NAME_{dir}.{r}.fastq.gz",
shell:
"""
mv {input} {output}
"""

snakemake - Missing input files for rule all

I am trying to create a pipeline that will take a user-configured directory in config.yml (where they have downloaded a project directory of .fastq.gz files from BaseSpace), to run fastqc on sequence files. I already have the downstream steps of merging the fastqs by lane and running fastqc on the merged files.
However, the wildcards are giving me problems running fastqc on the original basespace files. The following is my error when I try running snakemake.
Missing input files for rule all:
qc/fastqc_premerge/DEX-13_S9_L001_ngc1838-10_L001_ds.9fd1f6dff0df47ab821125aab07be69b_r1_fastqc.zip
qc/fastqc_premerge/BOMB-3-2-19D_S8_L002_ngc1838-8_L002_ds.b81c308d62ba447b8caf074ffb27917e_r1_fastqc.zip
qc/fastqc_premerge/DEX-13_S9_L002_ngc1838-10_L002_ds.6369bc71fac44f00931eecb9b0a45d59_r1_fastqc.zip
Any suggestions would be greatly appreciated. Below is minimal code to reproduce this problem.
import glob
configfile: "config.yaml"
wildcard_constraints:
bsdir = '\w+_L\d+_ds.\w+',
lanenum = '\d+'
inputdirectory=config["directory"]
DIRECTORY, SAMPLES, LANENUMS = glob_wildcards(inputdirectory+"/{bsdir}/{sample}_L{lanenum}_R1_001.fastq.gz")
DIRECTORY, SAMPLES, LANENUMS = glob_wildcards(inputdirectory+"/{bsdir}/{sample}_L{lanenum}_R2_001.fastq.gz")
##### target rules #####
rule all:
input:
#expand('qc/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1_fastqc.zip', sample=SAMPLES, bsdir=DIRECTORY, lanenum=LANENUMS)
expand('qc/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1_fastqc.zip', zip, sample=SAMPLES, bsdir=DIRECTORY, lanenum=LANENUMS) ##Changed to this from commenters suggestion, however, snakemake still wont run
rule fastqc_premerge_r1:
input:
f"{config['directory']}/{{bsdir}}/{{sample}}_L{{lanenum}}_R1_001.fastq.gz"
output:
html="qc/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1.html",
zip="qc/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1_fastqc.zip" # the suffix _fastqc.zip is necessary for multiqc to find the file. If not using multiqc, you are free to choose an arbitrary filename
params: ""
log:
"logs/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1.log"
threads: 1
wrapper:
"v0.69.0/bio/fastqc"
Directory structure:
ngc1838-10_L001_ds.9fd1f6dff0df47ab821125aab07be69b/DEX-13_S9_L001_R1_001.fastq.gz
ngc1838-10_L001_ds.9fd1f6dff0df47ab821125aab07be69b/DEX-13_S9_L001_R2_001.fastq.gz
ngc1838-10_L002_ds.6369bc71fac44f00931eecb9b0a45d59/DEX-13_S9_L002_R1_001.fastq.gz
ngc1838-10_L002_ds.6369bc71fac44f00931eecb9b0a45d59/DEX-13_S9_L002_R2_001.fastq.gz
ngc1838-8_L002_ds.b81c308d62ba447b8caf074ffb27917e/BOMB-3-2-19D_S8_L002_R1_001.fastq.gz
ngc1838-8_L002_ds.b81c308d62ba447b8caf074ffb27917e/BOMB-3-2-19D_S8_L002_R2_001.fastq.gz
In this above case, I would like to run fastqc on all 6 input R1/R2 files, then downstream, create a merged file for DEX_13_S9 (for the two inputs to merge) and BOMB-3_2_19D (which will be a copy of the 1 input). Then create 4 fastqc reports on these resulting R1 and R2 files.
EDIT: I had to change the following to get snakemake to run
inputdirectory=config["directory"]
PROJECTDIR, RANDOMINT, LANENUM1, BSSTRINGS, SAMPLES, LANENUMS = glob_wildcards(inputdirectory+"/{proj}-{randint}_L{lanenum1}_ds.{bsstring}/{sample}_L{lanenum}_R1_001.fastq.gz", followlinks=True)
PROJECTDIR, RANDOMINT, LANENUM1, BSSTRINGS, SAMPLES, LANENUMS = glob_wildcards(inputdirectory+"/{proj}-{randint}_L{lanenum1}_ds.{bsstring}/{sample}_L{lanenum}_R2_001.fastq.gz", followlinks=True)
##### target rules #####
rule all:
input:
"qc/multiqc_report_premerge.html"
rule fastqc_premerge_r1:
input:
f"{config['directory']}/{{proj}}-{{randint}}_L{{lanenum1}}_ds.{{bsstring}}/{{sample}}_L{{lanenum}}_R1_001.fastq.gz"
output:
html="qc/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r1.html",
zip="qc/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r1_fastqc.zip" # the suffix _fastqc.zip is necessary for multiqc
params: ""
log:
"logs/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r1.log"
threads: 1
wrapper:
"v0.69.0/bio/fastqc"
rule fastqc_premerge_r2:
input:
f"{config['directory']}/{{proj}}-{{randint}}_L{{lanenum1}}_ds.{{bsstring}}/{{sample}}_L{{lanenum}}_R2_001.fastq.gz"
output:
html="qc/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r2.html",
zip="qc/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r2_fastqc.zip" # the suffix _fastqc.zip is necessary for multiqc
params: ""
log:
"logs/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r2.log"
threads: 1
wrapper:
"v0.69.0/bio/fastqc"
rule multiqc_pre:
input:
expand("qc/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r1_fastqc.zip", zip, sample=SAMPLES, lanenum=LANENUMS, proj=PROJECTDIR, randint=RANDOMINT, lanenum1=LANENUM1, bsstring=BSSTRINGS),
expand("qc/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r2_fastqc.zip", zip, sample=SAMPLES, lanenum=LANENUMS, proj=PROJECTDIR, randint=RANDOMINT, lanenum1=LANENUM1, bsstring=BSSTRINGS)
output:
"qc/multiqc_report_premerge.html"
log:
"logs/multiqc_premerge.log"
wrapper:
"0.62.0/bio/multiqc"
In your rule all you have:
expand('qc/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1_fastqc.zip', sample=SAMPLES, bsdir=DIRECTORY, lanenum=LANENUMS)
This should generate all combinations of SAMPLES, DIRECTORY, and LANENUMS. Is this what you want? I suspect not since it means that all samples are in all directories and they are on all lanes. Maybe you want the zip function to expand the list:
expand('qc/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1_fastqc.zip', zip, sample=SAMPLES, bsdir=DIRECTORY, lanenum=LANENUMS)
It's telling you what files are missing; that's what the lines under "missing input files for rule all" are.
That being said, to answer your original question, if you do a dry run, that should tell you what the input/output files are for each planned rule you want to run (use flags -n -r) in your run command.

Define input files from csv

I would like to define input file names from different varialbles extracted from a csv. I have built the following simplified example:
I have a file test.csv:
data/samples/A.fastq
data/samples/B.fastq
I give the path to test.csv in a json config file:
{
"samples": {
"summaryFile": "somepath/test.csv"
}
}
Now I want to run bwa on each file within a rule. My feeling is that I have to use lambda wildcards but I am not sure. My Snakefile looks like this:
#only for bcf_tools
import pandas
input_table = config["samples"]["summaryFile"]
samplesData = pandas.read_csv(input_table)
def returnSamples(table):
# Have tried different things here but nothing worked
return table
rule all:
input:
expand("mapped_reads/{sample}.bam", sample= samplesData)
rule bwa_map:
input:
"data/genome.fa",
lambda wildcards: returnSamples(wildcards.sample)
output:
"mapped_reads/{sample}.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
I have tried a million things including using expand (which is working but the rule is not called on each file).
Any help will be tremendously appreciated.
Snakemake works by defining which output you want (like you do in rule all). You are very close to a working solution, however there were some small things that went wrong:
Reading the pandas dataframe does not do what you expect (try printing the samplesData to see what it did/does). Therefore the expand in rule all does not work properly.
You do not need to use lambdas for the input, you can reuse the wildcard.
This should work for your example:
import pandas
import re
input_table = config["samples"]["summaryFile"]
samplesData = pandas.read_csv(input_table, header=None).loc[:, 0].tolist()
samples = [re.findall("[^/]+\.", sample)[0][:-1] for sample in samplesData] # overly complicated regex
rule all:
input:
expand("mapped_reads/{sample}.bam", sample=samples)
rule bwa_map:
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
output:
"mapped_reads/{sample}.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
However I think it would be easiest to change the description in test.csv. Now we have to do some weird magic to get the sample name from the file, it would probably be best to just store the sample names there.

Snakemake: How to use config file efficiently

I'm using the following config file format in snakemake for a some sequencing analysis practice (I have loads of samples each containing 2 fastq files:
samples:
Sample1_XY:
- fastq_files/SRR4356728_1.fastq.gz
- fastq_files/SRR4356728_2.fastq.gz
Sample2_AB:
- fastq_files/SRR6257171_1.fastq.gz
- fastq_files/SRR6257171_2.fastq.gz
I'm using the following rules at the start of my pipeline to run fastqc and for alignment of the fastqc files:
import os
# read config info into this namespace
configfile: "config.yaml"
rule all:
input:
expand("FastQC/{sample}_fastqc.zip", sample=config["samples"]),
expand("bam_files/{sample}.bam", sample=config["samples"]),
"FastQC/fastq_multiqc.html"
rule fastqc:
input:
sample=lambda wildcards: config['samples'][wildcards.sample]
output:
# Output needs to end in '_fastqc.html' for multiqc to work
html="FastQC/{sample}_fastqc.html",
zip="FastQC/{sample}_fastqc.zip"
params: ""
wrapper:
"0.21.0/bio/fastqc"
rule bowtie2:
input:
sample=lambda wildcards: config['samples'][wildcards.sample]
output:
"bam_files/{sample}.bam"
log:
"logs/bowtie2/{sample}.txt"
params:
index=config["index"], # prefix of reference genome index (built with bowtie2-build),
extra=""
threads: 8
wrapper:
"0.21.0/bio/bowtie2/align"
rule multiqc_fastq:
input:
expand("FastQC/{sample}_fastqc.html", sample=config["samples"])
output:
"FastQC/fastq_multiqc.html"
params:
log:
"logs/multiqc.log"
wrapper:
"0.21.0/bio/multiqc"
My issue is with the fastqc rule.
Currently both the fastqc rule and the bowtie2 rule create one output file generated using two inputs SRRXXXXXXX_1.fastq.gz and SRRXXXXXXX_2.fastq.gz.
I need the fastq rule to generate two files, a separate one for each of the fastq.gz files but I'm unsure how to index the config file correctly from the fastqc rule input statement, or how to combine the the expand and wildcards commands to solve this. I can get an individual fastq file by adding [0] or [1] to the end of the input statement, but not both run individually/separately.
I've been messing around trying to get the correct indexing format to access each file separately. The current format is the only one I've managed that allows snakemake -np to generate a job list.
Any tips would be greatly appreciated.
It appears each sample would have two fastq files, and they are named in format ***_1.fastq.gz and ***_2.fastq.gz. In that case, config and code below would work.
config.yaml:
samples:
Sample_A: fastq_files/SRR4356728
Sample_B: fastq_files/SRR6257171
Snakefile:
# read config info into this namespace
configfile: "config.yaml"
print (config['samples'])
rule all:
input:
expand("FastQC/{sample}_{num}_fastqc.zip", sample=config["samples"], num=['1', '2']),
expand("bam_files/{sample}.bam", sample=config["samples"]),
"FastQC/fastq_multiqc.html"
rule fastqc:
input:
sample=lambda wildcards: f"{config['samples'][wildcards.sample]}_{wildcards.num}.fastq.gz"
output:
# Output needs to end in '_fastqc.html' for multiqc to work
html="FastQC/{sample}_{num}_fastqc.html",
zip="FastQC/{sample}_{num}_fastqc.zip"
wrapper:
"0.21.0/bio/fastqc"
rule bowtie2:
input:
sample=lambda wildcards: expand(f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])
output:
"bam_files/{sample}.bam"
wrapper:
"0.21.0/bio/bowtie2/align"
rule multiqc_fastq:
input:
expand("FastQC/{sample}_{num}_fastqc.html", sample=config["samples"], num=['1', '2'])
output:
"FastQC/fastq_multiqc.html"
wrapper:
"0.21.0/bio/multiqc"

Snakemake best practice for demultiplexing

I have a question about best practices. Specifically, about the best snakemake pattern for demultiplexing reads from illumina sequencing. Our workflow needs to demultiplex multiple lanes of sequencing, and then combine these in a single analysis. Obviously, we know lane and sample names, however sample names are not the same across lanes. With only one lane, one can do something like:
SAMPLES = [...]
rule demux:
input:
reads="lanes/lanename.fastq.gz",
key="keys/lanename.txt"
output:
reads=expand("reads/{sample}.fastq.gz", sample=SAMPLES)
...
However with multiple lanes, I'm stuck wanting to use a function as an output rule. How would the following translate to something possible:
LANES = {
"lane1": ["S1", "S2"],
"lane2": ["S3", "S4"],
"lane3": ["S5", "S6"],
}
rule demux:
input:
reads="lanes/{lane}.fastq.gz",
key="keys/{lane}.txt"
output:
reads=lambda wc: expand("reads/{sample}.fastq.gz", sample=LANES[wc.lane])
...
Forgive me if this has be answered previously, or if there is some obvious approach I'm missing.
Cheers,
Kevin
I realise this is an old question, but I had the same problem, and came up with the following solution: Split the demux rule into two. The first creates a temporary directory which holds the demultiplexed reads, and then the second moves a single readset from that directory.
Here's a relatively minimal example:
LANES = {
"lane1": ["S1", "S2"],
"lane2": ["S3", "S4"],
"lane3": ["S5", "S6"],
}
SAMPLES = {}
for lane, samples in LANES.items():
for sample in samples:
SAMPLES[sample] = lane
rule all:
input:
expand("reads/{sample}.fastq.gz", sample=SAMPLES.keys())
rule get_reads:
output:
reads="lanes/{lane}.fastq.gz",
key="keys/{lane}.txt",
shell:
"touch {output}"
rule demux:
input:
reads="lanes/{lane}.fastq.gz",
key="keys/{lane}.txt",
output:
directory(temp("demux_tmp_{lane}"))
params:
f=lambda w: expand("{sample}.fastq.gz", sample=LANES[w.lane]),
shell:
"mkdir -p {output} ; cd {output} ; touch {params.f}"
rule demux_files:
input:
lambda w: f"demux_tmp_{SAMPLES[w.sample]}",
output:
"reads/{sample}.fastq.gz"
shell:
"mv {input}/{wildcards.sample}.fastq.gz {output}"