snakemake wrapper Gatk/Mutect2 - snakemake

In the Snakemake Wrapper documentation
rule mutect2_bam:
input:
fasta="genome/genome.fasta",
map="mapped/{sample}.bam",
output:
vcf="variant_bam/{sample}.vcf",
bam="variant_bam/{sample}.bam",
message:
"Testing Mutect2 with {wildcards.sample}"
threads: 1
resources:
mem_mb=1024,
log:
"logs/mutect_{sample}.log",
wrapper:
"v1.10.0/bio/gatk/mutect"
Is just only one input bam file, but if I have to use two bam files. What should I do?

Related

snakemake - Missing input files for rule all

I am trying to create a pipeline that will take a user-configured directory in config.yml (where they have downloaded a project directory of .fastq.gz files from BaseSpace), to run fastqc on sequence files. I already have the downstream steps of merging the fastqs by lane and running fastqc on the merged files.
However, the wildcards are giving me problems running fastqc on the original basespace files. The following is my error when I try running snakemake.
Missing input files for rule all:
qc/fastqc_premerge/DEX-13_S9_L001_ngc1838-10_L001_ds.9fd1f6dff0df47ab821125aab07be69b_r1_fastqc.zip
qc/fastqc_premerge/BOMB-3-2-19D_S8_L002_ngc1838-8_L002_ds.b81c308d62ba447b8caf074ffb27917e_r1_fastqc.zip
qc/fastqc_premerge/DEX-13_S9_L002_ngc1838-10_L002_ds.6369bc71fac44f00931eecb9b0a45d59_r1_fastqc.zip
Any suggestions would be greatly appreciated. Below is minimal code to reproduce this problem.
import glob
configfile: "config.yaml"
wildcard_constraints:
bsdir = '\w+_L\d+_ds.\w+',
lanenum = '\d+'
inputdirectory=config["directory"]
DIRECTORY, SAMPLES, LANENUMS = glob_wildcards(inputdirectory+"/{bsdir}/{sample}_L{lanenum}_R1_001.fastq.gz")
DIRECTORY, SAMPLES, LANENUMS = glob_wildcards(inputdirectory+"/{bsdir}/{sample}_L{lanenum}_R2_001.fastq.gz")
##### target rules #####
rule all:
input:
#expand('qc/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1_fastqc.zip', sample=SAMPLES, bsdir=DIRECTORY, lanenum=LANENUMS)
expand('qc/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1_fastqc.zip', zip, sample=SAMPLES, bsdir=DIRECTORY, lanenum=LANENUMS) ##Changed to this from commenters suggestion, however, snakemake still wont run
rule fastqc_premerge_r1:
input:
f"{config['directory']}/{{bsdir}}/{{sample}}_L{{lanenum}}_R1_001.fastq.gz"
output:
html="qc/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1.html",
zip="qc/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1_fastqc.zip" # the suffix _fastqc.zip is necessary for multiqc to find the file. If not using multiqc, you are free to choose an arbitrary filename
params: ""
log:
"logs/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1.log"
threads: 1
wrapper:
"v0.69.0/bio/fastqc"
Directory structure:
ngc1838-10_L001_ds.9fd1f6dff0df47ab821125aab07be69b/DEX-13_S9_L001_R1_001.fastq.gz
ngc1838-10_L001_ds.9fd1f6dff0df47ab821125aab07be69b/DEX-13_S9_L001_R2_001.fastq.gz
ngc1838-10_L002_ds.6369bc71fac44f00931eecb9b0a45d59/DEX-13_S9_L002_R1_001.fastq.gz
ngc1838-10_L002_ds.6369bc71fac44f00931eecb9b0a45d59/DEX-13_S9_L002_R2_001.fastq.gz
ngc1838-8_L002_ds.b81c308d62ba447b8caf074ffb27917e/BOMB-3-2-19D_S8_L002_R1_001.fastq.gz
ngc1838-8_L002_ds.b81c308d62ba447b8caf074ffb27917e/BOMB-3-2-19D_S8_L002_R2_001.fastq.gz
In this above case, I would like to run fastqc on all 6 input R1/R2 files, then downstream, create a merged file for DEX_13_S9 (for the two inputs to merge) and BOMB-3_2_19D (which will be a copy of the 1 input). Then create 4 fastqc reports on these resulting R1 and R2 files.
EDIT: I had to change the following to get snakemake to run
inputdirectory=config["directory"]
PROJECTDIR, RANDOMINT, LANENUM1, BSSTRINGS, SAMPLES, LANENUMS = glob_wildcards(inputdirectory+"/{proj}-{randint}_L{lanenum1}_ds.{bsstring}/{sample}_L{lanenum}_R1_001.fastq.gz", followlinks=True)
PROJECTDIR, RANDOMINT, LANENUM1, BSSTRINGS, SAMPLES, LANENUMS = glob_wildcards(inputdirectory+"/{proj}-{randint}_L{lanenum1}_ds.{bsstring}/{sample}_L{lanenum}_R2_001.fastq.gz", followlinks=True)
##### target rules #####
rule all:
input:
"qc/multiqc_report_premerge.html"
rule fastqc_premerge_r1:
input:
f"{config['directory']}/{{proj}}-{{randint}}_L{{lanenum1}}_ds.{{bsstring}}/{{sample}}_L{{lanenum}}_R1_001.fastq.gz"
output:
html="qc/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r1.html",
zip="qc/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r1_fastqc.zip" # the suffix _fastqc.zip is necessary for multiqc
params: ""
log:
"logs/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r1.log"
threads: 1
wrapper:
"v0.69.0/bio/fastqc"
rule fastqc_premerge_r2:
input:
f"{config['directory']}/{{proj}}-{{randint}}_L{{lanenum1}}_ds.{{bsstring}}/{{sample}}_L{{lanenum}}_R2_001.fastq.gz"
output:
html="qc/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r2.html",
zip="qc/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r2_fastqc.zip" # the suffix _fastqc.zip is necessary for multiqc
params: ""
log:
"logs/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r2.log"
threads: 1
wrapper:
"v0.69.0/bio/fastqc"
rule multiqc_pre:
input:
expand("qc/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r1_fastqc.zip", zip, sample=SAMPLES, lanenum=LANENUMS, proj=PROJECTDIR, randint=RANDOMINT, lanenum1=LANENUM1, bsstring=BSSTRINGS),
expand("qc/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r2_fastqc.zip", zip, sample=SAMPLES, lanenum=LANENUMS, proj=PROJECTDIR, randint=RANDOMINT, lanenum1=LANENUM1, bsstring=BSSTRINGS)
output:
"qc/multiqc_report_premerge.html"
log:
"logs/multiqc_premerge.log"
wrapper:
"0.62.0/bio/multiqc"
In your rule all you have:
expand('qc/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1_fastqc.zip', sample=SAMPLES, bsdir=DIRECTORY, lanenum=LANENUMS)
This should generate all combinations of SAMPLES, DIRECTORY, and LANENUMS. Is this what you want? I suspect not since it means that all samples are in all directories and they are on all lanes. Maybe you want the zip function to expand the list:
expand('qc/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1_fastqc.zip', zip, sample=SAMPLES, bsdir=DIRECTORY, lanenum=LANENUMS)
It's telling you what files are missing; that's what the lines under "missing input files for rule all" are.
That being said, to answer your original question, if you do a dry run, that should tell you what the input/output files are for each planned rule you want to run (use flags -n -r) in your run command.

How to read the config.yaml file and feed it in snakemake

I have a ref.fasta file that contains scaffolds. In order to parallelise the variant calling process I grouped scaffolds by chromosomes and created a config.yaml file as below:
samples:
chr1A: scaffold26096,scaffold40476
chr1B: scaffold11969,scaffold83281,scaffold43483
chr1D: scaffold4701,scaffold102360
And a script as below.
configfile: "config.yaml"
rule all:
input:
expand("scaffolds/{sample}.vcf", sample=config["samples"])
rule gatk:
input:
"/path/to/ref.fasta",
"/path/to/bam.list",
lambda wildcards: config["samples"][wildcards.sample]
output:
outf ="scaffolds/{sample}.vcf"
shell:
"""
/Tools/gatk/gatk --java-options "-Xmx16g -XX:ParallelGCThreads=10" HaplotypeCaller -L {input[2]} -R {input[0]} -I {input[1]} -O {output.outf}
"""
I would like to get results as chr1A.vcf, chr1B.vcf and chr1D.vcf.
This is giving me an error:
Missing input files for rule gatk:
scaffold4701,scaffold102360
What is wrong?
I think your yaml file does not contain the data you think it does. I guess you want each chromosome to contain a list of scaffolds, like:
samples:
chr1A:
- scaffold26096
- scaffold40476
chr1B:
- scaffold11969
- scaffold83281
- scaffold43483
chr1D:
- scaffold4701
- scaffold102360
(I haven't checked the rest of your code)

Input vs Output: Job completed successfully, but some output files are missing

My pipeline is failing I believe due to a conflict between the expected out put of rule all vs the actual final output. I believe snakemake is waiting for the file kma/{sample} without an extension to appear instead it is getting a directory that has multiext("kma/{sample}", ".res", ".aln", ".fsa", ".gz") and I am having trouble getting them to play well.
configfile: "config.yaml"
rule all:
input:
expand("kma/{sample}", sample = config["samples"])
#multiext("kma/{sample}", ".res", ".aln", ".fsa", ".gz", sample = config["samples"])
rule seqtk_qualtiy_filter:
input:
lambda wildcards: "S5_Raw/" + config["samples"][wildcards.sample]
output:
temp("qtrim/{sample}.qtrim.fq")
shell:
"seqtk trimfq -b 0.01 {input} > {output}"
rule seqtk_clip:
input:
"qtrim/{sample}.qtrim.fq"
output:
temp("clip/{sample}.clip.fq")
shell:
"seqtk trimfq -b20 -L 350 {input} > {output}"
rule bbnorm:
input:
"clip/{sample}.clip.fq"
output:
"S5_processed/{sample}.norm.fq"
shell:
"bbnorm.sh in={input} out={output} target=100"
rule kma_map:
input:
"S5_processed/{sample}.norm.fq"
params:
ref = "ref/consensus.fasta"
output:
directory("kma/{sample}")
#multiext("kma/{sample}", ".res", ".aln", ".fsa", ".gz")
shell:
"kma -i {input} -t_db {params.ref} -o {output}"
The error if you run it the way that kma would like to see the handling done
Waiting at most 5 seconds for missing files.
MissingOutputException in line 33 of /home/sean/Desktop/reo/antisera project/ReovirusS1AmpliconS5.smk:
Job completed successfully, but some output files are missing. Missing files after 5 seconds:
kma/BA8359-19
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
File "/home/sean/.local/lib/python3.6/site-packages/snakemake/executors/__init__.py", line 544, in handle_job_success
File "/home/sean/.local/lib/python3.6/site-packages/snakemake/executors/__init__.py", line 231, in handle_job_success
Shutting down, this might take some time.
I have tried increasing the latency time as well, however since the expected file is never actually created it doesn't matter how long you wait.
the error you receive if you using the multiext function
Error in rule kma_map:
jobid: 11
output: kma/BA8359-19.res, kma/BA8359-19.aln, kma/BA8359-19.fsa, kma/BA8359-19.frag.gz
shell:
kma -i S5_processed/BA8359-19.norm.fq -t_db ref/consensus.fasta -o kma/BA8359-19.res kma/BA8359-19.aln kma/BA8359-19.fsa kma/BA8359-19.frag.gz
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
This error happens, I believe, because snakemake lists all the file types in the -o flag and kma says no.
There may be an inconsistency between the input files in rule all and the output files in rule kma_map. Showing the error you get would help.
Anyway, the command multiext("kma/{sample}", ".res", ".aln", ".fsa", ".gz", sample = config["samples"]) that you seem you have tried should not compile since multiext does not accept named arguments. Maybe what you want is this:
rule all:
input:
expand("kma/{sample}{ext}", sample= config['samples'], ext= [".res", ".aln", ".fsa", ".gz"]),
rule kma_map:
...
output:
multiext("kma/{sample}", ".res", ".aln", ".fsa", ".gz")
shell:
"kmap ... {input} {output}"
EDIT 21SEP2020
I have figured out the correct way to run this application.
first I need the rule all like this
KMA_OUTPUTS = [".res",".aln",".fsa",".frag.gz"]
rule all:
input:
expand("kma/{sample}{ext}", sample = config["samples"], ext = KMA_OUTPUTS)
And then the rule for kma like this
rule kma_map:
input:
"S5_processed/{sample}.norm.fq"
params:
ref = "ref/consensus.fasta",
prefix = "kma/{sample}"
output:
multiext("kma/{sample}", ".res", ".aln", ".fsa", ".frag.gz")
shell:
"kma -i {input} -t_db {params.ref} -o {params.prefix}"
by using the parama.prefix as the output file input I can get the desired output.

Snakemake: MissingInputException with inconsistent naming scheme

I am trying to process MinION cDNA amplicons using Porechop with Minimap2 and I am getting this error.
MissingInputException in line 16 of /home/sean/Desktop/reo/antisera project/20200813/MinIONAmplicon.smk:
Missing input files for rule minimap2:
8413_19_strict/BC01.fastq.g
I understand what the error telling me, I just understand why its being its not trying to make the rule before it. Porechop is being used to check for all the possible barcodes and will output more than one fastq file if it finds more than barcode in the directory. However since I know what barcode I am looking for I made a barcodes section in the config.yaml file so I can map them together.
I think the error is happening because my target output for Porechop doesn't match the input for minimap2 but I do not know how to correct this problem as there can be multiple outputs from porechop.
I thought I was building a path for the input file for the minimap2 rule and when snakemake discovered that the porechop output was not there it would make it, but that is not what is happening.
Here is my pipeline so far,
configfile: "config.yaml"
rule all:
input:
expand("{sample}.bam", sample = config["samples"])
rule porechop_strict:
input:
lambda wildcards: config["samples"][wildcards.sample]
output:
directory("{sample}_strict/")
shell:
"porechop -i {input} -b {output} --barcode_threshold 85 --threads 8 --require_two_barcodes"
rule minimap2:
input:
lambda wildcards: "{sample}_strict/" + config["barcodes"][wildcards.sample]
output:
"{sample}.bam"
shell:
"minimap2 -ax map-ont -t8 ../concensus.fasta {input} | samtools sort -o {output}"
and the yaml file
samples: {
'8413_19': relabeled_reads/8413_19.raw.fastq.gz,
'8417_19': relabeled_reads/8417_19.raw.fastq.gz,
'8445_19': relabeled_reads/8445_19.raw.fastq.gz,
'8466_19_104': relabeled_reads/8466_19_104.raw.fastq.gz,
'8466_19_105': relabeled_reads/8466_19_105.raw.fastq.gz,
'8467_20': relabeled_reads/8467_20.raw.fastq.gz,
}
barcodes: {
'8413_19': BC01.fastq.gz,
'8417_19': BC02.fastq.gz,
'8445_19': BC03.fastq.gz,
'8466_19_104': BC04.fastq.gz,
'8466_19_105': BC05.fastq.gz,
'8467_20': BC06.fastq.gz,
}
First of all, you can always debug the problems like that specifying the flag --printshellcmds. That would print all shell commands that Snakemake runs under the hood; you may try to run them manually and locate the problem.
As for why your rule doesn't produce any output, my guess is that samtools requires explicit filenames or - to use stdin:
Samtools is designed to work on a stream. It regards an input file '-'
as the standard input (stdin) and an output file '-' as the standard
output (stdout). Several commands can thus be combined with Unix
pipes. Samtools always output warning and error messages to the
standard error output (stderr).
So try that:
shell:
"minimap2 -ax map-ont -t8 ../concensus.fasta {input} | samtools sort -o {output} -"
So I am not 100% sure why this way works, I imagine it has to do with the way snakemake looks at the targets however here is the solution I found for it.
rule minimap2:
input:
"{sample}_strict"
params:
suffix=lambda wildcards: config["barcodes"][wildcards.sample]
output:
"{sample}.bam"
shell:
"minimap2 -ax map-ont -t8 ../consensus.fasta\
{input}/{params.suffix} | samtools sort -o {output}"
by using the params feature in snakemake I was able to match up the correct barcode to the sample name. I am not sure why I could just do that as the input itself, but when I returned the input to the match the output of the previous rule it works.

Snakemake: How to use config file efficiently

I'm using the following config file format in snakemake for a some sequencing analysis practice (I have loads of samples each containing 2 fastq files:
samples:
Sample1_XY:
- fastq_files/SRR4356728_1.fastq.gz
- fastq_files/SRR4356728_2.fastq.gz
Sample2_AB:
- fastq_files/SRR6257171_1.fastq.gz
- fastq_files/SRR6257171_2.fastq.gz
I'm using the following rules at the start of my pipeline to run fastqc and for alignment of the fastqc files:
import os
# read config info into this namespace
configfile: "config.yaml"
rule all:
input:
expand("FastQC/{sample}_fastqc.zip", sample=config["samples"]),
expand("bam_files/{sample}.bam", sample=config["samples"]),
"FastQC/fastq_multiqc.html"
rule fastqc:
input:
sample=lambda wildcards: config['samples'][wildcards.sample]
output:
# Output needs to end in '_fastqc.html' for multiqc to work
html="FastQC/{sample}_fastqc.html",
zip="FastQC/{sample}_fastqc.zip"
params: ""
wrapper:
"0.21.0/bio/fastqc"
rule bowtie2:
input:
sample=lambda wildcards: config['samples'][wildcards.sample]
output:
"bam_files/{sample}.bam"
log:
"logs/bowtie2/{sample}.txt"
params:
index=config["index"], # prefix of reference genome index (built with bowtie2-build),
extra=""
threads: 8
wrapper:
"0.21.0/bio/bowtie2/align"
rule multiqc_fastq:
input:
expand("FastQC/{sample}_fastqc.html", sample=config["samples"])
output:
"FastQC/fastq_multiqc.html"
params:
log:
"logs/multiqc.log"
wrapper:
"0.21.0/bio/multiqc"
My issue is with the fastqc rule.
Currently both the fastqc rule and the bowtie2 rule create one output file generated using two inputs SRRXXXXXXX_1.fastq.gz and SRRXXXXXXX_2.fastq.gz.
I need the fastq rule to generate two files, a separate one for each of the fastq.gz files but I'm unsure how to index the config file correctly from the fastqc rule input statement, or how to combine the the expand and wildcards commands to solve this. I can get an individual fastq file by adding [0] or [1] to the end of the input statement, but not both run individually/separately.
I've been messing around trying to get the correct indexing format to access each file separately. The current format is the only one I've managed that allows snakemake -np to generate a job list.
Any tips would be greatly appreciated.
It appears each sample would have two fastq files, and they are named in format ***_1.fastq.gz and ***_2.fastq.gz. In that case, config and code below would work.
config.yaml:
samples:
Sample_A: fastq_files/SRR4356728
Sample_B: fastq_files/SRR6257171
Snakefile:
# read config info into this namespace
configfile: "config.yaml"
print (config['samples'])
rule all:
input:
expand("FastQC/{sample}_{num}_fastqc.zip", sample=config["samples"], num=['1', '2']),
expand("bam_files/{sample}.bam", sample=config["samples"]),
"FastQC/fastq_multiqc.html"
rule fastqc:
input:
sample=lambda wildcards: f"{config['samples'][wildcards.sample]}_{wildcards.num}.fastq.gz"
output:
# Output needs to end in '_fastqc.html' for multiqc to work
html="FastQC/{sample}_{num}_fastqc.html",
zip="FastQC/{sample}_{num}_fastqc.zip"
wrapper:
"0.21.0/bio/fastqc"
rule bowtie2:
input:
sample=lambda wildcards: expand(f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])
output:
"bam_files/{sample}.bam"
wrapper:
"0.21.0/bio/bowtie2/align"
rule multiqc_fastq:
input:
expand("FastQC/{sample}_{num}_fastqc.html", sample=config["samples"], num=['1', '2'])
output:
"FastQC/fastq_multiqc.html"
wrapper:
"0.21.0/bio/multiqc"