snakemake with prefix as output including a path - snakemake

How can I make sure in rule all that the output folder was well created?
Should I add each expected result file?
somehow relates to snakemake define folder as output but in my case the specified 'output' is a combination of a path to a dir and a prefix for all results files (they wil be multiple)
the following command creates a folder path Analysis/MosDepth and adds to that path the files:
gt0.mosdepth.global.dist.txt
gt0.mosdepth.region.dist.txt
gt0.per-base.bed.gz
gt0.per-base.bed.gz.csi
gt0.regions.bed.gz
gt0.regions.bed.gz.csi
rule MosDepth:
input:
bam = "Analysis/Minimap2/"+UnpackedRawFastq+".bam",
bed = "ReferenceData/"+UnpackedGenomeGFF+"_exons.bed"
output:
pfx = "Analysis/MosDepth/gt0"
threads: config["threads"]
shell:
"mosdepth -t {threads} -b {input.bed} {output.pfx} {input.bam}"
I currently have only one of the files in rule all:, is this enough or is there a better way to ensure that the mosdepth has run well and not redo it in a later re-run?
rule all:
input:
"Analysis/MosDepth/gt0.regions.bed.gz"

I would recommend sth like this:
mos_out = ['gt0.mosdepth.global.dist.txt', 'gt0.mosdepth.region.dist.txt', 'gt0.per-base.bed.gz', 'gt0.per-base.bed.gz.csi', 'gt0.regions.bed.gz', 'gt0.regions.bed.gz.csi']
rule MosDepth:
input:
bam = "Analysis/Minimap2/"+UnpackedRawFastq+".bam",
bed = "ReferenceData/"+UnpackedGenomeGFF+"_exons.bed"
output:
expand("Analysis/MosDepth/{mos_out}", mos_out=mos_out)
params:
pfx = "Analysis/MosDepth/gt0"
threads: config["threads"]
shell:
"mosdepth -t {threads} -b {input.bed} {params.pfx} {input.bam}"
If one of the output files is not created by the rule, snakemake will remove all the output files for you, and throw an error.

Related

snakemake - Missing input files for rule all

I am trying to create a pipeline that will take a user-configured directory in config.yml (where they have downloaded a project directory of .fastq.gz files from BaseSpace), to run fastqc on sequence files. I already have the downstream steps of merging the fastqs by lane and running fastqc on the merged files.
However, the wildcards are giving me problems running fastqc on the original basespace files. The following is my error when I try running snakemake.
Missing input files for rule all:
qc/fastqc_premerge/DEX-13_S9_L001_ngc1838-10_L001_ds.9fd1f6dff0df47ab821125aab07be69b_r1_fastqc.zip
qc/fastqc_premerge/BOMB-3-2-19D_S8_L002_ngc1838-8_L002_ds.b81c308d62ba447b8caf074ffb27917e_r1_fastqc.zip
qc/fastqc_premerge/DEX-13_S9_L002_ngc1838-10_L002_ds.6369bc71fac44f00931eecb9b0a45d59_r1_fastqc.zip
Any suggestions would be greatly appreciated. Below is minimal code to reproduce this problem.
import glob
configfile: "config.yaml"
wildcard_constraints:
bsdir = '\w+_L\d+_ds.\w+',
lanenum = '\d+'
inputdirectory=config["directory"]
DIRECTORY, SAMPLES, LANENUMS = glob_wildcards(inputdirectory+"/{bsdir}/{sample}_L{lanenum}_R1_001.fastq.gz")
DIRECTORY, SAMPLES, LANENUMS = glob_wildcards(inputdirectory+"/{bsdir}/{sample}_L{lanenum}_R2_001.fastq.gz")
##### target rules #####
rule all:
input:
#expand('qc/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1_fastqc.zip', sample=SAMPLES, bsdir=DIRECTORY, lanenum=LANENUMS)
expand('qc/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1_fastqc.zip', zip, sample=SAMPLES, bsdir=DIRECTORY, lanenum=LANENUMS) ##Changed to this from commenters suggestion, however, snakemake still wont run
rule fastqc_premerge_r1:
input:
f"{config['directory']}/{{bsdir}}/{{sample}}_L{{lanenum}}_R1_001.fastq.gz"
output:
html="qc/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1.html",
zip="qc/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1_fastqc.zip" # the suffix _fastqc.zip is necessary for multiqc to find the file. If not using multiqc, you are free to choose an arbitrary filename
params: ""
log:
"logs/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1.log"
threads: 1
wrapper:
"v0.69.0/bio/fastqc"
Directory structure:
ngc1838-10_L001_ds.9fd1f6dff0df47ab821125aab07be69b/DEX-13_S9_L001_R1_001.fastq.gz
ngc1838-10_L001_ds.9fd1f6dff0df47ab821125aab07be69b/DEX-13_S9_L001_R2_001.fastq.gz
ngc1838-10_L002_ds.6369bc71fac44f00931eecb9b0a45d59/DEX-13_S9_L002_R1_001.fastq.gz
ngc1838-10_L002_ds.6369bc71fac44f00931eecb9b0a45d59/DEX-13_S9_L002_R2_001.fastq.gz
ngc1838-8_L002_ds.b81c308d62ba447b8caf074ffb27917e/BOMB-3-2-19D_S8_L002_R1_001.fastq.gz
ngc1838-8_L002_ds.b81c308d62ba447b8caf074ffb27917e/BOMB-3-2-19D_S8_L002_R2_001.fastq.gz
In this above case, I would like to run fastqc on all 6 input R1/R2 files, then downstream, create a merged file for DEX_13_S9 (for the two inputs to merge) and BOMB-3_2_19D (which will be a copy of the 1 input). Then create 4 fastqc reports on these resulting R1 and R2 files.
EDIT: I had to change the following to get snakemake to run
inputdirectory=config["directory"]
PROJECTDIR, RANDOMINT, LANENUM1, BSSTRINGS, SAMPLES, LANENUMS = glob_wildcards(inputdirectory+"/{proj}-{randint}_L{lanenum1}_ds.{bsstring}/{sample}_L{lanenum}_R1_001.fastq.gz", followlinks=True)
PROJECTDIR, RANDOMINT, LANENUM1, BSSTRINGS, SAMPLES, LANENUMS = glob_wildcards(inputdirectory+"/{proj}-{randint}_L{lanenum1}_ds.{bsstring}/{sample}_L{lanenum}_R2_001.fastq.gz", followlinks=True)
##### target rules #####
rule all:
input:
"qc/multiqc_report_premerge.html"
rule fastqc_premerge_r1:
input:
f"{config['directory']}/{{proj}}-{{randint}}_L{{lanenum1}}_ds.{{bsstring}}/{{sample}}_L{{lanenum}}_R1_001.fastq.gz"
output:
html="qc/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r1.html",
zip="qc/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r1_fastqc.zip" # the suffix _fastqc.zip is necessary for multiqc
params: ""
log:
"logs/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r1.log"
threads: 1
wrapper:
"v0.69.0/bio/fastqc"
rule fastqc_premerge_r2:
input:
f"{config['directory']}/{{proj}}-{{randint}}_L{{lanenum1}}_ds.{{bsstring}}/{{sample}}_L{{lanenum}}_R2_001.fastq.gz"
output:
html="qc/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r2.html",
zip="qc/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r2_fastqc.zip" # the suffix _fastqc.zip is necessary for multiqc
params: ""
log:
"logs/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r2.log"
threads: 1
wrapper:
"v0.69.0/bio/fastqc"
rule multiqc_pre:
input:
expand("qc/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r1_fastqc.zip", zip, sample=SAMPLES, lanenum=LANENUMS, proj=PROJECTDIR, randint=RANDOMINT, lanenum1=LANENUM1, bsstring=BSSTRINGS),
expand("qc/fastqc_premerge/{sample}_L{lanenum}_{proj}-{randint}_L{lanenum1}_ds.{bsstring}_r2_fastqc.zip", zip, sample=SAMPLES, lanenum=LANENUMS, proj=PROJECTDIR, randint=RANDOMINT, lanenum1=LANENUM1, bsstring=BSSTRINGS)
output:
"qc/multiqc_report_premerge.html"
log:
"logs/multiqc_premerge.log"
wrapper:
"0.62.0/bio/multiqc"
In your rule all you have:
expand('qc/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1_fastqc.zip', sample=SAMPLES, bsdir=DIRECTORY, lanenum=LANENUMS)
This should generate all combinations of SAMPLES, DIRECTORY, and LANENUMS. Is this what you want? I suspect not since it means that all samples are in all directories and they are on all lanes. Maybe you want the zip function to expand the list:
expand('qc/fastqc_premerge/{sample}_L{lanenum}_{bsdir}_r1_fastqc.zip', zip, sample=SAMPLES, bsdir=DIRECTORY, lanenum=LANENUMS)
It's telling you what files are missing; that's what the lines under "missing input files for rule all" are.
That being said, to answer your original question, if you do a dry run, that should tell you what the input/output files are for each planned rule you want to run (use flags -n -r) in your run command.

Snakemake: MissingInputException with inconsistent naming scheme

I am trying to process MinION cDNA amplicons using Porechop with Minimap2 and I am getting this error.
MissingInputException in line 16 of /home/sean/Desktop/reo/antisera project/20200813/MinIONAmplicon.smk:
Missing input files for rule minimap2:
8413_19_strict/BC01.fastq.g
I understand what the error telling me, I just understand why its being its not trying to make the rule before it. Porechop is being used to check for all the possible barcodes and will output more than one fastq file if it finds more than barcode in the directory. However since I know what barcode I am looking for I made a barcodes section in the config.yaml file so I can map them together.
I think the error is happening because my target output for Porechop doesn't match the input for minimap2 but I do not know how to correct this problem as there can be multiple outputs from porechop.
I thought I was building a path for the input file for the minimap2 rule and when snakemake discovered that the porechop output was not there it would make it, but that is not what is happening.
Here is my pipeline so far,
configfile: "config.yaml"
rule all:
input:
expand("{sample}.bam", sample = config["samples"])
rule porechop_strict:
input:
lambda wildcards: config["samples"][wildcards.sample]
output:
directory("{sample}_strict/")
shell:
"porechop -i {input} -b {output} --barcode_threshold 85 --threads 8 --require_two_barcodes"
rule minimap2:
input:
lambda wildcards: "{sample}_strict/" + config["barcodes"][wildcards.sample]
output:
"{sample}.bam"
shell:
"minimap2 -ax map-ont -t8 ../concensus.fasta {input} | samtools sort -o {output}"
and the yaml file
samples: {
'8413_19': relabeled_reads/8413_19.raw.fastq.gz,
'8417_19': relabeled_reads/8417_19.raw.fastq.gz,
'8445_19': relabeled_reads/8445_19.raw.fastq.gz,
'8466_19_104': relabeled_reads/8466_19_104.raw.fastq.gz,
'8466_19_105': relabeled_reads/8466_19_105.raw.fastq.gz,
'8467_20': relabeled_reads/8467_20.raw.fastq.gz,
}
barcodes: {
'8413_19': BC01.fastq.gz,
'8417_19': BC02.fastq.gz,
'8445_19': BC03.fastq.gz,
'8466_19_104': BC04.fastq.gz,
'8466_19_105': BC05.fastq.gz,
'8467_20': BC06.fastq.gz,
}
First of all, you can always debug the problems like that specifying the flag --printshellcmds. That would print all shell commands that Snakemake runs under the hood; you may try to run them manually and locate the problem.
As for why your rule doesn't produce any output, my guess is that samtools requires explicit filenames or - to use stdin:
Samtools is designed to work on a stream. It regards an input file '-'
as the standard input (stdin) and an output file '-' as the standard
output (stdout). Several commands can thus be combined with Unix
pipes. Samtools always output warning and error messages to the
standard error output (stderr).
So try that:
shell:
"minimap2 -ax map-ont -t8 ../concensus.fasta {input} | samtools sort -o {output} -"
So I am not 100% sure why this way works, I imagine it has to do with the way snakemake looks at the targets however here is the solution I found for it.
rule minimap2:
input:
"{sample}_strict"
params:
suffix=lambda wildcards: config["barcodes"][wildcards.sample]
output:
"{sample}.bam"
shell:
"minimap2 -ax map-ont -t8 ../consensus.fasta\
{input}/{params.suffix} | samtools sort -o {output}"
by using the params feature in snakemake I was able to match up the correct barcode to the sample name. I am not sure why I could just do that as the input itself, but when I returned the input to the match the output of the previous rule it works.

Snakemake shell command should only take one file at a time, but it's trying to do multiple files at once

First off, I'm sorry if I'm not explaining my problem clearly, English is not my native language.
I'm trying to make a snakemake rule that takes a fastq file and filters it with a program called Filtlong. I have multiple fastq files on which I want to run this rule and it should output a filtered file per fastq file but apparently it takes all of the fastq files as input for a single Filtlong command.
The fastq files are in separate directories and the snakemake rule should write the filtered files to separate directories aswell.
This is how my code looks right now:
from os import listdir
configfile: "config.yaml"
DATA = config["DATA"]
SAMPLES = listdir(config["RAW_DATA"])
RAW_DATA = config["RAW_DATA"]
FILT_DIR = config["FILTERED_DIR"]
rule all:
input:
expand("{FILT_DIR}/{sample}/{sample}_filtered.fastq.gz", FILT_DIR=FILT_DIR, sample=SAMPLES)
rule filter_reads:
input:
expand("{RAW_DATA}/{sample}/{sample}.fastq", sample=SAMPLES, RAW_DATA=RAW_DATA)
output:
"{FILT_DIR}/{sample}/{sample}_filtered.fastq.gz"
shell:
"filtlong --keep_percent 90 --target_bases 300000000 {input} | gzip > {output}"
And this is the config file:
DATA:
all_samples
RAW_DATA:
all_samples/raw_samples
FILTERED_DIR:
all_samples/filtered_samples
The separate directories with the fastq files are in RAW_DATA and the directories with the filtered files should be in FILTERED_DIR,
When I try to run this, I get an error that looks something like this:
Error in rule filter_reads:
jobid: 30
output: all_samples/filtered_samples/cell_18-07-19_barcode10/cell_18-07-19_barcode10_filtered.fastq.gz
shell:
filtlong --keep_percent 90 --target_bases 300000000 all_samples/raw_samples/cell3_barcode11/cell3_barcode11.fastq all_samples/raw_samples/barcode01/barcode01.fastq all_samples/raw_samples/barcode03/barcode03.fastq all_samples/raw_samples/barcode04/barcode04.fastq all_samples/raw_samples/barcode05/barcode05.fastq all_samples/raw_samples/barcode06/barcode06.fastq all_samples/raw_samples/barcode07/barcode07.fastq all_samples/raw_samples/barcode08/barcode08.fastq all_samples/raw_samples/barcode09/barcode09.fastq all_samples/raw_samples/cell3_barcode01/cell3_barcode01.fastq all_samples/raw_samples/cell3_barcode02/cell3_barcode02.fastq all_samples/raw_samples/cell3_barcode03/cell3_barcode03.fastq all_samples/raw_samples/cell3_barcode04/cell3_barcode04.fastq all_samples/raw_samples/cell3_barcode05/cell3_barcode05.fastq all_samples/raw_samples/cell3_barcode06/cell3_barcode06.fastq all_samples/raw_samples/cell3_barcode07/cell3_barcode07.fastq all_samples/raw_samples/cell3_barcode08/cell3_barcode08.fastq all_samples/raw_samples/cell3_barcode09/cell3_barcode09.fastq all_samples/raw_samples/cell3_barcode10/cell3_barcode10.fastq all_samples/raw_samples/cell3_barcode12/cell3_barcode12.fastq all_samples/raw_samples/cell_18-07-19_barcode01/cell_18-07-19_barcode01.fastq all_samples/raw_samples/cell_18-07-19_barcode02/cell_18-07-19_barcode02.fastq all_samples/raw_samples/cell_18-07-19_barcode03/cell_18-07-19_barcode03.fastq all_samples/raw_samples/cell_18-07-19_barcode04/cell_18-07-19_barcode04.fastq all_samples/raw_samples/cell_18-07-19_barcode05/cell_18-07-19_barcode05.fastq all_samples/raw_samples/cell_18-07-19_barcode06/cell_18-07-19_barcode06.fastq all_samples/raw_samples/cell_18-07-19_barcode07/cell_18-07-19_barcode07.fastq all_samples/raw_samples/cell_18-07-19_barcode08/cell_18-07-19_barcode08.fastq all_samples/raw_samples/cell_18-07-19_barcode09/cell_18-07-19_barcode09.fastq all_samples/raw_samples/cell_18-07-19_barcode10/cell_18-07-19_barcode10.fastq all_samples/raw_samples/cell_18-07-19_barcode11/cell_18-07-19_barcode11.fastq all_samples/raw_samples/cell_18-07-19_barcode12/cell_18-07-19_barcode12.fastq all_samples/raw_samples/cell_18-07-19_barcode13/cell_18-07-19_barcode13.fastq all_samples/raw_samples/cell_18-07-19_barcode14/cell_18-07-19_barcode14.fastq all_samples/raw_samples/cell_18-07-19_barcode15/cell_18-07-19_barcode15.fastq all_samples/raw_samples/cell_18-07-19_barcode16/cell_18-07-19_barcode16.fastq all_samples/raw_samples/cell_18-07-19_barcode17/cell_18-07-19_barcode17.fastq all_samples/raw_samples/cell_18-07-19_barcode18/cell_18-07-19_barcode18.fastq all_samples/raw_samples/cell_18-07-19_barcode19/cell_18-07-19_barcode19.fastq | gzip > all_samples/filtered_samples/cell_18-07-19_barcode10/cell_18-07-19_barcode10_filtered.fastq.gz
(exited with non-zero exit code)
As far as I can tell, the rule takes all of the fastq files as input for a single Filtlong command, but I don't quite understand why
You shouldn't use the expand function in your input section of the filter_reads rule. What you are doing now is requiring all your samples to be the input of each filtered file: that is what you can observe in your error message.
There is another complication that you introduce out of nothing: you mix both wildcards and variables. In your example the {FILT_DIR} is just a predefined value while the {sample} is a wildcard that Snakemake uses to match the rules. Try the following (pay special attention on single/double brackets and on the formatted string (the one that has the form f"")):
rule filter_reads:
input:
f"{RAW_DATA}/{{sample}}/{{sample}}.fastq"
output:
f"{FILT_DIR}/{{sample}}/{{sample}}_filtered.fastq.gz"
shell:
"filtlong --keep_percent 90 --target_bases 300000000 {input} | gzip > {output}"

How to pass variable value as input in snakemake?

I want to download the fastq files from SRA database using SRR ID using Snakemake. I read a file to get SRR ID using python code.
I want to parse the Variable one by one as input. My code is below.
I want to run command
fastq-dump SRR390728
#SAMPLES = ['SRR390728','SRR400816']
SAMPLES = [line.strip() for line in open("./srrList", 'r')]
rule all:
input:
expand("fastq/{sample}.fastq.log",sample=SAMPLES)
rule download_fastq:
input:
"{sample}"
output:
"fastq/{sample}.fastq.log"
shell:
"fastq-dump {input} > {output}"
Skip input and just call the wildcard in shell command. input needs to be a filepath that needs to already exist or be created as part of the pipeline - neither are true in your case.
rule download_fastq:
output:
"fastq/{sample}.fastq.log"
shell:
"fastq-dump {wildcards.sample} > {output}"

Snakemake: Target rules may not contain wildcards

I am trying to supply a bunch of files as input to snakemake and wildcards are not working for some reason:
rule cluster:
input:
script = '/Users/nikitavlasenko/python_scripts/python/dbscan.py',
path = '/Users/nikitavlasenko/python_scripts/data_files/umap/{sample}.csv'
output:
path = '/Users/nikitavlasenko/python_scripts/output/{sample}'
shell:
"python {input.script} -data {input.path} -eps '0.3' -min_samples '10' -path {output.path}"
I want snakemake to read files in from the umap directory, get their names, and then use them to pass to the python script, so that each result would get a unique name. How this task can be achieved without such an error that I am getting right now:
Building DAG of jobs...
WorkflowError:
Target rules may not contain wildcards. Please specify concrete files or
a rule without wildcards.
Update
I found that most probably the rule all is required at the top:
https://bioinformatics.stackexchange.com/questions/2761/how-to-resolve-in-snakemake-error-target-rules-may-not-contain-wildcards
So I added it like that:
samples='SCID_WT_CCA'
rule all:
input:
expand('/Users/nikitavlasenko/python_scripts/data_files/umap/
{sample}_umap.csv', sample=samples.split(' '))
However, I am getting the following weird message:
Building DAG of jobs...
Nothing to be done.
So, it is not running.
Update
I thought that it could be related to the fact that I had just one sample name at the top, so I changed it to:
samples='SCID_WT_CCA WT SCID plus_1 minus_1'
And added the respective files, of course, but it did not fix this error.
Actually if I run snakemake cluster I get the same error as at the very top, but if I just run snakemake, then there is nothing to be done error. I tried to substitute absolute paths for the relative ones, but it did not help:
samples='SCID_WT_CCA WT SCID plus_1 minus_1'
rule all:
input:
expand('data_files/umap/{sample}_umap.csv', sample=samples.split(' '))
rule cluster:
input:
script = 'python/dbscan.py',
path = 'data_files/umap/{sample}_umap.csv'
output:
path = 'output/{sample}'
shell:
"python {input.script} -data {input.path} -eps '0.3' -min_samples '10' -path {output.path}"
The "all" rule should have as input the list of files you want the other rule(s) to generate as output. Here, you seem to be using the list of your starting files instead.
Try the following:
samples = 'SCID_WT_CCA WT SCID plus_1 minus_1'
rule all:
input:
expand('output/{sample}', sample=samples.split(' '))
rule cluster:
input:
script = 'python/dbscan.py',
path = 'data_files/umap/{sample}_umap.csv'
output:
path = 'output/{sample}'
shell:
"python {input.script} -data {input.path} -eps '0.3' -min_samples '10' -path {output.path}"
Following bli's answer, I was able to solve the issue. However, one additional modification was needed. I passed output/{sample} to the python script and it generated two files from this path. Seems like that should not be done because I got another error when snakemake wrote that it could not see output/file_name. Obviously it will be able to see them only if I set all the paths manually right away without python modifying it on the fly, so I did that and here is the final Snakefile that worked well:
samples='SCID_WT_CCA WT SCID plus_1 minus_1'
rule all:
input:
expand('output/{sample}_umap.png', sample=samples.split(' ')),
expand('output/{sample}_clusters.csv', sample=samples.split(' '))
rule cluster:
input:
script = 'python/dbscan.py',
path = 'data_files/umap/{sample}_umap.csv'
output:
path_to_umap = 'output/{sample}_umap.png',
path_to_clusters = 'output/{sample}_clusters.csv'
shell:
"python {input.script} -data {input.path} -eps '0.3' -min_samples '10' -path_to_umap {output.path_to_umap} -path_to_clusters {output.path_to_clusters}"