Snakemake basic issue - snakemake

I tried to run Snakemake command on my local computer. It didn’t work even I used the simplest code structure, like so:
rule fastqc_raw:
input:
"raw/A.fastq"
output:
"output/fastqc_raw/A.html"
shell:
"fastqc {input} -o {output} -t 4"
It displayed this error:
Error in rule fastqc_raw:
jobid: 1
output: output/fastqc_raw/A.html RuleException: CalledProcessError in line 13 of
/Users/01/Desktop/Snakemake/Snakefile: Command ' set -euo pipefail;
fastqc raw/A.fastq -o output/fastqc_raw/A.html -t 4 ' returned
non-zero exit status 2. File
"/Users/01/Desktop/Snakemake/Snakefile", line 13, in __rule_fastqc_raw
File "/Users/01/miniconda3/lib/python3.6/concurrent/futures/thread.py",line 56, in run
However the snakemake program did created DAG file that looks normal and when I used “snakemake --np” command, it didn’t display any errors.
I did also ran fastqc locally without Snakemake using the same command, and it worked perfectly.
I hope anyone can help me with this
Thanks !!

It looks like Snakemake did its job. It ran the command:
fastqc raw/A.fastq -o output/fastqc_raw/A.html -t 4
But the command returned an error:
Command ' set -euo pipefail;
fastqc raw/A.fastq -o output/fastqc_raw/A.html -t 4 ' returned
non-zero exit status 2.
The next step in debugging is to run the fastqc command manually to see if it gives an error.

I hope you have gotten an answer by now but I had the exact same issue so I will offer my solution.
The error is in the
shell:
"fastqc {input} -o {output} -t 4"
FastQC flag -o expects the output directory and you have given it an output file. Your code should be:
shell:
"fastqc {input} -o output/fastqc_raw/ -t 4"
Your error relates to the fact that the output files have been output in a different location (most likely the input directory) and the rule all: has failed as a result.
Additionally, FastQC will give an error if the directories are not already created, so you will need to do that first.
It is strange as I have seen Snakemake scripts that have no -o flag in the fastqc shell and it worked fine, but I haven't been so lucky.

An additional note: I can see you're using 4 threads there with the '-t 4' argument. You should specify this so Snakemake gives it 4 threads, otherwise I believe it will run with 1 thread and may fail due to lack of memory. This can be done like so:
rule fastqc_raw:
input:
"raw/A.fastq"
output:
"output/fastqc_raw/A.html"
threads: 4
shell:
"fastqc {input} -o {output} -t 4"

Related

Snakemake Megahit output issue

A few days ago I started using Snakemake for the first time. I am having an issue when I am trying to run the megahit rule in my pipeline.
It gives me the following error "Outputs of incorrect type (directories when expecting files or vice versa). Output directories must be flagged with directory(). ......"
So initially it runs and then crashes with the above error. I implemented the solution with the directory() option in my pipeline but I think its not a good practice since, for various reasons, you can loose files without even knowing it.
Is there a way to run the rule without using the directory() ?
I would appreciate any help on the issue!
Thanking you in advance
sra = []
with open("run_ids") as f:
for line in f:
sra.append(line.strip())
rule all:
input:
expand("raw_reads/{sample}/{sample}.fastq", sample=sra),
expand("trimmo/{sample}/{sample}.trimmed.fastq", sample=sra),
expand("megahit/{sample}/final.contigs.fa", sample=sra)
rule download:
output:
"raw_reads/{sample}/{sample}.fastq"
params:
"--split-spot --skip-technical"
log:
"logs/fasterq-dump/{sample}.log"
benchmark:
"benchmarks/fastqdump/{sample}.fasterq-dump.benchmark.txt"
threads: 8
shell:
"""
fasterq-dump {params} --outdir /home/raw_reads/{wildcards.sample} {wildcards.sample} -e {threads}
"""
rule trim:
input:
"raw_reads/{sample}/{sample}.fastq"
output:
"trimmo/{sample}/{sample}.trimmed.fastq"
params:
"HEADCROP:15 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36"
log:
"logs/trimmo/{sample}.log"
benchmark:
"benchmarks/trimmo/{sample}.trimmo.benchmark.txt"
threads: 6
shell:
"""
trimmomatic SE -phred33 -threads {threads} {input} trimmo/{wildcards.sample}/{wildcards.sample}.trimmed.fastq {params}
"""
rule megahit:
input:
"trimmo/{sample}/{sample}.trimmed.fastq"
output:
"megahit/{sample}/final.contigs.fa"
params:
"-m 0.7 -t"
log:
"logs/megahit/{sample}.log"
benchmark:
"benchmarks/megahit/{sample}.megahit.benchmark.txt"
threads: 10
shell:
"""
megahit -r {input} -o {output} -t {threads}
"""
IMHO it is a bad design of the megahit software that it takes a directory as a parameter and outputs into a file in this directory with a hardcoded name. Flagging the filename with directory() doesn't solve the issue, as in this case what you expect to be a file with the .fa extension megahit treats as a directory. The rest of the pipeline is broken in this case.
But this issue can be solved in Snakemake like that:
rule megahit:
input:
"trimmo/{sample}/{sample}.trimmed.fastq"
output:
"megahit/{sample}/final.contigs.fa"
# ...
shell:
"""
megahit -r {input} -o megahit/{wildcards.sample} -t {threads}
"""
A better design of the megahit rule would look as follows:
rule megahit:
input:
"trimmo/{sample}/{sample}.trimmed.fastq"
output:
out_dir = directory("megahit/{sample}/"),
fasta = "megahit/{sample}/final.contigs.fa"
log:
"logs/megahit/{sample}.log"
benchmark:
"benchmarks/megahit/{sample}.megahit.benchmark.txt"
threads:
10
shell:
"megahit -r {input} -f -o {output.out_dir} -t {threads}"
This guarantees that the output directory is removed upon failure, while the -f argument to megahit tells it to ignore the fact that the output folder exists (it is created by Snakemake automatically because one of the outputs is a file inside it: final.contigs.fa).
BTW, the -m (--memory) parameter is best implemented as a resource. The only problem though is that snakemake's default resource, mem_mb is in megabytes. One workaround would be as follows:
resources:
mem_mb = mem_mb_limit_for_megahit # could be a fraction of a global constant
params:
mem_bytes = lambda w, resources: round(resources.mem_mb * 1e6)
shell:
"megahit ... -m {params.mem_bytes}"

snakemake - replacing command line parameters with wildcards by cluster profile

I am writing a snakemake pipeline to eventually identify corona virus variants.
Below is a minimal example with three steps:
LOGDIR = '/path/to/logDir'
barcodes = ['barcode49', 'barcode50', 'barcode51']
rule all:
input:
expand([
# guppyplex
"out/guppyplex/{barcode}/{barcode}.fastq",
# catFasta
"out/catFasta/cat_consensus.fasta",
], barcode = barcodes)
rule guppyplex:
input:
FQ = f"fastq/{{barcode}}" # FASTQ_PATH is parsed from config.yaml
output:
"out/guppyplex/{barcode}/{barcode}.fastq"
shell:
"touch {output}" # variables in CAPITALS are parsed from config.yaml
rule minion:
input:
INFQ = rules.guppyplex.output,
FAST5 = f"fasta/{{barcode}}"
params:
OUTDIR = "out/nanopolish/{barcode}"
output:
"out/nanopolish/{barcode}/{barcode}.consensus.fasta"
shell:
"""
touch {output} && echo {wildcards.barcode} > {output}
"""
rule catFasta:
input:
expand("out/nanopolish/{barcode}/{barcode}.consensus.fasta", barcode = barcodes)
output:
"out/catFasta/cat_consensus.fasta"
shell:
"cat {input} > {output}"
If I run the snakemake locally by calling snakemake -p --cores 1 all everything works. Yet my ultimate goal is to use qsub to run the jobs on a cluster. I also want the stderr and stdout from qsub to have meaningful names, which include wildcards and the rule names for each job.
However, if I call snakemake with
snakemake -p --cluster "qsub -q onlybngs05b -e {LOGDIR} -o {LOGDIR} -j y" -j 5 --jobname "{wildcards.barcode}.{rule}.{jobid}" all
I will get the following error:
AttributeError: 'Wildcards' object has no attribute 'barcode'
I have recently read the snakemake documentation where it appears that I could replace the command line parameters (--cluster "qsub -q onlybngs05b -e {LOGDIR} -o {LOGDIR} -j y" -j 5 --jobname "{wildcards.barcode}.{rule}.{jobid}") by a yaml file. Although the documentation is not all that clear to me.
I have created a config.yaml file at /home/user/.config/snakemake which looks like so:
cluster: 'qsub'
q: 'onlybngs05b'
e: '/home/ngs/tempOutSnakemake'
o: '/home/ngs/tempOutSnakemake'
j: 5
jobname: "{wildcards.barcode}.{rule}.{jobid}
But then it appears that snakemake is not properly parsing the config.yaml. I am getting
snakemake: error: ambiguous option: --o=/home/ngs/tempOutSnakemake could match --omit-from, --output-wait, --overwrite-shellcmd
I also tried to replace o in the config file by stdout (kind of the long version of the parameter (-h vs --help for several programs), though it does not work.
Therefore my question is how I can replace the command line parameters --cluster "qsub -q onlybngs05b -e {LOGDIR} -o {LOGDIR} -j y" -j 5 --jobname "{wildcards.barcode}.{rule}.{jobid}" by a config.yaml file that accepts wildcards?
I think the problem is that rule catFasta doesn't contain the wildcard barcode. If you think about it, what job name would you expect in {wildcards.barcode}.{rule}.{jobid}?
Maybe a solution could be to add to each rule a jobname parameter that could be {barcode} for guppyplex and minion and 'all_barcodes' for catFasta. Then use --jobname "{params.jobname}.{rule}.{jobid}"

running metabat2 with snakemake but not getting the bin files

I have been trying to run metabat2 with snakemake. I can run it but the output files in metabat2/ are missing. The checkM that works after it does use the data and can work I just cant find the files later. There should be files created with numbers but it is imposible to predict how many files will be created. Is there a way I can specify it to make sure that the files are created in that file?
rule all:
[f"metabat2/" for sample in samples],
[f"checkm/" for sample in samples]
rule metabat2:
input:
"input/consensus.fasta"
output:
directory("metabat2/")
conda:
"envs/metabat2.yaml"
shell:
"metabat2 -i {input} -o {output} -v"
rule checkM:
input:
"metabat2/"
output:
c = "bacteria/CheckM.txt",
d = directory("checkm/")
conda:
"envs/metabat2.yaml"
shell:
"checkm lineage_wf -f {output.c} -t 10 -x fa {input} {output.d}"
the normal code to run metabat2 would be
metabat2 -i path/to/consensus.fasta -o /outputdir/bin -v
this will create in outputdir files with bin.[number].fa
I can't tell what the problem is but I have a couple of suggestions...
[f"metabat2/" for sample in samples]: I doubt this will do what you expect as it will simply create a list with the string metabat2/ repeat len(samples) times. Maybe you want [f"metabat2/{sample}" for sample in samples]? The same for [f"checkm/" for sample in samples]
The samples variable is not used anywhere in the rules following all. I suspect somewhere it should be used and/or you should use something like output: directory("metabat2/{sample}")
Execute snakemake with -p option to see what commands are executed. It may be useful to post the stdout from it.

Does Snakefile location matter?

I am absolute beginner to snakemake. I am building a pipeline as I learn. My question is if the Snakefile is placed with data file that I want to process an NameError: occurs but if I move the Snakefile to a parent directory and edit the path information of input: and output: the code works. what am I missing?
rule sra_convert:
input:
"rna/{id}.sra"
output:
"rna/fastq/{id}.fastq"
shell:
"fastq-dump {input} -O {output}"
above code works fine when I run with
snakemake -p rna/fastq/SRR873382.fastq
However, if I move the file to "rna" directory where the SRR873382.sra file is and edit the code as below
rule sra_convert:
input:
"{id}.sra"
output:
"fastq/{id}.fastq"
message:
"Converting from {id}.sra to {id}.fastq"
shell:
"fastq-dump {input} -O {output}"
and run
snakemake -p fastq/SRR873382.fastq
I get the following error
Building DAG of jobs...
Job counts:
count jobs
1 sra_convert
1
RuleException in line 7 of /home/sarc/Data/rna/Snakefile:
NameError: The name 'id' is unknown in this context. Please make sure that you defined that variable. Also note that braces not used for variable access have to be escaped by repeating them, i.e. {{print $1}}
Solution
rule sra_convert:
input:
"{id}.sra"
output:
"fastq/{id}.fastq"
message:
"Converting from {wildcards.id}.sra to {wildcards.id}.fastq"
shell:
"fastq-dump {input} -O {output}"
above code runs fine without error
I believe that the best source that answers your actual question is:
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#wildcards
If the rule’s output matches a requested file, the substrings matched
by the wildcards are propagated to the input files and to the variable
wildcards, that is here also used in the shell command. The wildcards
object can be accessed in the same way as input and output, which is
described above.

Error: "attribute "m_numa_nodes" is not a integer value." when running a qsub snakemake

I am trying to run a snakemake with cluster submission for RNAseq analysis. Here is my script:
#path to gff
GFF = "RNASeq/data/ref_GRCh38.p2_top_level.gff3"
#sample names and classes
CTHP = 'CTHP1 CTHP2'.split()
CYP = 'CYP1 CYP2'.split()
samples = CTHP + CYP
rule all:
input:
'CTHP1/mapping_results/out_summary.gtf',
'CTHP2/mapping_results/out_summary.gtf',
'CYP2/mapping_results/out_summary.gtf',
'CYP1/mapping_results/out_summary.gtf',
rule order_sam:
input:
'{samples}/mapping_results/mapped.sam'
output:
'{samples}/mapping_results/ordered.mapped.bam'
threads: 12
params: ppn="nodes=1:ppn=12"
shell:
'samtools view -Su {input} | samtools sort > {output}'
rule count_sam:
input:
bam='{samples}/mapping_results/ordered.mapped.bam'
output:
summary='{samples}/mapping_results/out_summary.gtf',
abun='{samples}/mapping_results/abun_results.tab',
cover='{samples}/mapping_results/coveraged.gtf'
threads: 12
params: ppn="nodes=1:ppn=12"
shell:
'stringtie -o {output.summary} -G {GFF} -C {output.cover} '
'-A {output.abun} -p {threads} -l {samples} {input.bam}'
```
I want to submit each rule to a cluster. So, in the Terminal from the working directory, I do this:
snakemake --cluster "qsub -V -l {params.ppn}" -j 6
However, the jobs are not submitted and I get following error:
Unable to run job: attribute "m_numa_nodes" is not a integer value.
Exiting.
Error submitting jobscript (exit code 1):
I have also tried to set the nodes variable directly when running the snake file like this:
snakemake --cluster "qsub -V -l nodes=1:ppn=16" -j 6
and as expected, it gave me the same error. At this point I am not sure if its the local cluster setup or something that I am not doing right in the snake file. Any help would be great.
Thanks
The error does not look Snakemake related. I am not an SGE/Univa expert so I cannot really help you, but m_numa_nodes is a parameter of the engine. Snakemake does not set it in any way, so it must be either your local configuration or one of the arguments you provide to qsub.
EDIT: 2017/04/12 -- Caught one of the errors in the Google Groups Post. Remove the comma from the last line of input in your "all" rule.
**EDIT: 2017/04/13 -- Was advised the comma is not an issue **
The beauty of Snakemake is sending it to the cluster just requires additional arguments.To determine if its a Cluster issue, or a Snakemake issue, I recommend running a dryrun, via
snakemake -n
Dryrun will not submit any jobs, but it will return the list of jobs. This is a strong indicator if it's a Snakemake issue or a submission issue. I always perform dryruns while in development, to ensure my Snakemake code works before I start trying to submit it to the cluster, because cluster submissions can be a whole different basket of issues.
As per your submission problems, I use the "--drmaa" flag within Snakemake to handle my submissions to the cluster. I realize this is not what you asked for, but I really enjoy its functionality, and I guess I am just suggesting it as a robust alternative to your current approach.
https://pypi.python.org/pypi/drmaa OR https://anaconda.org/anaconda/drmaa
snakemake --jobs 10 --cluster-config input/config.json --drmaa "{cluster.clusterSpec}"
Inside config.json, my rules are mostly all provide this parameter set:
{
"__default__": {
"clusterSpec": "-V -S /bin/bash -o log/varScan -e log/varScan -l h_vmem=10G -pe ncpus 1"
}
}
SGE Cluster Arguments = "-V -S /bin/bash -l h_vmem=10G -pe ncpus 1"
DRMAA Arguments = "-o log/varScan -e log/varScan"
P.S. I think you have to post as well the Operating System (E.g. CentOS5) and your cluster type(E.g. SGE) you are using.