passing parameters to cwl from snakemake

passing parameters to cwl from snakemake - snakemake

I'm trying to execute some cwl pipelines using snakemake. I need to pass parameters to cwltool, which snakemake is using to execute the pipeline.
cwltool has a number of options. Currently, when snakemake calls it, the only option that I can figure out how to pass is --singularity, and that is easy, because if you call snakemake with the --use-singularity flag, it automatically inserts it into the call to cwltool.
snakemake --jobs 999 --printshellcmds --rerun-incomplete --use-singularity
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 999
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 align
1
[Tue Aug 11 06:55:18 2020]
rule align:
input: /project/uoa00571/projects/rapidAutopsy/data/raw/dna_batch2/BA1_1.fastq.gz, /nesi/project/uoa00571/projects/rapidAutopsy/data/raw/dna_batch2/BA1_2.fastq.gz
output: /nobackup/uoa00571/data/intermediate/rapidAutopsy/BA1.unaligned.bam
jobid: 0
threads: 8
cwltool --singularity file:/project/uoa00571/projects/rapidAutopsy/src/gridss/gridssAlignment.cwl /tmp/tmpfgsz5uvr
Unfortunately, I can't figure out how to add additional arguments to the cwltool call. When snakemake calls the cwl tool, it doesn't appear to pass on the working directory, so if I do :
snakemake --jobs 999 --printshellcmds --rerun-incomplete --use-singularity --directory "/nesi/nobackup/uoa00571/data/intermediate/rapidAutopsy/"
instead of landing intermediate files in the specified directory, singularity appears to be binding a directory in the /tmp directory for the intermediate files, which on the system I am working on, is not big enough, resulting in a disk quota exceeded error.
singularity \
--quiet \
exec \
--contain \
--pid \
--ipc \
--home \
/tmp/xq5co13r:/sSyVnR \
--bind \
/tmp/vbjabhqx:/tmp:rw \
Or at least, that's what I think is happening. If I run the cwl pipeline using cwltool, I can add in the option --cacheDir and the pipeline will run to completion, so I'd like to be able to pass that from snakemake to the cwltool call, if that's possible?

Related

Is it possible to see qsub job scripts and command line options etc. that will be issued by Snakemake in the dry run mode?

I am new to Snakemake and planning to use Snakemake for qsub in my cluster environment. In order to avoid critical mistakes that may disturb the cluster, I would like to check qsub job scripts and a qsub command that will be generated by Snakemake before actually submitting the jobs to the que.
Is it possible to see qsub job script files etc. in the dry run mode or in some other ways? I searched for relevant questions but could not find the answer. Thank you for your kind help.
Best,
Schuma

Using --printshellcmds or short version -p with --dry-run will allow you to see the commands snakemake will feed to qsub, but you won't see qsub options.
I don't know any option showing which parameters are given to qsub, but snakemake follows a simple set of rules, which you can find detailed information here and here. As you'll see, you can feed arguments to qsub in multiple ways.
With default values --default-resources resource_name1=<value1> resource_name2=<value2> when invoking snakemake.
On a per-rule basis, using resources in rules (prioritized over default values).
With explicitly set values, either for the whole pipeline using --set-resources resource_name1=<value1> or for a specific rule using --set-resources rule_name:resource_name1=<value1> (prioritized over default and per-rule values)
Suppose you have the following pipeline:
rule all:
input:
input.txt
output:
output.txt
resources:
mem_mb=2000
runtime_min=240
shell:
"""
some_command {input} {output}
"""
If you call qsub using the --cluster directive, you can access all keywords of your rules. Your command could then look like this:
snakemake all --cluster "qsub --runtime {resources.runtime} -l mem={resources.mem_mb}mb"
This means snakemake will submit the following script to the cluster just as if you did directly in your command line:
qsub --runtime 240 -l mem=2000mb some_command input.txt output.txt
It is up to you to see which parameters you define where. You might want to check your cluster's documentation or with its administrator what parameters are required and what to avoid.
Also note that for cluster use, Snakemake documentation recommends setting up a profile which you can then use with snakemake --profile myprofile instead of having to specify arguments and default values each time.
Such a profile can be written in a ~/.config/snakemake/profile_name/config.yaml file. Here is an example of such a profile:
cluster: "qsub -l mem={resources.mem_mb}mb other_resource={resources.other_resource_name}"
jobs: 256
printshellcmds: true
rerun-incomplete: true
default-resources:
- mem_mb=1000
- other_resource_name="foo"
Invoking snakemake all --profile profile_name corresponds to invoking
snakemake all --cluster "qsub -l mem={resources.mem_mb}mb other_resource= resources.other_resource_name_in_snakefile}" --jobs 256 --printshellcmds --rerun-incomplete --default-resources mem_mb=1000 other_resource_name "foo"
You may also want to define test rules, like a minimal example of your pipeline for instance, and try these first to verify all goes well before running your full pipeline.

Snakemake Error with MissingOutputException

I am trying to run STAR with snakemake in a server,
My smk file is that one :
import pandas as pd
configfile: 'config.yaml'
#Read sample to batch dataframe mapping batch to sample (here with zip)
sample_to_batch = pd.read_csv("/mnt/DataArray1/users/zisis/STAR_mapping/snakemake_STAR_index/all_samples_test.csv", sep = '\t')
#rule spcifying output
rule all_STAR:
input:
#expand("{sample}/Aligned.sortedByCoord.out.bam", sample = sample_to_batch['sample'])
expand(config['path_to_output']+"{sample}/Aligned.sortedByCoord.out.bam", sample = sample_to_batch['sample'])
rule STAR_align:
#specify input fastq files
input:
fq1 = config['path_to_output']+"{sample}_1.fastq.gz",
fq2 = config['path_to_output']+"{sample}_2.fastq.gz"
params:
#location of indexed genome andl location to save the ouput
genome = directory(config['path_to_reference']+config['ref_assembly']+".STAR_index"),
prefix_outdir = directory(config['path_to_output']+"{sample}/")
threads: 12
output:
config['path_to_output']+"{sample}/Aligned.sortedByCoord.out.bam"
log:
config['path_to_output']+"logs/{sample}.log"
message:
"--- Mapping STAR---"
shell:
"""
STAR --runThreadN {threads} \
--readFilesCommand zcat \
--readFilesIn {input} \
--genomeDir {params.genome} \
--outSAMtype BAM SortedByCoordinate \
--outSAMunmapped Within \
--outSAMattributes Standard
"""
While STAR starts normally at the end i have this error:
Waiting at most 5 seconds for missing files.
MissingOutputException in line 14 of /mnt/DataArray1/users/zisis/STAR_mapping/snakemake/STAR_snakefile_align.smk:
Job Missing files after 5 seconds:
/mnt/DataArray1/users/zisis/STAR_mapping/snakemake/001_T1/Aligned.sortedByCoord.out.bam
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Job id: 1 completed successfully, but some output files are missing. 1
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
I tried --latency-wait but is not working.
In order to execute snake make i run the command
users/zisis/STAR_mapping/snakemake_STAR_index$ snakemake --snakefile STAR_new_snakefile.smk --cores all --printshellcmds
Technically i am in my directory with full access and permissions
Do you think that this is happening due to strange rights in the execution of snakemake or when it tries to create directories ?
It creates the directory and the files but i can see that there is a files Aligned.sortedByCoord.out.bamAligned.sortedByCoord.out.bam .
IS this the problem ?

I think your STAR command does not have the option that says which file and directory to write to, presumably it is writing the default filename to the current directory. Try something like:
rule STAR_align:
input: ...
output: ...
...
shell:
r"""
outprefix=`dirname {output}`
STAR --outFileNamePrefix $outprefix \
--runThreadN {threads} \
etc...
"""
I am runing the command from my directory in which i am sudo user
I don't think that is the problem but it is strongly recommended to work as regular user and use sudo only in special circumstances (e.g. installing system-wide programs but if you use conda you shouldn't need that).

How can I run multiple runs of pipeline with different config files - issue with lock on .snakemake directory

I am running a snakemake pipeline from the same working directory but with different config files and the input / output are in different directories too. The issue seems to be that although both runs are using data in different folders snakemake creates the lock on the pipeline folder due to the .snakemake folder and the lock folder within. Is there a way to force separate .snakemake folders? code example below:
Both runs are ran from within /home/pipelines/qc_pipeline :
run 1:
/home/apps/miniconda3/bin/snakemake -p -k -j 999 --latency-wait 10 --restart-times 3 --use-singularity --singularity-args "-B /pipelines_test/QC_pipeline/PE_trimming/,/clusterTMP/testingQC/,/home/www/codebase/references" --configfile /clusterTMP/testingQC/config.yaml --cluster-config QC_slurm_roadsheet.json --cluster "sbatch --job-name {cluster.name} --mem-per-cpu {cluster.mem-per-cpu} -t {cluster.time} --output {cluster.output}"
run 2:
/home/apps/miniconda3/bin/snakemake -p -k -j 999 --latency-wait 10 --restart-times 3 --use-singularity --singularity-args "-B /pipelines_test/QC_pipeline/SE_trimming/,/clusterTMP/testingQC2/,/home/www/codebase/references" --configfile /clusterTMP/testingQC2/config.yaml --cluster-config QC_slurm_roadsheet.json --cluster "sbatch --job-name {cluster.name} --mem-per-cpu {cluster.mem-per-cpu} -t {cluster.time} --output {cluster.output}"
error:
Directory cannot be locked. Please make sure that no other Snakemake process is trying to create the same files in the following directory:
/home/pipelines/qc_pipeline
If you are sure that no other instances of snakemake are running on this directory, the remaining lock was likely caused by a kill signal or a power loss. It can be removed with the --unlock argument.

Maarten-vd-Sande correctly points to the --nolock option (+1), but in my opinion it's a very bad idea to use --nolock routinely.
As the error says, two snakemake processes are trying to create the same file. Unless the error is a bug in snakemake, I wouldn't blindly proceed and overwrite files.
I think it would be safer to assign to each snakemake execution its own execution directory and working directory, like:
topdir=`pwd`
mkdir -p run1
cd run1
snakemake --configfile /path/to/config1.yaml ...
cd $topdir
mkdir -p run2
cd run2
snakemake --configfile /path/to/config2.yaml ...
cd $topdir
mkdir -p run3
etc...
EDIT
Actually, it should be less clunky and probably better to use the the --directory/-d option:
snakemake -d run1 --configfile /path/to/config1.yaml ...
snakemake -d run2 --configfile /path/to/config2.yaml ...
...

As long as the different pipelines do not generate the same output files you can do it with the --nolock option:
snakemake --nolock [rest of the command]
Take a look here for a short doc about nolock.

Is it possible to print commands instead of rules in snakemake dry run?

Dry runs are a super important functionality of workflow languages. What I am looking at is mostly what would be executed if I run the command and this is exactly what one see when running make -n.
However analogical functionality snakemake -n prints something like
Building DAG of jobs...
rule produce_output:
output: my_output
jobid: 0
wildcards: var=something
Job counts:
count jobs
1 produce_output
1
The log contains kind of everything else than commands that get executed. Is there a way how to get command from snakemake?

snakemake -p --quiet -n
-p for print shell commands
-n for dry run
--quiet for removing the rest
EDIT 2019-Jan
This solution seems broken for lasts versions of snakemake

snakemake -p -n
Avoid the --quiet reported in the #eric-c answer, at least in some situations the combination on -p -n -q does not print the command executed without -n.

Error: "attribute "m_numa_nodes" is not a integer value." when running a qsub snakemake

I am trying to run a snakemake with cluster submission for RNAseq analysis. Here is my script:
#path to gff
GFF = "RNASeq/data/ref_GRCh38.p2_top_level.gff3"
#sample names and classes
CTHP = 'CTHP1 CTHP2'.split()
CYP = 'CYP1 CYP2'.split()
samples = CTHP + CYP
rule all:
input:
'CTHP1/mapping_results/out_summary.gtf',
'CTHP2/mapping_results/out_summary.gtf',
'CYP2/mapping_results/out_summary.gtf',
'CYP1/mapping_results/out_summary.gtf',
rule order_sam:
input:
'{samples}/mapping_results/mapped.sam'
output:
'{samples}/mapping_results/ordered.mapped.bam'
threads: 12
params: ppn="nodes=1:ppn=12"
shell:
'samtools view -Su {input} | samtools sort > {output}'
rule count_sam:
input:
bam='{samples}/mapping_results/ordered.mapped.bam'
output:
summary='{samples}/mapping_results/out_summary.gtf',
abun='{samples}/mapping_results/abun_results.tab',
cover='{samples}/mapping_results/coveraged.gtf'
threads: 12
params: ppn="nodes=1:ppn=12"
shell:
'stringtie -o {output.summary} -G {GFF} -C {output.cover} '
'-A {output.abun} -p {threads} -l {samples} {input.bam}'
```
I want to submit each rule to a cluster. So, in the Terminal from the working directory, I do this:
snakemake --cluster "qsub -V -l {params.ppn}" -j 6
However, the jobs are not submitted and I get following error:
Unable to run job: attribute "m_numa_nodes" is not a integer value.
Exiting.
Error submitting jobscript (exit code 1):
I have also tried to set the nodes variable directly when running the snake file like this:
snakemake --cluster "qsub -V -l nodes=1:ppn=16" -j 6
and as expected, it gave me the same error. At this point I am not sure if its the local cluster setup or something that I am not doing right in the snake file. Any help would be great.
Thanks

The error does not look Snakemake related. I am not an SGE/Univa expert so I cannot really help you, but m_numa_nodes is a parameter of the engine. Snakemake does not set it in any way, so it must be either your local configuration or one of the arguments you provide to qsub.

EDIT: 2017/04/12 -- Caught one of the errors in the Google Groups Post. Remove the comma from the last line of input in your "all" rule.
**EDIT: 2017/04/13 -- Was advised the comma is not an issue **
The beauty of Snakemake is sending it to the cluster just requires additional arguments.To determine if its a Cluster issue, or a Snakemake issue, I recommend running a dryrun, via
snakemake -n
Dryrun will not submit any jobs, but it will return the list of jobs. This is a strong indicator if it's a Snakemake issue or a submission issue. I always perform dryruns while in development, to ensure my Snakemake code works before I start trying to submit it to the cluster, because cluster submissions can be a whole different basket of issues.
As per your submission problems, I use the "--drmaa" flag within Snakemake to handle my submissions to the cluster. I realize this is not what you asked for, but I really enjoy its functionality, and I guess I am just suggesting it as a robust alternative to your current approach.
https://pypi.python.org/pypi/drmaa OR https://anaconda.org/anaconda/drmaa
snakemake --jobs 10 --cluster-config input/config.json --drmaa "{cluster.clusterSpec}"
Inside config.json, my rules are mostly all provide this parameter set:
{
"__default__": {
"clusterSpec": "-V -S /bin/bash -o log/varScan -e log/varScan -l h_vmem=10G -pe ncpus 1"
}
}
SGE Cluster Arguments = "-V -S /bin/bash -l h_vmem=10G -pe ncpus 1"
DRMAA Arguments = "-o log/varScan -e log/varScan"
P.S. I think you have to post as well the Operating System (E.g. CentOS5) and your cluster type(E.g. SGE) you are using.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas