How to pass a list or dictionary using Snakemake's command line config option

How to pass a list or dictionary using Snakemake's command line config option - snakemake

I want to pass a list of filenames to produce through my Snakemake workflow by using the --config CLI option. What's the syntax I need for that?

Just specify the list or dict in YAML (~Python) syntax on the command line:
snakemake -c1 --config 'foo={"a":"a.txt", "b":"b.txt"}' 'bar=["file1.txt","file2.csv"]'
You can then access this list in a sample Snakemake rule as follows:
rule all:
input:
baz=config["foo"]["a"],
qux=config["foo"]["b"],
quux=config["bar"],
As usual, the name before the = will be how you access the data after the = through the config variable. Multiple config items can be passed through space separation as shown above.
Here is the full parsing order of the --config CLI option (as of November '21):
--config key=value [key=value […]] sets top-level keys to the given
value, which is parsed, in order, as either: int(value), float(value),
literal True or False (a bool), a YAML-encoded value, or finally
str(value). Only the last occurrence of the option is used if it is
given more than once.
https://github.com/tsibley/blab-standup/blob/master/2021-11-04.md#specifying-config
The full post is a good reference of how Snakemake config handling works (the documentation is a bit lacking at times).

Related

Using SnakeMake, how to pass SLURM flags with dashes in their names?

I'm using Snakemake to execute rules on a SLURM cluster.
One of the mandatory flags for this cluster is ntasks-per-node, which in a batch script would be specified as e.g. #SBATCH --ntasks-per-node=5. My understanding is that I need to specify this in a snakemake rule as
rule rule_name:
...
resources:
time='00:00:30', #30 sec
ntasks-per-node=1
...
However, running this Snakefile I get
SyntaxError in line 14 of .../Snakefile:
keyword can't be an expression
because there are dashes in the name. But as far as I can tell, replacing the dashes with underscores doesn't work. What should I do here?
(I'm using the SLURM profile here if that matters)

Try quoting. But more importantly, only the resources that are defined in the RESOURCE_MAPPING variable in the slurm_submit.py will be picked up, and the default cookiecutter does not include an ntasks-per-node argument. Hence, quoting alone won't solve the issue.
There are multiple options.
Edit the slurm_submit.py. Add the ntasks-per-node argument and provide whatever alias(es) you would like to use.
RESOURCE_MAPPING = {
"time": ("time", "runtime", "walltime"),
"mem": ("mem", "mem_mb", "ram", "memory"),
"mem-per-cpu": ("mem-per-cpu", "mem_per_cpu", "mem_per_thread"),
"nodes": ("nodes", "nnodes"),
# some suggested aliases
"ntasks-per-node": ("ntasks-per-node", "ntasks_per_node", "ntasks")
}
I would only do this if there actually are situations where you might change this value.
Define an invocation-level configuration. Snakemake's --cluster_config parameter can still be used to provide additional configuration settings. In this case, a file like
# myslurm.yaml
__default__:
ntasks-per-node: 1
Then use it with
snakemake --profile slurm --cluster_config myslurm.yaml
This is likely the least work to get going.
Define a global value in the profile. The cookiecutter profile generator provides multiple options to define global options that don't often need to change for the profile.

Can I add a file to rule all: which is not defined in output

A number of commands produce silently extra files not defined in the rule output section.
When I try to make sure these are produced by adding them to 'rule all:' a re-run of the workflow fails because the file are not found in the rule(s) output list.
Can I add a supplementary file (not present as {output}) to the 'rule all:'?
Thanks
eg: STAR index produces a number of files in a folder defined by command arguments, checking for the presence of the folder does not mean that indexing has worked out normally
added for clarity, the STAR index exmple takes 'star_idx_75' as output argument and makes a folder of it in which all the following files are stored (their number may vary in function of the index type).
chrLength.txt
chrName.txt
chrNameLength.txt
chrStart.txt
exonGeTrInfo.tab
exonInfo.tab
geneInfo.tab
Genome
genomeParameters.txt
SA
SAindex
sjdbInfo.txt
sjdbList.fromGTF.out.tab
sjdbList.out.tab
transcriptInfo.tab
What I wanted was to check that they are all present BUT none of them is used to build the command itself and if I required them in the rule all: a rerun breaks because they are not in any snakemake {output} definition.
This is why I asked wether I could create 'fake' output variables that are not 'used' for running a command but allow placing the corresponding items in the 'rule all:' - am I more clear now :-).

Can I add a supplementary file (not present as {output}) to the 'rule all:'?
I don't think so, at least not without resorting on some convoluted solution. Every file in rule all (or more precisely the first rule) must have a rule that lists it in output.
If you don't want to repeat a long list, why not doing something like this?
star_index= ['ref.idx1', 'ref.idx2', ...]
rule all:
input:
star_index
rule make_index:
input:
...
output:
star_index
shell:
...

It's probably better to list them all in the rule's output, but only use the relevant ones in subsequent rules. You could also look into using directory() which could possibly fit here.

How to overwrite a parameter from the configfile that is not at the first level in a snakemake call?

I can't figure out the syntax. For example:
snakemake --configfile myconfig.yml --config myparam="new value"
This will overwrite the value of config["myparam"] from the yaml file upon workflow execution.
But what if I want to overwrite config["myparam"]["otherparam"]?
Thanks!

This is currently not possible. A general remark: Note that --config should be used as little as possible, because it defeats the goal of reproducibility and data provenance (you would have to remember the command line with which you invoked snakemake).

How to gather files from subdirectories to run jobs in Snakemake?

I am currently working on this project where iam struggling with this issue.
My current directory structure is
/shared/dir1/file1.bam
/shared/dir2/file2.bam
/shared/dir3/file3.bam
I want to convert various .bam files to fastq in the results directory
results/file1_1.fastq.gz
results/file1_2.fastq.gz
results/file2_1.fastq.gz
results/file2_2.fastq.gz
results/file3_1.fastq.gz
results/file3_2.fastq.gz
I have the following code:
END=["1","2"]
(dirs, files) = glob_wildcards("/shared/{dir}/{file}.bam")
rule all:
input: expand( "/results/{sample}_{end}.fastq.gz",sample=files, end=END)
rule bam_to_fq:
input: {dir}/{sample}.bam"
output: left="/results/{sample}_1.fastq", right="/results/{sample}_2.fastq"
shell: "/shared/packages/bam2fastq/bam2fastq --force -o /results/{sample}.fastq {input}"
This outputs the following error:
Wildcards in input files cannot be determined from output files:
'dir'
Any help would be appreciated

You're just missing an assignment for "dir" in your input directive of the rule bam_to_fq. In your code, you are trying to get Snakemake to determine "{dir}" from the output of the same rule, because you have it setup as a wildcard. Since it didn't exist, as a variable in your output directive, you received an error.
input:
"{dir}/{sample}.bam"
output:
left="/results/{sample}_1.fastq",
right="/results/{sample}_2.fastq",
Rule of thumb: input and output wildcards must match
rule all:
input:
expand("/results/{sample}_{end}.fastq.gz", sample=files, end=END)
rule bam_to_fq:
input:
expand("{dir}/{{sample}}.bam", dir=dirs)
output:
left="/results/{sample}_1.fastq",
right="/results/{sample}_2.fastq"
shell:
"/shared/packages/bam2fastq/bam2fastq --force -o /results/{sample}.fastq {input}
NOTES
the sample variable in the input directive now requires double {}, because that is how one identifies wildcards in an expand.
dir is no longer a wildcard, it is explicitly set to point to the list of directories determined by the glob_wildcard call and assigned to the variable "dirs" which I am assuming you make earlier in your script, since the assignment of one of the variables is successful already, in your rule all input "sample=files".
I like and recommend easily differentiable variable names. I'm not a huge fan of the usage of variable names "dir", and "dirs". This makes you prone to pedantic spelling errors. Consider changing it to "dirLIST" and "dir"... or anything really. I just fear one day someone will miss an 's' somewhere and it's going to be frustrating to debug. I'm personally guilty, an thus a slight hypocrite, as I do use "sample=samples" in my core Snakefile. It has caused me minor stress, thus why I make this recommendation. Also makes it easier for others to read your code as well.
EDIT 1; Adding to response as I had initially missed the requirement for key-value matching of the dir and sample
I recommend keeping separate the path and the sample name in different variables. Two approaches I can think of:
Keep using glob_wildcards to make a blanket search for all possible variables, and then use a python function to validate which path+file combinations are legit.
Drop the usage of glob_wildcards. Propagate the directory name as a wildcard variable, {dir}, throughout your rules. Just set it as a sub-directory of "results". Use pandas to pass known, key-value pairs listed in a file to the rule all. Initially I suggest generating the key-value pairs file manually, but eventually, it's generation could just be a rule upstream of others.
Generalizing bam_to_fq a little bit... utilizing an external config, something like....
from pandas import read_table
rule all:
input:
expand("/results/{{sample[1][dir]}}/{sample[1][file]}_{end}.fastq.gz", sample=read_table(config["sampleFILE"], " ").iterrows(), end=['1','2'])
rule bam_to_fq:
input:
"{dir}/{sample}.bam"
output:
left="/results/{dir}/{sample}_1.fastq",
right="/results/{dir}/{sample}_2.fastq"
shell:
"/shared/packages/bam2fastq/bam2fastq --force -o /results/{sample}.fastq {input}
sampleFILE
dir file
dir1 file1
dir2 file2
dir3 file3

how can I pass a string in config file into the output section?

New to snakemake and I've been trying to transform my shell script based pipeline into snakemake based today and run into a lot of syntax issues.. I think most of the trouble I have is around getting all the files in a particular directories and infer output names from input names since that's how I use shell script (for loop), in particular, I tried to use expand function in the output section and it always gave me an error.
After checking some example Snakefile, I realized people never use expand in the output section. So my first question is: is output the only section where expand can't be used and if so, why? What if I want to pass a prefix defined in config.yaml file as part of the output file and that prefix can not be inferred from input file names, how can I achieve that, just like what I did below for the log section where {runid} is my prefix?
Second question about syntax: I tried to pass a user defined id in the configuration file (config.yaml) into the log section and it seems to me that here I have to use expand in the following form, is there a better way of passing strings defined in config.yaml file?
log:
expand("fastq/fastqc/{runid}_fastqc_log.txt",runid=config["run"])
where in the config.yaml
run:
"run123"
Third question: I initially tried the following 2 methods but they gave me errors so does it mean that inside log (probably input and output) section, Python syntax is not followed?
log:
"fastq/fastqc/"+config["run"]+"_fastqc_log.txt"
log:
"fastq/fastqc/{config["run"]}_fastqc_log.txt"

Here is an example of small workflow:
# Sample IDs
SAMPLES = ["sample1", "sample2"]
CONTROL = ["sample1"]
TREATMENT = ["sample2"]
rule all:
input: expand("{treatment}_vs_{control}.bed", treatment=TREATMENT, control=CONTROL)
rule peak_calling:
input: control="{control}.sam", treatment="{treatment}.sam"
output: "{treatment}_vs_{control}.bed"
shell: "touch {output}"
rule mapping:
input: "{samples}.fastq"
output: "{samples}.sam"
shell: "cp {input} {output}"
I used the expand function only in my final target. From there, snakemake can deduce the different values of the wildcards used in the rules "mapping" and "peak_calling".
As for the last part, the right way to put it would be the first one:
log:
"fastq/fastqc/" + config["run"] + "_fastqc_log.txt"
But again, snakemake can deduce it from your target (the rule all, in my example).
rule mapping:
input: "{samples}.fastq"
output: "{samples}.sam"
log: "{samples}.log"
shell: "cp {input} {output}"
Hope this helps!

You can use f-strings:
If this is you folder_with_configs/some_config.yaml:
var: value
Then simply
configfile:
"folder_with_configs/some_config.yaml"
rule one_to_rule_all:
output:
f"results/{config['var']}.file"
shell:
"touch {output[0]}"
Do remember about python rules related to nesting different types of apostrophes.
config in the smake rule is a simple python dictionary.
If you need to use additional variables in a path, e.g. some_param, use more curly brackets.
rule one_to_rule_all:
output:
f"results/{config['var']}.{{some_param}}"
shell:
"touch {output[0]}"
enjoy

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to pass a list or dictionary using Snakemake's command line config option - snakemake

I want to pass a list of filenames to produce through my Snakemake workflow by using the --config CLI option. What's the syntax I need for that?

Related

Using SnakeMake, how to pass SLURM flags with dashes in their names?

Can I add a file to rule all: which is not defined in output

How to overwrite a parameter from the configfile that is not at the first level in a snakemake call?

How to gather files from subdirectories to run jobs in Snakemake?

how can I pass a string in config file into the output section?

Categories

Resources