Use wildcard in python shell rule - snakemake

I would like to do something like the following in Snakemake:
rule RULE:
input: ...
output: "{tool}/{file}"
shell: lambda wildcards: command_for_tool[wildcards.tool]
possibly with the shell command wrapped in a format(.., file=wildcards.file) to expand the {file} that will be inside the command_for_tool.
Currently I can do this using a run: calling shell(..), but I can't use this because I'm benchmarking the memory usage of the rule and going via python adds 30+MB overhead.
It is possible to use some python code inside the shell: rule that returns a string, but in this case I cannot figure out how to use the wildcards.
It is also possible to use wildcards directly in a string value, where they will be substituted automatically, but this doesn't allow for the map-lookup I need.
Is there a clean solution to this? (Currently I'm trying to work around it using params:.) To me it seems like an omission/inconsistency in how snakemake works.

Using your own suggestion, a solution using params seems quite clean:
rule RULE:
input:
'in.txt',
output:
'{foo}.txt',
params:
cmd= lambda wc: command_for_tool[wc.foo],
shell:
"""
{params.cmd}
"""
although I can see that for consistency with the input and params directive, also shell: lambda wildcards: command_for_tool[wildcards.tool] should work.

Related

Snakemake: Pass all wildcards to a single shell command

Snakemake supports generalizing rules with wildcards, like so:
rule conversion:
input:
"file_{name}.txt"
output:
"file_{name}.csv"
shell:
"python process.py {wildcards.name}"
Now say I have another rule like:
rule all:
input:
expand("file_{output_types}.csv", output_types=["A", "B"])
Calling the all rule, Snakemake will see that conversion can produce the requested files file_A.csv and file_B.csv (assuming the input files exist).
What ends up running are two shell commands:
python process.py A
python process.py B
Is it possible to run just a single shell command using wildcards?
What I would like to happen is to use the wildcards to run the command:
python process.py A B
(In my use-case, process.py has a long spin-up, so I want to avoid running it multiple times in a row.)
What about this?
names = ['A', 'B']
rule conversion:
input:
expand("file_{output_types}.txt", output_types= names),
output:
expand("file_{output_types}.csv", output_types= names),
params:
names= names,
shell:
"python process.py {params.names}"

How to get a rule that would work the same on a directory and its sub-directories

I am trying to make a rule that would work the same on a directory and any of its sub sub-directory (to avoid having to repeat the rule several times). I would like to have access to the name of the subdirectory if there is one.
My approach was to make the sub-directory optional. Given that wildcards can be made to accept an empty string by explicitly giving the ".*" pattern, I therefore tried the following rule:
rule test_optional_sub_dir:
input:
"{adir}/{bdir}/a.txt"
output:
"{adir}/{bdir,.*}/b.txt"
shell:
"cp {input} {output}"
I was hoping that this rule would match both A/b.txt and A/B/b.txt.
However, A/b.txt doesn't match the rule. (Neither does A//b.txt which would be the litteral omission of bdir, I guess the double / gets removed before the matching happens).
The following rule works with both A/b.txt and A/B/b.txt:
rule test_optional_sub_dir2:
input:
"{path}/a.txt"
output:
"{path,.*}/b.txt"
shell:
"cp {input} {output}"
but the problem in this case is that I don't have easy access to the name of the directories in path. I could use the function pathlib.Path to break {path} up but this seems to get overly complicated.
Is there a better way to accomplish what I am trying to do?
Thanks a lot for your help.
How exactly you want to use the sub-directory in your rule might determine the best way to do this. Maybe something like:
def get_subdir(path):
dirs = path.split('/')
if len(dirs) > 1:
return dirs[1]
else:
return ''
rule myrule:
input:
"{dirpath}/a.txt"
output:
"{dirpath}/b.txt"
params:
subdir = lambda wildcards: get_subdir(wildcards.dirpath)
shell:
#use {params.subdir}
Of course, if your rule uses "run" or "script" instead of "shell" you don't even need that function and the subdir param, and can just figure out the subdir from the wildcard that gets passed into the script.
With some further fiddling, I found something that is close to what I want:
Let's say I want at least one directory and no more than 2 optional ones below it. The following works. The only downside is that opt_dir1 and opt_dir2 contain the trailing slash rather than just the name of the directory.
rule test_optional_sub_dir3:
input:
"{mand_dir}/{opt_dir1}{opt_dir2}a.txt"
output:
"{mand_dir}/{opt_dir1}{opt_dir2}b.txt"
wildcard_constraints:
mand_dir="[^/]+",
opt_dir1="([^/]+/)?",
opt_dir2="([^/]+/)?"
shell:
"cp {input} {output}"
Still interested in better approaches if anyone has one.

How to make Snakemake wrappers work with the {threads} variable

EDIT 27-07-2018: Wrapper does not account for threads. Furthermore, the syntax I`m trying here won't work and as far as I can find similar syntax is not supported. Answer is from cross-post on Snakemake Google groups and meono below.
I am using Snakemake and I'm quite happy with it. However, for some processes I`m using wrappers (i.e. FastQC and Trimmomatic). However, I notice that these wrappers do not take the {threads} variable into account. Can someone explain what the proper syntax is to make this work?
I've tried setting threads: 4 and then specifying {threads} at the proper place in the code (e.g. for FastQC: params: "--threads {threads}"). Likewise, I've tested setting {wildcards.threads} and also {snakemake.threads}. It looks like that wrapper codeblock is unable to "see" the value of the threads variable.
Please see the example below.
Note: I've looked at the Bitbucket snakemake-wrapper repo and readthedocs readme, but could not find an answer.
rule FastQC_preTrim:
input:
join(RAW_DATA, PATTERN_ANY)
output:
html="FastQC_pretrim/{sample}.html",
zip="FastQC_pretrim/{sample}_fastqc.zip"
threads: 4
params:
"--threads {wildcards.threads}" # Also tried {threads}
wrapper:
"0.20.1/bio/fastqc"
(would put this in a comment but don't have the reps)
fastqc wrapper doesn't account for threads in rule. I think
params:
"--threads 4"
would work for you.
I ran into the same issue while trying to use snakemake wrappers but failed when specifying {threads} or any other wildcards into params.
A workaround is to explicitly define a "thread" parameter in your config.yaml and calling it from there. For example:
# In config.yaml
thread_use: 32
And
# In rule
rule some_rule_with_wrapper:
input:
"path/to/input"
output:
"path/to/output"
params:
extra="-t "+str(config["thread_use"]) # Remember to coerce int to string
threads: 32
wrapper:
"some/wrapper"

How can one access Snakemake config variables inside `shell` section?

In snakemake I would like to access keys from the config from within the shell: directive. I can use {input.foo}, {output.bar}, and {params.baz}, but {config.quux} is not supported. Is there a way to achieve this?
rule do_something:
input: "source.txt"
output: "target.txt"
params:
# access config[] here. parameter tracking is a side effect
tmpdir = config['tmpdir']
shell:
# using {config.tmpdir} or {config['tmpdir']} here breaks the build
"./scripts/do.sh --tmpdir {params.tmpdir} {input} > {output}; "
I could assign the parts of the config I want to a key under params, and then use a {param.x} replacement, but this has unwanted side effects (e.g. the parameter is saved in the snakemake metadata (i.e. .snakemake/params_tracking). Using run: instead of shell: would be another workaround, but accessing {config.tmpdir} directly from the shell block, would be most desirable.
"./scripts/do.sh --tmpdir {config[tmpdir]} {input} > {output}; "
should work here.
It is stated in the documentation:
http://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html#standard-configuration
"For adding config placeholders into a shell command, Python string formatting syntax requires you to leave out the quotes around the key name, like so:"
shell:
"mycommand {config[foo]} ..."

how can I pass a string in config file into the output section?

New to snakemake and I've been trying to transform my shell script based pipeline into snakemake based today and run into a lot of syntax issues.. I think most of the trouble I have is around getting all the files in a particular directories and infer output names from input names since that's how I use shell script (for loop), in particular, I tried to use expand function in the output section and it always gave me an error.
After checking some example Snakefile, I realized people never use expand in the output section. So my first question is: is output the only section where expand can't be used and if so, why? What if I want to pass a prefix defined in config.yaml file as part of the output file and that prefix can not be inferred from input file names, how can I achieve that, just like what I did below for the log section where {runid} is my prefix?
Second question about syntax: I tried to pass a user defined id in the configuration file (config.yaml) into the log section and it seems to me that here I have to use expand in the following form, is there a better way of passing strings defined in config.yaml file?
log:
expand("fastq/fastqc/{runid}_fastqc_log.txt",runid=config["run"])
where in the config.yaml
run:
"run123"
Third question: I initially tried the following 2 methods but they gave me errors so does it mean that inside log (probably input and output) section, Python syntax is not followed?
log:
"fastq/fastqc/"+config["run"]+"_fastqc_log.txt"
log:
"fastq/fastqc/{config["run"]}_fastqc_log.txt"
Here is an example of small workflow:
# Sample IDs
SAMPLES = ["sample1", "sample2"]
CONTROL = ["sample1"]
TREATMENT = ["sample2"]
rule all:
input: expand("{treatment}_vs_{control}.bed", treatment=TREATMENT, control=CONTROL)
rule peak_calling:
input: control="{control}.sam", treatment="{treatment}.sam"
output: "{treatment}_vs_{control}.bed"
shell: "touch {output}"
rule mapping:
input: "{samples}.fastq"
output: "{samples}.sam"
shell: "cp {input} {output}"
I used the expand function only in my final target. From there, snakemake can deduce the different values of the wildcards used in the rules "mapping" and "peak_calling".
As for the last part, the right way to put it would be the first one:
log:
"fastq/fastqc/" + config["run"] + "_fastqc_log.txt"
But again, snakemake can deduce it from your target (the rule all, in my example).
rule mapping:
input: "{samples}.fastq"
output: "{samples}.sam"
log: "{samples}.log"
shell: "cp {input} {output}"
Hope this helps!
You can use f-strings:
If this is you folder_with_configs/some_config.yaml:
var: value
Then simply
configfile:
"folder_with_configs/some_config.yaml"
rule one_to_rule_all:
output:
f"results/{config['var']}.file"
shell:
"touch {output[0]}"
Do remember about python rules related to nesting different types of apostrophes.
config in the smake rule is a simple python dictionary.
If you need to use additional variables in a path, e.g. some_param, use more curly brackets.
rule one_to_rule_all:
output:
f"results/{config['var']}.{{some_param}}"
shell:
"touch {output[0]}"
enjoy