How to make Snakemake wrappers work with the {threads} variable - snakemake

EDIT 27-07-2018: Wrapper does not account for threads. Furthermore, the syntax I`m trying here won't work and as far as I can find similar syntax is not supported. Answer is from cross-post on Snakemake Google groups and meono below.
I am using Snakemake and I'm quite happy with it. However, for some processes I`m using wrappers (i.e. FastQC and Trimmomatic). However, I notice that these wrappers do not take the {threads} variable into account. Can someone explain what the proper syntax is to make this work?
I've tried setting threads: 4 and then specifying {threads} at the proper place in the code (e.g. for FastQC: params: "--threads {threads}"). Likewise, I've tested setting {wildcards.threads} and also {snakemake.threads}. It looks like that wrapper codeblock is unable to "see" the value of the threads variable.
Please see the example below.
Note: I've looked at the Bitbucket snakemake-wrapper repo and readthedocs readme, but could not find an answer.
rule FastQC_preTrim:
input:
join(RAW_DATA, PATTERN_ANY)
output:
html="FastQC_pretrim/{sample}.html",
zip="FastQC_pretrim/{sample}_fastqc.zip"
threads: 4
params:
"--threads {wildcards.threads}" # Also tried {threads}
wrapper:
"0.20.1/bio/fastqc"

(would put this in a comment but don't have the reps)
fastqc wrapper doesn't account for threads in rule. I think
params:
"--threads 4"
would work for you.

I ran into the same issue while trying to use snakemake wrappers but failed when specifying {threads} or any other wildcards into params.
A workaround is to explicitly define a "thread" parameter in your config.yaml and calling it from there. For example:
# In config.yaml
thread_use: 32
And
# In rule
rule some_rule_with_wrapper:
input:
"path/to/input"
output:
"path/to/output"
params:
extra="-t "+str(config["thread_use"]) # Remember to coerce int to string
threads: 32
wrapper:
"some/wrapper"

Related

Use wildcard in python shell rule

I would like to do something like the following in Snakemake:
rule RULE:
input: ...
output: "{tool}/{file}"
shell: lambda wildcards: command_for_tool[wildcards.tool]
possibly with the shell command wrapped in a format(.., file=wildcards.file) to expand the {file} that will be inside the command_for_tool.
Currently I can do this using a run: calling shell(..), but I can't use this because I'm benchmarking the memory usage of the rule and going via python adds 30+MB overhead.
It is possible to use some python code inside the shell: rule that returns a string, but in this case I cannot figure out how to use the wildcards.
It is also possible to use wildcards directly in a string value, where they will be substituted automatically, but this doesn't allow for the map-lookup I need.
Is there a clean solution to this? (Currently I'm trying to work around it using params:.) To me it seems like an omission/inconsistency in how snakemake works.
Using your own suggestion, a solution using params seems quite clean:
rule RULE:
input:
'in.txt',
output:
'{foo}.txt',
params:
cmd= lambda wc: command_for_tool[wc.foo],
shell:
"""
{params.cmd}
"""
although I can see that for consistency with the input and params directive, also shell: lambda wildcards: command_for_tool[wildcards.tool] should work.

Snakemake expand function alternative

I have been having some difficulty for some time producing a workflow with many inputs and a single output, such as is shown below. The code below works fine to some extent, however when there are too many input files the concatenate step invariably fails:
rule generate_text:
input:
"data/{name}.csv"
output:
"text_files/{name}.txt"
shell:
"somecommand {input} -o {output}"
rule concatenate_text :
input:
expand("text_files/{name}.txt", name=names)
output:
"summaries/summary.txt"
shell:
"cat {input} > {output}"
I have done some digging and found that this is attributable to a limitation on the number of characters that can be put in a single command. I am working with increasingly large numbers of inputs and therefore the above solution is not scalable.
Can anybody please propose any solutions to this issue? I haven't been able to find any online.
Ideally the solution wouldn't be one limited to just cat or other shell commands and could be employed within the structure of a rule in cases where --use-conda can be employed. My current fix involves using an onsuccess script as follows, but this doesn't allow use of --use-conda and rule specific conda environments.
One handy thing about the shell command is that you can feed it snakemake variables, but its not quite flexible enough for my purposes due to the aforementioned conda issue.
onsuccess:
shell("cat text_files/*.txt > summaries/summary.txt")

Accessing file path from a config.yaml in Snakemake

I'm working with Snakemake for NGS analysis. I have a list of input files, stored in a YAML file as follows:
DATASETS:
sample1: /path/to/input/bam
.
.
A very simplified skeleton of my Snakemake file, as described earlier in Snakemake: How to use config file efficiently and https://www.biostars.org/p/406452/, is as follows:
rule all:
input:
expand("report/{sample}.xlsx", sample = config["DATASETS"])
rule call:
input:
lambda wildcards: config["DATASETS"][wildcards.sample]
output:
"tmp/{sample}.vcf"
shell:
"some mutect2 script"
rule summarize:
input:
"tmp/{sample}.vcf"
output:
"report/{sample}.xlsx"
shell:
"processVCF.py"
This complains about missing input files for rule all. I'm really not too sure what I am missing out here: Could someone perhaps point out where I can start looking to try to solve my problem?
This problem persists even when I execute snakemake -n tmp/sample1.vcf, so it seems the problem is related to the inability to pass the input file to the rule call. I have a nagging feeling that I'm really missing something trivial here.

Snakemake › Access multiple keys from config file

I have the question about the proper handling of the config file. I'm trying to solve my issue for a couple of days now but with the best will, I just can't find out how to do it. I know that this question is maybe quite similar with all the others here and I really tried to use them - but I didn't really get it. I hope that some things about how snakemake works will be more clear when I solved this problem.
I'm just switching to snakemake and I thought I just can easily convert my bash script. To get familiar with snakemake I started trying a simple Data-Processing pipeline. I know I could solve my case while defining every variable within the snakefile. But I want to use an external config file.
First is to say, for better understanding I decided just to post the code which I thought would work somehow. I already played around with different versions for a "rule all" and the "lambda" functions, but nothing worked so far and it just would be confusing. I'm really a bit embarrassed and confused about why I can't get this working. The variable differs from the key because I aways had a version where I redefine the variable, like:
$ sample=config["samples"]
I would be incredibly thankful for an example code.
What I'd like to have is:
The config file:
samples:
- SRX1232390
- SRX2312380
names:
- SomeData
- SomeControl
adapters:
- GATCGTAGC
- GATCAGTCG
And then I thought I can just call the keys like different variables.
rule download_fastq:
output:
"fastq/{name}.fastq.gz"
shell:
"fastq-dump {wildcards.sample} > {output}"
later there will be more rules, so I thought for them I also just need a key:
rule trimming_cutadapt:
input:
"fastq/{name}.fastq"
output:
"ctadpt_{name}.fastq"
shell:
"cutadapt -a {adapt}"
I also tried something with a config file like this:
samples:
Somedata: SRX1232131
SomeControl: SRX12323
But in the end I also didn't find the final solution nor would I know how to add a third "variable" then.
I hope it is somehow understandable what I want to have. It would be very awesome if someone could help me.
EDIT:
Ok - I reworked my code and tried to dig into everything. I fear my understanding lacks in connecting the things I read in this case. I would really appreciate some tips which will probably help me to understand my confusion.
First of all: Rather than try to download data from a pipeline I decided to do this in a config step. I tried out two different versions now:
Based on this answer I tried version one. I like the version with the two files. But I'm stuck in how to deal with the variables now in things like using them with the lambda function or everything where you normally would write "config["sample"]".
So my problem here is that I don't knwo ho to proceed or how the correct syntax is now to call the variables.
#version one
configfile: "config.yaml"
sample_file = config["sample_file"]
import pandas as pd
sample = pd.read_table(sample_file)['samples']
adapt = pd.read_table(sample_file)['adapters']
rule trimming_cutadapt:
input:
data=expand("../data/{sample}.fastq", name = pd.read_table(sample_file)['names']),
lambda wildcards: ???
output:
"trimmed/ctadpt_{sample}.fastq"
shell:
"cutadapt -a {adapt}"
So I went back to try to understand using and defining the wildcards. So (among other things) I looked into the example snakefile and the example rules of Johannes. And of course into the man. Oh and the Thing about the zip function.
At least I don't get an error anymore that it can't deal with wildcards or whatever. Now it's just doing nothing. And I can't find out why because I don't get any information. Additionaly I marked some points which I don't understand.
#version two
configfile: "config_ChIP_Seq_Pipeline.yaml"
rule all:
input:
expand("../data/{sample}.fastq", sample=config["samples"])
#when to write the lambda or the expand in a rule all and when into the actual rule?
rule trimming_cutadapt:
input:
"../data/{sample}.fastq"
params:
adapt=lambda wildcards: config[wildcards.sample]["adapt"] #why do I have to write .samle? when I have to use wildcard.XXX in the shell part?
output:
"trimmed/ctadpt_{sample}.fastq"
shell:
"cutadapt -a {params.adapt}"
As a testfile I used this one.
My configfile in version 1:
sample_file: "sample.tab"
and the tab file:
samples names adapters
test_1 input GACCTA
and the configfile from version two:
samples:
- test_1
adapt:
- GTACGTAG
Thanks for your help and patients!
Cheers
You can look at this post to see how to store and access sample information.
Then you can look at Snakemake documentation here, more specifically at the zip function, which might help you as well.

how can I pass a string in config file into the output section?

New to snakemake and I've been trying to transform my shell script based pipeline into snakemake based today and run into a lot of syntax issues.. I think most of the trouble I have is around getting all the files in a particular directories and infer output names from input names since that's how I use shell script (for loop), in particular, I tried to use expand function in the output section and it always gave me an error.
After checking some example Snakefile, I realized people never use expand in the output section. So my first question is: is output the only section where expand can't be used and if so, why? What if I want to pass a prefix defined in config.yaml file as part of the output file and that prefix can not be inferred from input file names, how can I achieve that, just like what I did below for the log section where {runid} is my prefix?
Second question about syntax: I tried to pass a user defined id in the configuration file (config.yaml) into the log section and it seems to me that here I have to use expand in the following form, is there a better way of passing strings defined in config.yaml file?
log:
expand("fastq/fastqc/{runid}_fastqc_log.txt",runid=config["run"])
where in the config.yaml
run:
"run123"
Third question: I initially tried the following 2 methods but they gave me errors so does it mean that inside log (probably input and output) section, Python syntax is not followed?
log:
"fastq/fastqc/"+config["run"]+"_fastqc_log.txt"
log:
"fastq/fastqc/{config["run"]}_fastqc_log.txt"
Here is an example of small workflow:
# Sample IDs
SAMPLES = ["sample1", "sample2"]
CONTROL = ["sample1"]
TREATMENT = ["sample2"]
rule all:
input: expand("{treatment}_vs_{control}.bed", treatment=TREATMENT, control=CONTROL)
rule peak_calling:
input: control="{control}.sam", treatment="{treatment}.sam"
output: "{treatment}_vs_{control}.bed"
shell: "touch {output}"
rule mapping:
input: "{samples}.fastq"
output: "{samples}.sam"
shell: "cp {input} {output}"
I used the expand function only in my final target. From there, snakemake can deduce the different values of the wildcards used in the rules "mapping" and "peak_calling".
As for the last part, the right way to put it would be the first one:
log:
"fastq/fastqc/" + config["run"] + "_fastqc_log.txt"
But again, snakemake can deduce it from your target (the rule all, in my example).
rule mapping:
input: "{samples}.fastq"
output: "{samples}.sam"
log: "{samples}.log"
shell: "cp {input} {output}"
Hope this helps!
You can use f-strings:
If this is you folder_with_configs/some_config.yaml:
var: value
Then simply
configfile:
"folder_with_configs/some_config.yaml"
rule one_to_rule_all:
output:
f"results/{config['var']}.file"
shell:
"touch {output[0]}"
Do remember about python rules related to nesting different types of apostrophes.
config in the smake rule is a simple python dictionary.
If you need to use additional variables in a path, e.g. some_param, use more curly brackets.
rule one_to_rule_all:
output:
f"results/{config['var']}.{{some_param}}"
shell:
"touch {output[0]}"
enjoy