Snakemake: dependencies which are not input - snakemake

I'd like to know if there is a way in Snakemake to define a dependency which is actually not an input file.
What I mean by that is that there are programs that expect some files to exists while there are not provided on the command line.
Let's consider bwa as an example.
This is a rule from Johannes Köster mapping rules:
rule bwa_mem_map:
input:
lambda wildcards: config["references"][wildcards.reference],
lambda wildcards: config["units"][wildcards.unit]
output:
"mapping/{reference}/units/{unit}.bam"
params:
sample=lambda wildcards: UNIT_TO_SAMPLE[wildcards.unit],
custom=config.get("params_bwa_mem", "")
log:
"mapping/log/{reference}/{unit}.log"
threads: 8
shell:
"bwa mem {params.custom} "
r"-R '#RG\tID:{wildcards.unit}\t"
"SM:{params.sample}\tPL:{config[platform]}' "
"-t {threads} {input} 2> {log} "
"| samtools view -Sbh - > {output}"
Here, bwa expect that the genome index file exists while it is not a command-line argument (the path to the index file is deduced from the genome path).
Is there a way to tell Snakemake that the index file is a dependency, and Snakemake will look in its rule if he knows how to generate this file ?
I suppose you could still rewrite your rule inputs as:
rule bwa_mem_map:
input:
genome=lambda wildcards: config["references"][wildcards.reference],
fastq=lambda wildcards: config["units"][wildcards.unit]
index=foo.idx
And adapt the rule run part consequently.
Is it the best solution?
Thanks in advance.
Benoist

I think the only way snakemake handles dependencies between rules is through files, so I'd say you are doing it correctly when you put the index file explicitly as an input for your mapping rule, even though this file does not appear in the mapping command.
For what its worth, I do the same for bam index files, which are an implicit dependency for some tools: I put both the sorted bam file and its index as input, but only use the bam file in the shell or run part. And I have a rule generating both having the two files as output.
input and output files do not need to appear in shell / run parts of a rule.

Related

How to parallelize jobs for a list of files using snakemake (beginner question)

I am struggling with a very simple thing. On input of my snakemake pipeline I would like to have a directory, list its content, and process each file from that directory in parallel. Naively I thought something like this should work:
rule all:
input:
"in/{test}.txt"
output:
"out/{test}.txt"
shell:
"echo {input} >> {output}"
This ends with the error
WorkflowError:
Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards.
All the resources I could find start with hard-coding the list of jobs in the script, which is something I want to avoid to keep the pipeline generic. The idea is to just point the pipeline to a directory with a list of files and let it do its job. Is this possible? Seems fairly simple and intuitive, but couldn't find an example showing that.
I don't know what command you used for this rule, but the following workflow should suffice your purpose
rule all:
input:
expand("out/{prefix}.txt", prefix=glob_wildcards("in/{test}.txt").test)
rule test:
input:
"in/{test}.txt"
output:
"out/{test}.txt"
shell:
"echo {input} >> {output}"
glob_wildcards is a function by snakemake to find out all the files that match the specified pattern (in/{test}.txt in this case), then .text is to get the list of strings that match {test} in filenames (example: "ab" in "in/ab.txt").
Then expand can fill the string to the placeholder variable that wrapped by curly bracket, then generate a list of input file names.
So rule all wants a list of input files correspond to all txt files in in folder, then it would let snakemake execute rule test for every file

Eliminating snakemake temporary directories

In order to save the space on PC, I am working using a temp() function in snakemake. This is eliminating all the files {sample}.dup.bam inside the dup directory,but not the directory itself. How to improve this?
rule all:
input:
expand("dup/{sample}.dup.bam", sample=SAMPLES),
"dup/bam_list"
rule samtools_markdup:
input:
sortbam ="rg/{sample}.rg.bam"
output:
dupbam = temp("dup/{sample}.dup.bam")
threads: 5
shell:
"""
samtools markdup -# {threads} {input.sortbam} {output.dupbam}
"""
rule bam_list:
input:
expand("dup/{sample}.dup.bam", sample=SAMPLES)
output:
outlist = "dup/bam_list"
shell:
"""
ls dup/*.bam > {output.outlist}
"""
The temp() function deletes all files which are not needed anymore in the workflow.
Since you specify in rule all that you need to create the file dup/bam_list, snakemake will not the delete this file, and thus, the dup directory. I'm even surprised all the bam files get deleted since you're asking for them in rule all.
Tips
You're defining a dependency between your rules:
rule samtools_markdup is needed before running rule bam_list. Therefore, you do not need to ask for expand("dup/{sample}.dup.bam", sample=SAMPLES) in rule all. The lasts will be created (and deleted as marked as temporary files) in order to create the file dup/bam_list.
If you need to delete a directory you can (probably) mark it as temp too as well as the directory() function:
output: temp(directory("dup"))
but once more, if any file in this folder is given to rule all, it won't be deleted. Working with directories is always a bit tricky since snakemake uses files (and their timestamps) to define the DAG.

An WorkflowError with wildcards

I want to use snakemake to QC the fastq file, but it show that :
WorkflowError:
Target rules may not contain wildcards. Please specify concrete files
or a rule without wildcards.
The code I wrote is like this
SAMPLE = ["A","B","C"]
rule trimmomatic:
input:
"/data/samples/{sample}.fastq"
output:
"/data/samples/{sample}.clean.fastq"
shell:
"trimmomatic SE -threads 5 -phred33 -trimlog trim.log {input} {output} LEADING:20 TRAILING:20 MINLEN:16"
I'm a novice, if anyone know that, please tell me. Thanks so much!
You could do one of the following, but chances are you want to do the latter one.
Explicitly specifiy output filenames via commandline:
snakemake data/samples/A.clean.fastq
This would run rule to create file data/samples/A.clean.fastq
Specify target output files to be created in Snakefile itself using rule all. See here to learn more about adding targets via rule all
SAMPLE_NAMES = ["A","B", "C"]
rule all:
input:
expand("data/samples/{sample}.clean.fastq", sample=SAMPLE_NAMES)
rule trimmomatic:
input:
"data/samples/{sample}.fastq"
output:
"data/samples/{sample}.clean.fastq"
shell:
"trimmomatic SE -threads 5 -phred33 -trimlog trim.log {input} {output} LEADING:20 TRAILING:20 MINLEN:16"

Varying (known) number of outputs in Snakemake

I have a Snakemake rule that works on a data archive and essentially unpacks the data in it. The archives contain a varying number of files that I know before my rule starts, so I would like to exploit this and do something like
rule unpack:
input: '{id}.archive'
output:
lambda wildcards: ARCHIVE_CONTENTS[wildcards.id]
but I can't use functions in output, and for good reason. However, I can't come up with a good replacement. The rule is very expensive to run, so I cannot do
rule unpack:
input: '{id}.archive'
output: '{id}/{outfile}'
and run the rule several times for each archive. Another alternative could be
rule unpack:
input: '{id}.archive'
output: '{id}/{outfile}'
run:
if os.path.isfile(output[0]):
return
...
but I am afraid that would introduce a race condition.
Is marking the rule output with dynamic really the only option? I would be fine with auto-generating a separate rule for every archive, but I haven't found a way to do so.
Here, it becomes handy that Snakemake is an extension of plain Python. You can generate a separate rule for each archive:
for id, contents in ARCHIVE_CONTENTS.items():
rule:
input:
'{id}.tar.gz'.format(id=id)
output:
expand('{id}/{outfile}', outfile=contents)
shell:
'tar -C {wildcards.id} -xf {input}'
Depending on what kind of archive this is, you could also have a single rule that just extracts the desired file, e.g.:
rule unpack:
input:
'{id}.tar.gz'
output:
'{id}/{outfile}'
shell:
'tar -C {wildcards.id} -xf {input} {wildcards.outfile}'

how can I pass a string in config file into the output section?

New to snakemake and I've been trying to transform my shell script based pipeline into snakemake based today and run into a lot of syntax issues.. I think most of the trouble I have is around getting all the files in a particular directories and infer output names from input names since that's how I use shell script (for loop), in particular, I tried to use expand function in the output section and it always gave me an error.
After checking some example Snakefile, I realized people never use expand in the output section. So my first question is: is output the only section where expand can't be used and if so, why? What if I want to pass a prefix defined in config.yaml file as part of the output file and that prefix can not be inferred from input file names, how can I achieve that, just like what I did below for the log section where {runid} is my prefix?
Second question about syntax: I tried to pass a user defined id in the configuration file (config.yaml) into the log section and it seems to me that here I have to use expand in the following form, is there a better way of passing strings defined in config.yaml file?
log:
expand("fastq/fastqc/{runid}_fastqc_log.txt",runid=config["run"])
where in the config.yaml
run:
"run123"
Third question: I initially tried the following 2 methods but they gave me errors so does it mean that inside log (probably input and output) section, Python syntax is not followed?
log:
"fastq/fastqc/"+config["run"]+"_fastqc_log.txt"
log:
"fastq/fastqc/{config["run"]}_fastqc_log.txt"
Here is an example of small workflow:
# Sample IDs
SAMPLES = ["sample1", "sample2"]
CONTROL = ["sample1"]
TREATMENT = ["sample2"]
rule all:
input: expand("{treatment}_vs_{control}.bed", treatment=TREATMENT, control=CONTROL)
rule peak_calling:
input: control="{control}.sam", treatment="{treatment}.sam"
output: "{treatment}_vs_{control}.bed"
shell: "touch {output}"
rule mapping:
input: "{samples}.fastq"
output: "{samples}.sam"
shell: "cp {input} {output}"
I used the expand function only in my final target. From there, snakemake can deduce the different values of the wildcards used in the rules "mapping" and "peak_calling".
As for the last part, the right way to put it would be the first one:
log:
"fastq/fastqc/" + config["run"] + "_fastqc_log.txt"
But again, snakemake can deduce it from your target (the rule all, in my example).
rule mapping:
input: "{samples}.fastq"
output: "{samples}.sam"
log: "{samples}.log"
shell: "cp {input} {output}"
Hope this helps!
You can use f-strings:
If this is you folder_with_configs/some_config.yaml:
var: value
Then simply
configfile:
"folder_with_configs/some_config.yaml"
rule one_to_rule_all:
output:
f"results/{config['var']}.file"
shell:
"touch {output[0]}"
Do remember about python rules related to nesting different types of apostrophes.
config in the smake rule is a simple python dictionary.
If you need to use additional variables in a path, e.g. some_param, use more curly brackets.
rule one_to_rule_all:
output:
f"results/{config['var']}.{{some_param}}"
shell:
"touch {output[0]}"
enjoy