An WorkflowError with wildcards

An WorkflowError with wildcards - error-handling

I want to use snakemake to QC the fastq file, but it show that :
WorkflowError:
Target rules may not contain wildcards. Please specify concrete files
or a rule without wildcards.
The code I wrote is like this
SAMPLE = ["A","B","C"]
rule trimmomatic:
input:
"/data/samples/{sample}.fastq"
output:
"/data/samples/{sample}.clean.fastq"
shell:
"trimmomatic SE -threads 5 -phred33 -trimlog trim.log {input} {output} LEADING:20 TRAILING:20 MINLEN:16"
I'm a novice, if anyone know that, please tell me. Thanks so much!

You could do one of the following, but chances are you want to do the latter one.
Explicitly specifiy output filenames via commandline:
snakemake data/samples/A.clean.fastq
This would run rule to create file data/samples/A.clean.fastq
Specify target output files to be created in Snakefile itself using rule all. See here to learn more about adding targets via rule all
SAMPLE_NAMES = ["A","B", "C"]
rule all:
input:
expand("data/samples/{sample}.clean.fastq", sample=SAMPLE_NAMES)
rule trimmomatic:
input:
"data/samples/{sample}.fastq"
output:
"data/samples/{sample}.clean.fastq"
shell:
"trimmomatic SE -threads 5 -phred33 -trimlog trim.log {input} {output} LEADING:20 TRAILING:20 MINLEN:16"

Related

Possible to have wildcard as dictionary key in snakemake rule output?

I need my output to be dynamic to input, the best way I thought to do is this by having output based on a dictionary. He is a stripped down example:
config.yaml:
{'names' : ['foo', 'bar']}
Snakefile:
configfile: "config.yaml"
rule all:
input:
expand("{name}",name=config['names'])
rule make_file:
output:
lambda wildcards: config[wildcards.name]
shell:
"""
touch {output}
"""
I get Only input files can be specified as functions
I also tried adding making output of rule make_file config["{name}"]

Snakemake has the opposite logic: that is the output that is used to define the actual wildcards values which then are used in other sections.
If you need the output to depend on the input, you may define several rules, where each one would define one possible pattern of how an input is being converted to output. There are other possibilities to add a dynamic behavior like checkpoints and dynamic files. Anyway there is no one size fits all solution for every problem, so you need to clarify what you are trying to achieve.
Anyway, when designing a Snakemake pipeline you should think in terms of "what target is my goal", but not in terms of "what could I make out of what I have".

Maybe you are making things more complicated than necessary or you simplified the example too much (or I'm misunderstanding the question...).
This will create files foo and bar by running rule make_file on each item in config['names']:
rule all:
input:
expand("{name}", name= config['names'])
rule make_file:
output:
'{name}'
shell:
"""
touch {output}
"""

I need my output to be dynamic to input
I'll add that you can use multiple rules to have snakemake select what to run based on the input to generate a single output file. This may be what you are getting at, but do provide more details if not:
ruleorder:
rule1 > rule2 > rule3
rule rule1:
output: 'foo'
input: 'bar'
shell: 'echo rule1 > {output}'
rule rule2:
output: 'foo'
input: 'baz'
shell: 'echo rule2 > {output}'
rule rule3:
output: 'foo'
shell: 'echo rule3 > {output}'
The rule order resolves ambiguity if bar and baz are both present. If neither are present than rule3 runs as a default (with no necessary input).
I've used something similar in cases where say a fasta file can be generated from a vcf, a bam file, or downloaded as a fallback.

Eliminating snakemake temporary directories

In order to save the space on PC, I am working using a temp() function in snakemake. This is eliminating all the files {sample}.dup.bam inside the dup directory,but not the directory itself. How to improve this?
rule all:
input:
expand("dup/{sample}.dup.bam", sample=SAMPLES),
"dup/bam_list"
rule samtools_markdup:
input:
sortbam ="rg/{sample}.rg.bam"
output:
dupbam = temp("dup/{sample}.dup.bam")
threads: 5
shell:
"""
samtools markdup -# {threads} {input.sortbam} {output.dupbam}
"""
rule bam_list:
input:
expand("dup/{sample}.dup.bam", sample=SAMPLES)
output:
outlist = "dup/bam_list"
shell:
"""
ls dup/*.bam > {output.outlist}
"""

The temp() function deletes all files which are not needed anymore in the workflow.
Since you specify in rule all that you need to create the file dup/bam_list, snakemake will not the delete this file, and thus, the dup directory. I'm even surprised all the bam files get deleted since you're asking for them in rule all.
Tips
You're defining a dependency between your rules:
rule samtools_markdup is needed before running rule bam_list. Therefore, you do not need to ask for expand("dup/{sample}.dup.bam", sample=SAMPLES) in rule all. The lasts will be created (and deleted as marked as temporary files) in order to create the file dup/bam_list.
If you need to delete a directory you can (probably) mark it as temp too as well as the directory() function:
output: temp(directory("dup"))
but once more, if any file in this folder is given to rule all, it won't be deleted. Working with directories is always a bit tricky since snakemake uses files (and their timestamps) to define the DAG.

Specify input and output files in Snakefile

I'm new to Snakemake and I want to make a pipeline that takes a given input text file and concatenates its content to a given output file. However I want to be able to specify the names of both the input and output files at run time, so neither file names are hardcoded in the Snakefile. Right now all I can come up with is:
rule all:
input:
"{input}.txt",
"{output}.txt"
rule output_files:
input:
"{input}.txt"
output:
"{output}.txt"
shell:
"cat {input}.txt > {output}.txt"
I tried running this with "snakemake input1.txt output.txt" but I got the error:
Building DAG of jobs...
WildcardError in line 6 of Snakefile:
Wildcards in input files cannot be determined from output files:
'input'
Any suggestions would be greatly appreciated.

In your example you actually copy a single input file into an output file using a cat shell command. That could be understood as an intention to concatenate several inputs into one output:
rule concatenate:
input:
"input1.txt",
"input2.txt"
output:
"output.txt"
shell:
"cat {input} > {output}"
takes a given input text file and concatenates its content to a given output file
Another way to understand the question is that you are trying to append an input file to the end of the output. That is more challenging: Snakemake "thinks" in terms of goals where each goal is a distinct file. How would Snakemake know if the output file is a raw one or if it is a concatenated version? One way to do that is to have "flag" files: the presence of such file would mean that the goal is achieved and no concatenation is needed. One more problem: Snakemake clears the output file before running the rule. Than means that you need to specify it as a input:
rule append:
input:
in = "input.txt",
out = "output.txt"
output:
flag = "flag"
shell:
"cat {input.in} >> {input.out} && touch {output.flag}"
Now back to your question regarding the error and the way to specify the filenames in runtime. You get this error because the wildcards should be fully inferred from the output section, and both your rules are ill-formed. Let's start with the rule all.
You need to say Snakemake what goal you are building. No wildcards in the input, everything should be disambigued:
def getInput():
pass
# form the actual goal (you may query the database, service, hardcode, etc.)
rule all:
input: getInput
Let's say you decided that the goal should be 3 files: ["output1.txt", "output3.txt", "output3.txt"]:
def getInput():
magic_numbers_from_oracle = ["1", "2", "3"]
return magic_numbers_from_oracle
rule all:
input: expand("output{number}.txt", number=getInput())
Ok, now Snakemake knows the goal. The next step is to write a rule that says how to create a single output{number}.txt file. For simplicity I'm taking your initial approach with cat/copying:
rule cat_copy:
input:
"input{n}.txt"
output:
"output{n}.txt"
shell:
"cat {input} > {output}"
That's it. As long as you have files input1.txt, input2.txt, input3.txt you would get the corresponding outputs.

Varying (known) number of outputs in Snakemake

I have a Snakemake rule that works on a data archive and essentially unpacks the data in it. The archives contain a varying number of files that I know before my rule starts, so I would like to exploit this and do something like
rule unpack:
input: '{id}.archive'
output:
lambda wildcards: ARCHIVE_CONTENTS[wildcards.id]
but I can't use functions in output, and for good reason. However, I can't come up with a good replacement. The rule is very expensive to run, so I cannot do
rule unpack:
input: '{id}.archive'
output: '{id}/{outfile}'
and run the rule several times for each archive. Another alternative could be
rule unpack:
input: '{id}.archive'
output: '{id}/{outfile}'
run:
if os.path.isfile(output[0]):
return
...
but I am afraid that would introduce a race condition.
Is marking the rule output with dynamic really the only option? I would be fine with auto-generating a separate rule for every archive, but I haven't found a way to do so.

Here, it becomes handy that Snakemake is an extension of plain Python. You can generate a separate rule for each archive:
for id, contents in ARCHIVE_CONTENTS.items():
rule:
input:
'{id}.tar.gz'.format(id=id)
output:
expand('{id}/{outfile}', outfile=contents)
shell:
'tar -C {wildcards.id} -xf {input}'
Depending on what kind of archive this is, you could also have a single rule that just extracts the desired file, e.g.:
rule unpack:
input:
'{id}.tar.gz'
output:
'{id}/{outfile}'
shell:
'tar -C {wildcards.id} -xf {input} {wildcards.outfile}'

Snakemake: dependencies which are not input

I'd like to know if there is a way in Snakemake to define a dependency which is actually not an input file.
What I mean by that is that there are programs that expect some files to exists while there are not provided on the command line.
Let's consider bwa as an example.
This is a rule from Johannes Köster mapping rules:
rule bwa_mem_map:
input:
lambda wildcards: config["references"][wildcards.reference],
lambda wildcards: config["units"][wildcards.unit]
output:
"mapping/{reference}/units/{unit}.bam"
params:
sample=lambda wildcards: UNIT_TO_SAMPLE[wildcards.unit],
custom=config.get("params_bwa_mem", "")
log:
"mapping/log/{reference}/{unit}.log"
threads: 8
shell:
"bwa mem {params.custom} "
r"-R '#RG\tID:{wildcards.unit}\t"
"SM:{params.sample}\tPL:{config[platform]}' "
"-t {threads} {input} 2> {log} "
"| samtools view -Sbh - > {output}"
Here, bwa expect that the genome index file exists while it is not a command-line argument (the path to the index file is deduced from the genome path).
Is there a way to tell Snakemake that the index file is a dependency, and Snakemake will look in its rule if he knows how to generate this file ?
I suppose you could still rewrite your rule inputs as:
rule bwa_mem_map:
input:
genome=lambda wildcards: config["references"][wildcards.reference],
fastq=lambda wildcards: config["units"][wildcards.unit]
index=foo.idx
And adapt the rule run part consequently.
Is it the best solution?
Thanks in advance.
Benoist

I think the only way snakemake handles dependencies between rules is through files, so I'd say you are doing it correctly when you put the index file explicitly as an input for your mapping rule, even though this file does not appear in the mapping command.
For what its worth, I do the same for bam index files, which are an implicit dependency for some tools: I put both the sorted bam file and its index as input, but only use the bam file in the shell or run part. And I have a rule generating both having the two files as output.
input and output files do not need to appear in shell / run parts of a rule.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

An WorkflowError with wildcards - error-handling

Related

Possible to have wildcard as dictionary key in snakemake rule output?

Eliminating snakemake temporary directories

Specify input and output files in Snakefile

Varying (known) number of outputs in Snakemake

Snakemake: dependencies which are not input

Categories

Resources