Snakemake: 'Missing input files' due to wrong wildcard expansion - snakemake

I am new to Snakemake and I want to write a very simple Snakefile with a rule that processes each input file separately to an output file, but somehow my wildcards aren't interpreted correctly.
I have set up a minimal, reproducible example environment in Ubuntu 18.04 with the input files "test/test1.txt", "test/test2.txt", and a Snakefile. (snakemake version 5.5.4)
Snakefile:
ins = glob_wildcards("test/{f}.txt")
rule all:
input: expand("out/{f}.txt", f=ins)
rule test:
input: "test/{f}.txt"
output: "out/{f}.txt"
shell: "touch {output}"
This Snakefile throws the following error while building the DAG of jobs:
Missing input files for rule test:
test/['test1', 'test2'].txt
Any ideas how to fix this error?

I think you need to use ins.f or something similar:
expand("out/{f}.txt", f= ins.f)
The reason is explained in the FAQ
[glob_wildcards returns] a named tuple that contains a list of values
for each wildcard.

Related

How to parallelize jobs for a list of files using snakemake (beginner question)

I am struggling with a very simple thing. On input of my snakemake pipeline I would like to have a directory, list its content, and process each file from that directory in parallel. Naively I thought something like this should work:
rule all:
input:
"in/{test}.txt"
output:
"out/{test}.txt"
shell:
"echo {input} >> {output}"
This ends with the error
WorkflowError:
Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards.
All the resources I could find start with hard-coding the list of jobs in the script, which is something I want to avoid to keep the pipeline generic. The idea is to just point the pipeline to a directory with a list of files and let it do its job. Is this possible? Seems fairly simple and intuitive, but couldn't find an example showing that.
I don't know what command you used for this rule, but the following workflow should suffice your purpose
rule all:
input:
expand("out/{prefix}.txt", prefix=glob_wildcards("in/{test}.txt").test)
rule test:
input:
"in/{test}.txt"
output:
"out/{test}.txt"
shell:
"echo {input} >> {output}"
glob_wildcards is a function by snakemake to find out all the files that match the specified pattern (in/{test}.txt in this case), then .text is to get the list of strings that match {test} in filenames (example: "ab" in "in/ab.txt").
Then expand can fill the string to the placeholder variable that wrapped by curly bracket, then generate a list of input file names.
So rule all wants a list of input files correspond to all txt files in in folder, then it would let snakemake execute rule test for every file

Accessing file path from a config.yaml in Snakemake

I'm working with Snakemake for NGS analysis. I have a list of input files, stored in a YAML file as follows:
DATASETS:
sample1: /path/to/input/bam
.
.
A very simplified skeleton of my Snakemake file, as described earlier in Snakemake: How to use config file efficiently and https://www.biostars.org/p/406452/, is as follows:
rule all:
input:
expand("report/{sample}.xlsx", sample = config["DATASETS"])
rule call:
input:
lambda wildcards: config["DATASETS"][wildcards.sample]
output:
"tmp/{sample}.vcf"
shell:
"some mutect2 script"
rule summarize:
input:
"tmp/{sample}.vcf"
output:
"report/{sample}.xlsx"
shell:
"processVCF.py"
This complains about missing input files for rule all. I'm really not too sure what I am missing out here: Could someone perhaps point out where I can start looking to try to solve my problem?
This problem persists even when I execute snakemake -n tmp/sample1.vcf, so it seems the problem is related to the inability to pass the input file to the rule call. I have a nagging feeling that I'm really missing something trivial here.

Specify input and output files in Snakefile

I'm new to Snakemake and I want to make a pipeline that takes a given input text file and concatenates its content to a given output file. However I want to be able to specify the names of both the input and output files at run time, so neither file names are hardcoded in the Snakefile. Right now all I can come up with is:
rule all:
input:
"{input}.txt",
"{output}.txt"
rule output_files:
input:
"{input}.txt"
output:
"{output}.txt"
shell:
"cat {input}.txt > {output}.txt"
I tried running this with "snakemake input1.txt output.txt" but I got the error:
Building DAG of jobs...
WildcardError in line 6 of Snakefile:
Wildcards in input files cannot be determined from output files:
'input'
Any suggestions would be greatly appreciated.
In your example you actually copy a single input file into an output file using a cat shell command. That could be understood as an intention to concatenate several inputs into one output:
rule concatenate:
input:
"input1.txt",
"input2.txt"
output:
"output.txt"
shell:
"cat {input} > {output}"
takes a given input text file and concatenates its content to a given output file
Another way to understand the question is that you are trying to append an input file to the end of the output. That is more challenging: Snakemake "thinks" in terms of goals where each goal is a distinct file. How would Snakemake know if the output file is a raw one or if it is a concatenated version? One way to do that is to have "flag" files: the presence of such file would mean that the goal is achieved and no concatenation is needed. One more problem: Snakemake clears the output file before running the rule. Than means that you need to specify it as a input:
rule append:
input:
in = "input.txt",
out = "output.txt"
output:
flag = "flag"
shell:
"cat {input.in} >> {input.out} && touch {output.flag}"
Now back to your question regarding the error and the way to specify the filenames in runtime. You get this error because the wildcards should be fully inferred from the output section, and both your rules are ill-formed. Let's start with the rule all.
You need to say Snakemake what goal you are building. No wildcards in the input, everything should be disambigued:
def getInput():
pass
# form the actual goal (you may query the database, service, hardcode, etc.)
rule all:
input: getInput
Let's say you decided that the goal should be 3 files: ["output1.txt", "output3.txt", "output3.txt"]:
def getInput():
magic_numbers_from_oracle = ["1", "2", "3"]
return magic_numbers_from_oracle
rule all:
input: expand("output{number}.txt", number=getInput())
Ok, now Snakemake knows the goal. The next step is to write a rule that says how to create a single output{number}.txt file. For simplicity I'm taking your initial approach with cat/copying:
rule cat_copy:
input:
"input{n}.txt"
output:
"output{n}.txt"
shell:
"cat {input} > {output}"
That's it. As long as you have files input1.txt, input2.txt, input3.txt you would get the corresponding outputs.

Chain dynamic input/output rules in snakemake

I am trying to create a pipeline where a small chain of rules are ran on a dynamic number of files output by an earlier rule using output. However, I am getting the following error: "wildcards in input files cannot be determined from output files:".
This suggests to me that what I am trying to do is not currently supported. Here is a pseudo example of what I am trying to do:
rule a:
input: "my static file.txt"
output: dynamic('my/path/{id}.txt')
rule b:
input: dynamic('my/path/{id}.txt')
output: dynamic('my/path/{id}.reprocessed.txt')
rule c:
input: dynamic('my/path/{id}.reprocessed.txt')
output: 'gather.txt'
Running snakemake with
rule all:
input: dynamic('my/path/{id}.txt')
Works without any issues, but when I run snakemake with:
rule all:
input: dynamic('my/path/{id}.reprocessed.txt')
I get the error: "wildcards in input files cannot be determined from output files:"
Is this feature supported? Has anyone successfully made such a chain? Any considerations I need to take into account?
Thanks!
This was resolved by removing the dynamic statement from rule b.

how can I pass a string in config file into the output section?

New to snakemake and I've been trying to transform my shell script based pipeline into snakemake based today and run into a lot of syntax issues.. I think most of the trouble I have is around getting all the files in a particular directories and infer output names from input names since that's how I use shell script (for loop), in particular, I tried to use expand function in the output section and it always gave me an error.
After checking some example Snakefile, I realized people never use expand in the output section. So my first question is: is output the only section where expand can't be used and if so, why? What if I want to pass a prefix defined in config.yaml file as part of the output file and that prefix can not be inferred from input file names, how can I achieve that, just like what I did below for the log section where {runid} is my prefix?
Second question about syntax: I tried to pass a user defined id in the configuration file (config.yaml) into the log section and it seems to me that here I have to use expand in the following form, is there a better way of passing strings defined in config.yaml file?
log:
expand("fastq/fastqc/{runid}_fastqc_log.txt",runid=config["run"])
where in the config.yaml
run:
"run123"
Third question: I initially tried the following 2 methods but they gave me errors so does it mean that inside log (probably input and output) section, Python syntax is not followed?
log:
"fastq/fastqc/"+config["run"]+"_fastqc_log.txt"
log:
"fastq/fastqc/{config["run"]}_fastqc_log.txt"
Here is an example of small workflow:
# Sample IDs
SAMPLES = ["sample1", "sample2"]
CONTROL = ["sample1"]
TREATMENT = ["sample2"]
rule all:
input: expand("{treatment}_vs_{control}.bed", treatment=TREATMENT, control=CONTROL)
rule peak_calling:
input: control="{control}.sam", treatment="{treatment}.sam"
output: "{treatment}_vs_{control}.bed"
shell: "touch {output}"
rule mapping:
input: "{samples}.fastq"
output: "{samples}.sam"
shell: "cp {input} {output}"
I used the expand function only in my final target. From there, snakemake can deduce the different values of the wildcards used in the rules "mapping" and "peak_calling".
As for the last part, the right way to put it would be the first one:
log:
"fastq/fastqc/" + config["run"] + "_fastqc_log.txt"
But again, snakemake can deduce it from your target (the rule all, in my example).
rule mapping:
input: "{samples}.fastq"
output: "{samples}.sam"
log: "{samples}.log"
shell: "cp {input} {output}"
Hope this helps!
You can use f-strings:
If this is you folder_with_configs/some_config.yaml:
var: value
Then simply
configfile:
"folder_with_configs/some_config.yaml"
rule one_to_rule_all:
output:
f"results/{config['var']}.file"
shell:
"touch {output[0]}"
Do remember about python rules related to nesting different types of apostrophes.
config in the smake rule is a simple python dictionary.
If you need to use additional variables in a path, e.g. some_param, use more curly brackets.
rule one_to_rule_all:
output:
f"results/{config['var']}.{{some_param}}"
shell:
"touch {output[0]}"
enjoy