Accessing file path from a config.yaml in Snakemake - snakemake

I'm working with Snakemake for NGS analysis. I have a list of input files, stored in a YAML file as follows:
DATASETS:
sample1: /path/to/input/bam
.
.
A very simplified skeleton of my Snakemake file, as described earlier in Snakemake: How to use config file efficiently and https://www.biostars.org/p/406452/, is as follows:
rule all:
input:
expand("report/{sample}.xlsx", sample = config["DATASETS"])
rule call:
input:
lambda wildcards: config["DATASETS"][wildcards.sample]
output:
"tmp/{sample}.vcf"
shell:
"some mutect2 script"
rule summarize:
input:
"tmp/{sample}.vcf"
output:
"report/{sample}.xlsx"
shell:
"processVCF.py"
This complains about missing input files for rule all. I'm really not too sure what I am missing out here: Could someone perhaps point out where I can start looking to try to solve my problem?
This problem persists even when I execute snakemake -n tmp/sample1.vcf, so it seems the problem is related to the inability to pass the input file to the rule call. I have a nagging feeling that I'm really missing something trivial here.

Related

How to parallelize jobs for a list of files using snakemake (beginner question)

I am struggling with a very simple thing. On input of my snakemake pipeline I would like to have a directory, list its content, and process each file from that directory in parallel. Naively I thought something like this should work:
rule all:
input:
"in/{test}.txt"
output:
"out/{test}.txt"
shell:
"echo {input} >> {output}"
This ends with the error
WorkflowError:
Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards.
All the resources I could find start with hard-coding the list of jobs in the script, which is something I want to avoid to keep the pipeline generic. The idea is to just point the pipeline to a directory with a list of files and let it do its job. Is this possible? Seems fairly simple and intuitive, but couldn't find an example showing that.
I don't know what command you used for this rule, but the following workflow should suffice your purpose
rule all:
input:
expand("out/{prefix}.txt", prefix=glob_wildcards("in/{test}.txt").test)
rule test:
input:
"in/{test}.txt"
output:
"out/{test}.txt"
shell:
"echo {input} >> {output}"
glob_wildcards is a function by snakemake to find out all the files that match the specified pattern (in/{test}.txt in this case), then .text is to get the list of strings that match {test} in filenames (example: "ab" in "in/ab.txt").
Then expand can fill the string to the placeholder variable that wrapped by curly bracket, then generate a list of input file names.
So rule all wants a list of input files correspond to all txt files in in folder, then it would let snakemake execute rule test for every file

Snakemake: 'Missing input files' due to wrong wildcard expansion

I am new to Snakemake and I want to write a very simple Snakefile with a rule that processes each input file separately to an output file, but somehow my wildcards aren't interpreted correctly.
I have set up a minimal, reproducible example environment in Ubuntu 18.04 with the input files "test/test1.txt", "test/test2.txt", and a Snakefile. (snakemake version 5.5.4)
Snakefile:
ins = glob_wildcards("test/{f}.txt")
rule all:
input: expand("out/{f}.txt", f=ins)
rule test:
input: "test/{f}.txt"
output: "out/{f}.txt"
shell: "touch {output}"
This Snakefile throws the following error while building the DAG of jobs:
Missing input files for rule test:
test/['test1', 'test2'].txt
Any ideas how to fix this error?
I think you need to use ins.f or something similar:
expand("out/{f}.txt", f= ins.f)
The reason is explained in the FAQ
[glob_wildcards returns] a named tuple that contains a list of values
for each wildcard.

Chain dynamic input/output rules in snakemake

I am trying to create a pipeline where a small chain of rules are ran on a dynamic number of files output by an earlier rule using output. However, I am getting the following error: "wildcards in input files cannot be determined from output files:".
This suggests to me that what I am trying to do is not currently supported. Here is a pseudo example of what I am trying to do:
rule a:
input: "my static file.txt"
output: dynamic('my/path/{id}.txt')
rule b:
input: dynamic('my/path/{id}.txt')
output: dynamic('my/path/{id}.reprocessed.txt')
rule c:
input: dynamic('my/path/{id}.reprocessed.txt')
output: 'gather.txt'
Running snakemake with
rule all:
input: dynamic('my/path/{id}.txt')
Works without any issues, but when I run snakemake with:
rule all:
input: dynamic('my/path/{id}.reprocessed.txt')
I get the error: "wildcards in input files cannot be determined from output files:"
Is this feature supported? Has anyone successfully made such a chain? Any considerations I need to take into account?
Thanks!
This was resolved by removing the dynamic statement from rule b.

How to gather files from subdirectories to run jobs in Snakemake?

I am currently working on this project where iam struggling with this issue.
My current directory structure is
/shared/dir1/file1.bam
/shared/dir2/file2.bam
/shared/dir3/file3.bam
I want to convert various .bam files to fastq in the results directory
results/file1_1.fastq.gz
results/file1_2.fastq.gz
results/file2_1.fastq.gz
results/file2_2.fastq.gz
results/file3_1.fastq.gz
results/file3_2.fastq.gz
I have the following code:
END=["1","2"]
(dirs, files) = glob_wildcards("/shared/{dir}/{file}.bam")
rule all:
input: expand( "/results/{sample}_{end}.fastq.gz",sample=files, end=END)
rule bam_to_fq:
input: {dir}/{sample}.bam"
output: left="/results/{sample}_1.fastq", right="/results/{sample}_2.fastq"
shell: "/shared/packages/bam2fastq/bam2fastq --force -o /results/{sample}.fastq {input}"
This outputs the following error:
Wildcards in input files cannot be determined from output files:
'dir'
Any help would be appreciated
You're just missing an assignment for "dir" in your input directive of the rule bam_to_fq. In your code, you are trying to get Snakemake to determine "{dir}" from the output of the same rule, because you have it setup as a wildcard. Since it didn't exist, as a variable in your output directive, you received an error.
input:
"{dir}/{sample}.bam"
output:
left="/results/{sample}_1.fastq",
right="/results/{sample}_2.fastq",
Rule of thumb: input and output wildcards must match
rule all:
input:
expand("/results/{sample}_{end}.fastq.gz", sample=files, end=END)
rule bam_to_fq:
input:
expand("{dir}/{{sample}}.bam", dir=dirs)
output:
left="/results/{sample}_1.fastq",
right="/results/{sample}_2.fastq"
shell:
"/shared/packages/bam2fastq/bam2fastq --force -o /results/{sample}.fastq {input}
NOTES
the sample variable in the input directive now requires double {}, because that is how one identifies wildcards in an expand.
dir is no longer a wildcard, it is explicitly set to point to the list of directories determined by the glob_wildcard call and assigned to the variable "dirs" which I am assuming you make earlier in your script, since the assignment of one of the variables is successful already, in your rule all input "sample=files".
I like and recommend easily differentiable variable names. I'm not a huge fan of the usage of variable names "dir", and "dirs". This makes you prone to pedantic spelling errors. Consider changing it to "dirLIST" and "dir"... or anything really. I just fear one day someone will miss an 's' somewhere and it's going to be frustrating to debug. I'm personally guilty, an thus a slight hypocrite, as I do use "sample=samples" in my core Snakefile. It has caused me minor stress, thus why I make this recommendation. Also makes it easier for others to read your code as well.
EDIT 1; Adding to response as I had initially missed the requirement for key-value matching of the dir and sample
I recommend keeping separate the path and the sample name in different variables. Two approaches I can think of:
Keep using glob_wildcards to make a blanket search for all possible variables, and then use a python function to validate which path+file combinations are legit.
Drop the usage of glob_wildcards. Propagate the directory name as a wildcard variable, {dir}, throughout your rules. Just set it as a sub-directory of "results". Use pandas to pass known, key-value pairs listed in a file to the rule all. Initially I suggest generating the key-value pairs file manually, but eventually, it's generation could just be a rule upstream of others.
Generalizing bam_to_fq a little bit... utilizing an external config, something like....
from pandas import read_table
rule all:
input:
expand("/results/{{sample[1][dir]}}/{sample[1][file]}_{end}.fastq.gz", sample=read_table(config["sampleFILE"], " ").iterrows(), end=['1','2'])
rule bam_to_fq:
input:
"{dir}/{sample}.bam"
output:
left="/results/{dir}/{sample}_1.fastq",
right="/results/{dir}/{sample}_2.fastq"
shell:
"/shared/packages/bam2fastq/bam2fastq --force -o /results/{sample}.fastq {input}
sampleFILE
dir file
dir1 file1
dir2 file2
dir3 file3

how can I pass a string in config file into the output section?

New to snakemake and I've been trying to transform my shell script based pipeline into snakemake based today and run into a lot of syntax issues.. I think most of the trouble I have is around getting all the files in a particular directories and infer output names from input names since that's how I use shell script (for loop), in particular, I tried to use expand function in the output section and it always gave me an error.
After checking some example Snakefile, I realized people never use expand in the output section. So my first question is: is output the only section where expand can't be used and if so, why? What if I want to pass a prefix defined in config.yaml file as part of the output file and that prefix can not be inferred from input file names, how can I achieve that, just like what I did below for the log section where {runid} is my prefix?
Second question about syntax: I tried to pass a user defined id in the configuration file (config.yaml) into the log section and it seems to me that here I have to use expand in the following form, is there a better way of passing strings defined in config.yaml file?
log:
expand("fastq/fastqc/{runid}_fastqc_log.txt",runid=config["run"])
where in the config.yaml
run:
"run123"
Third question: I initially tried the following 2 methods but they gave me errors so does it mean that inside log (probably input and output) section, Python syntax is not followed?
log:
"fastq/fastqc/"+config["run"]+"_fastqc_log.txt"
log:
"fastq/fastqc/{config["run"]}_fastqc_log.txt"
Here is an example of small workflow:
# Sample IDs
SAMPLES = ["sample1", "sample2"]
CONTROL = ["sample1"]
TREATMENT = ["sample2"]
rule all:
input: expand("{treatment}_vs_{control}.bed", treatment=TREATMENT, control=CONTROL)
rule peak_calling:
input: control="{control}.sam", treatment="{treatment}.sam"
output: "{treatment}_vs_{control}.bed"
shell: "touch {output}"
rule mapping:
input: "{samples}.fastq"
output: "{samples}.sam"
shell: "cp {input} {output}"
I used the expand function only in my final target. From there, snakemake can deduce the different values of the wildcards used in the rules "mapping" and "peak_calling".
As for the last part, the right way to put it would be the first one:
log:
"fastq/fastqc/" + config["run"] + "_fastqc_log.txt"
But again, snakemake can deduce it from your target (the rule all, in my example).
rule mapping:
input: "{samples}.fastq"
output: "{samples}.sam"
log: "{samples}.log"
shell: "cp {input} {output}"
Hope this helps!
You can use f-strings:
If this is you folder_with_configs/some_config.yaml:
var: value
Then simply
configfile:
"folder_with_configs/some_config.yaml"
rule one_to_rule_all:
output:
f"results/{config['var']}.file"
shell:
"touch {output[0]}"
Do remember about python rules related to nesting different types of apostrophes.
config in the smake rule is a simple python dictionary.
If you need to use additional variables in a path, e.g. some_param, use more curly brackets.
rule one_to_rule_all:
output:
f"results/{config['var']}.{{some_param}}"
shell:
"touch {output[0]}"
enjoy