Snakemake: Error when trying to run a rule for multiple directories and files - snakemake

I create a dictionary in python and save the path to the directories (that I want the software to run on) as the keys and the corresponding values are a list of the expected output for each directory. Right now I have a structure like this:
sampleDict = {'/path_to_directory1': ["sample1","sample2","sample3"],
'/path_to_directory2': ["sample1","sample2"],
'/path_to_directory3': ["sample1","sample2","sample3"]}
# sampleDict looks pretty much like this
# key is a path to the directory that I want the rule to be executed on and the corresponding value sampleDict[key] is an array e.g. ["a","b","c"]
def input():
input=[]
for key in dirSampleDict:
input.extend(expand('{dir}/{sample}*.foo', dir = key, sample=dirSampleDict[key]))
return input
rule all:
input:
input()
# example should run some software on different directories for each set of directories and their expected output samples
rule example:
input:
# the path to each set of samples should be the wildcard
dir = lambda wildcards: expand("{dir}", dir=dirSampleDict.keys())
params:
# some params
output:
expand('{dir}/{sample}*.foo', dir = key, sample=dirSampleDict[key])
log:
log = '{dir}/{sample}.log'
run:
cmd = "software {dir}"
shell(cmd)
Doing this I receive the following error:
No values given for wildcard 'dir
Edit: Maybe it was not so clear what I actually want to do so I filled in some data.
I also tried using the wildcards I set up in rule all as follows:
sampleDict = {'/path_to_directory1': ["sample1","sample2","sample3"],
'/path_to_directory2': ["sample1","sample2"],
'/path_to_directory3': ["sample1","sample2","sample3"]}
# sampleDict looks pretty much like this
# key is a path to the directory that I want the rule to be executed on and the corresponding value sampleDict[key] is an array e.g. ["a","b","c"]
def input():
input=[]
for key in dirSampleDict:
input.extend(expand('{dir}/{sample}*.foo', dir = key, sample=dirSampleDict[key]))
return input
rule all:
input:
input()
# example should run some software on different directories for each set of directories and their expected output samples
rule example:
input:
# the path to each set of samples should be the wildcard
dir = "{{dir}}"
params:
# some params
output:
'{dir}/{sample}*.foo'
log:
log = '{dir}/{sample}.log'
run:
cmd = "software {dir}"
shell(cmd)
Doing this I receive the following error:
Not all output, log and benchmark files of rule example contain the
same wildcards. This is crucial though, in order to avoid that two or
more jobs write to the same file.
I'm pretty sure the second part is more likely what I actually want to do, since expand() as output would only run the rule once but I need to run it for every key value pair in the dictonary.

First of all, what do you expect from the asterisk in the output?
output:
'{dir}/{sample}*.foo'
The output has to be a list of valid filenames that can be formed with substitution of each wildcard with some string.
Next problem is that you are using the "{dir}" in the run: section. There is no variable dir defined in the script used for run. If you want to use the wildcard, you need to address it using wildcards.dir. However the run: can be substituted with a shell: section:
shell:
"software {wildcards.dir}"
Regarding your first script: there is no dir wildcard defined (actually there are no wildcards at all):
output:
expand('{dir}/{sample}*.foo', dir = key, sample=dirSampleDict[key])
Both {dir} and {sample} are the variables in the context of expand function, and they are fully substituted with the named parameters.
Now the second script. What did you mean by this input?
input:
dir = "{{dir}}"
Here the "{{dir}}" is not a wildcard, but a reference to a global variable (you haven't provided the rest of your script, so I cannot judge whether it is defined or not). Moreover, what's the need in the input? You never use the {input} variable at all, and there is no dependencies that are needed to connect the rule example with any other rule to produce the input for rule example.

Related

How to implement splitting of files in snakemake when number of files is known

Context
rule A uses the split command in a shell directive.
The number of files generated by rule A depends on a user specified value from the config and is thus known.
In this question there is a difference because the number of output files is unknown, but there is a reference to the dynamic() keyword. Apparently this has been replaced by the use of checkpoint. Is this really the correct way to go in this scenario? There is also something like scattergatter but the example is not clear to me.
Code
chunks = config["chunks"]
sample_list = ["S1", "S2"]
rule all:
input:
expand("{sample}_chunk_{chunk}_done_something.tsv", sample=sample_list,
chunk=[f"{i}".zfill(len(str(chunks))-1) for i in range(0, chunks)])
rule A:
input:
"input_file_{sample}.tsv"
output:
# the user defined number of chunks, how to specify these?
params: chunks=chunks
shell:
"split -n {params.chunks} --numeric-suffixes=1 --additional-suffix=.tsv {input[0]} some_prefix_{wildcards.sample}_"
rule B:
input:
"some_prefix_{sample}_{chunk}.tsv"
output:
"{sample}_chunk_{chunk}_done_something.tsv"
shell:
"#Do something"
Attempts
I tried using a checkpoint with an input function for rule B and using directory() in rule A. However using directory results in SyntaxError in line 253 of MySnakefile: Unexpected keyword directory in rule definition (Snakefile, line 253) and even if that would not throw an error, I don't know how to get chunks into this input function since it is not a wildcard.
How to implement the splitting of an input file best in Snakemake?
Since the number of chunks is known beforehand, you can set the number of output files in rule A from the chunks parameter using an array:
rule A:
...
output:
chunks = ["some_prefix_{{sample}}_{02d}.tsv".format(x+1) for x in range(chunks)]
With chunks = 2, this would expand to chunks = ["some_prefix_{sample}_01.tsv", "some_prefix_{sample}_02.tsv"], matching the synatx of the split output. The {sample} wildcard will be filled-in with Snakemake's standard wildcard replacement.

Snakemake copy from several directories

Snakemake is super-confusing to me. I have files of the form:
indir/type/name_1/run_1/name_1_processed.out
indir/type/name_1/run_2/name_1_processed.out
indir/type/name_2/run_1/name_2_processed.out
indir/type/name_2/run_2/name_2_processed.out
where type, name, and the numbers are variable. I would like to aggregate files such that all files with the same "name" end up in a single dir:
outdir/type/name/name_1-1.out
outdir/type/name/name_1-2.out
outdir/type/name/name_2-1.out
outdir/type/name/name_2-2.out
How do I write a snakemake rule to do this? I first tried the following
rule rename:
input:
"indir/{type}/{name}_{nameno}/run_{runno}/{name}_{nameno}_processed.out"
output:
"outdir/{type}/{name}/{name}_{nameno}-{runno}.out"
shell:
"cp {input} {output}"
# example command: snakemake --cores 1 outdir/type/name/name_1-1.out
This worked, but doing it this way doesn't save me any effort because I have to know what the output files are ahead of time, so basically I'd have to pass all the output files as a list of arguments to snakemake, requiring a bit of shell trickery to get the variables.
So then I tried to use directory (as well as give up on preserving runno).
rule rename2:
input:
"indir/{type}/{name}_{nameno}"
output:
directory("outdir/{type}/{name}")
shell:
"""
for d in {input}/run_*; do
i=0
for f in ${{d}}/*processed.out; do
cp ${{f}} {output}/{wildcards.name}_{wildcards.nameno}-${{i}}.out
done
let ++i
done
"""
This gave me the error, Wildcards in input files cannot be determined from output files: 'nameno'. I get it; {nameno} doesn't exist in output. But I don't want it there in the directory name, only in the filename that gets copied.
Also, if I delete {nameno}, then it complains because it can't find the right input file.
What are the best practices here for what I'm trying to do? Also, how does one wrap their head around the fact that in snakemake, you specify outputs, not inputs? I think this latter fact is what is so confusing.
I guess what you need is the expand function:
rule all:
input: expand("outdir/{type}/{name}/{name}_{nameno}-{runno}.out",
type=TYPES,
name=NAMES,
nameno=NAME_NUMBERS,
runno=RUN_NUMBERS)
The TYPES, NAMES, NAME_NUMBERS and RUN_NUMBERS are the lists of all possible values for these parameters. You either need to hardcode or use the glob_wildcards function to collects these data:
TYPES, NAMES, NAME_NUMBERS, RUN_NUMBERS, = glob_wildcards("indir/{type}/{name}_{nameno}/run_{runno}/{name}_{nameno}_processed.out")
This however would give you duplicates. If that is not desireble, remove the duplicates:
TYPES, NAMES, NAME_NUMBERS, RUN_NUMBERS, = map(set, glob_wildcards("indir/{type}/{name}_{nameno}/run_{runno}/{name}_{nameno}_processed.out"))

Snakemake: dynamic + non-dynamic output

I have a use case in which a rule generates an arbitrary number of "checkpoint" files and a single output file. For example, "example.input" would produce:
example_000.checkpoint
example_001.checkpoint
...
example_093.checkpoint (arbitrary number here)
example.output (guaranteed non-dynamic output)
The checkpoints are intended to be used to restart from that point in the calculation, but I have additional use for them. However, I only need the first (e.g., example_000.checkpoint) and the last (e.g., example_093.checkpoint). How can I construct a rule such that my outputs are defined as:
rule example:
input:
{id}.input
output:
non_dynamic = {id}.output
first = {id}_{first}.checkpoint
last = {id}_{last}.checkpoint
# OR
checkpoints = dynamic({id}_{checkpoint}.checkpoint)
If I define new wildcards, I get the error "Not all output files of rule example contain the same wildcards." If I try to use dynamic output, I get the error "A rule with dynamic output may not define any non-dynamic output files."
Thanks in advance for any help!

How to gather files from subdirectories to run jobs in Snakemake?

I am currently working on this project where iam struggling with this issue.
My current directory structure is
/shared/dir1/file1.bam
/shared/dir2/file2.bam
/shared/dir3/file3.bam
I want to convert various .bam files to fastq in the results directory
results/file1_1.fastq.gz
results/file1_2.fastq.gz
results/file2_1.fastq.gz
results/file2_2.fastq.gz
results/file3_1.fastq.gz
results/file3_2.fastq.gz
I have the following code:
END=["1","2"]
(dirs, files) = glob_wildcards("/shared/{dir}/{file}.bam")
rule all:
input: expand( "/results/{sample}_{end}.fastq.gz",sample=files, end=END)
rule bam_to_fq:
input: {dir}/{sample}.bam"
output: left="/results/{sample}_1.fastq", right="/results/{sample}_2.fastq"
shell: "/shared/packages/bam2fastq/bam2fastq --force -o /results/{sample}.fastq {input}"
This outputs the following error:
Wildcards in input files cannot be determined from output files:
'dir'
Any help would be appreciated
You're just missing an assignment for "dir" in your input directive of the rule bam_to_fq. In your code, you are trying to get Snakemake to determine "{dir}" from the output of the same rule, because you have it setup as a wildcard. Since it didn't exist, as a variable in your output directive, you received an error.
input:
"{dir}/{sample}.bam"
output:
left="/results/{sample}_1.fastq",
right="/results/{sample}_2.fastq",
Rule of thumb: input and output wildcards must match
rule all:
input:
expand("/results/{sample}_{end}.fastq.gz", sample=files, end=END)
rule bam_to_fq:
input:
expand("{dir}/{{sample}}.bam", dir=dirs)
output:
left="/results/{sample}_1.fastq",
right="/results/{sample}_2.fastq"
shell:
"/shared/packages/bam2fastq/bam2fastq --force -o /results/{sample}.fastq {input}
NOTES
the sample variable in the input directive now requires double {}, because that is how one identifies wildcards in an expand.
dir is no longer a wildcard, it is explicitly set to point to the list of directories determined by the glob_wildcard call and assigned to the variable "dirs" which I am assuming you make earlier in your script, since the assignment of one of the variables is successful already, in your rule all input "sample=files".
I like and recommend easily differentiable variable names. I'm not a huge fan of the usage of variable names "dir", and "dirs". This makes you prone to pedantic spelling errors. Consider changing it to "dirLIST" and "dir"... or anything really. I just fear one day someone will miss an 's' somewhere and it's going to be frustrating to debug. I'm personally guilty, an thus a slight hypocrite, as I do use "sample=samples" in my core Snakefile. It has caused me minor stress, thus why I make this recommendation. Also makes it easier for others to read your code as well.
EDIT 1; Adding to response as I had initially missed the requirement for key-value matching of the dir and sample
I recommend keeping separate the path and the sample name in different variables. Two approaches I can think of:
Keep using glob_wildcards to make a blanket search for all possible variables, and then use a python function to validate which path+file combinations are legit.
Drop the usage of glob_wildcards. Propagate the directory name as a wildcard variable, {dir}, throughout your rules. Just set it as a sub-directory of "results". Use pandas to pass known, key-value pairs listed in a file to the rule all. Initially I suggest generating the key-value pairs file manually, but eventually, it's generation could just be a rule upstream of others.
Generalizing bam_to_fq a little bit... utilizing an external config, something like....
from pandas import read_table
rule all:
input:
expand("/results/{{sample[1][dir]}}/{sample[1][file]}_{end}.fastq.gz", sample=read_table(config["sampleFILE"], " ").iterrows(), end=['1','2'])
rule bam_to_fq:
input:
"{dir}/{sample}.bam"
output:
left="/results/{dir}/{sample}_1.fastq",
right="/results/{dir}/{sample}_2.fastq"
shell:
"/shared/packages/bam2fastq/bam2fastq --force -o /results/{sample}.fastq {input}
sampleFILE
dir file
dir1 file1
dir2 file2
dir3 file3

how can I pass a string in config file into the output section?

New to snakemake and I've been trying to transform my shell script based pipeline into snakemake based today and run into a lot of syntax issues.. I think most of the trouble I have is around getting all the files in a particular directories and infer output names from input names since that's how I use shell script (for loop), in particular, I tried to use expand function in the output section and it always gave me an error.
After checking some example Snakefile, I realized people never use expand in the output section. So my first question is: is output the only section where expand can't be used and if so, why? What if I want to pass a prefix defined in config.yaml file as part of the output file and that prefix can not be inferred from input file names, how can I achieve that, just like what I did below for the log section where {runid} is my prefix?
Second question about syntax: I tried to pass a user defined id in the configuration file (config.yaml) into the log section and it seems to me that here I have to use expand in the following form, is there a better way of passing strings defined in config.yaml file?
log:
expand("fastq/fastqc/{runid}_fastqc_log.txt",runid=config["run"])
where in the config.yaml
run:
"run123"
Third question: I initially tried the following 2 methods but they gave me errors so does it mean that inside log (probably input and output) section, Python syntax is not followed?
log:
"fastq/fastqc/"+config["run"]+"_fastqc_log.txt"
log:
"fastq/fastqc/{config["run"]}_fastqc_log.txt"
Here is an example of small workflow:
# Sample IDs
SAMPLES = ["sample1", "sample2"]
CONTROL = ["sample1"]
TREATMENT = ["sample2"]
rule all:
input: expand("{treatment}_vs_{control}.bed", treatment=TREATMENT, control=CONTROL)
rule peak_calling:
input: control="{control}.sam", treatment="{treatment}.sam"
output: "{treatment}_vs_{control}.bed"
shell: "touch {output}"
rule mapping:
input: "{samples}.fastq"
output: "{samples}.sam"
shell: "cp {input} {output}"
I used the expand function only in my final target. From there, snakemake can deduce the different values of the wildcards used in the rules "mapping" and "peak_calling".
As for the last part, the right way to put it would be the first one:
log:
"fastq/fastqc/" + config["run"] + "_fastqc_log.txt"
But again, snakemake can deduce it from your target (the rule all, in my example).
rule mapping:
input: "{samples}.fastq"
output: "{samples}.sam"
log: "{samples}.log"
shell: "cp {input} {output}"
Hope this helps!
You can use f-strings:
If this is you folder_with_configs/some_config.yaml:
var: value
Then simply
configfile:
"folder_with_configs/some_config.yaml"
rule one_to_rule_all:
output:
f"results/{config['var']}.file"
shell:
"touch {output[0]}"
Do remember about python rules related to nesting different types of apostrophes.
config in the smake rule is a simple python dictionary.
If you need to use additional variables in a path, e.g. some_param, use more curly brackets.
rule one_to_rule_all:
output:
f"results/{config['var']}.{{some_param}}"
shell:
"touch {output[0]}"
enjoy