How to implement splitting of files in snakemake when number of files is known - snakemake

Context
rule A uses the split command in a shell directive.
The number of files generated by rule A depends on a user specified value from the config and is thus known.
In this question there is a difference because the number of output files is unknown, but there is a reference to the dynamic() keyword. Apparently this has been replaced by the use of checkpoint. Is this really the correct way to go in this scenario? There is also something like scattergatter but the example is not clear to me.
Code
chunks = config["chunks"]
sample_list = ["S1", "S2"]
rule all:
input:
expand("{sample}_chunk_{chunk}_done_something.tsv", sample=sample_list,
chunk=[f"{i}".zfill(len(str(chunks))-1) for i in range(0, chunks)])
rule A:
input:
"input_file_{sample}.tsv"
output:
# the user defined number of chunks, how to specify these?
params: chunks=chunks
shell:
"split -n {params.chunks} --numeric-suffixes=1 --additional-suffix=.tsv {input[0]} some_prefix_{wildcards.sample}_"
rule B:
input:
"some_prefix_{sample}_{chunk}.tsv"
output:
"{sample}_chunk_{chunk}_done_something.tsv"
shell:
"#Do something"
Attempts
I tried using a checkpoint with an input function for rule B and using directory() in rule A. However using directory results in SyntaxError in line 253 of MySnakefile: Unexpected keyword directory in rule definition (Snakefile, line 253) and even if that would not throw an error, I don't know how to get chunks into this input function since it is not a wildcard.
How to implement the splitting of an input file best in Snakemake?

Since the number of chunks is known beforehand, you can set the number of output files in rule A from the chunks parameter using an array:
rule A:
...
output:
chunks = ["some_prefix_{{sample}}_{02d}.tsv".format(x+1) for x in range(chunks)]
With chunks = 2, this would expand to chunks = ["some_prefix_{sample}_01.tsv", "some_prefix_{sample}_02.tsv"], matching the synatx of the split output. The {sample} wildcard will be filled-in with Snakemake's standard wildcard replacement.

Related

How to write a Snakemake rule-all, where expand statements can handle the absence of all particular input files

I want to write a Snakemake-Pipeline to process either short or long read sequencing files or both types, depending on which type of files is provided in the input file.
First my Snakefile calls a shell script that creates a config file with the name of all short read files in the input directory under the heading short_reads and all long read files under the heading long_reads.
This is followed by my all rule:
rule all:
input:
expand("../qc/id/{sample}/fastqc_raw/{sample}_R1_fastqc.html", sample=config["samples_short"]),
expand("../qc/id/{sample}/nanoplot_raw/NanoPlot-report.html", sample=config["samples_long"])
...
However, if one of the file types (long or short reads) is not provided, Snakemake fails with a KeyError.
If I modify the config file in a way that the heading is still there but no sample names, Snakemake tries to call the input with the value None, e.g.
Missing input files for rule nanoplot_raw:
../raw_reads/None_ont.fastq.gz
How can I design the rule-all in a way, that it can handle either only short or long reads as well as both sequence types as Input?
Thanks for your help!
Does the following work?
if config["samples_short"]:
fastqc_short = expand("../qc/id/{sample}/fastqc_raw/{sample}_R1_fastqc.html", sample=config["samples_short"])
else:
fastqc_short = []
if config["samples_long"]:
nanoplot_long = expand("../qc/id/{sample}/nanoplot_raw/NanoPlot-report.html", sample=config["samples_long"])
else:
nanoplot_long = []
rule all:
input:
fastqc_short,
nanoplot_long,
...

Snakemake: Error when trying to run a rule for multiple directories and files

I create a dictionary in python and save the path to the directories (that I want the software to run on) as the keys and the corresponding values are a list of the expected output for each directory. Right now I have a structure like this:
sampleDict = {'/path_to_directory1': ["sample1","sample2","sample3"],
'/path_to_directory2': ["sample1","sample2"],
'/path_to_directory3': ["sample1","sample2","sample3"]}
# sampleDict looks pretty much like this
# key is a path to the directory that I want the rule to be executed on and the corresponding value sampleDict[key] is an array e.g. ["a","b","c"]
def input():
input=[]
for key in dirSampleDict:
input.extend(expand('{dir}/{sample}*.foo', dir = key, sample=dirSampleDict[key]))
return input
rule all:
input:
input()
# example should run some software on different directories for each set of directories and their expected output samples
rule example:
input:
# the path to each set of samples should be the wildcard
dir = lambda wildcards: expand("{dir}", dir=dirSampleDict.keys())
params:
# some params
output:
expand('{dir}/{sample}*.foo', dir = key, sample=dirSampleDict[key])
log:
log = '{dir}/{sample}.log'
run:
cmd = "software {dir}"
shell(cmd)
Doing this I receive the following error:
No values given for wildcard 'dir
Edit: Maybe it was not so clear what I actually want to do so I filled in some data.
I also tried using the wildcards I set up in rule all as follows:
sampleDict = {'/path_to_directory1': ["sample1","sample2","sample3"],
'/path_to_directory2': ["sample1","sample2"],
'/path_to_directory3': ["sample1","sample2","sample3"]}
# sampleDict looks pretty much like this
# key is a path to the directory that I want the rule to be executed on and the corresponding value sampleDict[key] is an array e.g. ["a","b","c"]
def input():
input=[]
for key in dirSampleDict:
input.extend(expand('{dir}/{sample}*.foo', dir = key, sample=dirSampleDict[key]))
return input
rule all:
input:
input()
# example should run some software on different directories for each set of directories and their expected output samples
rule example:
input:
# the path to each set of samples should be the wildcard
dir = "{{dir}}"
params:
# some params
output:
'{dir}/{sample}*.foo'
log:
log = '{dir}/{sample}.log'
run:
cmd = "software {dir}"
shell(cmd)
Doing this I receive the following error:
Not all output, log and benchmark files of rule example contain the
same wildcards. This is crucial though, in order to avoid that two or
more jobs write to the same file.
I'm pretty sure the second part is more likely what I actually want to do, since expand() as output would only run the rule once but I need to run it for every key value pair in the dictonary.
First of all, what do you expect from the asterisk in the output?
output:
'{dir}/{sample}*.foo'
The output has to be a list of valid filenames that can be formed with substitution of each wildcard with some string.
Next problem is that you are using the "{dir}" in the run: section. There is no variable dir defined in the script used for run. If you want to use the wildcard, you need to address it using wildcards.dir. However the run: can be substituted with a shell: section:
shell:
"software {wildcards.dir}"
Regarding your first script: there is no dir wildcard defined (actually there are no wildcards at all):
output:
expand('{dir}/{sample}*.foo', dir = key, sample=dirSampleDict[key])
Both {dir} and {sample} are the variables in the context of expand function, and they are fully substituted with the named parameters.
Now the second script. What did you mean by this input?
input:
dir = "{{dir}}"
Here the "{{dir}}" is not a wildcard, but a reference to a global variable (you haven't provided the rest of your script, so I cannot judge whether it is defined or not). Moreover, what's the need in the input? You never use the {input} variable at all, and there is no dependencies that are needed to connect the rule example with any other rule to produce the input for rule example.

Specify input and output files in Snakefile

I'm new to Snakemake and I want to make a pipeline that takes a given input text file and concatenates its content to a given output file. However I want to be able to specify the names of both the input and output files at run time, so neither file names are hardcoded in the Snakefile. Right now all I can come up with is:
rule all:
input:
"{input}.txt",
"{output}.txt"
rule output_files:
input:
"{input}.txt"
output:
"{output}.txt"
shell:
"cat {input}.txt > {output}.txt"
I tried running this with "snakemake input1.txt output.txt" but I got the error:
Building DAG of jobs...
WildcardError in line 6 of Snakefile:
Wildcards in input files cannot be determined from output files:
'input'
Any suggestions would be greatly appreciated.
In your example you actually copy a single input file into an output file using a cat shell command. That could be understood as an intention to concatenate several inputs into one output:
rule concatenate:
input:
"input1.txt",
"input2.txt"
output:
"output.txt"
shell:
"cat {input} > {output}"
takes a given input text file and concatenates its content to a given output file
Another way to understand the question is that you are trying to append an input file to the end of the output. That is more challenging: Snakemake "thinks" in terms of goals where each goal is a distinct file. How would Snakemake know if the output file is a raw one or if it is a concatenated version? One way to do that is to have "flag" files: the presence of such file would mean that the goal is achieved and no concatenation is needed. One more problem: Snakemake clears the output file before running the rule. Than means that you need to specify it as a input:
rule append:
input:
in = "input.txt",
out = "output.txt"
output:
flag = "flag"
shell:
"cat {input.in} >> {input.out} && touch {output.flag}"
Now back to your question regarding the error and the way to specify the filenames in runtime. You get this error because the wildcards should be fully inferred from the output section, and both your rules are ill-formed. Let's start with the rule all.
You need to say Snakemake what goal you are building. No wildcards in the input, everything should be disambigued:
def getInput():
pass
# form the actual goal (you may query the database, service, hardcode, etc.)
rule all:
input: getInput
Let's say you decided that the goal should be 3 files: ["output1.txt", "output3.txt", "output3.txt"]:
def getInput():
magic_numbers_from_oracle = ["1", "2", "3"]
return magic_numbers_from_oracle
rule all:
input: expand("output{number}.txt", number=getInput())
Ok, now Snakemake knows the goal. The next step is to write a rule that says how to create a single output{number}.txt file. For simplicity I'm taking your initial approach with cat/copying:
rule cat_copy:
input:
"input{n}.txt"
output:
"output{n}.txt"
shell:
"cat {input} > {output}"
That's it. As long as you have files input1.txt, input2.txt, input3.txt you would get the corresponding outputs.

Snakemake copy from several directories

Snakemake is super-confusing to me. I have files of the form:
indir/type/name_1/run_1/name_1_processed.out
indir/type/name_1/run_2/name_1_processed.out
indir/type/name_2/run_1/name_2_processed.out
indir/type/name_2/run_2/name_2_processed.out
where type, name, and the numbers are variable. I would like to aggregate files such that all files with the same "name" end up in a single dir:
outdir/type/name/name_1-1.out
outdir/type/name/name_1-2.out
outdir/type/name/name_2-1.out
outdir/type/name/name_2-2.out
How do I write a snakemake rule to do this? I first tried the following
rule rename:
input:
"indir/{type}/{name}_{nameno}/run_{runno}/{name}_{nameno}_processed.out"
output:
"outdir/{type}/{name}/{name}_{nameno}-{runno}.out"
shell:
"cp {input} {output}"
# example command: snakemake --cores 1 outdir/type/name/name_1-1.out
This worked, but doing it this way doesn't save me any effort because I have to know what the output files are ahead of time, so basically I'd have to pass all the output files as a list of arguments to snakemake, requiring a bit of shell trickery to get the variables.
So then I tried to use directory (as well as give up on preserving runno).
rule rename2:
input:
"indir/{type}/{name}_{nameno}"
output:
directory("outdir/{type}/{name}")
shell:
"""
for d in {input}/run_*; do
i=0
for f in ${{d}}/*processed.out; do
cp ${{f}} {output}/{wildcards.name}_{wildcards.nameno}-${{i}}.out
done
let ++i
done
"""
This gave me the error, Wildcards in input files cannot be determined from output files: 'nameno'. I get it; {nameno} doesn't exist in output. But I don't want it there in the directory name, only in the filename that gets copied.
Also, if I delete {nameno}, then it complains because it can't find the right input file.
What are the best practices here for what I'm trying to do? Also, how does one wrap their head around the fact that in snakemake, you specify outputs, not inputs? I think this latter fact is what is so confusing.
I guess what you need is the expand function:
rule all:
input: expand("outdir/{type}/{name}/{name}_{nameno}-{runno}.out",
type=TYPES,
name=NAMES,
nameno=NAME_NUMBERS,
runno=RUN_NUMBERS)
The TYPES, NAMES, NAME_NUMBERS and RUN_NUMBERS are the lists of all possible values for these parameters. You either need to hardcode or use the glob_wildcards function to collects these data:
TYPES, NAMES, NAME_NUMBERS, RUN_NUMBERS, = glob_wildcards("indir/{type}/{name}_{nameno}/run_{runno}/{name}_{nameno}_processed.out")
This however would give you duplicates. If that is not desireble, remove the duplicates:
TYPES, NAMES, NAME_NUMBERS, RUN_NUMBERS, = map(set, glob_wildcards("indir/{type}/{name}_{nameno}/run_{runno}/{name}_{nameno}_processed.out"))

How to gather files from subdirectories to run jobs in Snakemake?

I am currently working on this project where iam struggling with this issue.
My current directory structure is
/shared/dir1/file1.bam
/shared/dir2/file2.bam
/shared/dir3/file3.bam
I want to convert various .bam files to fastq in the results directory
results/file1_1.fastq.gz
results/file1_2.fastq.gz
results/file2_1.fastq.gz
results/file2_2.fastq.gz
results/file3_1.fastq.gz
results/file3_2.fastq.gz
I have the following code:
END=["1","2"]
(dirs, files) = glob_wildcards("/shared/{dir}/{file}.bam")
rule all:
input: expand( "/results/{sample}_{end}.fastq.gz",sample=files, end=END)
rule bam_to_fq:
input: {dir}/{sample}.bam"
output: left="/results/{sample}_1.fastq", right="/results/{sample}_2.fastq"
shell: "/shared/packages/bam2fastq/bam2fastq --force -o /results/{sample}.fastq {input}"
This outputs the following error:
Wildcards in input files cannot be determined from output files:
'dir'
Any help would be appreciated
You're just missing an assignment for "dir" in your input directive of the rule bam_to_fq. In your code, you are trying to get Snakemake to determine "{dir}" from the output of the same rule, because you have it setup as a wildcard. Since it didn't exist, as a variable in your output directive, you received an error.
input:
"{dir}/{sample}.bam"
output:
left="/results/{sample}_1.fastq",
right="/results/{sample}_2.fastq",
Rule of thumb: input and output wildcards must match
rule all:
input:
expand("/results/{sample}_{end}.fastq.gz", sample=files, end=END)
rule bam_to_fq:
input:
expand("{dir}/{{sample}}.bam", dir=dirs)
output:
left="/results/{sample}_1.fastq",
right="/results/{sample}_2.fastq"
shell:
"/shared/packages/bam2fastq/bam2fastq --force -o /results/{sample}.fastq {input}
NOTES
the sample variable in the input directive now requires double {}, because that is how one identifies wildcards in an expand.
dir is no longer a wildcard, it is explicitly set to point to the list of directories determined by the glob_wildcard call and assigned to the variable "dirs" which I am assuming you make earlier in your script, since the assignment of one of the variables is successful already, in your rule all input "sample=files".
I like and recommend easily differentiable variable names. I'm not a huge fan of the usage of variable names "dir", and "dirs". This makes you prone to pedantic spelling errors. Consider changing it to "dirLIST" and "dir"... or anything really. I just fear one day someone will miss an 's' somewhere and it's going to be frustrating to debug. I'm personally guilty, an thus a slight hypocrite, as I do use "sample=samples" in my core Snakefile. It has caused me minor stress, thus why I make this recommendation. Also makes it easier for others to read your code as well.
EDIT 1; Adding to response as I had initially missed the requirement for key-value matching of the dir and sample
I recommend keeping separate the path and the sample name in different variables. Two approaches I can think of:
Keep using glob_wildcards to make a blanket search for all possible variables, and then use a python function to validate which path+file combinations are legit.
Drop the usage of glob_wildcards. Propagate the directory name as a wildcard variable, {dir}, throughout your rules. Just set it as a sub-directory of "results". Use pandas to pass known, key-value pairs listed in a file to the rule all. Initially I suggest generating the key-value pairs file manually, but eventually, it's generation could just be a rule upstream of others.
Generalizing bam_to_fq a little bit... utilizing an external config, something like....
from pandas import read_table
rule all:
input:
expand("/results/{{sample[1][dir]}}/{sample[1][file]}_{end}.fastq.gz", sample=read_table(config["sampleFILE"], " ").iterrows(), end=['1','2'])
rule bam_to_fq:
input:
"{dir}/{sample}.bam"
output:
left="/results/{dir}/{sample}_1.fastq",
right="/results/{dir}/{sample}_2.fastq"
shell:
"/shared/packages/bam2fastq/bam2fastq --force -o /results/{sample}.fastq {input}
sampleFILE
dir file
dir1 file1
dir2 file2
dir3 file3