Snakemake exit a rule during execution - snakemake

Is there a way to print a helpful message and allow Snakemake to exit the workflow without giving an error? I have this example workflow:
def readFile(file):
with open(file) as f:
line = f.readline()
return(line.strip())
def isFileEmpty(file):
with open(file) as f:
line = f.readline()
if line.strip() != '':
return True
else:
return False
rule all:
input: "output/final.txt"
rule step1:
input: "input.txt"
output: "out.txt"
run:
if readFile(input[0]) == 'a':
shell("echo 'a' > out.txt")
else:
shell("echo '' > out.txt")
rule step2:
input: "out.txt"
output: dynamic("output/{files}")
run:
i = isFileEmpty(input[0])
if i:
shell("echo 'out2' > output/out2.txt")
else:
print("Out.txt is empty, workflow ended")
rule step3:
input: "output/out2.txt"
output: "output/final.txt"
run: shell("echo 'final' > output/final.txt")
In step 1, I'm reading the file contents of input.txt and if doesn't contain the letter 'a' then an empty out.txt will be produced. In step 2, whether out.txt is empty is checked. If it's not empty, step2 and 3 will be performed to give final.txt at the end. If it's empty, I want Snakemake to print the message "Out.txt is empty, workflow ended" and exit immediately without performing step 3 and giving an error message. Right now the code I have will print the message at step 2 if input.txt is empty, but it'll still try to run step 3 and will give MissingOutputException because final.txt is not generated. I understand the reason is because final.txt is one of the input files in the rule all, but I'm having trouble writing up this workflow because final.txt may or may not be produced.

Related

snakemake checkpoint and create a expand list/wildcards based on created output files

hope someone can guide me in the right direction. See below for a small working example:
from glob import glob
rule all:
input:
expand("output/step4/{number}_step4.txt", number=["1","2","3","4"])
checkpoint split_to_fasta:
input:
seqfile = "files/seqs.csv"#just an empty file in this example
output:
fastas=directory("output/fasta")
shell:
#A python script will create below files and I dont know them beforehand.
"mkdir -p {output.fastas} ; "
"echo test > {output.fastas}/1_LEFT_sample_name_foo.fa ;"
"echo test > {output.fastas}/1_RIGHT_sample_name_foo.fa ;"
"echo test > {output.fastas}/2_LEFT_sample_name_spam.fa ;"
"echo test > {output.fastas}/2_RIGHT_sample_name_bla.fa ;"
"echo test > {output.fastas}/3_LEFT_sample_name_egg.fa ;"
"echo test > {output.fastas}/4_RIGHT_sample_name_ham.fa ;"
rule step2:
input:
fasta = "output/fasta/{fasta}.fa"
output:
step2 = "output/step2/{fasta}_step2.txt",
shell:
"cp {input.fasta} {output.step2}"
rule step3:
input:
file = rules.step2.output.step2
output:
step3 = "output/step3/{fasta}_step3.txt",
shell:
"cp {input.file} {output.step3}"
def aggregate_input(wildcards):
checkpoint_output = checkpoints.split_to_fasta.get(**wildcards).output[0]
###dont know where to use this line correctly
###files = [Path(x).stem.split("_")[0] for x in glob("output/step3/"+ f"*_step3.txt") ]
return expand("output/step3/{fasta}_step3.txt", fasta=glob_wildcards(os.path.join(checkpoint_output, "{fasta}.fa")).fasta)
def get_id_files(wildcards):
blast = glob("output/step3/"+ f"{wildcards.number}*_step3.txt")
return sorted(blast)
rule step4:
input:
step3files = aggregate_input,
idfiles = get_id_files
output:
step4 = "output/step4/{number}_step4.txt",
run:
shell("cat {input.idfiles} > {output.step4}")
Because of rule all snakemake knows how to "start" the pipeline. I hard coded the numbers 1,2,3 and 4 but in a real situation I dont know these numbers beforehand.
expand("output/step4/{number}_step4.txt", number=["1","2","3","4"])
What I want is to get those numbers based on the output filenames of split_to_fasta, step2 or step3. And then use it as a target for wildcards. (I can easily get the numbers with glob and split)
I want to do it with wildcards like in def get_id_files because I want to execute the next step in parallel. In other words, the following sets of files need to go in the next step:
[1_LEFT_sample_name_foo.fa, 1_RIGHT_sample_name_foo.fa]
[2_LEFT_sample_name_spam.fa, 2_RIGHT_sample_name_bla.fa]
[3_LEFT_sample_name_egg.fa]
[4_RIGHT_sample_name_ham.fa]
A may need a second checkpoint but not sure how to implement that.
EDIT (solution):
I was already worried my question was not so clear so I made another example, see below. This pipeline generates some fake files (end of step 3). From this point I want to continue and process all files with the same id in parallel. The id is the number at the beginning of the filename. I could make a second pipeline that "starts" with step 4 and execute them after each other but that sounds like bad practice. I think I need to define a target for the next rule (step 4) but dont know how to do that based on this situation. The code to define the target itself is something like:
files = [Path(x).stem.split("_")[0] for x in glob("output/step3/"+ f"*_step3.txt") ]
ids = list(set(files))
expand("output/step4/{number}_step4.txt", number=ids)
The second example (Edited to the solution):
from glob import glob
def aggregate_input(wildcards):
checkpoint_output = checkpoints.split_to_fasta.get(**wildcards).output[0]
ids = [Path(x).stem.split("_")[0] for x in glob("output/fasta/"+ f"*.fa") ]
return expand("output/step3/{fasta}_step3.txt", fasta=glob_wildcards(os.path.join(checkpoint_output, "{fasta}.fa")).fasta) + expand("output/step4/{number}_step4.txt", number=ids)
rule all:
input:
aggregate_input,
checkpoint split_to_fasta:
input:
seqfile = "files/seqs.csv"
output:
fastas=directory("output/fasta")
shell:
#A python script will create below files and I dont know them beforehand.
#I could get the numbers if needed
"mkdir -p {output.fastas} ; "
"echo test1 > {output.fastas}/1_LEFT_sample_name_foo.fa ;"
"echo test2 > {output.fastas}/1_RIGHT_sample_name_foo.fa ;"
"echo test3 > {output.fastas}/2_LEFT_sample_name_spam.fa ;"
"echo test4 > {output.fastas}/2_RIGHT_sample_name_bla.fa ;"
"echo test5 > {output.fastas}/3_LEFT_sample_name_egg.fa ;"
"echo test6 > {output.fastas}/4_RIGHT_sample_name_ham.fa ;"
rule step2:
input:
fasta = "output/fasta/{fasta}.fa"
output:
step2 = "output/step2/{fasta}_step2.txt",
shell:
"cp {input.fasta} {output.step2}"
rule step3:
input:
file = rules.step2.output.step2
output:
step3 = "output/step3/{fasta}_step3.txt",
shell:
"cp {input.file} {output.step3}"
def get_id_files(wildcards):
#blast = glob("output/step3/"+ f"{wildcards.number}*_step3.txt")
blast = expand(f"output/step3/{wildcards.number}_{{sample}}_step3.txt", sample=glob_wildcards(f"output/fasta/{wildcards.number}_{{sample}}.fa").sample)
return blast
rule step4:
input:
idfiles = get_id_files
output:
step4 = "output/step4/{number}_step4.txt",
run:
shell("cat {input.idfiles} > {output.step4}")
Replacing your blast line in get_id_functions in second example with
blast = expand(f"output/step3/{wildcards.number}_{{sample}}_step3.txt", sample=glob_wildcards(f"output/fasta/{wildcards.number}_{{sample}}.fa").sample)
This is my way of understanding checkpoint, when the input of a rule (say rule a) is checkpoint, anything upstream of a is blocked by the first evaluation of DAG, after checkpoint has been successfully executed. The second round of evaluation would start with knowing the output of checkpoints.
So in your case, putting checkpoint in rule all would hide step2/3/4 at 1st evaluation (since these steps are upstream of all). Then checkpoint got executed, then 2nd evaluation. At this time point, you are evaluating a new workflow knowing all outputs of checkpoint, so you can 1. infer the ids 2. infer the corresponding step3 outputs according to split_to_fasta outputs.
1st evaluation: Rule all -> checkpoint split_to_fasta (TBD)
2nd evaluation(split_to_fasta executed): Rule all -> checkpoint split_to_fasta -> Rule step_4 -> Rule step_3 -> Rule step_2
get_id_files happens at step_4, where step_3 has not been executed, this is why you need to infer based on outputs of split_to_fasta instead of directly finding the outputs of step 3
If I understand the problem correctly, the following line should be changed:
ids = [Path(x).stem.split("_")[0] for x in glob("output/step3/"+ f"*_step3.txt") ]
Right now it's glob-bing for files in step3 (I presume these files do not yet exist). Instead, the right thing to glob is the output of the rule split_to_fasta, so something like this:
ids = [Path(x).stem.split("_")[0] for x in glob("output/fasta*.fa") ]
And later to use these ids to extract the relevant wildcards and use them in the expand("output/step3/{fasta}_step3.txt", ...).
Sorry this is not a functional example, but the original code is a bit hard to read.

How do Snakemake checkpoints work when i do not wanna make a folder?

I have a snakemake file where one rule produces a file from witch i would like to extract the header and use as wildcards in my rule all.
The Snakemake guide provides an example where it creates new folders named like the wildcards, but if I can avoid that it would be nice since in some cases it would need to create 100-200 folders then. Any suggestions on how to make it work?
link to snakemake guide:
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html
import pandas as pd
rule all:
input:
final_report = expand('report_{fruit}.txt', fruit= ???)
rule create_file:
input:
output:
fruit = 'fruit_file.csv'
run:
....
rule next:
input:
fruit = 'fruit_file.csv'
output:
report = 'report_{phenotype}.txt'
run:
fruit_file = pd.read_csv({input.fruit}, header = 0, sep = '\t')
fruits= fruit_file.columns.tolist()[2:]
for i in fruits:
cmd = 'touch report_' + i + '.txt'
shell(cmd)
This is a simplified workflow since i am actually using some long script to both produce the pheno_file.csv and the report files.
The pheno_file.csv is tab-seperated and could look like this:
FID IID Apple Banana Plum
Mouse Mickey 0 0 1
Mouse Minnie 1 0 1
Duck Donnald 0 1 0
I think you are misreading the snakemake checkpoint example. You only need to create one folder in your case. They have a wildcard (sample) in the folder name, but that part of the output name is known ahead of time.
checkpoint fruit_reports:
input:
fruit = 'fruit_file.csv'
output:
report_dir = directory('reports')
run:
fruit_file = pd.read_csv({input.fruit}, header = 0, sep = '\t')
fruits= fruit_file.columns.tolist()[2:]
for i in fruits:
cmd = f'touch {output}/report_{i}.txt'
shell(cmd)
Since you do not know all names (fruits) ahead of time, you cannot include them in the all rule. You need to reference an intermediate rule to bring everything together. Maybe use a final report file:
rule all:
input: 'report.txt'
Then after the checkpoint:
def aggregate_fruit(wildcards):
checkpoint_output = checkpoints.fruit_reports.get(**wildcards).output[0]
return expand("reports/report_{i}.txt",
i=glob_wildcards(os.path.join(checkpoint_output, "report_{i}.txt")).i)
rule report:
input:
aggregate_input
output:
"report.txt"
shell:
"ls 1 {input} > {output}"

Snakemake decide which rules to execute during execution

I'm working on a bioinformatics pipeline which must be able to run different rules to produce different outputs based on the contents of an input file:
def foo(file):
'''
Function will read the file contents and output a boolean value based on its contents
'''
# Code to read file here...
return bool
rule check_input:
input: "input.txt"
run:
bool = foo("input.txt")
rule bool_is_True:
input: "input.txt"
output: "out1.txt"
run:
# Some code to generate out1.txt. This rule is supposed to run only if foo("input.txt") is true
rule bool_is_False:
input: "input.txt"
output: "out2.txt"
run:
# Some code to generate out2.txt. This rule is supposed to run only if foo("input.txt") is False
How do I write my rules to handle this situation? Also how do I write my first rule all if the output files are unknown before the rule check_input is executed?
Thanks!
You're right, snakemake has to know which files to produce before executing the rules. Therefore, I suggest you use a function which reads what you called "the input file" and define the output of the workflow accordingly.
ex:
def getTargetsFromInput():
targets = list()
## read file and add target files to targets
return targets
rule all:
input: getTargetsFromInput()
...
You can define the path of the input file with --config argument on the snakemake command line or directly use some sort of structured input file (yaml, json) and use the keyword configfile: in the Snakefile: https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html
Thanks Eric. I got it work with:
def getTargetsFromInput(file):
with open(file) as f:
line = f.readline()
if line.strip() == "out1":
return "out1.txt"
else:
return "out2.txt"
rule all:
input: getTargetsFromInput("input.txt")
rule out1:
input: "input.txt"
output: "out1.txt"
run: shell("echo 'out1' > out1.txt")
rule out2:
input: "input.txt"
output: "out2.txt"
run: shell("echo 'out2' > out2.txt")

A curious case of snakemake

I have a similar goal as in Snakemake: unknown output/input files after splitting by chromosome , however, as pointed out, I do know in advance that I have e.g., 5 chromosomes in my sample.bam file. Using as a toy example:
$ cat sample.bam
chromosome 1
chromosome 2
chromosome 3
chromosome 4
chromosome 5
I wish to "split" this bam file, and then do a bunch of per chromosome downstream jobs on the resulting chromosomes. The simplest solution I could conjure up was:
chromosomes = '1 2 3 4 5'.split()
rule master :
input :
expand('sample.REF_{chromosome}.bam',
chromosome = chromosomes)
rule chromosome :
output :
touch('sample.REF_{chromosome}.bam')
input : 'split.done'
rule split_bam :
output :
touch('split.done')
input : 'sample.bam'
run :
print('splitting bam..')
chromosome = 1
for line in open(input[0]) :
outfile = 'sample.REF_{}.bam'.format(chromosome)
print(line, end = '', file = open(outfile, 'w'))
chromosome += 1
results in empty sample_REF_{chromosome}.bam files. I understand why this happens, and indeed snakemake even warns, e.g.,
Warning: the following output files of rule chromosome were not present when the DAG was created:
{'sample.REF_3.bam'}
Touching output file sample.REF_3.bam.
that is, these files were not in the DAG to begin with, and snakemake touches them with empty versions, erasing what was put there. I guess I am surprised by this behavior, and wonder if there is a good reason for this. Note that this behavior is not limited to snakemake's touch(), since, should I replace touch('sample.REF_{chromosome}.bam') with simply 'sample.REF_{chromosome}.bam', and then have a shell :touch {output}`, I get the same result. Now, of course, I have found a perfectly acceptable workaround:
chromosomes = '1 2 3 4 5'.split()
rule master :
input :
expand('sample.REF_{chromosome}.bam',
chromosome = chromosomes)
rule chromosome :
output : 'sample.REF_{chromosome}.bam'
input : 'split_dir'
shell : 'mv {input}/{output} {output}'
rule split_bam :
output :
temp(directory('split_dir'))
input : 'sample.bam'
run :
print('splitting bam..')
shell('mkdir {output}')
chromosome = 1
for line in open(input[0]) :
outfile = '{}/sample.REF_{}.bam'.format(output[0], chromosome)
print(line, end = '', file = open(outfile, 'w'))
chromosome += 1
but I am surprised I have to go though these gymnastics for a seemingly simple task. For this reason, I wonder if there is a better design, or if I am not asking the right question. Any advice/ideas are most welcome.
I think your example is a bit contrived for a couple of reasons. The rule split_bam already produces the final output sample.REF_{chromosome}.bam. Also, the rule master uses the chromosomes taken from the variable chromosomes whereas the rule split_bam iterates through the bam file to get the chromosomes.
My impression is that what you want could be something like:
chromosomes= '1 2 3 4 5'.split()
rule master:
input:
expand('sample.REF_{chromosome}.bam',
chromosome = chromosomes)
rule split_bam :
input:
'sample.bam'
output:
expand('sample.split.{chromosome}.bam', chromosome= chromosomes)
run:
print('splitting bam..')
for chromosome in chromosomes:
outfile = 'sample.split.{}.bam'.format(chromosome)
print(chromosome, end = '', file = open(outfile, 'w'))
rule chromosome:
input:
'sample.split.{chromosome}.bam'
output:
touch('sample.REF_{chromosome}.bam')

Parallel execution of a snakemake rule with same input and a range of values for a single parameter

I am transitioning a bash script to snakemake and I would like to parallelize a step I was previously handling with a for loop. The issue I am running into is that instead of running parallel processes, snakemake ends up trying to run one process with all parameters and fails.
My original bash script runs a program multiple times for a range of values of the parameter K.
for num in {1..3}
do
structure.py -K $num --input=fileprefix --output=fileprefix
done
There are multiple input files that start with fileprefix. And there are two main outputs per run, e.g. for K=1 they are fileprefix.1.meanP, fileprefix.1.meanQ. My config and snakemake files are as follows.
Config:
cat config.yaml
infile: fileprefix
K:
- 1
- 2
- 3
Snakemake:
configfile: 'config.yaml'
rule all:
input:
expand("output/{sample}.{K}.{ext}",
sample = config['infile'],
K = config['K'],
ext = ['meanQ', 'meanP'])
rule structure:
output:
"output/{sample}.{K}.meanQ",
"output/{sample}.{K}.meanP"
params:
prefix = config['infile'],
K = config['K']
threads: 3
shell:
"""
structure.py -K {params.K} \
--input=output/{params.prefix} \
--output=output/{params.prefix}
"""
This was executed with snakemake --cores 3. The problem persists when I only use one thread.
I expected the outputs described above for each value of K, but the run fails with this error:
RuleException:
CalledProcessError in line 84 of Snakefile:
Command ' set -euo pipefail; structure.py -K 1 2 3 --input=output/fileprefix \
--output=output/fileprefix ' returned non-zero exit status 2.
File "Snakefile", line 84, in __rule_Structure
File "snake/lib/python3.6/concurrent/futures/thread.py", line 56, in run
When I set K to a single value such as K = ['1'], everything works. So the problem seems to be that {params.K} is being expanded to all values of K when the shell command is executed. I started teaching myself snakemake today, and it works really well, but I'm hitting a brick wall with this.
You need to retrieve the argument for -K from the wildcards, not from the config file. The config file will simply return your list of possible values, it is a plain python dictionary.
configfile: 'config.yaml'
rule all:
input:
expand("output/{sample}.{K}.{ext}",
sample = config['infile'],
K = config['K'],
ext = ['meanQ', 'meanP'])
rule structure:
output:
"output/{sample}.{K}.meanQ",
"output/{sample}.{K}.meanP"
params:
prefix = config['invcf'],
K = config['K']
threads: 3
shell:
"structure.py -K {wildcards.K} "
"--input=output/{params.prefix} "
"--output=output/{params.prefix}"
Note that there are more things to improve here. For example, the rule structure does not define any input file, although it uses one.
There is an option now for parameter space exploration
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#parameter-space-exploration