Conditional execution of multiplexed analysis with snakemake - snakemake

I've some troubles with Snakemake, up to now I didn’t found pertinent informations
in the documentation (or somewhere else).
In fact, I've a big file with different samples (multiplexed analyses) and I would like to stop the execution of the pipeline for some sample according to result found after rules.
I've already tried to change this value out of a rule definition (using a checkpoint or a def), to make conditional input for folowing rules and to considere wildcards as a simple list to delete one item.
Below is an example of what I want to do (the conditional if is only indicative here) :
# Import the config file(s)
configfile: "../PATH/configfile.yaml"
# Wildcards
sample = config["SAMPLE"]
lauch = config["LAUCH"]
# Rules
rule all:
input:
expand("PATH_TO_OUTPUT/{lauch}.{sample}.output", lauch=lauch, sample=sample)
rule one:
input:
"PATH_TO_INPUT/{lauch}.{sample}.input"
output:
temp("PATH_TO_OUTPUT/{lauch}.{sample}.output.tmp")
shell:
"""
somescript.sh {input} {output}
"""
rule two:
input:
"PATH_TO_OUTPUT/{lauch}.{sample}.output.tmp"
output:
"PATH_TO_OUTPUT/{lauch}.{sample}.output"
shell:
"""
somecheckpoint.sh {input} # Print a message and write in the log file for now
if [ file_dont_pass_checkpoint ]; then
# Delete the correspondant sample to the wildcard {sample}
# to continu the analysis only with samples who are pass the validation
fi
somescript2.sh {input} {output}
"""
If someone has an idea I'm interested.
Thank you in advance for your answers.

I think this is an interesting situation if I understand it correctly. If a sample passes some checks, then keep analysing it. Otherwise, stop early.
At the end of the pipeline, every sample must have a PATH_TO_OUTPUT/{lauch}.{sample}.output since this what the rule all asks for regardless of the check results.
You could have the rule(s) performing the checks writing a file containing a flag indicating whether for that sample the checks passed or not (say flag PASS or FAIL). Then according to that flag, the rule(s) doing the analysis either go for the full analysis (if PASS) or write an empty file (or whathever) if the flag is FAIL. Here's the gist:
rule all:
input:
expand('{sample}.output', sample= samples),
rule checker:
input:
'{sample}.input',
output:
'{sample}.check',
shell:
r"""
if [ some_check_is_ok ]
then
echo "PASS" > {output}
else
echo "FAIL" > {output}
fi
"""
rule do_analysis:
input:
chk= '{sample}.check',
smp= '{sample}.input',
output:
'{sample}.output',
shell:
r"""
if [ {input.chk} contains "PASS"]:
do_long_analysis.sh {input.smp} > {output}
else:
> {output} # Do nothing: empty file
"""
If you don't want to see the failed, empty output files at all, you could use the onsuccess directive to get rid of them at the end of the pipeline:
onsuccess:
for x in expand('{sample}.output', sample= samples):
if os.path.getsize(x) == 0:
print('Removing failed sample %s' % x)
os.remove(x)

The canonical solution to problems like this is to use checkpoints. Consider the following example:
import pandas as pd
def get_results(wildcards):
qc = pd.read_csv(checkpoints.qc.get().output[0].open(), sep="\t")
return expand(
"results/processed/{sample}.txt",
sample=qc[qc["some-qc-criterion"] > config["qc-threshold"]]["sample"]
)
rule all:
input:
get_results
checkpoint qc:
input:
expand("results/preprocessed/{sample}.txt", sample=config["samples"])
output:
"results/qc.tsv"
shell:
"perfom-qc {input} > {output}"
rule process:
input:
"results/preprocessed/{sample}.txt"
output:
"results/processed/{sample.txt}"
shell:
"process {input} > {output}"
The idea is the following: at some point in your pipeline, after some (let's say) preprocessing, you add a checkpoint rule, which aggregates over all samples and generates some kind of QC table. Then, downstream of that, there is a rule that aggregates over samples (e.g. the rule all, or some other aggregation inside of the workflow). Let's say in that aggregation you only want to consider samples that pass the QC. For that, you let the required files ("results/processed/{sample}.txt") be determined via an input function, which reads the QC table generated by the checkpoint rule. Snakemake's checkpoint mechanism ensures that this input function is evaluated after the checkpoint has been executed, so that you can actually read the table results and base your decision about the samples on the qc criteria contained in that table. Any intermediate rules (like here the process rule) will then be automatically applied by Snakemake when re-evaluating the DAG.

Related

Snakemake: a rule with batched inputs and corresponding outputs

I have the following basic structure of the workflow:
files are downloaded from a remote server,
converted locally and then
analyzed.
One of the analyses is time-consuming, but it scales well if run on multiple input files at a time. The output of this rule is independent of what files are analyzed together as a batch as long as they all share the same set of settings. Upstream and downstream rules operate on individual files, so from the perspective of the workflow, this rule is an outlier. What files are to be run together can told in advance, although ideally if some of the inputs failed to be produced along the way, the rule should be run on a reduced of files.
The following example illustrates the problem:
samples = [ 'a', 'b', 'c', 'd', 'e', 'f' ]
groups = {
'A': samples[0:3],
'B': samples[3:6]
}
rule all:
input:
expand("done/{sample}.txt", sample = samples)
rule create:
output:
"created/{sample}.txt"
shell:
"echo {wildcards.sample} > {output}"
rule analyze:
input:
"created/{sample}.txt"
output:
"analyzed/{sample}.txt"
params:
outdir = "analyzed/"
shell:
"""
sleep 1 # or longer
parallel md5sum {{}} \> {params.outdir}/{{/}} ::: {input}
"""
rule finalize:
input:
"analyzed/{sample}.txt"
output:
"done/{sample}.txt"
shell:
"touch {output}"
The rule analyze is the one to produce multiple output files from multiple inputs according to the assignment in groups. The rules create and finalize operate on individual files upstream and downstream, respectively.
Is there a way to implement such logic? I'd try like to try to avoid splitting the workflow to accommodate this irregularity.
Note: this question is not related to the similar sounding question here.
If I understand correctly. rule analyze takes in input files created/a.txt, created/b.txt, created/c.txt for group A and gives in output
analyzed/a.txt, analyzed/b.txt, analyzed/c.txt. The same for group B so rule analyze runs twice, everything else runs 6 times.
If so, I make rule analyze output a dummy file signaling that files in group A (or B, etc.) has been analyzed. Downstream rules will take in input this dummy file and will find the corresponding analyzed/{sample}.txtavailable.
Here's your example:
samples = [ 'a', 'b', 'c', 'd', 'e', 'f' ]
groups = {
'A': samples[0:3],
'B': samples[3:6]
}
# Map samples to groups by inverting dict groups
inv_groups = {}
for x in samples:
for k in groups:
if x in groups[k]:
inv_groups[x] = k
rule all:
input:
expand("done/{sample}.txt", sample = samples)
rule create:
output:
"created/{sample}.txt"
shell:
"echo {wildcards.sample} > {output}"
rule analyze:
input:
# Collect input for this group (A, B, etc)
grp= lambda wc: ["created/%s.txt" % x for x in groups[wc.group]]
output:
done = touch('created/{group}.done'),
shell:
"""
# Code that actually does the job...
for x in {input.grp}
do
sn=`basename $x .txt`
touch analyzed/$sn.txt
done
"""
rule finalize:
input:
# Get dummy file for this {sample}.
# If the dummy exists also the corresponding analyzed/{sample}.txt exists.
done = lambda wc: 'created/%s.done' % inv_groups[wc.sample],
output:
fout= "done/{sample}.txt"
params:
fin= "analyzed/{sample}.txt",
shell:
"cp {params.fin} {output.fout}"

lambda function in snakemake output

I currently have a snakemake workflow that requires the use of lambda wildcards, set up as follows:
Snakefile:
configfile: "config.yaml"
workdir: config["work"]
rule all:
input:
expand("logs/bwa/{ref}.log", ref=config["refs"])
rule bwa_index:
input:
lambda wildcards: 'data/'+config["refs"][wildcards.ref]+".fna.gz"
output:
"logs/bwa/{ref}.log"
log:
"logs/bwa/{ref}.log"
shell:
"bwa index {input} 2&>1 {log}"
Config file:
work: /datasets/work/AF_CROWN_RUST_WORK/2020-02-28_GWAS
refs:
12NC29: GCA_002873275.1_ASM287327v1_genomic
12SD80: GCA_002873125.1_ASM287312v1_genomic
This works, but I've had to use a hack to get the output of bwa_index to play with the input of all. My hack is to generate a log file as part of bwa_index, set the log to the output of bwa_index, and then set the input of all to these log files. As I said, it works, but I don't like it.
The problem is that the true outputs of bwa_index are of the format, for example, GCA_002873275.1_ASM287327v1_genomic.fna.sa. So, to specify these output files, I would need to use a lambda function for the output, something like:
rule bwa_index:
input:
lambda wildcards: 'data/'+config["refs"][wildcards.ref]+".fna.gz"
output:
lambda wildcards: 'data/'+config["refs"][wildcards.ref]+".fna.sa"
log:
"logs/bwa/{ref}.log"
shell:
"bwa index {input} 2&>1 {log}"
and then use a lambda function with expand for the input of rule all. However, snakemake will not accept functions as output, so I'm at a complete loss how to do this (other than my hack). Does anyone have suggestions of a sensible solution? TIA!
You can use a simple python function in the inputs (as the lambda function) so I suggest you use it for the rule all.
configfile: "config.yaml"
workdir: config["work"]
def getTargetFiles():
targets = list()
for r in config["refs"]:
targets.append("data/"+config["refs"][r]+".fna.sa")
return targets
rule all:
input:
getTargetFiles()
rule bwa_index:
input:
"data/{ref}.fna.gz"
output:
"data/{ref}.fna.sa"
log:
"logs/bwa/{ref}.log"
shell:
"bwa index {input} 2&>1 {log}"
Careful here the wildcard {ref} is the value and not the key of your dictionnary so your log files will finally be named "logs/bwa/GCA_002873275.1_ASM287327v1_genomic.log", etc...

snakemake batch creation of output

Generating output based on changed input files in Snakemake is easy:
rule all:
input: [f'out_{i}.txt' for i in range(10)]
rule make_input:
output: 'in_{i}.txt'
shell: 'touch {output}'
rule make_output_parallel:
input: 'in_{i}.txt'
output: 'out_{i}.txt'
shell: 'touch {output}'
In this case, make_output will only run for instances where in_{i}.txt have changed.
But suppose the 'out_{i}.txt' cannot be generated in parallel and I want to generate them in a single step, like,
rule make_output_one_step:
input: [f'in_{i}.txt' for i in range(10)]
output: [f'out_{i}.txt' for i in range(10)]
shell: 'touch {output}'
If only one of the in_{i}.txt files have changed, I don't need to regenerate all 10 of them.
How can I adjust make_output_one_step.output to generate only the needed files?
If you want some parts of the pipeline to not work in parallel for whatever reason (RAM, internet usage, IO, API limit, etc....) you can make use of resources.
rule all:
input: [f'out_{i}.txt' for i in range(10)]
rule make_input:
output: 'in_{i}.txt'
shell: 'touch {output}'
rule make_output:
input: 'in_{i}.txt'
output: 'out_{i}.txt'
resources: max_parallel=1
shell: 'touch {output}'
And then you can call your pipeline like snakemake --resources max_parallel=1 --cores 10. In this case all the jobs of rule make_input will run in parallel, but only one instance of make_output will run in parallel.

Snakemake read input from file

I am trying to use file that will be written during the run as an input to another rule, but it always give me error FileNotFoundError: [Errno 2] No such file or directory:
Is there a way to fix it or other implementation to have the same logic.
def vc_list(wildcards):
my_list = []
with open(wildcards.mydir+"/file_B.txt", 'r') as data_in:
for line in data_in:
my_list.append(line.strip())
return(my_list)
# rule A will process file_A.txt and give me file_B.txt
rule A:
input: "{mydir}/file_A.txt"
output: "{mydir}/file_B.txt"
shell: "seq 1 5 > {output}" # assume that `seq 1 5` is the output from proicessing the file
rule B:
input: "{vlaue}"
output: "{vlaue}.vc"
shell: "pythoncode.py {input} {output}"
# rule C will process file_B.txt to give me list of values that will be used to expanded the input, then will use rile B to produce it
rule C:
input:
processed_file = rules.A.output, #"{mydir}/file_B.txt",
my_list = lambda wildcards: expand("{mydir}/{value}.vc", mydir=wildcards.mydir, value=vc_list(wildcards))
output: "{mydir}/done.txt"
shell: "touch {output}"
#I always have the error that "{mydir}/file_B.txt" does not exist
The error now:
test_loop.snakefile:
FileNotFoundError: [Errno 2] No such file or directory: 'read_file/file_B.txt'
Wildcards:
mydir=read_file
Thanks,
The answer to my question is to use checkpoint as dynamic will be deprecated.
Here is how the logic should be changed:
rule:
input: 'done.txt'
checkpoint A:
output: 'B.txt'
shell: 'seq 1 2 > {output}'
rule N:
input: "genome.fa"
output: '{num}.bam'
shell: "touch {output}"
rule B:
input: '{num}.bam'
output: '{num}.vc'
shell: "touch {output}"
def aggregate_input(wildcards):
with open(checkpoints.A.get(**wildcards).output[0], 'r') as f:
return [num.rstrip() + '.vc' for num in f]
rule C:
input: aggregate_input
output: touch('done.txt')
Credit goes to Eric Lim
Your script fails even before the workflow starts, on the phase of the pipeline construction.
So, there is nothing surprising regarding the rules A and B: Snakemake reads their input and output sections and finds no problem with them. Then it starts reading the rule C where the input section calls the vc_list() function which in turn tries to read the file 'read_file/file_B.txt' even before the workflow has started! For sure it doesn't find the file and produces the error.
As for what to do, you need to clarify the task first. Most probable you are trying to use dynamic information in the input rule. In this case you need to use dynamic files or checkpoints.

snakemake rules: Passing on variables outside of the file name

So far I used snakemake to generate individual plots with snakemake. This has worked great! Now though, I want to create a rule that creates a combined plot across the topics, without explicitly putting the name in the plot. See the combined_plot rule below.
topics=["soccer", "football"]
params=[1, 2, 3, 4]
rule all:
input:
expand("plot_p={param}_{topic}.png", topic=topics, param=params),
expand("combined_p={param}_plot.png", param=params),
rule plot:
input:
"data_p={param}_{topic}.csv"
output:
"plot_p={param}_{topic}.png"
shell:
"plot.py --input={input} --output={output}"
rule combined_plot:
input:
# all data_p={param}_{topic}.csv files
output:
"combined_p={param}_plot.png"
shell:
"plot2.py " + # one "--input=" and one "--output" for each csv file
Is there a simple way to do this with snakemake?
If I understand correctly, the code below should be more straightforward as it replaces the lambda and the glob with the expand function. It will execute the two commands:
plot2.py --input=data_p=1_soccer.csv --input=data_p=1_football.csv --output combined_p=1_plot.png
plot2.py --input=data_p=2_soccer.csv --input=data_p=2_football.csv --output combined_p=2_plot.png
topics=["soccer", "football"]
params=[1, 2]
rule all:
input:
expand("combined_p={param}_plot.png", param=params),
rule combined_plot:
input:
csv= expand("data_p={{param}}_{topic}.csv", topic= topics)
output:
"combined_p={param}_plot.png",
run:
inputs= ['--input=' + x for x in input.csv]
shell("plot2.py {inputs} --output {output}")
I got a working version, by using a function called 'wcs' as input (see here) and I used run instead of shell. In the run section I could first define a variable before executing the result with shell(...).
Instead of referring to the files with glob I could also have directly used the topics in the lambda function.
If anyone with more experience sees this, please tell me if this is the "right" way to do it.
from glob import glob
topics=["soccer", "football"]
params=[1, 2]
rule all:
input:
expand("plot_p={param}_{topic}.png", topic=topics, param=params),
expand("combined_p={param}_plot.png", param=params),
rule plot:
input:
"data_p={param}_{topic}.csv"
output:
"plot_p={param}_{topic}.png"
shell:
"echo plot.py {input} {output}"
rule combined_plot:
input:
lambda wcs: glob("data_p={param}_*.csv".format(**wcs))
output:
"combined_p={param}_plot.png"
run:
inputs=" ".join(["--input " + inp for inp in input])
shell("echo plot2.py {inputs}")