Snakemake: Data-dependent conditional execution of rules, IndexError - snakemake

When executing the below snakemake pipeline, I get an error: IndexError: list index out of range. I think it's because fastqc_pretrim is being executed for all SAMPLEs. However, not all samples pass basecalling QC, so only some files will need to be processed here. I am trying to use checkpointing to get this to run. Looking at the log, we can see it is trying to run fastqc_pretrim for sample "FAQ20773_pass_barcode01_68fda206_1". However, if you look above that line in the LOG, FAQ20773_fail_barcode03_68fda206_0 is actually the only sample that passed with a .fastq.gz file. I'm not sure why the correct sample is not running.
LOG:
snakemake --use-conda --jobs 1 -pr
['FAQ20773_fail_barcode01_68fda206_0', 'FAQ20773_pass_barcode01_68fda206_2', 'FAQ20773_fail_barcode03_68fda206_0', 'FAQ20773_fail_barcode02_68fda206_0', 'FAQ20773_pass_barcode01_68fda206_0', 'FAQ20773_pass_barcode01_68fda206_1']
The flag 'directory' used in rule guppy_basecall_persample is only valid for outputs, not inputs.
Building DAG of jobs...
Updating job fastqc_pretrim.
basecall/FAQ20773_fail_barcode01_68fda206_0
[]
Updating job all.
Updating job fastqc_pretrim.
basecall/FAQ20773_pass_barcode01_68fda206_2
[]
Updating job all.
Updating job fastqc_pretrim.
basecall/FAQ20773_fail_barcode03_68fda206_0
['basecall/FAQ20773_fail_barcode03_68fda206_0/pass/fastq_runid_68fda20603fe08e9e2a4eef8718997203b603497_0_0.fastq.gz']
Updating job all.
Updating job fastqc_pretrim.
basecall/FAQ20773_fail_barcode02_68fda206_0
[]
Updating job all.
Updating job fastqc_pretrim.
basecall/FAQ20773_pass_barcode01_68fda206_0
[]
Updating job all.
Updating job fastqc_pretrim.
basecall/FAQ20773_pass_barcode01_68fda206_1
[]
Updating job all.
Using shell: /usr/bin/bash
[Thu Aug 26 13:13:51 2021]
rule fastqc_pretrim:
output: qc/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1.html, qc/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1_fastqc.zip
log: logs/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1.log
jobid: 19
reason: Missing output files: qc/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1_fastqc.zip
wildcards: sample=FAQ20773_pass_barcode01_68fda206_1
resources: tmpdir=/tmp
/home/hvasquezgross/miniconda3/envs/snakemake/bin/python3.9 /mypool/projects/steve_frese/snakemake_guppy_basecall/.snakemake/scripts/tmpzxqxx28h.wrapper.py
Activating conda environment: /mypool/projects/steve_frese/snakemake_guppy_basecall/.snakemake/conda/224336800b4f74953334e368c2f338c4
Traceback (most recent call last):
File "/mypool/projects/steve_frese/snakemake_guppy_basecall/.snakemake/scripts/tmpzxqxx28h.wrapper.py", line 41, in <module> shell( File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/site-packages/snakemake/shell.py", line 130, in __new__ cmd = format(cmd, *args, stepout=2, **kwargs) File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/site-packages/snakemake/utils.py", line 427, in format return fmt.format(_pattern, *args, **variables) File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/string.py", line 161, in format return self.vformat(format_string, args, kwargs)
File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/string.py", line 165, in vformat result, _ = self._vformat(format_string, args, kwargs, used_args, 2)
File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/string.py", line 205, in _vformat obj, arg_used = self.get_field(field_name, args, kwargs)
File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/string.py", line 278, in get_field obj = obj[i]
File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/site-packages/snakemake/io.py", line 1536, in __getitem__ return super().__getitem__(key)
IndexError: list index out of range
[Thu Aug 26 13:13:52 2021]
Error in rule fastqc_pretrim:
jobid: 19
output: qc/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1.html, qc/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1_fastqc.zip
log: logs/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1.log (check log file(s) for error message)
conda-env: /mypool/projects/steve_frese/snakemake_guppy_basecall/.snakemake/conda/224336800b4f74953334e368c2f338c4
RuleException:
CalledProcessError in line 60 of /mypool/projects/steve_frese/snakemake_guppy_basecall/Snakefile:
Command 'source /home/hvasquezgross/miniconda3/bin/activate '/mypool/projects/steve_frese/snakemake_guppy_basecall/.snakemake/conda/224336800b4f74953334e368c2f338c4'; /home/hvasquezgross/miniconda3/envs/snakemake/bin/python3.9 /mypool/projects/steve_frese/snakemake_guppy_basecall/.snakemake/scripts/tmpzxqxx28h.wrapper.py' returned non-zero exit status 1.
File "/mypool/projects/steve_frese/snakemake_guppy_basecall/Snakefile", line 60, in __rule_fastqc_pretrim
File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/concurrent/futures/thread.py", line 52, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Snakemake
import glob
configfile: "config.yaml"
inputdirectory=config["directory"]
SAMPLES, = glob_wildcards(inputdirectory+"/{sample}.fast5", followlinks=True)
print(SAMPLES)
wildcard_constraints:
sample="\w+\d+_\w+_\w+\d+_.+_\d"
##### target rules #####
rule all:
input:
expand('basecall/{sample}/sequencing_summary.txt', sample=SAMPLES),
"qc/multiqc.html"
rule make_indvidual_samplefiles:
input:
inputdirectory+"/{sample}.fast5",
output:
"lists/{sample}.txt",
shell:
"basename {input} > {output}"
checkpoint guppy_basecall_persample:
input:
directory=directory(inputdirectory),
samplelist="lists/{sample}.txt",
output:
summary="basecall/{sample}/sequencing_summary.txt",
directory=directory("basecall/{sample}/"),
params:
config["basealgo"]
shell:
"guppy_basecaller -i {input.directory} --input_file_list {input.samplelist} -s {output.directory} -c {params} --compress_fastq -x \"auto\" --gpu_runners_per_device 3 --num_callers 2 --chunks_per_runner 200"
def aggregate_input(wildcards):
checkpoint_output = checkpoints.guppy_basecall_persample.get(**wildcards).output[1]
print(checkpoint_output)
exparr = expand("basecall/{sample}/pass/{runid}.fastq.gz", sample=wildcards.sample,
runid=glob_wildcards(os.path.join(checkpoint_output, "pass/", "{runid}.fastq.gz")).runid)
print(exparr)
return exparr
rule fastqc_pretrim:
input:
aggregate_input
output:
html="qc/fastqc_pretrim/{sample}.html",
zip="qc/fastqc_pretrim/{sample}_fastqc.zip" # the suffix _fastqc.zip is necessary for multiqc to find the file. If not using multiqc, you are free to choose an arbitrary filename
params: ""
log:
"logs/fastqc_pretrim/{sample}.log"
threads: 1
wrapper:
"0.77.0/bio/fastqc"
rule multiqc:
input:
#expand("basecall/{sample}.fastq.gz", sample=SAMPLES)
expand("qc/fastqc_pretrim/{sample}_fastqc.zip", sample=SAMPLES)
output:
"qc/multiqc.html"
params:
"" # Optional: extra parameters for multiqc.
log:
"logs/multiqc.log"
wrapper:
"0.77.0/bio/multiqc"

I think you are making things more complicated than necessary by using checkpoint and wrapper. This is what I would do, more or less:
rule guppy_basecall_persample:
input:
...
output:
summary="basecall/{sample}/sequencing_summary.txt",
directory=directory("basecall/{sample}/"),
shell:
r"""
guppy ...
"""
rule fastqc_pretrim:
input:
directory= directory("basecall/{sample}/"),
output:
html="qc/fastqc_pretrim/{sample}.html",
zip="qc/fastqc_pretrim/{sample}_fastqc.zip"
shell:
r"""
fastqc {input.directory}/pass/*.fastq.gz
"""

Related

How to use Snakemake for memory management?

I tried to enforce Snakemake to run a rule (with many jobs) sequentially to avoid memory conflict.
rule run_eval_all:
input:
expand("config["out_model"] + "{iLogit}.rds", iLogit = MODELS)
rule eval_model:
input:
script = config["src_est"] + "evals/script.R",
model = config["out_model"] + "{iLogit}.rds",
output:
"out/{iLogit}.rds"
threads: 5
resources:
mem_mb = 100000
shell:
"{runR} {input.script} "
"--out {output}"
And I run the rule by snakemake --cores all --resources mem_mb=100000 run_eval_all. But I keep getting errors like:
x86_64-conda-linux-gnu % snakemake --resources mem_mb=100000 run_eval_all
Traceback (most recent call last):
File "/local/home/zhakaida/mambaforge/envs/r_snake/bin/snakemake", line 10, in <module>
sys.exit(main())
File "/local/home/zhakaida/mambaforge/envs/r_snake/lib/python3.9/site-packages/snakemake/__init__.py", line 2401, in main
resources = parse_resources(args.resources)
File "/local/home/zhakaida/mambaforge/envs/r_snake/lib/python3.9/site-packages/snakemake/resources.py", line 85, in parse_resources
for res, val in resources_args.items():
AttributeError: 'list' object has no attribute 'items'
If I run snakemake --cores all run_eval_all, it works but jobs run in parallel (as expected) and sometimes induces memory overuse and collapse. How shall I properly claim memory for Snakemake?
The error is due to a known issue with parsing the --resources argument in Snakemake 6.5.1, https://github.com/snakemake/snakemake/issues/1069.
Update to snakemake 6.5.3 or later and see if your problem still exists.

Running multiple snakemake rules

I would like to run multiple rules one after another using snakemake. However, when I run this script, the bam_list rule appears before samtools_markdup rule, and gives me an error that it cannot find input files, which are obviously have not been generated yet.
How to solve this problem?
rule all:
input:
expand("dup/{sample}.dup.bam", sample=SAMPLES)
"dup/bam_list"
rule samtools_markdup:
input:
sortbam ="rg/{sample}.rg.bam"
output:
dupbam = "dup/{sample}.dup.bam"
threads: 5
shell:
"""
samtools markdup -# {threads} {input.sortbam} {output.dupbam}
"""
rule bam_list:
output:
outlist = "dup/bam_list"
shell:
"""
ls dup/*.bam > {output.outlist}
"""
Snakemake is following directions, you want dup/bam_list and it can be produced without any inputs. I think what you mean to have is:
rule all:
input:
"dup/bam_list"
rule samtools_markdup:
input:
sortbam ="rg/{sample}.rg.bam"
output:
dupbam = "dup/{sample}.dup.bam"
threads: 5
shell:
"""
samtools markdup -# {threads} {input.sortbam} {output.dupbam}
"""
rule bam_list:
input:
expand("dup/{sample}.dup.bam", sample=SAMPLES)
output:
outlist = "dup/bam_list"
shell:
"""
ls dup/*.bam > {output.outlist}
"""
Now bam_list will wait until all the samtools_markdup jobs are completed. As an aside, I expect the contents of dup_list to be identical to expand("dup/{sample}.dup.bam", sample=SAMPLES), so if you use the file later in the workflow you can probably just use the expand output.

How are nested checkpoints resolved in snakemake?

I need to use nested checkpoints in snakemake since for every dynamic file I have to create again other dynamic files. So far, I am unable to resolve the two checkpoints properly. Below, you find a minimal toy example.
It seems that until the first checkout is not properly resolved, the second checkpoint is not even executed, thus a single aggregate rule won't work.
I don't know how to invoke the two checkpoints and resolve the wildcards.
import os.path
import glob
rule all:
input:
'collect/all_done.txt'
#generate a number of files
checkpoint create_files:
output:
directory('files')
run:
import random
r = random.randint(1,10)
for x in range(r):
output_dir = output[0] + '/' + str(x+1)
import os
if not os.path.isdir(output_dir):
os.makedirs(output_dir, exist_ok=True)
output_file=output_dir + '/test.txt'
print(output_file)
with open(output_file, 'w') as f:
f.write(str(x+1))
checkpoint create_other_files:
input: 'files/{i}/test.txt'
output: directory('other_files/{i}/')
shell:
'''
L=$(( $RANDOM % 10))
for j in $(seq 1 $L);
do
mkdir -p {output}/{j}
cp -f {input} {output}/$j/test2.txt
done
'''
def aggregate(wildcards):
i_wildcard = checkpoints.create_files.get(**wildcards).output[0]
print('in_def_aggregate')
print(i_wildcard)
j_wildcard = checkpoints.create_other_files.get(**wildcards).output[0]
print(j_wildcard)
split_files = expand('other_files/{i}/{j}/test2.txt',
i =glob_wildcards(os.path.join(i_wildcard, '{i}/test.txt')).i,
j = glob_wildcards(os.path.join(j_wildcard, '{j}/test2.txt')).j
)
return split_files
#non-sense collect function
rule collect:
input: aggregate
output: touch('collect/all_done.txt')
Currently, I get the following error from snakemake:
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 all
1 collect
1 create_files
3
[Thu Nov 14 14:45:01 2019]
checkpoint create_files:
output: files
jobid: 2
Downstream jobs will be updated after completion.
Job counts:
count jobs
1 create_files
1
files/1/test.txt
files/2/test.txt
files/3/test.txt
files/4/test.txt
files/5/test.txt
files/6/test.txt
files/7/test.txt
files/8/test.txt
files/9/test.txt
files/10/test.txt
Updating job 1.
in_def_aggregate
files
[Thu Nov 14 14:45:02 2019]
Error in rule create_files:
jobid: 2
output: files
InputFunctionException in line 53 of /TL/stat_learn/work/feldmann/Phd/Projects/HIVImmunoAdapt/HIVIA/playground/Snakefile2:
WorkflowError: Missing wildcard values for i
Wildcards:
Removing output files of failed job create_files since they might be corrupted:
files
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
I am interested in having the files /other_files/{checkpoint_1_wildcard}/{checkpoint_2_wildcard}/test2.txt
I am not entirely sure what you were trying to do, so I rewrote it quite some. But does clarify the problem?
import glob
import random
from pathlib import Path
rule all:
input:
'collect/all_done.txt'
checkpoint first:
output:
directory('first')
run:
for i in range(random.randint(1,10)):
Path(f"{output[0]}/{i}").mkdir(parents=True, exist_ok=True)
Path(f"{output[0]}/{i}/test.txt").touch()
checkpoint second:
input:
'first/{i}/test.txt'
output:
directory('second/{i}')
run:
for j in range(random.randint(1,10)):
Path(f"{output[0]}/{j}").mkdir(parents=True, exist_ok=True)
Path(f"{output[0]}/{j}/test2.txt").touch()
rule copy:
input:
'second/{i}/{j}/test2.txt'
output:
'copy/{i}/{j}/test2.txt'
shell:
"""
cp -f {input} {output}
"""
def aggregate(wildcards):
outputs_i = glob.glob(f"{checkpoints.first.get().output}/*/")
outputs_i = [output.split('/')[-2] for output in outputs_i]
split_files = []
for i in outputs_i:
outputs_j = glob.glob(f"{checkpoints.second.get(i=i).output}/*/")
outputs_j = [output.split('/')[-2] for output in outputs_j]
for j in outputs_j:
split_files.append(f"copy/{i}/{j}/test2.txt")
return split_files
rule collect:
input:
aggregate
output:
touch('collect/all_done.txt')

Snakemake read input from file

I am trying to use file that will be written during the run as an input to another rule, but it always give me error FileNotFoundError: [Errno 2] No such file or directory:
Is there a way to fix it or other implementation to have the same logic.
def vc_list(wildcards):
my_list = []
with open(wildcards.mydir+"/file_B.txt", 'r') as data_in:
for line in data_in:
my_list.append(line.strip())
return(my_list)
# rule A will process file_A.txt and give me file_B.txt
rule A:
input: "{mydir}/file_A.txt"
output: "{mydir}/file_B.txt"
shell: "seq 1 5 > {output}" # assume that `seq 1 5` is the output from proicessing the file
rule B:
input: "{vlaue}"
output: "{vlaue}.vc"
shell: "pythoncode.py {input} {output}"
# rule C will process file_B.txt to give me list of values that will be used to expanded the input, then will use rile B to produce it
rule C:
input:
processed_file = rules.A.output, #"{mydir}/file_B.txt",
my_list = lambda wildcards: expand("{mydir}/{value}.vc", mydir=wildcards.mydir, value=vc_list(wildcards))
output: "{mydir}/done.txt"
shell: "touch {output}"
#I always have the error that "{mydir}/file_B.txt" does not exist
The error now:
test_loop.snakefile:
FileNotFoundError: [Errno 2] No such file or directory: 'read_file/file_B.txt'
Wildcards:
mydir=read_file
Thanks,
The answer to my question is to use checkpoint as dynamic will be deprecated.
Here is how the logic should be changed:
rule:
input: 'done.txt'
checkpoint A:
output: 'B.txt'
shell: 'seq 1 2 > {output}'
rule N:
input: "genome.fa"
output: '{num}.bam'
shell: "touch {output}"
rule B:
input: '{num}.bam'
output: '{num}.vc'
shell: "touch {output}"
def aggregate_input(wildcards):
with open(checkpoints.A.get(**wildcards).output[0], 'r') as f:
return [num.rstrip() + '.vc' for num in f]
rule C:
input: aggregate_input
output: touch('done.txt')
Credit goes to Eric Lim
Your script fails even before the workflow starts, on the phase of the pipeline construction.
So, there is nothing surprising regarding the rules A and B: Snakemake reads their input and output sections and finds no problem with them. Then it starts reading the rule C where the input section calls the vc_list() function which in turn tries to read the file 'read_file/file_B.txt' even before the workflow has started! For sure it doesn't find the file and produces the error.
As for what to do, you need to clarify the task first. Most probable you are trying to use dynamic information in the input rule. In this case you need to use dynamic files or checkpoints.

Conditional execution of multiplexed analysis with snakemake

I've some troubles with Snakemake, up to now I didn’t found pertinent informations
in the documentation (or somewhere else).
In fact, I've a big file with different samples (multiplexed analyses) and I would like to stop the execution of the pipeline for some sample according to result found after rules.
I've already tried to change this value out of a rule definition (using a checkpoint or a def), to make conditional input for folowing rules and to considere wildcards as a simple list to delete one item.
Below is an example of what I want to do (the conditional if is only indicative here) :
# Import the config file(s)
configfile: "../PATH/configfile.yaml"
# Wildcards
sample = config["SAMPLE"]
lauch = config["LAUCH"]
# Rules
rule all:
input:
expand("PATH_TO_OUTPUT/{lauch}.{sample}.output", lauch=lauch, sample=sample)
rule one:
input:
"PATH_TO_INPUT/{lauch}.{sample}.input"
output:
temp("PATH_TO_OUTPUT/{lauch}.{sample}.output.tmp")
shell:
"""
somescript.sh {input} {output}
"""
rule two:
input:
"PATH_TO_OUTPUT/{lauch}.{sample}.output.tmp"
output:
"PATH_TO_OUTPUT/{lauch}.{sample}.output"
shell:
"""
somecheckpoint.sh {input} # Print a message and write in the log file for now
if [ file_dont_pass_checkpoint ]; then
# Delete the correspondant sample to the wildcard {sample}
# to continu the analysis only with samples who are pass the validation
fi
somescript2.sh {input} {output}
"""
If someone has an idea I'm interested.
Thank you in advance for your answers.
I think this is an interesting situation if I understand it correctly. If a sample passes some checks, then keep analysing it. Otherwise, stop early.
At the end of the pipeline, every sample must have a PATH_TO_OUTPUT/{lauch}.{sample}.output since this what the rule all asks for regardless of the check results.
You could have the rule(s) performing the checks writing a file containing a flag indicating whether for that sample the checks passed or not (say flag PASS or FAIL). Then according to that flag, the rule(s) doing the analysis either go for the full analysis (if PASS) or write an empty file (or whathever) if the flag is FAIL. Here's the gist:
rule all:
input:
expand('{sample}.output', sample= samples),
rule checker:
input:
'{sample}.input',
output:
'{sample}.check',
shell:
r"""
if [ some_check_is_ok ]
then
echo "PASS" > {output}
else
echo "FAIL" > {output}
fi
"""
rule do_analysis:
input:
chk= '{sample}.check',
smp= '{sample}.input',
output:
'{sample}.output',
shell:
r"""
if [ {input.chk} contains "PASS"]:
do_long_analysis.sh {input.smp} > {output}
else:
> {output} # Do nothing: empty file
"""
If you don't want to see the failed, empty output files at all, you could use the onsuccess directive to get rid of them at the end of the pipeline:
onsuccess:
for x in expand('{sample}.output', sample= samples):
if os.path.getsize(x) == 0:
print('Removing failed sample %s' % x)
os.remove(x)
The canonical solution to problems like this is to use checkpoints. Consider the following example:
import pandas as pd
def get_results(wildcards):
qc = pd.read_csv(checkpoints.qc.get().output[0].open(), sep="\t")
return expand(
"results/processed/{sample}.txt",
sample=qc[qc["some-qc-criterion"] > config["qc-threshold"]]["sample"]
)
rule all:
input:
get_results
checkpoint qc:
input:
expand("results/preprocessed/{sample}.txt", sample=config["samples"])
output:
"results/qc.tsv"
shell:
"perfom-qc {input} > {output}"
rule process:
input:
"results/preprocessed/{sample}.txt"
output:
"results/processed/{sample.txt}"
shell:
"process {input} > {output}"
The idea is the following: at some point in your pipeline, after some (let's say) preprocessing, you add a checkpoint rule, which aggregates over all samples and generates some kind of QC table. Then, downstream of that, there is a rule that aggregates over samples (e.g. the rule all, or some other aggregation inside of the workflow). Let's say in that aggregation you only want to consider samples that pass the QC. For that, you let the required files ("results/processed/{sample}.txt") be determined via an input function, which reads the QC table generated by the checkpoint rule. Snakemake's checkpoint mechanism ensures that this input function is evaluated after the checkpoint has been executed, so that you can actually read the table results and base your decision about the samples on the qc criteria contained in that table. Any intermediate rules (like here the process rule) will then be automatically applied by Snakemake when re-evaluating the DAG.