`errorStrategy` setting to stop current process but continue pipeline - nextflow

I have a lot of samples that go through a process which sometimes fail (deterministically). In such a case, I would want the failing process to stop, but all other samples to still get submitted and processed independently.
If I understand correctly, setting errorStrategy 'ignore' will continue the script within the failing process, which is not what I want. And errorStrategy 'finish' would stop submitting new samples, even though there is no reason for the other samples to fail too. And while errorStrategy 'retry' could technically work (by repeating the failing processes while the good ones get through), that doesn't seem like a good solution.
Am I missing something?

If a process can fail deterministically, it might be better to handle this situation somehow. Setting the errorStrategy directive to 'ignore' will mean any processes execution errors are ignored and allow your workflow continue. For example, you might get a process execution error if a process exits with a non-zero exit status or if one or more expected output files are missing. The pipeline will continue, however downstream processes will not be attempted.
Contents of test.nf:
nextflow.enable.dsl=2
process foo {
tag { sample }
input:
val sample
output:
path "${sample}.txt"
"""
if [ "${sample}" == "s1" ] ; then
(exit 1)
fi
if [ "${sample}" == "s2" ] ; then
echo "Hello" > "${sample}.txt"
fi
"""
}
process bar {
tag { txt }
input:
path txt
output:
path "${txt}.gz"
"""
gzip -c "${txt}" > "${txt}.gz"
"""
}
workflow {
Channel.of('s1', 's2', 's3') | foo | bar
}
Contents of nextflow.config:
process {
// this is the default task.shell:
shell = [ '/bin/bash', '-ue' ]
errorStrategy = 'ignore'
}
Run with:
nextflow run -ansi-log false test.nf
Results:
N E X T F L O W ~ version 20.10.0
Launching `test.nf` [drunk_bartik] - revision: e2103ea23b
[9b/56ce2d] Submitted process > foo (s2)
[43/0d5c9d] Submitted process > foo (s1)
[51/7b6752] Submitted process > foo (s3)
[43/0d5c9d] NOTE: Process `foo (s1)` terminated with an error exit status (1) -- Error is ignored
[51/7b6752] NOTE: Missing output file(s) `s3.txt` expected by process `foo (s3)` -- Error is ignored
[51/267685] Submitted process > bar (s2.txt)

Related

Passing list of filenames to nextflow process

I am a newcomer to Nextflow and I am trying to process multiple files in a workflow. The number of these files is more than 300, so I would like to not to paste it into a command line as an option. So what I have done is I've created a file with every filename of the files I need to process, but I am not sure how to pass it into the process. This is what I've tried:
params.SRRs = "srr_ids.txt"
process tmp {
input:
file ids
output:
path "*.txt"
script:
'''
while read id; do
touch ${id}.txt;
echo ${id} > ${id}.txt;
done < $ids
'''
}
workflow {
tmp(params.SRRs)
}
The script is supposed to read in the file srr_ids.txt, and create files that have their ids in it (just testing on a smaller task). The error log says that the id variable is unbound, but I don't understand why. What is the conventional way of passing lots of filenames to a pipeline? Should I write some other process that parses the list?
Maybe there's a typo in your question, but the error is actually that the ids variable is unbound:
Command error:
.command.sh: line 5: ids: unbound variable
The problem is that when you use a single-quote script string, you will not be able to access Nextflow variables in your script block. You can either define your script using a double-quote string and escape your shell variables:
params.SRRs = "srr_ids.txt"
process tmp {
input:
path ids
output:
path "*.txt"
script:
"""
while read id; do
touch "\${id}.txt"
echo "\${id}" > "\${id}.txt"
done < "${ids}"
"""
}
workflow {
SRRs = file(params.SRRs)
tmp(SRRs)
}
Or, use a shell block which uses the exclamation mark ! character as the variable placeholder for Nextflow variables. This makes it possible to use both Nextflow and shell variables in the same piece of code without having to escape each of the shell variables:
params.SRRs = "srr_ids.txt"
process tmp {
input:
path ids
output:
path "*.txt"
shell:
'''
while read id; do
touch "${id}.txt"
echo "${id}" > "${id}.txt"
done < "!{ids}"
'''
}
workflow {
SRRs = file(params.SRRs)
tmp(SRRs)
}
What is the conventional way of passing lots of filenames to a
pipeline?
The conventional way, I think, is to actually supply one (or more) glob patterns to the fromPath channel factory method. For example:
params.SRRs = "./path/to/files/SRR*.fastq.gz"
workflow {
Channel
.fromPath( params.SRRs )
.view()
}
Results:
$ nextflow run main.nf
N E X T F L O W ~ version 22.04.4
Launching `main.nf` [sleepy_bernard] DSL2 - revision: 30020008a7
/home/steve/working/stackoverflow/73702711/path/to/files/SRR1910483.fastq.gz
/home/steve/working/stackoverflow/73702711/path/to/files/SRR1910482.fastq.gz
/home/steve/working/stackoverflow/73702711/path/to/files/SRR1448795.fastq.gz
/home/steve/working/stackoverflow/73702711/path/to/files/SRR1448793.fastq.gz
/home/steve/working/stackoverflow/73702711/path/to/files/SRR1448794.fastq.gz
/home/steve/working/stackoverflow/73702711/path/to/files/SRR1448792.fastq.gz
If instead you would prefer to pass in a list of filenames, like in your example, use either the splitCsv or the splitText operator to get what you want. For example:
params.SRRs = "srr_ids.txt"
workflow {
Channel
.fromPath( params.SRRs )
.splitText() { it.strip() }
.view()
}
Results:
$ nextflow run main.nf
N E X T F L O W ~ version 22.04.4
Launching `main.nf` [fervent_ramanujan] DSL2 - revision: 89a1771d50
SRR1448794
SRR1448795
SRR1448792
SRR1448793
SRR1910483
SRR1910482
Should I write some other process that parses the list?
You may not need to. My feeling is that your code might benefit from using the fromSRA factory method, but we don't really have enough details to say one way or the other. If you need to, you could just write a function that returns a channel.

Snakemake explicit handling for Out Of Memory (OOM) failures

A Snakemake workflow can re-attempt for each restart after any type of failure, including if the error is of an Out Of Memory (OOM) doing e.g.
def get_mem_mb(wildcards, attempt):
return attempt * 100
rule:
input: ...
output: ...
resources:
mem_mb=get_mem_mb
shell:
"..."
Is there anyway in Snakemake to deal explicitly with a memory-related error, as NextFlow does. e.g. when Exit error is memory related (137 in a LSF system)?
process foo {
memory { 2.GB * task.attempt }
time { 1.hour * task.attempt }
errorStrategy { task.exitStatus in 137..140 ? 'retry' : 'terminate' }
maxRetries 3
script:
<your job here>
}
I could not find this information anywhere,
thanks
I am not sure if there is an explicit way for Snakemake to handle out of memory errors. However, the memory function you have in your code example is what I've done to handle memory issues using Snakemake.
To make use of the function, you need to provide the --rerun-incomplete option when executing Snakemake, so that failed jobs will be rerun. You can control the number of times Snakemake will retry with --restart-times.

How to call a process in workflow.onError

I have this small pipeline:
process test {
"""
echo 'hello'
exit 1
"""
}
workflow.onError {
process finish_error{
script:
"""
echo 'blablabla'
"""
}
}
I want to trigger a python script in case the pipeline has an error using the finish error process, but this entire process does not seem to be triggered even when using a simple echo blabla example.
nextflow run test.nf
N E X T F L O W ~ version 20.10.0
Launching `test.nf` [cheesy_banach] - revision: 9020d641ca
executor > local (1)
[56/994298] process > test [100%] 1 of 1, failed: 1 ✘
[- ] process > finish_error [ 0%] 0 of 1
Error executing process > 'test'
Caused by:
Process `test` terminated with an error exit status (1)
Command executed:
echo 'hello'
exit 1
Command exit status:
1
Command output:
hello
Command wrapper:
hello
Work dir:
/home/joost/nextflow/work/56/9942985fc9948fd9bf7797d39c1785
Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line
How can I trigger this finish_error process, and how can I view its output?
The onError handler is invoked when a process causes pipeline execution to terminate prematurely. Since a Nextflow pipeline is really just a series of processes joined together, launching another pipeline process from within an event handler doesn't make much sense to me. If your python script should be run using the local executor, you can just execute it in the usual way. This example assumes your script is executable and has an appropriate shebang:
process test {
"""
echo 'hello'
exit 1
"""
}
workflow.onError {
def proc = "${baseDir}/test.py".execute()
proc.waitFor()
println proc.text
}
Run using:
nextflow run -ansi-log false test.nf

Snakemake: variable that defines whether process is submitted cluster job or the snakefile

My current architecture is that at the start of my Snakefile I have a long running function somefunc which helps decide the "input" to rule all. I realized when I was running the workflow with slurm that somefunc is being executed by each job. Is there some variable I can access that defines whether the code is a submitted job or whether it is the main process:
if not snakemake.submitted_job:
config['layout'] = somefunc()
...
A solution which I don't really recommend is to make somefunc write the list of inputs to a tmp file so that slurm jobs will read this tmp file rather than reconstructing the list from scratch. The tmp file is created by whatever job is executed first so the long-running part is done only once.
At the end of the workflow delete the tmp file so that later executions will start fresh with new input.
Here's a sketch:
def somefunc():
try:
all_output = open('tmp.txt').readlines()
all_output = [x.strip() for x in all_output]
print('List of input files read from tmp.txt')
except:
all_output = ['file1.txt', 'file2.txt'] # Long running part
with open('tmp.txt', 'w') as fout:
for x in all_output:
fout.write(x + '\n')
print('List of input files created and written to tmp.txt')
return all_output
all_output = somefunc()
rule all:
input:
all_output,
rule one:
output:
all_output,
shell:
r"""
touch {output}
"""
onsuccess:
os.remove('tmp.txt')
onerror:
os.remove('tmp.txt')
Since jobs will be submitted in parallel, you should make sure that only one job writes tmp.txt and the others read it. I think the try/except above will do it but I'm not 100% sure. (Probably you want to use some better filename than tmp.txt, see the module tempfile. see also the module atexit) for exit handlers)
As discussed with #dariober it seems the cleanest to check whether the (hidden) snakemake directory has locks since they seem not to be generated until the first rule starts (assuming you are not using the --nolock argument).
import os
locked = len(os.listdir(".snakemake/locks")) > 0
However this results in a problem in my case:
import time
import os
def longfunc():
time.sleep(10)
return range(5)
locked = len(os.listdir(".snakemake/locks")) > 0
if not locked:
info = longfunc()
rule all:
input:
expand("test_{sample}", sample=info)
rule test:
output:
touch("test_{sample}")
run:
"""
sleep 1
"""
Somehow snakemake lets each rule reinterpret the complete snakefile, with the issue that all the jobs will complain that 'info is not defined'. For me it was easiest to store the results and load them for each job (pickle.dump and pickle.load).

pre-populate associative array keys in awk?

I've written a munin plugin that uses slurm's sacct to monitor job states on a HPC cluster. I've written it in sh + awk (rather than my usual tool of choice, perl).
The script works, but it took me ages to figure out how to pre-populate the associative array of possible states (some/most may not be present in sacct output, and i want them to default to zero). Google wasn't much help, and the best I could come up with was to use split on a string to produce a temporary array, which I then iterated over.
I came up with this:
BEGIN {
num = split("cancelled completed completing failed nodefail pending running suspended timeout",statenames," ");
for (i=1;i<=num;i++) {
states[statenames[i]] = 0
}
}
This works, but seems clumsy compared to how i'd do it in perl, like this:
foreach (qw(cancelled completed completing failed nodefail pending running suspended timeout)) {
$states{$_} = 0;
}
or this
%states = map {$_ => 0} qw(cancelled completed completing failed nodefail pending running suspended timeout);
my question is: is there a way of doing this in awk that is similar to either of the perl versions?
[ edited ]
to clarify, here's a sample of the sacct output i'm piping into awk. Note that the only states in this output are RUNNING, COMPLETED, and CANCELLED - the others don't exist (because they haven't occurred today), but i want them in my script's output anyway (in a form usable by munin as "statename.value 0").
# sacct -X -P -o 'state' -n
RUNNING
RUNNING
RUNNING
RUNNING
COMPLETED
RUNNING
COMPLETED
RUNNING
COMPLETED
COMPLETED
CANCELLED by 1000
COMPLETED
[ edited again ]
and here's sample output from my munin plugin:
# ./slurm-sacct
suspended.value 0
pending.value 0
nodefail.value 0
failed.value 0
running.value 6
completing.value 0
completed.value 5
timeout.value 0
cancelled.value 1
The script runs and does what I want, I just wanted to know if there was a better way to initialise the associative array.
You probably don't need to do it at all. Variables in awk are dynamic, which means they're automatically initialized when they are first used (either assigned to or accessed), and this applies to array elements as well.
A variable will be initialized to 0 if it's accessed in a numeric context, or to the empty string otherwise. (At least gawk does this, though I'm not sure if it's implementation-dependent) So if you're doing something like counting the number of jobs that are in each state, the entire program is as simple as something like
{ states[$1]++ }
END {
for (state in states) print state, states[state]
}
Each time the expression states[$1]++ is executed, it will check for the existence of states[$1] and initialize it to 0 if it doesn't already exist.
EDIT: From your comment I'm guessing you want to print out a line for each possible state, regardless of whether there are any jobs in that state or not. In that case, you need to include all the possible state names, and there is no shortcut notation for doing so as there is in Perl. As far as I know, what you've already found is about as clean as it gets. (Awk is not really designed with that usage in mind)
I'd suggest the following:
{ states[$1]++ }
END {
split("cancelled completed completing failed nodefail pending running suspended timeout",statenames," ");
for (state in statenames) print state, states[state]+0
}
Perhaps Craig can use instead of :
print "Timeout states ",states[timeout],".";
this:
print "Timeout states ",int(states[timeout]),".";
In my case if there is no timeout state in awk input, the first print will give:
Timeout states .
While the second will give:
Timeout states 0.
I think a more natural approach in awk would be to have a separate file of keys. Consider a file keys.txt with one key per line. You could then do something like this:
printf "key1\nkey2\nkey2\nkey5" |
awk '
FILENAME == "keys.txt" {
counts[$0] = 0
next
}
{
counts[$0]++
}
END {
for (key in counts) {
print key, counts[key]
}
}' keys.txt -
With five keys in keys.txt, this produces:
key1 1
key2 2
key3 0
key4 0
key5 1
Although the keys are shown in order here, that's just incidental and shouldn't be relied upon.
For the specific example, you could also skip the associative array altogether. Instead, you could minimally process the lines with awk and use sort | uniq -c to tabulate the counts. The presence of all keys could be ensured using join against a file of keys.
awk is somewhat clumsier (I would say "less terse") than Perl.
You could write this (similar to #Michael's answer):
pipeline of data |
awk '
NR == FNR {statenames[$1]=0; next}
{ usual processing }
END { usual output }
' <(printf "%s\n" cancelled completed completing failed nodefail pending running suspended timeout) -
One tweak to #DavidZaslavsky's answer might be to print the states in the order you specified them on the split() line. That would be:
{ states[tolower($1)]++ }
END {
n = split("cancelled completed completing failed nodefail pending running suspended timeout",statenames)
for (i=1; i<=n; i++) {
state = statenames[i]
print state, states[state]+0
}
}
I also converted the input to lower case so it matches your hard-coded values, got rid of the unnecessary 3rd arg to split() and the subsequent null statement (trailing semi-colon).
In case you want to account for finding state names in your input that weren't in your hard-coded set, you could tweak it to:
{ states[tolower($1)]++ }
END {
n = split("cancelled completed completing failed nodefail pending running suspended timeout",statenames)
for (i=1; i<=n; i++) {
state = statenames[i]
print state, states[state]+0
delete states[state]
}
for (state in states) {
print "WARNING: found new state name %s\n",state | "cat>&2"
print state, states[state]+0
}
}