Passing list of filenames to nextflow process - nextflow

I am a newcomer to Nextflow and I am trying to process multiple files in a workflow. The number of these files is more than 300, so I would like to not to paste it into a command line as an option. So what I have done is I've created a file with every filename of the files I need to process, but I am not sure how to pass it into the process. This is what I've tried:
params.SRRs = "srr_ids.txt"
process tmp {
input:
file ids
output:
path "*.txt"
script:
'''
while read id; do
touch ${id}.txt;
echo ${id} > ${id}.txt;
done < $ids
'''
}
workflow {
tmp(params.SRRs)
}
The script is supposed to read in the file srr_ids.txt, and create files that have their ids in it (just testing on a smaller task). The error log says that the id variable is unbound, but I don't understand why. What is the conventional way of passing lots of filenames to a pipeline? Should I write some other process that parses the list?

Maybe there's a typo in your question, but the error is actually that the ids variable is unbound:
Command error:
.command.sh: line 5: ids: unbound variable
The problem is that when you use a single-quote script string, you will not be able to access Nextflow variables in your script block. You can either define your script using a double-quote string and escape your shell variables:
params.SRRs = "srr_ids.txt"
process tmp {
input:
path ids
output:
path "*.txt"
script:
"""
while read id; do
touch "\${id}.txt"
echo "\${id}" > "\${id}.txt"
done < "${ids}"
"""
}
workflow {
SRRs = file(params.SRRs)
tmp(SRRs)
}
Or, use a shell block which uses the exclamation mark ! character as the variable placeholder for Nextflow variables. This makes it possible to use both Nextflow and shell variables in the same piece of code without having to escape each of the shell variables:
params.SRRs = "srr_ids.txt"
process tmp {
input:
path ids
output:
path "*.txt"
shell:
'''
while read id; do
touch "${id}.txt"
echo "${id}" > "${id}.txt"
done < "!{ids}"
'''
}
workflow {
SRRs = file(params.SRRs)
tmp(SRRs)
}
What is the conventional way of passing lots of filenames to a
pipeline?
The conventional way, I think, is to actually supply one (or more) glob patterns to the fromPath channel factory method. For example:
params.SRRs = "./path/to/files/SRR*.fastq.gz"
workflow {
Channel
.fromPath( params.SRRs )
.view()
}
Results:
$ nextflow run main.nf
N E X T F L O W ~ version 22.04.4
Launching `main.nf` [sleepy_bernard] DSL2 - revision: 30020008a7
/home/steve/working/stackoverflow/73702711/path/to/files/SRR1910483.fastq.gz
/home/steve/working/stackoverflow/73702711/path/to/files/SRR1910482.fastq.gz
/home/steve/working/stackoverflow/73702711/path/to/files/SRR1448795.fastq.gz
/home/steve/working/stackoverflow/73702711/path/to/files/SRR1448793.fastq.gz
/home/steve/working/stackoverflow/73702711/path/to/files/SRR1448794.fastq.gz
/home/steve/working/stackoverflow/73702711/path/to/files/SRR1448792.fastq.gz
If instead you would prefer to pass in a list of filenames, like in your example, use either the splitCsv or the splitText operator to get what you want. For example:
params.SRRs = "srr_ids.txt"
workflow {
Channel
.fromPath( params.SRRs )
.splitText() { it.strip() }
.view()
}
Results:
$ nextflow run main.nf
N E X T F L O W ~ version 22.04.4
Launching `main.nf` [fervent_ramanujan] DSL2 - revision: 89a1771d50
SRR1448794
SRR1448795
SRR1448792
SRR1448793
SRR1910483
SRR1910482
Should I write some other process that parses the list?
You may not need to. My feeling is that your code might benefit from using the fromSRA factory method, but we don't really have enough details to say one way or the other. If you need to, you could just write a function that returns a channel.

Related

Nextflow input how to declare tuple in tuple

I am working with a nextflow workflow that, at a certain stage, groups a series of files by their sample id using groupTuple(), and resulting in a channel that looks like this:
[sample_id, [file_A, file_B, ... , file_N]]
[sample_id, [file_A, file_B, ... , file_N]]
...
[sample_id, [file_A, file_B, ... , file_N]]
Note that this is the same channel structure that you get from .fromFilePairs().
I want to use these channel items in a process in such a way that, for each item, the process reads the sample_id from the first field and all the files from the inner tuple at once.
The nextflow documentation is somewhat cryptic about this, and it is hard to find how to declare this type of input in a channel, so I thought I'd create a question on stack overflow and then answer it myself for anyone who will ever be looking for this answer.
How does one declare the inner tuple in the input section of a nextflow process?
In the example given above, my inner tuple contains items of only one type (files). I can therefore pass the whole second term of the tuple (i.e. the inner tuple) as a single input item under the file() qualifier. Like this:
input:
tuple \
val(sample_id), \
file(inner_tuple) \
from Input_channel
This will ensure that the tuple content is read as file (one by one), the same way as performing .collect() on a channel of files, in the sense that all files will then be available in the nextflow temp directory where the process is executed.
The question is how you come up with sample_id, but in case they just have different file extensions you might use something like this:
all_files = Channel.fromPath("/path/to/your/files/*")
all_files.map { it -> [it.simpleName, it] }
.groupTuple()
.set { grouped_files }
The path qualifier (previously the file qualifier) can be used to stage a single (file) value or a collection of (file) values into the process execution directory. The note at the bottom of the multiple input files section in the docs also mentions:
The normal file input constructs introduced in the input of files
section are valid for collections of multiple files as well.
This means, you can use a script variable, e.g.:
input:
tuple val(sample_id), path(my_files)
In which case, the variable will hold the list of files (preserving the original filenames). You could use it directly to refer to all of the files in the list, or, you could access specific (file) elements (if you need them) using square bracket (slice) notation.
This is the syntax you will want most of the time. However, if you need predicable filenames or if you need to deal with files with the identical filenames, you may need a different approach:
Alternatively, you could specify a target filename, e.g.:
input:
tuple val(sample_id), path('my_file')
In the case where a single file is received by the process, the file would be staged with the target filename. However, when a collection of files is received by the process, the filename will be appended with a numerical suffix representing its ordinal position in the list. For example:
process test {
tag { sample_id }
debug true
stageInMode 'rellink'
input:
tuple val(sample_id), path('fastq')
"""
echo "${sample_id}:"
ls -g --time-style=+"" fastq*
"""
}
workflow {
readgroups = Channel.fromFilePairs( '*_{1,2}.fastq' )
test( readgroups )
}
Results:
$ touch {foo,bar,baz}_{1,2}.fastq
$ nextflow run .
N E X T F L O W ~ version 22.04.4
Launching `./main.nf` [scruffy_caravaggio] DSL2 - revision: 87a80d6d50
executor > local (3)
[65/66f860] process > test (bar) [100%] 3 of 3 ✔
baz:
lrwxrwxrwx 1 users 20 fastq1 -> ../../../baz_1.fastq
lrwxrwxrwx 1 users 20 fastq2 -> ../../../baz_2.fastq
foo:
lrwxrwxrwx 1 users 20 fastq1 -> ../../../foo_1.fastq
lrwxrwxrwx 1 users 20 fastq2 -> ../../../foo_2.fastq
bar:
lrwxrwxrwx 1 users 20 fastq1 -> ../../../bar_1.fastq
lrwxrwxrwx 1 users 20 fastq2 -> ../../../bar_2.fastq
Note that the names of staged files can be controlled using the * and ? wildcards. See the links above for a table that shows how the wildcards are replaced depending on the cardinality of the input collection.

Nextflow: publishDir, output channels, and output subdirectories

I've been trying to learn how to use Nextflow and come across an issue with adding output to a channel as I need the processes to run in an order. I want to pass output files from one of the output subdirectories created by the tool (ONT-Guppy) into a channel, but can't seem to figure out how.
Here is the nextflow process in question:
process GupcallBases {
publishDir "$params.P1_outDir", mode: 'copy', pattern: "pass/*.bam"
executor = 'pbspro'
clusterOptions = "-lselect=1:ncpus=${params.P1_threads}:mem=${params.P1_memory}:ngpus=1:gpu_type=${params.P1_GPU} -lwalltime=${params.P1_walltime}:00:00"
output:
path "*.bam" into bams_ch
script:
"""
module load cuda/11.4.2
singularity exec --nv $params.Gup_container \
guppy_basecaller --config $params.P1_gupConf \
--device "cuda:0" \
--bam_out \
--recursive \
--compress \
--align_ref $params.refGen \
-i $params.P1_inDir \
-s $params.P1_outDir \
--gpu_runners_per_device $params.P1_GPU_runners \
--num_callers $params.P1_callers
"""
}
The output of the process is something like this:
$params.P1_outDir/pass/(lots of bams and fastqs)
$params.P1_outDir/fail/(lots of bams and fastqs)
$params.P1_outDir/(a few txt and log files)
I only want to keep the bam files in $params.P1_outDir/pass/, hence trying to use the pattern = "pass/*.bam, but I've tried a few other patterns to no avail.
The output syntax was chosen since once this process is done, using the following channel works:
// Channel
// .fromPath("${params.P1_outDir}/pass/*.bam")
// .ifEmpty { error "Cannot find any bam files in ${params.P1_outDir}" }
// .set { bams_ch }
But the problem is if I don't pass the files into the output channel of the first process, they run in parallel. I could simply be missing something in the extensive documentation in how to order processes, which would be an alternative solution.
Edit: I forgo to add the error message which is here: Missing output file(s) `*.bam` expected by process `GupcallBases` and the $params.P1_outDir/ contains the subdirectories and all the log files despite the pattern argument.
Thanks in advance.
Nextflow processes are designed to run isolated from each other, but this can be circumvented somewhat when the command-line input and/or outputs are specified using params. Using params like this can be problematic because if, for example, a params variable specifies an absolute path but your output declaration expects files in the Nextflow working directory (e.g. ./work/fc/0249e72585c03d08e31ce154b6d873), you will get the 'Missing output file(s) expected by process' error you're seeing.
The solution is to ensure your inputs are localized in the working directory using an input declaration block and that the outputs are also written to the work dir. Note that only files specified in the output declaration block can be published using the publishDir directive.
Also, best to avoid calling Singularity manually in your script block. Instead just add singularity.enabled = true to your nextflow.config. This should also work nicely with the beforeScript process directive to initialize your environment:
params.publishDir = './results'
input_dir = file( params.input_dir )
guppy_config = file( params.guppy_config )
ref_genome = file( params.ref_genome )
process GuppyBasecaller {
publishDir(
path: "${params.publishDir}/GuppyBasecaller",
mode: 'copy',
saveAs: { fn -> fn.substring(fn.lastIndexOf('/')+1) },
)
beforeScript 'module load cuda/11.4.2; export SINGULARITY_NV=1'
container '/path/to/guppy_basecaller.img'
input:
path input_dir
path guppy_config
path ref_genome
output:
path "outdir/pass/*.bam" into bams_ch
"""
mkdir outdir
guppy_basecaller \\
--config "${guppy_config}" \\
--device "cuda:0" \\
--bam_out \\
--recursive \\
--compress \\
--align_ref "${ref_genome}" \\
-i "${input_dir}" \\
-s outdir \\
--gpu_runners_per_device "${params.guppy_gpu_runners}" \\
--num_callers "${params.guppy_callers}"
"""
}

Nextflow: how do you pass an output (multiple files) from the publishdir to the next process?

I have a process generating two files that I am interested in, hitsort.cls and contigs.fasta.
I output these using publishdir:
process RUN_RE {
publishDir "$baseDir/RE_output", mode: 'copy'
input:
file 'interleaved.fq'
output:
file "${params.RE_run}/seqclust/clustering/hitsort.cls"
file "${params.RE_run}/contigs.fasta"
script:
"""
some_code
"""
}
Now, I need these two files to be an input for another process but I don't know how to do that.
I have tried calling this process with
NEXT_PROCESS(params.hitsort, params.contigs)
while specifying the input as:
process NEXT_PROCESS {
input:
path hitsort
path contigs
but it's not working, because only the basename is used instead of the full path. Basically what I want is to wait for RUN_RE to finish, and then use the two files it outputs for the next process.
Best to avoid accessing files in the publishDir, since:
Files are copied into the specified directory in an asynchronous manner, thus they may not be immediately available in the published directory at the end of the process execution. For this reason files published by a process must not be accessed by other downstream processes.
The recommendation is therefore to ensure your processes only access files in the working directory, (i.e. ./work). What this means is: it's best to avoid things like absolute paths in your input and output declarations. This will also help ensure your workflows are portable.
nextflow.enable.dsl=2
params.interleaved_fq = './path/to/interleaved.fq'
params.publish_dir = './results'
process RUN_RE {
publishDir "${params.publish_dir}/RE_output", mode: 'copy'
input:
path interleaved
output:
path "./seqclust/clustering/hitsort.cls", emit: hitsort_cls
path "./contigs.fasta", emit: contigs_fasta
"""
# do something with ${interleaved}...
ls -l "${interleaved}"
# create some outputs...
mkdir -p ./seqclust/clustering
touch ./seqclust/clustering/hitsort.cls
touch ./contigs.fasta
"""
}
process NEXT_PROCESS {
input:
path hitsort
path contigs
"""
ls -l
"""
}
workflow {
interleaved_fq = file( params.interleaved_fq )
NEXT_PROCESS( RUN_RE( interleaved_fq ) )
}
The above workflow block is effectively the same as:
workflow {
interleaved_fq = file( params.interleaved_fq )
RUN_RE( interleaved_fq )
NEXT_PROCESS( RUN_RE.out.hitsort_cls, RUN_RE.out.contigs_fasta )
}

Snakemake: variable that defines whether process is submitted cluster job or the snakefile

My current architecture is that at the start of my Snakefile I have a long running function somefunc which helps decide the "input" to rule all. I realized when I was running the workflow with slurm that somefunc is being executed by each job. Is there some variable I can access that defines whether the code is a submitted job or whether it is the main process:
if not snakemake.submitted_job:
config['layout'] = somefunc()
...
A solution which I don't really recommend is to make somefunc write the list of inputs to a tmp file so that slurm jobs will read this tmp file rather than reconstructing the list from scratch. The tmp file is created by whatever job is executed first so the long-running part is done only once.
At the end of the workflow delete the tmp file so that later executions will start fresh with new input.
Here's a sketch:
def somefunc():
try:
all_output = open('tmp.txt').readlines()
all_output = [x.strip() for x in all_output]
print('List of input files read from tmp.txt')
except:
all_output = ['file1.txt', 'file2.txt'] # Long running part
with open('tmp.txt', 'w') as fout:
for x in all_output:
fout.write(x + '\n')
print('List of input files created and written to tmp.txt')
return all_output
all_output = somefunc()
rule all:
input:
all_output,
rule one:
output:
all_output,
shell:
r"""
touch {output}
"""
onsuccess:
os.remove('tmp.txt')
onerror:
os.remove('tmp.txt')
Since jobs will be submitted in parallel, you should make sure that only one job writes tmp.txt and the others read it. I think the try/except above will do it but I'm not 100% sure. (Probably you want to use some better filename than tmp.txt, see the module tempfile. see also the module atexit) for exit handlers)
As discussed with #dariober it seems the cleanest to check whether the (hidden) snakemake directory has locks since they seem not to be generated until the first rule starts (assuming you are not using the --nolock argument).
import os
locked = len(os.listdir(".snakemake/locks")) > 0
However this results in a problem in my case:
import time
import os
def longfunc():
time.sleep(10)
return range(5)
locked = len(os.listdir(".snakemake/locks")) > 0
if not locked:
info = longfunc()
rule all:
input:
expand("test_{sample}", sample=info)
rule test:
output:
touch("test_{sample}")
run:
"""
sleep 1
"""
Somehow snakemake lets each rule reinterpret the complete snakefile, with the issue that all the jobs will complain that 'info is not defined'. For me it was easiest to store the results and load them for each job (pickle.dump and pickle.load).

How do I store the value returned by either run or shell?

Let's say I have this script:
# prog.p6
my $info = run "uname";
When I run prog.p6, I get:
$ perl6 prog.p6
Linux
Is there a way to store a stringified version of the returned value and prevent it from being output to the terminal?
There's already a similar question but it doesn't provide a specific answer.
You need to enable the stdout pipe, which otherwise defaults to $*OUT, by setting :out. So:
my $proc = run("uname", :out);
my $stdout = $proc.out;
say $stdout.slurp;
$stdout.close;
which can be shortened to:
my $proc = run("uname", :out);
say $proc.out.slurp(:close);
If you want to capture output on stderr separately than stdout you can do:
my $proc = run("uname", :out, :err);
say "[stdout] " ~ $proc.out.slurp(:close);
say "[stderr] " ~ $proc.err.slurp(:close);
or if you want to capture stdout and stderr to one pipe, then:
my $proc = run("uname", :merge);
say "[stdout and stderr] " ~ $proc.out.slurp(:close);
Finally, if you don't want to capture the output and don't want it output to the terminal:
my $proc = run("uname", :!out, :!err);
exit( $proc.exitcode );
The solution covered in this answer is concise.
This sometimes outweighs its disadvantages:
Doesn't store the result code. If you need that, use ugexe's solution instead.
Doesn't store output to stderr. If you need that, use ugexe's solution instead.
Potential vulnerability. This is explained below. Consider ugexe's solution instead.
Documentation of the features explained below starts with the quote adverb :exec.
Safest unsafe variant: q
The safest variant uses a single q:
say qx[ echo 42 ] # 42
If there's an error then the construct returns an empty string and any error message will appear on stderr.
This safest variant is analogous to a single quoted string like 'foo' passed to the shell. Single quoted strings don't interpolate so there's no vulnerability to a code injection attack.
That said, you're passing a single string to the shell which may not be the shell you're expecting so it may not parse the string as you're expecting.
Least safe unsafe variant: qq
The following line produces the same result as the q line but uses the least safe variant:
say qqx[ echo 42 ]
This double q variant is analogous to a double quoted string ("foo"). This form of string quoting does interpolate which means it is subject to a code injection attack if you include a variable in the string passed to the shell.
By default run just passes the STDOUT and STDERR to the parent process's STDOUT and STDERR.
You have to tell it to do something else.
The simplest is to just give it :out to tell it to keep STDOUT. (Short for :out(True))
my $proc = run 'uname', :out;
my $result = $proc.out.slurp(:close);
my $proc = run 'uname', :out;
for $proc.out.lines(:close) {
.say;
}
You can also effectively tell it to just send STDOUT to /dev/null with :!out. (Short for :out(False))
There are more things you can do with :out
{
my $file will leave {.close} = open :w, 'test.out';
run 'uname', :out($file); # write directly to a file
}
print slurp 'test.out'; # Linux
my $proc = run 'uname', :out;
react {
whenever $proc.out.Supply {
.print
LAST {
$proc.out.close;
done; # in case there are other whenevers
}
}
}
If you are going to do that last one, it is probably better to use Proc::Async.