I am trying to have a process that is launched only if a combination of conditions is met, but when checking if a channel has a path to a file, it always returns it as empty. Probably I am doing something wrong, in that case please correct my code. I tried to follow some of the suggestions in this issue but no success.
Consider the following minimal example:
process one {
output:
file("test.txt") into _chProcessTwo
script:
"""
echo "Hello world" > "test.txt"
"""
}
// making a copy so I check first if something in the channel or not
// avoids raising exception of MultipleInputChannel
_chProcessTwo.into{
_chProcessTwoView;
_chProcessTwoCheck;
_chProcessTwoUse
}
//print contents of channel
println "Channel contents: " + _chProcessTwoView.toList().view()
process two {
input:
file(myInput) from _chProcessTwoUse
when:
(!_chProcessTwoCheck.toList().isEmpty())
script:
def test = _chProcessTwoUse.toList().isEmpty() ? "I'm empty" : "I'm NOT empty"
println "The outcome is: " + test
}
I want to have process two run if and only if there is a file in the _chProcessTwo channel.
If I run the above code I obtain:
marius#dev:~/pipeline$ ./bin/nextflow run test.nf
N E X T F L O W ~ version 19.09.0-edge
Launching `test.nf` [infallible_gutenberg] - revision: 9f57464dc1
[c8/bf38f5] process > one [100%] 1 of 1 ✔
[- ] process > two -
[/home/marius/pipeline/work/c8/bf38f595d759686a497bb4a49e9778/test.txt]
where the last line are actually the contents of _chProcessTwoView
If I remove the when directive from the second process I get:
marius#mg-dev:~/pipeline$ ./bin/nextflow run test.nf
N E X T F L O W ~ version 19.09.0-edge
Launching `test.nf` [modest_descartes] - revision: 5b2bbfea6a
[57/1b7b97] process > one [100%] 1 of 1 ✔
[a9/e4b82d] process > two [100%] 1 of 1 ✔
[/home/marius/pipeline/work/57/1b7b979933ca9e936a3c0bb640c37e/test.txt]
with the contents of the second worker .command.log file being: The outcome is: I'm empty
I tried also without toList()
What am I doing wrong? Thank you in advance
Update: a workaround would be to check _chProcessTwoUse.view() != "" but that is pretty dirty
Update 2 as required by #Steve, I've updated the code to reflect a bit more the actual conditions i have in my own pipeline:
def runProcessOne = true
process one {
when:
runProcessOne
output:
file("inputProcessTwo.txt") into _chProcessTwo optional true
file("inputProcessThree.txt") into _chProcessThree optional true
script:
// this would replace the probability that output is not created
def outputSomething = false
"""
if ${outputSomething}; then
echo "Hello world" > "inputProcessTwo.txt"
echo "Goodbye world" > "inputProcessThree.txt"
else
echo "Sorry. Process one did not write to file."
fi
"""
}
// making a copy so I check first if something in the channel or not
// avoids raising exception of MultipleInputChannel
_chProcessTwo.into{
_chProcessTwoView;
_chProcessTwoCheck;
_chProcessTwoUse
}
//print contents of channel
println "Channel contents: " + _chProcessTwoView.view()
println _chProcessTwoView.view() ? "Me empty" : "NOT empty"
process two {
input:
file(myInput) from _chProcessTwoUse
when:
(runProcessOne)
script:
"""
echo "The outcome is: ${myInput}"
"""
}
process three {
input:
file(defaultInput) from _chUpstreamProcesses
file(inputFromProcessTwo) from _chProcessThree
script:
def extra_parameters = _chProcessThree.isEmpty() ? "" : "--extra-input " + inputFromProcessTwo
"""
echo "Hooray! We got: ${extra_parameters}"
"""
}
As #Steve mentioned, I should not even check if a channel is empty, NextFlow should know better to not initiate the process. But I think in this construct I will have to.
Marius
I think part of the problem here is that process 'one' creates only optional outputs. This makes dealing with the optional inputs in process 'three' a bit tricky. I would try to reconcile this if possible. If this can't be reconciled, then you'll need to deal with the optional inputs in process 'three'. To do this, you'll basically need to create a dummy file, pass it into the channel using the ifEmpty operator, then use the name of the dummy file to check whether or not to prepend the argument's prefix. It's a bit of a hack, but it works pretty well.
The first step is to actually create the dummy file. I like shareable pipelines, so I would just create this in your baseDir, perhaps under a folder called 'assets':
mkdir assets
touch assets/NO_FILE
Then pass in your dummy file if your '_chProcessThree' channel is empty:
params.dummy_file = "${baseDir}/assets/NO_FILE"
dummy_file = file(params.dummy_file)
process three {
input:
file(defaultInput) from _chUpstreamProcesses
file(optfile) from _chProcessThree.ifEmpty(dummy_file)
script:
def extra_parameters = optfile.name != 'NO_FILE' ? "--extra-input ${optfile}" : ''
"""
echo "Hooray! We got: ${extra_parameters}"
"""
}
Also, these lines are problematic:
//print contents of channel
println "Channel contents: " + _chProcessTwoView.view()
println _chProcessTwoView.view() ? "Me empty" : "NOT empty"
Calling view() will emit all values from the channel to stdout. You can ignore whatever value it returns. Unless you enable DSL2, the channel will then be empty. I think what you're looking for here is a closure:
_chProcessTwoView.view { "Found: $it" }
Be sure to append -ansi-log false to your nextflow run command so the output doesn't get clobbered. HTH.
Related
How do I go about defining a workflow that executes two initial processes in parallel and then handles both outputs of those processes in a third process? The simple examples I was able to find in tutorials always define sequential flows and often use stdin/stdout to transport information.
In order to illustrate what I want to achieve, here is a diagram of the DAG that I imagine:
I imagine the .nf file to look something like the following but cannot fill in the blanks
#!/usr/bin/env nextflow
nextflow.enable.dsl=2
process produceRandomX {
output:
x // what do I put in these places?
"""
print $RANDOM
"""
}
process produceRandomY {
output:
y
"""
print -$RANDOM
"""
}
process calculateSum {
input:
x
y
output:
sum
"""
print $x+$y
"""
}
process printResult {
input:
sum
output:
stdout
"""
print $sum
"""
}
workflow {
// syntax here is broken and just meant to illustrate what I want to achieve
produceRandomX \
| calculateSum | printResult
produceRandomY /
}
Nextflow processes can define one or more input and output channels. The interaction between these, and ultimately the pipeline execution itself, is implicitly defined by these input and output declarations1. In the following example, produceRandomX and produceRandomY could be run in parallel (assuming there are sufficient system resources available) but even if they're not, calculateSum will wait until it receives a complete input configuration (i.e. until it receives a value from each input channel).
process produceRandomX {
output:
stdout
"""
echo "\$RANDOM"
"""
}
process produceRandomY {
output:
stdout
"""
echo "-\$RANDOM"
"""
}
process calculateSum {
input:
val x
val y
output:
stdout
"""
echo \$(( $x + $y ))
"""
}
workflow {
x = produceRandomX()
y = produceRandomY()
sum = calculateSum( x, y )
sum.view()
}
Results:
$ nextflow run -ansi-log false main.nf
N E X T F L O W ~ version 22.10.0
Launching `main.nf` [lonely_torvalds] DSL2 - revision: 3b752ca2cc
[07/8e1473] Submitted process > produceRandomX
[fd/d9a629] Submitted process > produceRandomY
[15/c88aac] Submitted process > calculateSum
-1331
I've been struggling to identify why a nextflow (v20.10.00) process is not using all the items in a channel. I want the process to run for each sample bam file (10 in total) and for each chromosome (3 in total).
Here is the creation of the channels and the process:
ref_genome = file( params.RefGen, checkIfExists: true )
ref_dir = ref_genome.getParent()
ref_name = ref_genome.getBaseName()
ref_dict = file( "${ref_dir}/${ref_name}.dict", checkIfExists: true )
ref_index = file( "${ref_dir}/${ref_name}.*.fai", checkIfExists: true )
// Handles reading in data if the previous step is skipped
if( params.Skip_BP ){
Channel
.fromFilePairs("${params.ProcBamDir}/*{bam,bai}") { file -> file.name.replaceAll(/.bam|.bai$/,'') }
.ifEmpty { error "No bams found in ${params.ProcBamDir}" }
.map { ID, files -> tuple(ID, files[0], files[1]) }
.set { processed_bams }
}
// Setting up the chromosome channel
if( params.Chroms == "" ){
// Defaulting to using all chromosomes
chromosomes_ch = Channel
.from("AgamP4_2L", "AgamP4_2R", "AgamP4_3L", "AgamP4_3R", "AgamP4_X", "AgamP4_Y_unplaced", "AgamP4_UNKN")
println "No chromosomes specified, using all major chromosomes: AgamP4_2L, AgamP4_2R, AgamP4_3L, AgamP4_3R, AgamP4_X, AgamP4_Y_unplaced, AgamP4_UNKN"
} else {
// User option to choose which chromosome will be used
// This worked with the following syntax nextflow run testing.nf --profile imperial --Chroms "AgamP4_3R,AgamP4_2L"
chrs = params.Chroms.split(",")
chromosomes_ch = Channel
.from( chrs )
println "User defined chromosomes set: ${params.Chroms}"
}
process DNA_HCG {
errorStrategy { sleep(Math.pow(2, task.attempt) * 600 as long); return 'retry' }
maxRetries 3
maxForks params.HCG_Forks
tag { SampleID+"-"+chrom }
executor = 'pbspro'
clusterOptions = "-lselect=1:ncpus=${params.HCG_threads}:mem=${params.HCG_memory}gb:mpiprocs=1:ompthreads=${params.HCG_threads} -lwalltime=${params.HCG_walltime}:00:00"
publishDir(
path: "${params.HCDir}",
mode: 'copy',
)
input:
each chrom from chromosomes_ch
set SampleID, path(bam), path(bai) from processed_bams
path ref_genome
path ref_dict
path ref_index
output:
tuple chrom, path("${SampleID}-${chrom}.vcf") into HCG_ch
path("${SampleID}-${chrom}.vcf.idx") into idx_ch
beforeScript 'module load anaconda3/personal; source activate NF_GATK'
script:
"""
if [ ! -d tmp ]; then mkdir tmp; fi
taskset -c 0-${params.HCG_threads} gatk --java-options \"-Xmx${params.HCG_memory}G -XX:+UseParallelGC -XX:ParallelGCThreads=${params.HCG_threads}\" HaplotypeCaller \\
--tmp-dir tmp/ \\
--pair-hmm-implementation AVX_LOGLESS_CACHING_OMP \\
--native-pair-hmm-threads ${params.HCG_threads} \\
-ERC GVCF \\
-L ${chrom} \\
-R ${ref_genome} \\
-I ${bam} \\
-O ${SampleID}-${chrom}.vcf ${params.GVCF_args}
"""
}
But for reasons I cannot figure out, nextflow only creates 3 jobs: [d8/45499b] process > DNA_HCG (0_wt5_BP-CM029350.1) [ 0%] 0 of 3
I thought maybe it was because it only took the first sample and then one process for each chromosome. Though I doubted this since the code works for a different reference genome correctly. Regardless, I adjusted the input channels:
processed_bams
.combine(chromosomes_ch)
.set { HCG_in }
and
input:
set SampleID, path(bam), path(bai), chrom from HCG_in
But this resulted in only a single job being created: [6e/78b070] process > DNA_HCG (0_wt10_BP-CM029350.1) [ 0%] 0 of 1
Confusingly, when i use HCG_in.view() there are 30 items. And to further confuse me the correct number of jobs comes from the following code:
chrs = params.Chroms.split(",")
chromosomes_ch = Channel
.from(chrs)
Channel
.fromFilePairs("${params.ProcBamDir}/*{bam,bai}") { file -> file.name.replaceAll(/.bam|.bai$/,'') }
.ifEmpty { error "No bams found in ${params.ProcBamDir}" }
.map { ID, files -> tuple(ID, files[0], files[1]) }
.set { processed_bams }
process HCG {
executor 'local'
input:
each chrom from chromosomes_ch
set SampleID, path(bam), path(bai) from processed_bams
//set SampleID, path(bam), path(bai), chrom from HCG_in
script:
"""
echo "${SampleID} - ${chrom}"
"""
}
Output: [75/c1c25a] process > HCG (27) [100%] 30 of 30 ✔
I'm hoping I've just missed something obvious, but I cannot see it at the moment. Thanks in advance for the help.
Issues like this almost always involve the use of multiple input channels:
When two or more channels are declared as process inputs, the process
stops until there’s a complete input configuration ie. it receives an
input value from all the channels declared as input.
Your initial assessment was correct. However, the reason only three processes were run (i.e. one sample for each of the three chromosomes), is because this line (probably) returned a list (i.e. a java LinkedList) containing a single element, and lists behave like queue channels:
ref_index = file( "${ref_dir}/${ref_name}.*.fai", checkIfExists: true )
You might have expected this to return a UnixPath. Ultimately, the solution is to ensure ref_index is value channel.
I am getting a java.nio.file.ProviderMismatchException when I run the following script:
process a {
output:
file _biosample_id optional true into biosample_id
script:
"""
touch _biosample_id
"""
}
process b {
input:
file _biosample_id from biosample_id.ifEmpty{file("_biosample_id")}
script:
def biosample_id_option = _biosample_id.isEmpty() ? '' : "--biosample_id \$(cat _biosample_id)"
"""
echo \$(cat ${_biosample_id})
"""
}
i'm using a slightly modified version of Optional Input pattern.
Any ideas on why I'm getting the java.nio.file.ProviderMismatchException?
In your script block, _biosample_id is actually an instance of the nextflow.processor.TaskPath class. So to check if the file (or directory) is empty you can just call it's .empty() method. For example:
script:
def biosample_id_option = _biosample_id.empty() ? '' : "--biosample_id \$(< _biosample_id)"
I like your solution - I think it's neat. And I think it should be robust (but I haven't tested it). The optional input pattern that is recommended will fail when attempting to stage missing input files to a remote filesystem/object store. There is a solution however, which is to keep an empty file in your $baseDir and point to it in your scripts. For example:
params.inputs = 'prots/*{1,2,3}.fa'
params.filter = "${baseDir}/assets/null/NO_FILE"
prots_ch = Channel.fromPath(params.inputs)
opt_file = file(params.filter)
process foo {
input:
file seq from prots_ch
file opt from opt_file
script:
def filter = opt.name != 'NO_FILE' ? "--filter $opt" : ''
"""
your_commad --input $seq $filter
"""
}
I have problem with nextflow, I have a tuple with 3 elements (id, fastq_File, out_file) and I need join a new file to every tuple element (same file for all tuple element).
Well first I have a fastq, and I split this in chunks, and map with their id, after I have a process (simple process in the example), but this process return the id with other file.
reads = Channel.fromPath( 'data/illumina.fastq' )
.splitFastq(by: 150_000, file:true)
reads.map { it -> [it.name - ~/\.fastq/, it] }
.into{tuple_reads ; tuple_reads2}
process pr1 { /*is an example my real process is more complex*/
echo true
input:
tuple val(id), path(file) from tuple_reads
output:
tuple val(id), file("example${id}.out") into example_test
script:
"""
echo example${id} > example${id}.out
"""
}
readss = tuple_reads2.join(example_test)
I join the channels and I obtain something like this:
[illumina.1, /home/qs/work/../illumina.1.fastq, /home/qs/work/../exampleillumina.1.out]
[illumina.2, /home/qs/work/../illumina.2.fastq, /home/qs/work/../exampleillumina.2.out]
[illumina.3, /home/qs/work/../illumina.3.fastq, /home/qs/work/../exampleillumina.3.out]
Now, I have a channel with my id, the fastq file, and the out from process pr1, perfect for me, but this is the problem now, I need to create other process to run with a static file.
I need that every id run with the static_file but I don't know how do this. I need a new channel with something like this:
[illumina.1, /home/qs/work/../illumina.1.fastq, /home/qs/work/../exampleillumina.1.out,/home/qs/work/../static_file.txt]
[illumina.2, /home/qs/work/../illumina.2.fastq, /home/qs/work/../exampleillumina.2.out,/home/qs/work/../static_file.txt]
[illumina.3, /home/qs/work/../illumina.3.fastq, /home/qs/work/../exampleillumina.3.out,/home/qs/work/../static_file.txt]
or I need a process that repet the static file with every run.
The below code only run with the first element from the tuple :( (I tried with each but doesn't work.
process pr2 {
echo true
input:
tuple val(id), path(fastq_file), path(out_file) from example_test
path(st_file) from static_file
script:
"""
echo ${id} ${st_file}
"""
}
Thanks!!
You just need to make sure your second channel (i.e. the one for your static file) is a value channel. You didn't show how the static_file channel came was created, but you'll get the behaviour you're seeing if it's a regular queue channel, see here: Understand how multiple input channels work. To fix your example, all you need is:
static_file = file(params.static)
process pr2 {
echo true
input:
tuple val(id), path(fastq_file), path(out_file) from example_test
path(st_file) from static_file
script:
"""
echo ${id} ${st_file}
"""
}
Which is the same as:
static_file = file(params.static)
process pr2 {
echo true
input:
tuple val(id), path(fastq_file), path(out_file) from example_test
path static_file
script:
"""
echo ${id} ${static_file}
"""
}
Given the following nextflow.config:
google {
project = "cool-project"
region = "europe-west4"
lifeSciences {
bootDiskSize = "200 GB"
debug = true
preemptible = true
}
}
Is it possible to override one or more of those settings using command line arguments. For example, if I wanted to specify that no preemptible machines should be used, can I do the following:
nextflow run main.nf -c nextflow.config --google.lifeSciences.preemptible false
?
Overriding pipeline parameters can be done using Nextflow's command line interface by prefixing the parameter name with a double dash. For example, put the following in a file called 'test.nf':
#!/usr/bin/env nextflow
params.greeting = 'Hello'
names = Channel.of( "foo", "bar", "baz" )
process greet {
input:
val name from names
output:
stdout result
"""
echo "${params.greeting} ${name}"
"""
}
result.view { it.trim() }
And run it using:
nextflow run -ansi-log false test.nf --greeting 'Bonjour'
Results:
N E X T F L O W ~ version 20.10.0
Launching `test.nf` [backstabbing_cajal] - revision: 431ef92cef
[46/22b4f0] Submitted process > greet (1)
[ca/32992c] Submitted process > greet (3)
[6e/5880b0] Submitted process > greet (2)
Bonjour bar
Bonjour foo
Bonjour baz
This works fine for pipeline params, but AFAIK there's no way to directly override executor config like you describe on the command line. You can however, just parameterize these values and set them on the command line like described above. For example, in your nextflow.config:
params {
gc_region = false
gc_preemptible = true
...
}
profiles {
'test' {
includeConfig 'conf/test.config'
}
'google' {
includeConfig 'conf/google.config'
}
...
}
And in a file called 'conf/google.config':
google {
project = "cool-project"
region = params.gc_region
lifeSciences {
bootDiskSize = "200 GB"
debug = true
preemptible = params.gc_preemptible
}
}
Then you should be able to override these in the usual way:
nextflow run main.nf -profile google --gc_region "europe-west4" --gc_preemptible false
Note that you can also specify multiple configuration profiles by separating the profile names with a comma:
nextflow run main.nf -profile google,test ...