Why is a conditional channel source causing a downstream process to not execute an instance for each value in a different channel? - nextflow

I have a Nextflow DSL2 pipeline where an early process generally takes a very long time (~24 hours) and has intermediate products that occupy a lot of storage (~1 TB). Because of the length and resources required for this process, it would be desirable to be able to set a "checkpoint", i.e. save the (relatively small) final output to a safe location, and on subsequent pipeline executions retrieve the output from that location. This means that the intermediate data can be safely deleted without preventing resumption of the pipeline later.
However, I've found that when I implement this and use the checkpoint, a process further downstream that is supposed to run an instance for every value in a list only runs a single instance. Minimal working example and example outputs below:
// foobarbaz.nf
nextflow.enable.dsl=2
params.publish_dir = "$baseDir/output"
params.nofoo = false
xy = ['x', 'y']
xy_chan = Channel.fromList(xy)
process foo {
publishDir "${params.publish_dir}/", mode: "copy"
output:
path "foo.out"
"""
touch foo.out
"""
}
process bar {
input:
path foo_out
output:
path "bar.out"
script:
"""
touch bar.out
"""
}
process baz {
input:
path bar_out
val xy
output:
tuple val(xy), path("baz_${xy}.out")
script:
"""
touch baz_${xy}.out
"""
}
workflow {
main:
if( params.nofoo ) {
foo_out = Channel.fromPath("${params.publish_dir}/foo.out")
}
else {
foo_out = foo() // generally takes a long time and uses lots of storage
}
bar_out = bar(foo_out)
baz_out = baz(bar_out, xy_chan)
// ... continue to do things with baz_out ...
}
First execution with foo:
$ nextflow foobarbaz.nf
N E X T F L O W ~ version 21.10.6
Launching `foobarbaz.nf` [soggy_gautier] - revision: f4e70a5cd2
executor > local (4)
[77/c65a9a] process > foo [100%] 1 of 1 ✔
[23/846929] process > bar [100%] 1 of 1 ✔
[18/1c4bb1] process > baz (2) [100%] 2 of 2 ✔
(note that baz successfully executes two instances: one where xy==x and one where xy==y)
Later execution using the checkpoint:
$ nextflow foobarbaz.nf --nofoo
N E X T F L O W ~ version 21.10.6
Launching `foobarbaz.nf` [infallible_babbage] - revision: f4e70a5cd2
executor > local (2)
[40/b42ed3] process > bar (1) [100%] 1 of 1 ✔
[d9/76888e] process > baz (1) [100%] 1 of 1 ✔
The checkpointing is successful (bar executes without needing foo), but now baz only executes a single instance where xy==x.
Why is this happening, and how can I get the intended behaviour? I see no reason why whether foo_out comes from foo or retrieved directly from a file should make any difference to how the xy channel is interpreted by baz.

The problem is that the Channel.fromPath factory method creates a queue channel to provide a single value, whereas the output of process 'foo' implicitly produces a value channel:
A value channel is implicitly created by a process when an input
specifies a simple value in the from clause. Moreover, a value channel
is also implicitly created as output for a process whose inputs are
only value channels.
So without --nofoo, 'foo_out' and 'bar_out' are both value channels. Since, 'xy_chan' is a queue channel that provides two values, process 'bar' gets executed twice. With --nofoo, 'foo_out' and 'bar_out' are both queue channels which provide a single value. Since there's only one complete input configuration (i.e. one value from each input channel), process 'bar' gets executed only once. See also: Understand how multiple input channels work.
The solution is to ensure that 'foo_out' is either, always a queue channel or always value channel. Given your 'foo' process declaration, you probably want the latter:
if( params.nofoo ) {
foo_out = file( "${params.publish_dir}/foo.out" )
}
else {
foo_out = foo()
}

In my experience a process is executed according to the input channel with lowest N of emissions (which is one path emission from bar in your case).
So in this case the strange behaviour is actually the example without --nofoo in my mind.
If you want it executed 2 time you may try to combine the Channels using combine something like baz_input_ch=bar.out.combine(xy_chan)

Related

How to utilize the batch function with multiple inputs to receive one agent from each input to meet the batch?

I have 8 inputs that I would like to combine to one agent using the batch block. All the inputs have the same flow rate (1 per minute) and I would like them all to deliver one and only one agent to the batch so all inputs deliver one agent for the batch to be complete.
I have tried to use a delay and queue to manually restrict flow but that has not worked. I got an error saying cannot restrict flow but I have the inputs set "agents that cant exist are destroyed".
I also looked into trying to use a function but have not come across one that makes sense in my problem. Any help would be appreciated!
In a very primitive way you can build the following model:
You will define the HOLD block as follows:
The function will check that each queue has at least one agent ready and then release agents from each HOLD block:
if (queue.size() > 0 &&
queue1.size() > 0 &&
queue2.size() > 0 &&
queue3.size() > 0 &&
queue4.size() > 0 &&
queue5.size() > 0 &&
queue6.size() > 0 &&
queue7.size() > 0 )
{
hold.unblock();
hold1.unblock();
hold2.unblock();
hold3.unblock();
hold4.unblock();
hold5.unblock();
hold6.unblock();
hold7.unblock();
}
Every time an agent arrives you call the function() under the onAtExit event of your sources.

Why is Python object id different after the Process starts but the pid remains the same?

"""
import time
from multiprocessing import Process, freeze_support
class FileUploadManager(Process):
"""
WorkerObject which uploads files in background process
"""
def __init__(self):
"""
Worker class to upload files in a separate background process.
"""
super().__init__()
self.daemon = True
self.upload_size = 0
self.upload_queue = set()
self.pending_uploads = set()
self.completed_uploads = set()
self.status_info = {'STOPPED'}
print(f"Initial ID: {id(self)}")
def run(self):
try:
print("STARTING NEW PROCESS...\n")
if 'STARTED' in self.status_info:
print("Upload Manager - Already Running!")
return True
self.status_info.add('STARTED')
print(f"Active Process Info: {self.status_info}, ID: {id(self)}")
# Upload files
while True:
print("File Upload Queue Empty.")
time.sleep(10)
except Exception as e:
print(f"{repr(e)} - Cannot run upload process.")
if __name__ == '__main__':
upload_manager = FileUploadManager()
print(f"Object ID: {id(upload_manager)}")
upload_manager.start()
print(f"Process Info: {upload_manager.status_info}, ID After: {id(upload_manager)}")
while 'STARTED' not in upload_manager.status_info:
print(f"Not Started! Process Info: {upload_manager.status_info}")
time.sleep(7)
"""
OUTPUT
Initial ID: 2894698869712
Object ID: 2894698869712
Process Info: {'STOPPED'}, ID After: 2894698869712
Not Started! Process Info: {'STOPPED'}
STARTING NEW PROCESS...
Active Process Info: {'STARTED', 'STOPPED'}, ID: 2585771578512
File Upload Queue Empty.
Not Started! Process Info: {'STOPPED'}
File Upload Queue Empty.
Why does the Process object have the same id and attribute values before and after is has started. but different id when the run method starts?
Initial ID: 2894698869712
Active Process Info: {'STARTED', 'STOPPED'}, ID: 2585771578512
Process Info: {'STOPPED'}, ID After: 2894698869712
I fixed your indentation, and I also removed everything from your script that was not actually being used. It is now a minimum, reproducible example that anyone can run. In the future, please adhere to the site guidelines, and please proofread your questions. It will save everybody's time and you will get better answers.
I would also like to point out that the question in your title is not at all the same as the question asked in your text. At no point do you retrieve the process ID, which is an operating system value. You are printing out the ID of the object, which is a value that has meaning only within the Python runtime environment.
import time
from multiprocessing import Process
# Removed freeze_support since it was unused
class FileUploadManager(Process):
"""
WorkerObject which uploads files in background process
"""
def __init__(self):
"""
Worker class to upload files in a separate background process.
"""
super().__init__(daemon=True)
# The next line probably does not work as intended, so
# I commented it out. The docs say that the daemon
# flag must be set by a keyword-only argument
# self.daemon = True
# I removed a buch of unused variables for this test program
self.status_info = {'STOPPED'}
print(f"Initial ID: {id(self)}")
def run(self):
try:
print("STARTING NEW PROCESS...\n")
if 'STARTED' in self.status_info:
print("Upload Manager - Already Running!")
return # Removed True return value (it was unused)
self.status_info.add('STARTED')
print(f"Active Process Info: {self.status_info}, ID: {id(self)}")
# Upload files
while True:
print("File Upload Queue Empty.")
time.sleep(1.0)
except Exception as e:
print(f"{repr(e)} - Cannot run upload process.")
if __name__ == '__main__':
upload_manager = FileUploadManager()
print(f"Object ID: {id(upload_manager)}")
upload_manager.start()
print(f"Process Info: {upload_manager.status_info}",
f"ID After: {id(upload_manager)}")
while 'STARTED' not in upload_manager.status_info:
print(f"Not Started! Process Info: {upload_manager.status_info}")
time.sleep(0.7)
Your question is, why is the id of upload_manager the same before and after it is started. Simple answer: because it's the same object. It does not become another object just because you called one of its functions. That would not make any sense.
I suppose you might be wondering why the ID of the FileUploadManager object is different when you print it out from its "run" method. It's the same simple answer: because it's a different object. Your script actually creates two instances of FileUploadManager, although it's not obvious. In Python, each Process has its own memory space. When you start a secondary Process (upload_manager.start()), Python makes a second instance of FileUploadManager to execute in this new Process. The two instances are completely separate and "know" nothing about each other.
You did not say that your script doesn't terminate, but it actually does not. It runs forever, stuck in the loop while 'STARTED' not in upload_manager.status_info. That's because 'STARTED' was added to self.status_info in the secondary Process. That Process is working with a different instance of FileUploadManager. The changes you make there do not get automatically reflected in the first instance, which lives in the main Process. Therefore the first instance of FileUploadManager never changes, and the loop never exits.
This all makes perfect sense once you realize that each Process works with its own separate objects. If you need to pass data from one Process to another, that can be done with Pipes, Queues, Managers and shared variables. That is documented in the Concurrent Execution section of the standard library.

How to output a value channel that has paths using Nextflow

Say one has some process that outputs files (e.g., converting sam to bam files) and one wants the output of the process to be a value channel so that it can be reused many times. Can one do this during the output of the channel? Or, does one have to call an operator (first?) on the queue channel after it has been output and that process completed?
Here is an example:
process sam2bam {
input:
path samfile from alignments
output:
path '*sorted.bam' into sam2bam //CAN THIS BE A VALUE CHANNEL
script:
"""
samtools view -b -o ${samfile}.bam ${samfile}
"""
}
So far, I have output the queue channel and then either duplicated it or tried to convert it to a value channel via first. This seems clunky and I figure there must be some way to directly output a value channel that has paths. I can't seem to find a simple answer in the documentation.
Thanks!
Ultimately this depends on if the 'alignments' channel is already a value channel or not:
A value channel is implicitly created by a process when an input
specifies a simple value in the from clause. Moreover, a value channel
is also implicitly created as output for a process whose inputs are
only value channels.
Note that this will create a queue channel:
alignments = Channel.fromPath( 'test.sorted.sam' )
And this will create a value channel:
alignments = file( 'test.sorted.sam' )
So if the 'alignments' channel is a value channel, the downstream 'sam2bam' channel will also be a value channel. If the 'alignments channel' is indeed a queue channel, then yes, you'll need to use one of the channel operators that returns a single value, such as first, last, collect, count, min, max, reduce, sum. The one you want is almost always collect.
Note that you may also be able to use the each input repeater to repeat the execution of a process for each item in a collection. Not sure what your downstream code looks like, but something like this might be convenient:
alignments = Channel.fromPath( '*.sorted.sam' )
chromosomes = Channel.of(1..23, 'X', 'Y')
process sam2bam {
input:
path samfile from alignments
output:
path "${samfile.baseName}.bam" into sam2bam
"""
touch "${samfile.baseName}.bam"
"""
}
process do_something {
echo true
input:
path bam from sam2bam
each chrom from chromosomes
"""
echo "${chrom}: ${bam}"
"""
}

How to run multiple process, sequentially in gem5 se mode?

I followed gem5-tutorial to build a test config. It executes hello world, on se mode. But, now I want to run multiple processes, one by one. How to do that? So far i tried this
processes = []
processes.append([bzip2_benchmark, bzip2_input])
processes.append([mcf_benchmark, mcf_input])
processes.append([hmmer_benchmark, '--fixed=0', '--mean=325', '--num=45000', '--sd=200', '--seed=0', hmmer_input])
processes.append([sjeng_benchmark, sjeng_input])
processes.append([lbm_benchmark, 20, 'reference.dat', 0, 1, benchmark_dir+'470.lbm/data/100_100_130_cf_a.of'])
for p in processes:
process = Process()
process.cmd = p
system.cpu.workload = process
system.cpu.createThreads()
root = Root(full_system=False, system=system)
m5.instantiate()
print("Beginning simulation!")
exit_event = m5.simulate()
print('Exiting # tick {} because {}'
.format(m5.curTick(), exit_event.getCause()))
Assume that all imports are correct and, system is instantiated correctly. The above code gives "fatal: Attempt to allocate multiple instances of Root.", after running the first process. I understood why this happens, but I want to know how to run these benchmark programs one by one.

Why don't all the shell processes in my promises (start blocks) run? (Is this a bug?)

I want to run multiple shell processes, but when I try to run more than 63, they hang. When I reduce max_threads in the thread pool to n, it hangs after running the nth shell command.
As you can see in the code below, the problem is not in start blocks per se, but in start blocks that contain the shell command:
#!/bin/env perl6
my $*SCHEDULER = ThreadPoolScheduler.new( max_threads => 2 );
my #processes;
# The Promises generated by this loop work as expected when awaited
for #*ARGS -> $item {
#processes.append(
start { say "Planning on processing $item" }
);
}
# The nth Promise generated by the following loop hangs when awaited (where n = max_thread)
for #*ARGS -> $item {
#processes.append(
start { shell "echo 'processing $item'" }
);
}
await(#processes);
Running ./process_items foo bar baz gives the following output, hanging after processing bar, which is just after the nth (here 2nd) thread has run using shell:
Planning on processing foo
Planning on processing bar
Planning on processing baz
processing foo
processing bar
What am I doing wrong? Or is this a bug?
Perl 6 distributions tested on CentOS 7:
Rakudo Star 2018.06
Rakudo Star 2018.10
Rakudo Star 2019.03-RC2
Rakudo Star 2019.03
With Rakudo Star 2019.03-RC2, use v6.c versus use v6.d did not make any difference.
The shell and run subs use Proc, which is implemented in terms of Proc::Async. This uses the thread pool internally. By filling up the pool with blocking calls to shell, the thread pool becomes exhausted, and so cannot process events, resulting in the hang.
It would be far better to use Proc::Async directly for this task. The approach with using shell and a load of real threads won't scale well; every OS thread has memory overhead, GC overhead, and so forth. Since spawning a bunch of child processes is not CPU-bound, this is rather wasteful; in reality, just one or two real threads are needed. So, in this case, perhaps the implementation pushing back on you when doing something inefficient isn't the worst thing.
I notice that one of the reasons for using shell and the thread pool is to try and limit the number of concurrent processes. But this isn't a very reliable way to do it; just because the current thread pool implementation sets a default maximum of 64 threads does not mean it always will do so.
Here's an example of a parallel test runner that runs up to 4 processes at once, collects their output, and envelopes it. It's a little more than you perhaps need, but it nicely illustrates the shape of the overall solution:
my $degree = 4;
my #tests = dir('t').grep(/\.t$/);
react {
sub run-one {
my $test = #tests.shift // return;
my $proc = Proc::Async.new('perl6', '-Ilib', $test);
my #output = "FILE: $test";
whenever $proc.stdout.lines {
push #output, "OUT: $_";
}
whenever $proc.stderr.lines {
push #output, "ERR: $_";
}
my $finished = $proc.start;
whenever $finished {
push #output, "EXIT: {.exitcode}";
say #output.join("\n");
run-one();
}
}
run-one for 1..$degree;
}
The key thing here is the call to run-one when a process ends, which means that you always replace an exited process with a new one, maintaining - so long as there are things to do - up to 4 processes running at a time. The react block naturally ends when all processes have completed, due to the fact that the number of events subscribed to drops to zero.