How to output a value channel that has paths using Nextflow - nextflow

Say one has some process that outputs files (e.g., converting sam to bam files) and one wants the output of the process to be a value channel so that it can be reused many times. Can one do this during the output of the channel? Or, does one have to call an operator (first?) on the queue channel after it has been output and that process completed?
Here is an example:
process sam2bam {
input:
path samfile from alignments
output:
path '*sorted.bam' into sam2bam //CAN THIS BE A VALUE CHANNEL
script:
"""
samtools view -b -o ${samfile}.bam ${samfile}
"""
}
So far, I have output the queue channel and then either duplicated it or tried to convert it to a value channel via first. This seems clunky and I figure there must be some way to directly output a value channel that has paths. I can't seem to find a simple answer in the documentation.
Thanks!

Ultimately this depends on if the 'alignments' channel is already a value channel or not:
A value channel is implicitly created by a process when an input
specifies a simple value in the from clause. Moreover, a value channel
is also implicitly created as output for a process whose inputs are
only value channels.
Note that this will create a queue channel:
alignments = Channel.fromPath( 'test.sorted.sam' )
And this will create a value channel:
alignments = file( 'test.sorted.sam' )
So if the 'alignments' channel is a value channel, the downstream 'sam2bam' channel will also be a value channel. If the 'alignments channel' is indeed a queue channel, then yes, you'll need to use one of the channel operators that returns a single value, such as first, last, collect, count, min, max, reduce, sum. The one you want is almost always collect.
Note that you may also be able to use the each input repeater to repeat the execution of a process for each item in a collection. Not sure what your downstream code looks like, but something like this might be convenient:
alignments = Channel.fromPath( '*.sorted.sam' )
chromosomes = Channel.of(1..23, 'X', 'Y')
process sam2bam {
input:
path samfile from alignments
output:
path "${samfile.baseName}.bam" into sam2bam
"""
touch "${samfile.baseName}.bam"
"""
}
process do_something {
echo true
input:
path bam from sam2bam
each chrom from chromosomes
"""
echo "${chrom}: ${bam}"
"""
}

Related

Why is a conditional channel source causing a downstream process to not execute an instance for each value in a different channel?

I have a Nextflow DSL2 pipeline where an early process generally takes a very long time (~24 hours) and has intermediate products that occupy a lot of storage (~1 TB). Because of the length and resources required for this process, it would be desirable to be able to set a "checkpoint", i.e. save the (relatively small) final output to a safe location, and on subsequent pipeline executions retrieve the output from that location. This means that the intermediate data can be safely deleted without preventing resumption of the pipeline later.
However, I've found that when I implement this and use the checkpoint, a process further downstream that is supposed to run an instance for every value in a list only runs a single instance. Minimal working example and example outputs below:
// foobarbaz.nf
nextflow.enable.dsl=2
params.publish_dir = "$baseDir/output"
params.nofoo = false
xy = ['x', 'y']
xy_chan = Channel.fromList(xy)
process foo {
publishDir "${params.publish_dir}/", mode: "copy"
output:
path "foo.out"
"""
touch foo.out
"""
}
process bar {
input:
path foo_out
output:
path "bar.out"
script:
"""
touch bar.out
"""
}
process baz {
input:
path bar_out
val xy
output:
tuple val(xy), path("baz_${xy}.out")
script:
"""
touch baz_${xy}.out
"""
}
workflow {
main:
if( params.nofoo ) {
foo_out = Channel.fromPath("${params.publish_dir}/foo.out")
}
else {
foo_out = foo() // generally takes a long time and uses lots of storage
}
bar_out = bar(foo_out)
baz_out = baz(bar_out, xy_chan)
// ... continue to do things with baz_out ...
}
First execution with foo:
$ nextflow foobarbaz.nf
N E X T F L O W ~ version 21.10.6
Launching `foobarbaz.nf` [soggy_gautier] - revision: f4e70a5cd2
executor > local (4)
[77/c65a9a] process > foo [100%] 1 of 1 ✔
[23/846929] process > bar [100%] 1 of 1 ✔
[18/1c4bb1] process > baz (2) [100%] 2 of 2 ✔
(note that baz successfully executes two instances: one where xy==x and one where xy==y)
Later execution using the checkpoint:
$ nextflow foobarbaz.nf --nofoo
N E X T F L O W ~ version 21.10.6
Launching `foobarbaz.nf` [infallible_babbage] - revision: f4e70a5cd2
executor > local (2)
[40/b42ed3] process > bar (1) [100%] 1 of 1 ✔
[d9/76888e] process > baz (1) [100%] 1 of 1 ✔
The checkpointing is successful (bar executes without needing foo), but now baz only executes a single instance where xy==x.
Why is this happening, and how can I get the intended behaviour? I see no reason why whether foo_out comes from foo or retrieved directly from a file should make any difference to how the xy channel is interpreted by baz.
The problem is that the Channel.fromPath factory method creates a queue channel to provide a single value, whereas the output of process 'foo' implicitly produces a value channel:
A value channel is implicitly created by a process when an input
specifies a simple value in the from clause. Moreover, a value channel
is also implicitly created as output for a process whose inputs are
only value channels.
So without --nofoo, 'foo_out' and 'bar_out' are both value channels. Since, 'xy_chan' is a queue channel that provides two values, process 'bar' gets executed twice. With --nofoo, 'foo_out' and 'bar_out' are both queue channels which provide a single value. Since there's only one complete input configuration (i.e. one value from each input channel), process 'bar' gets executed only once. See also: Understand how multiple input channels work.
The solution is to ensure that 'foo_out' is either, always a queue channel or always value channel. Given your 'foo' process declaration, you probably want the latter:
if( params.nofoo ) {
foo_out = file( "${params.publish_dir}/foo.out" )
}
else {
foo_out = foo()
}
In my experience a process is executed according to the input channel with lowest N of emissions (which is one path emission from bar in your case).
So in this case the strange behaviour is actually the example without --nofoo in my mind.
If you want it executed 2 time you may try to combine the Channels using combine something like baz_input_ch=bar.out.combine(xy_chan)

What is the best way to communicate among multiple processes in ubuntu

I've three different machine learning models in python. To improve performance, I run them on different terminals in parallel. They are communicating and sharing data with one another through files. These models are creating batches of files to make available for other. All the processes are running in parallel but dependent on data prepared by other process. Once a process A prepares a batch of data, it creates a file to give signal to other process that data is ready, then process B starts processing it, while looking for other batch too simultaneously. How can this huge data be shared with next process without creating files? Is there any better way to communicate among these processes without creating/deleting temporary files in python?
Thanks
You could consider running up a small Redis instance... a very fast, in-memory data structure server.
It allows you to share strings, lists, queues, hashes, atomic integers, sets, ordered sets between processes very simply.
As it is networked, you can share all these data structures not only within a single machine, but across multiple machines.
As it has bindings for C/C++, Python, bash, Ruby, Perl and so on, it also means you can use the shell, for example, to quickly inject commands/data into your app to change its behaviour, or get debugging insight by looking at how variables are set.
Here's an example of how to do multiprocessing in Python3. Instead of storing results in a file the results are stored in a dictionary (see output)
from multiprocessing import Pool, cpu_count
def multi_processor(function_name):
file_list = []
# Test, put 6 strings in the list so your_function should run six times
# with 6 processors in parallel, (assuming your CPU has enough cores)
file_list.append("test1")
file_list.append("test2")
file_list.append("test3")
file_list.append("test4")
file_list.append("test5")
file_list.append("test6")
# Use max number of system processors - 1
pool = Pool(processes=cpu_count()-1)
pool.daemon = True
results = {}
# for every item in the file_list, start a new process
for aud_file in file_list:
results[aud_file] = pool.apply_async(your_function, args=("arg1", "arg2"))
# Wait for all processes to finish before proceeding
pool.close()
pool.join()
# Results and any errors are returned
return {your_function: result.get() for your_function, result in results.items()}
def your_function(arg1, arg2):
try:
print("put your stuff in this function")
your_results = ""
return your_results
except Exception as e:
return str(e)
if __name__ == "__main__":
some_results = multi_processor("your_function")
print(some_results)
The output is
put your stuff in this function
put your stuff in this function
put your stuff in this function
put your stuff in this function
put your stuff in this function
put your stuff in this function
{'test1': '', 'test2': '', 'test3': '', 'test4': '', 'test5': '', 'test6': ''}
Try using a sqlite database to share files.
I made this for this exact purpose:
https://pypi.org/project/keyvalue-sqlite/
You can use it like this:
from keyvalue_sqlite import KeyValueSqlite
DB_PATH = '/path/to/db.sqlite'
db = KeyValueSqlite(DB_PATH, 'table-name')
# Now use standard dictionary operators
db.set_default('0', '1')
actual_value = db.get('0')
assert '1' == actual_value
db.set_default('0', '2')
assert '1' == db.get('0')

Generating a parameterized number of outputfiles for a snakemake rule

My workflow needs to be executed on two different clusters. The first cluster schedules jobs to nodes based on resource availability. The second cluster reserves entire nodes for a given job and asks its users to use those multiple cores efficiently within their job script. For the second cluster, it is accepted practice to submit a smaller number of jobs and stack processes in the background.
For a toy example, say I have four files I would like to create:
SAMPLES = [1, 2, 3, 4]
rule all:
input:
expand("sample.{sample}", sample=SAMPLES)
rule normal_create_files:
input:
output:
expand("sample.{sample}", sample=SAMPLES)
shell:
"touch {output}"
This can be run in parallel with one job per sample.
In addition to four jobs creating a single file each, I would like to be able to have two jobs creating two files each.
I've tried a few ideas, but have not gotten very far. The following workflow does the same as above, except it creates batches and launches the jobs as background processes within each batch:
rule all:
input:
expand("sample.{sample}", sample=SAMPLES)
rule stacked_create_files:
input:
output:
"sample.{sample}"
run:
import subprocess as sp
def chunks(l, n):
for i in range(0, len(l), n):
yield l[i:i + n]
pids = []
for chunk in chunks({output}.pop(), 2):
for sample in chunk:
pids.append(sp.Popen(["touch", sample]))
exit_codes = [p.wait() for p in pids]
However, this still creates four jobs!
I also came across Karel Brinda's response on the mailing list on a related topic. He pointed to his own project where he does dynamic rule creation in python. I will try something along these lines next.
The ideal solution would be a single rule that generates a set of output files, but is able to generate those files in batches. The number of batches would be set by a configuration parameter.
Has anyone here encountered a similar situation? Any thoughts, or ideas would be greatly appreciated!
I think the true solution to your problem will be the ability to group Snakemake jobs together. This feature is currently in the planning phase (in fact I have a research grant about this).
Indeed, currently the only solution is to somehow encode this into the rules themselves (e.g. via code generation).
In the future, you will be able to specify how the DAG of jobs shall be partitioned/grouped. Each of the resulting groups of jobs is submitted to the cluster as one batch.

Ryu Controller Drop Packet

How do I send a flow entry to drop a package using Ryu? I've learned from tutorials how to send package out flow entry:
I define the action:
actions = [ofp_parser.OFPActionOutput(ofp.OFPP_FLOOD)]
Then the entry itself:
out = ofp_parser.OFPPacketOut(datapath=dp, buffer_id=msg.buffer_id, in_port=msg.in_port,actions=actions)
Send the message to the switch:
dp.send_msg(out)
I'm trying to find the documentation to make this code drop the package instead of flooding, without success. I imagine I'll have to change actions on the first step and fp_parser.OFPPacketOut on the second step. I need someone more experienced on Ryu and developing itself to point me to the right direction. Thank you.
The default disposition of a packet in OpenFlow is to drop the packet. Therefore if you have a Flow Rule that when it matches you want to drop the packet, you should simply have an instruction to CLEAR_ACTIONS and then no other instruction, which means that no other tables will be processed since there is no instruction to process (go to) another table and no actions on it.
Remember to keep in mind your flow priorities. If you have more than one flow rule that will match the packet, the one with the highest priority will be the one to take effect. So your "drop packet" could be hidden behind a higher priority flow rule.
Here is some code that I have that will drop all traffic that matches a given EtherType, assuming that no higher priority packet matches. The function is dependent on a couple of instance variables, namely datapath, proto, and parser.
def dropEthType(self,
match_eth_type = 0x0800):
parser = self.parser
proto = self.proto
match = parser.OFPMatch(eth_type = match_eth_type)
instruction = [
parser.OFPInstructionActions(proto.OFPIT_CLEAR_ACTIONS, [])
]
msg = parser.OFPFlowMod(self.datapath,
table_id = OFDPA_FLOW_TABLE_ID_ACL_POLICY,
priority = 1,
command = proto.OFPFC_ADD,
match = match,
instructions = instruction
)
self._log("dropEthType : %s" % str(msg))
reply = api.send_msg(self.ryuapp, msg)
if reply:
raise Exception

Is it possible to send a message to an unregistered processes in Erlang?

I am aware that you can preform simple message passing with the following:
self() ! hello.
and you can see the message by calling:
flush().
I can also create simple processes in functions with something like:
spawn(module, function, args).
However I am not clear how one can send messages to the processes with out registering the Pid.
I have seen examples showing that you can pattern match against this in the shell to get the Pid assigned to a var, so if i create a gen_server such as:
...
start_link() ->
gen_server:start_link(?MODULE, init, []).
init(Pid) ->
{ok, Pid}.
...
I can then call it with the following from the shell:
{ok, Pid} = test_sup:start_link().
{ok,<0.143.0>}
> Pid ! test.
test
So my question is, can you send messages to Pids in the form <0.0.0> with out registering them to an atom or variable in the shell? Experimenting and searching as proved fruitless...
If you happen to need to send a message to a Pid based on the textual representation of its Pid, you can do (assuming the string is "<0.42.0>"):
list_to_pid("<0.42.0>") ! Message
This is almost only useful in the shell (where you can see the output of log messages or monitor data from something like Observer); any spawned process should normally be a child of some form of parent process to which it is linked (or monitored).
As for sending a message to something you just spawned, spawn returns a Pid, so you can assign it directly to a variable (which is not the same as registering it):
Pid = spawn(M, F, A),
Pid ! Message.
If you have the string "" to identify a pid, it is
either because you are working in the shell, and you use the representation you see, and you forgot to store this pid in a variable. Then simply use pid(X,Y,Z) to get it;
either because you did something like io_lib:format("~p",[Val]) where Val is the pid or a an erlang term which contain this pid. Then simply assign the pid to a variable (directly or extracting it from the term). It can be stored in an ets, send to another process without transformation
You should avoid to use the shell (or string) representation. One reason is that this representation is different when you ask the pid of one process from 2 different nodes as shown in the next screen capture.