How can I implement coroutines for a parallel task - kotlin

So, I have this piece of code:
for (z in 0 until texture.extent.z) {
println(z)
for (y in 0 until texture.extent.y)
for (x in 0 until texture.extent.x) {
val v = Vec3(x, y, z) / texture.extent
var n = when {
FRACTAL -> FractalNoise().noise(v * noiseScale)
else -> 20f * glm.perlin(v)
}
n -= glm.floor(n)
data[x + y * texture.extent.x + z * texture.extent.x * texture.extent.y] = glm.floor(n * 255).b
}
}
That takes over 4m on the jvm. The original sample in cpp uses OpenMp to accelerate the calculation.
I've heard about coroutines and I hope I could take advantage of them in this case.
I tried first to wrap the whole fors into a runBlocking because I do want that all the coroutines have finished before I move on.
runBlocking {
for (z in 0 until texture.extent.z) {
println(z)
for (y in 0 until texture.extent.y)
for (x in 0 until texture.extent.x) {
launch {
val v = Vec3(x, y, z) / texture.extent
var n = when {
FRACTAL -> FractalNoise().noise(v * noiseScale)
else -> 20f * glm.perlin(v)
}
n -= glm.floor(n)
data[x + y * texture.extent.x + z * texture.extent.x * texture.extent.y] = glm.floor(n * 255).b
}
}
}
}
But this is throwing different thread errors plus a final jvm crash
[thread 27624 also had an error][thread 23784 also had an error]# A fatal error has been detected by the Java Runtime Environment:
#
[thread 27624 also had an error][thread 23784 also had an error]# A fatal error has been detected by the Java Runtime Environment:
#
# [thread 14004 also had an error]EXCEPTION_ACCESS_VIOLATION
[thread 32652 also had an error] (0xc0000005)[thread 32616 also had an error]
at pc=0x0000000002d2fd50
, pid=23452[thread 21264 also had an error], tid=0x0000000000007b68
#
# JRE version: Java(TM) SE Runtime Environment (8.0_144-b01) (build 1.8.0_144-b01)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.144-b01 mixed mode windows-amd64 compressed oops)
# Problematic frame:
# J 1431 C2 java.util.concurrent.ForkJoinPool$WorkQueue.runTask(Ljava/util/concurrent/ForkJoinTask;)V (86 bytes) # 0x0000000002d2fd50 [0x0000000002d2f100+0xc50]
#
# Failed to write core dump. Minidumps are not enabled by default on client versions of Windows
#
# An error report file with more information is saved as:
# C:\Users\gBarbieri\IdeaProjects\Vulkan\hs_err_pid23452.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
#
Process finished with exit code 1
I tried also to collect all the jobs into an arrayList and join() them at the end, but without success..
May coroutine be used for a parallel task like this one?
If yes, what am I doing wrong?

Instead of coroutines you should consider the parallel computation engine built into the JDK: java.util.stream. What you have here is an embarrassingly parallelizable task, a perfect use case for it.
I'd use something along these lines:
IntStream.range(0, extent.x)
.boxed()
.parallel()
.flatMap { x ->
IntStream.range(0, extent.y).boxed().flatMap { y ->
IntStream.range(0, extent.z).mapToObj { z ->
Vec(x, y, z)
}
}
}
.forEach { vec ->
data[vecToArrayIndex(vec)] = computeValue(vec)
}

Related

How do I process files with matching pattern in nextflow?

Suppose I have nextflow channels:
Channel.fromFilePairs( "test/read*_R{1,2}.fa" )
.set{ reads }
reads.view()
Channel.fromPath(['test/lib_R1.fa','test/lib_R2.fa'] )
.set{ libs }
libs.view()
Which results in:
// reads channel
[read_b, [<path>/test/read_b_R1.fa, <path>/test/read_b_R2.fa]]
[read_a, [<path>/test/read_a_R1.fa, <path>/test/read_a_R2.fa]]
// libs channel
<path>/test/lib_R1.fa
<path>/test/lib_R2.fa
How do I run a process foo that executes matching read-lib pair, where the same lib is used for all read pairs? So basically I want to execute foo 4 times:
foo(test/read_b_R1.fa, test/lib_R1.fa)
foo(test/read_b_R2.fa, test/lib_R2.fa)
foo(test/read_a_R1.fa, test/lib_R1.fa)
foo(test/read_a_R2.fa, test/lib_R2.fa)
If you want to use the same library for all read pairs, what you really want is a value channel which can be read an unlimited number of times without being consumed. Note that a value channel is implicitly created by a process when it's invoked with a simple value. This could indeed be a list of files, but it looks like what you want is just one of these to correspond to each of the R1 or R2 reads. I think the simplest solution here is to just include your process using an alias so that you can pass in the required channels/files without too much effort:
params.reads = 'test/read*_R{1,2}.fa'
include { foo as foo_r1 } from './modules/foo.nf'
include { foo as foo_r2 } from './modules/foo.nf'
workflow {
Channel
.fromFilePairs( params.reads )
.multiMap { sample, reads ->
def (r1, r2) = reads
read1:
tuple(sample, r1)
read2:
tuple(sample, r2)
}
.set { reads }
lib_r1 = file('test/lib_R1.fa')
lib_r2 = file('test/lib_R2.fa')
foo_r1(reads.read1, lib_r1)
foo_r2(reads.read2, lib_r2)
}
Contents of ./modules/foo.nf:
process foo {
debug true
input:
tuple val(sample), path(fasta)
path(lib)
"""
echo $sample, $fasta, $lib
"""
}
Results:
$ nextflow run main.nf
N E X T F L O W ~ version 22.10.0
Launching `main.nf` [confident_boyd] DSL2 - revision: 8c81e2d743
executor > local (6)
[a8/e8a752] process > foo_r1 (2) [100%] 3 of 3 ✔
[75/2b32f5] process > foo_r2 (3) [100%] 3 of 3 ✔
readC, readC_R2.fa, lib_R2.fa
readA, readA_R1.fa, lib_R1.fa
readC, readC_R1.fa, lib_R1.fa
readB, readB_R2.fa, lib_R2.fa
readA, readA_R2.fa, lib_R2.fa
readB, readB_R1.fa, lib_R1.fa
process FOO {
debug true
input:
tuple val(files), path(lib)
output:
stdout
script:
file_a = files[0]
file_b = files[1]
"""
echo $file_a with $lib
echo $file_b with $lib
"""
}
workflow {
Channel
.of(['read_b', [file('/test/read_b_R1.fa'), file('/test/read_b_R2.fa')]],
['read_a', [file('/test/read_a_R1.fa'), file('/test/read_a_R2.fa')]]
)
.set { reads }
Channel
.of(file('/test/lib_R1.fa'),
file('/test/lib_R2.fa')
)
.set { libs }
reads
.map { sample, files -> files }
.flatten()
.map { file -> [file.name.split('_')[2].split('.fa')[0], file]}
.groupTuple()
.set { reads }
libs
.map { file -> [file.name.split('_')[1].split('.fa')[0], file]}
.set { libs }
reads
.join(libs)
.map { Rx, path, lib -> [path, lib] }
| FOO
}
The output of the script above is:
N E X T F L O W ~ version 22.10.4
Launching `ex.nf` [elegant_wiles] DSL2 - revision: 00862286fd
executor > local (2)
[58/9b3cf1] process > FOO (2) [100%] 2 of 2 ✔
/test/read_b_R1.fa with lib_R1.fa
/test/read_a_R1.fa with lib_R1.fa
/test/read_b_R2.fa with lib_R2.fa
/test/read_a_R2.fa with lib_R2.fa
EDIT as a reply to the comment below.
If you want the process to run once per element in the channel, check the modified version below:
process FOO {
debug true
input:
tuple val(file), path(lib)
output:
stdout
script:
"""
echo $file with $lib
"""
}
workflow {
Channel
.of(['read_b', [file('/test/read_b_R1.fa'), file('/test/read_b_R2.fa')]],
['read_a', [file('/test/read_a_R1.fa'), file('/test/read_a_R2.fa')]]
)
.set { reads }
Channel
.of(file('/test/lib_R1.fa'),
file('/test/lib_R2.fa')
)
.set { libs }
reads
.map { sample, files -> files }
.flatten()
.map { file -> [file.name.split('_')[2].split('.fa')[0], file]}
.groupTuple()
.set { reads }
libs
.map { file -> [file.name.split('_')[1].split('.fa')[0], file]}
.set { libs }
reads
.join(libs)
.map { Rx, path, lib -> [path, lib] }
.map { x, y -> [[x[0], y], [x[1], y]] }
.flatMap()
| FOO
}
Output:
N E X T F L O W ~ version 22.10.4
Launching `ex.nf` [sharp_ekeblad] DSL2 - revision: 1412af632e
executor > local (4)
[a0/416f59] process > FOO (1) [100%] 4 of 4 ✔
/test/read_b_R2.fa with lib_R2.fa
/test/read_a_R2.fa with lib_R2.fa
/test/read_a_R1.fa with lib_R1.fa
/test/read_b_R1.fa with lib_R1.fa

RisingEdge example doesn't work for module input signal in Chisel3

In Chisel documentation we have an example of rising edge detection method defined as following :
def risingedge(x: Bool) = x && !RegNext(x)
All example code is available on my github project blp.
If I use it on an Input signal declared as following :
class RisingEdge extends Module {
val io = IO(new Bundle{
val sclk = Input(Bool())
val redge = Output(Bool())
val fedge = Output(Bool())
})
// seems to not work with icarus + cocotb
def risingedge(x: Bool) = x && !RegNext(x)
def fallingedge(x: Bool) = !x && RegNext(x)
// works with icarus + cocotb
//def risingedge(x: Bool) = x && !RegNext(RegNext(x))
//def fallingedge(x: Bool) = !x && RegNext(RegNext(x))
io.redge := risingedge(io.sclk)
io.fedge := fallingedge(io.sclk)
}
With this icarus/cocotb testbench :
class RisingEdge(object):
def __init__(self, dut, clock):
self._dut = dut
self._clock_thread = cocotb.fork(clock.start())
#cocotb.coroutine
def reset(self):
short_per = Timer(100, units="ns")
self._dut.reset <= 1
self._dut.io_sclk <= 0
yield short_per
self._dut.reset <= 0
yield short_per
#cocotb.test()
def test_rising_edge(dut):
dut._log.info("Launching RisingEdge test")
redge = RisingEdge(dut, Clock(dut.clock, 1, "ns"))
yield redge.reset()
cwait = Timer(10, "ns")
for i in range(100):
dut.io_sclk <= 1
yield cwait
dut.io_sclk <= 0
yield cwait
I will never get rising pulses on io.redge and io.fedge. To get the pulse I have to change the definition of risingedge as following :
def risingedge(x: Bool) = x && !RegNext(RegNext(x))
With dual RegNext() :
With simple RegNext() :
Is it a normal behavior ?
[Edit: I modified source example with the github example given above]
I'm not sure about Icarus, but using the default Treadle simulator for a test like this.
class RisingEdgeTest extends FreeSpec {
"debug should toggle" in {
iotesters.Driver.execute(Array("-tiwv"), () => new SlaveSpi) { c =>
new PeekPokeTester(c) {
for (i <- 0 until 10) {
poke(c.io.csn, i % 2)
println(s"debug is ${peek(c.io.debug)}")
step(1)
}
}
}
}
}
I see the output
[info] [0.002] debug is 0
[info] [0.002] debug is 1
[info] [0.002] debug is 0
[info] [0.003] debug is 1
[info] [0.003] debug is 0
[info] [0.003] debug is 1
[info] [0.004] debug is 0
[info] [0.004] debug is 1
[info] [0.005] debug is 0
[info] [0.005] debug is 1
And the wave form looks like
Can you explain what you think this should look like.
Do not change module input value on rising edge of clock.
Ok I found my bug. In the cocotb testbench I toggled input values on the same edge of synchronous clock. If we do that, the input is modified exactly under the setup time of D-Latch, then the behavior is undefined !
Then, the problem was a cocotb testbench bug and not Chisel bug. To solve it we just have to change the clock edge for toggling values like it :
#cocotb.test()
def test_rising_edge(dut):
dut._log.info("Launching RisingEdge test")
redge = RisingEdge(dut, Clock(dut.clock, 1, "ns"))
yield redge.reset()
cwait = Timer(4, "ns")
yield FallingEdge(dut.clock) # <--- 'synchronize' on falling edge
for i in range(5):
dut.io_sclk <= 1
yield cwait
dut.io_sclk <= 0
yield cwait

Is there an 'or' notation in qmake

I am using win32, macx and unix:!macx aka. Linux if statements in my .pro file, to specify os specific tasks, e.g.
win32 {
TARGET = myapp
RC_FILE = myapp.rc
}
macx {
TARGET = MyApp
ICON = myapp.icns
QMAKE_INFO_PLIST = Info.plist
}
unix:!macx { # linux
CONFIG(debug, debug|release) {
TARGET = myapp-debug
}
CONFIG(release, debug|release) {
TARGET = myapp
}
}
This works fine for if X else, if X elseif X else, and if not X where X is an os specifier.
Is there a way to tell qmake it must compile a block for os1 or os2?
You can use the | operator for a logical or. For example:
win32|macx {
HEADERS += debugging.h
}
http://doc.qt.io/qt-4.8/qmake-advanced-usage.html

CGAL: Is 2D poly partitioning supported with epeck kernel?

I'd like to use CGAL convex partitioning in an application that is based on the epeck kernel, but trying to compile such throws the following error:
error:
no matching constructor for initialization of 'CGAL::Partition_vertex<CGAL::Partition_traits_2<CGAL::Epeck> >'
A simple test case for this is to take, for example, the greene_approx_convex_partition_2.cpp example from the distribution and try to change the kernel parameterization to epeck.
Are/can the 2D convex partitioning routines supported on an epeck kernel? Any pointers or advice much appreciated!
thanks much,
Here is a workaround:
--- a/include/CGAL/Partition_2/Indirect_edge_compare.h
+++ b/include/CGAL/Partition_2/Indirect_edge_compare.h
## -69,7 +69,7 ## class Indirect_edge_compare
else
{
// construct supporting line for edge
- Line_2 line = _construct_line_2(*edge_vtx_1, *edge_vtx_2);
+ Line_2 line = _construct_line_2((Point_2)*edge_vtx_1, (Point_2)*edge_vtx_2);
return _compare_x_at_y_2(*vertex, line) == SMALLER;
}
}
## -98,10 +98,10 ## class Indirect_edge_compare
// else neither endpoint is shared
// construct supporting line
- Line_2 l_p = _construct_line_2(*p, *after_p);
+ Line_2 l_p = _construct_line_2((Point_2)*p, (Point_2)*after_p);
if (_is_horizontal_2(l_p))
{
- Line_2 l_q = _construct_line_2(*q, *after_q);
+ Line_2 l_q = _construct_line_2((Point_2)*q, (Point_2)*after_q);
if (_is_horizontal_2(l_q))
{
## -130,7 +130,7 ## class Indirect_edge_compare
return q_larger_x;
// else one smaller and one larger
// construct the other line
- Line_2 l_q = _construct_line_2(*q, *after_q);
+ Line_2 l_q = _construct_line_2((Point_2)*q, (Point_2)*after_q);
if (_is_horizontal_2(l_q)) // p is not horizontal
{
return _compare_x_at_y_2((*q), l_p) == LARGER;
I have also noticed that while greene_approx_convex_partition_2 with epeck results in the compiler error mentioned above, the alternative approx_convex_partition_2 compiles just fine with epeck right out of the box.

Akka - load balancing and full use of the processor

I wrote a matrix multiplication algorithm, which uses parallel collections, to speed up the multiplication.
It goes like that:
(0 until M1_ROWS).grouped(PARTITION_ROWS).toList.par.map( i =>
singleThreadedMultiplicationFAST(i.toArray.map(m1(_)), m2)
).reduce(_++_)
Now I would like to do the same in Akka, so what I did is:
val multiplyer = actorOf[Pool]
multiplyer start
val futures = (0 until M1_ROWS).grouped(PARTITION_ROWS).map( i =>
multiplyer ? MultiplyMatrix(i.toArray.map(m1(_)), m2)
)
futures.map(_.get match { case res :Array[Array[Double]] => res }).reduce(_++_)
class Multiplyer extends akka.actor.Actor{
protected def receive = {
case MultiplyMatrix(m1, m2) => self reply singleThreadedMultiplicationFAST (m1,m2)
}
}
class Pool extends Actor with DefaultActorPool
with FixedCapacityStrategy with RoundRobinSelector {
def receive = _route
def partialFill = false
def selectionCount = 1
def instance = actorOf[Multiplyer]
def limit = 32 // I tried 256 with no effect either
}
It turned out that actor based version of this algorithm is using only
200% on my i7 sandybridge, while the parallel collections version is
using 600% of processor and is 4-5x faster.
I thought it might be the dispatcher and tried this:
self.dispatcher = Dispatchers.newThreadBasedDispatcher(self, mailboxCapacity = 100)
and this(I shared this one between actors):
val messageDispatcher = Dispatchers.newExecutorBasedEventDrivenDispatcher("d1")
.withNewBoundedThrea dPoolWithLinkedBlockingQueueWithUnboundedCapacity(100)
.setCorePoolSize(16)
.setMaxPoolSize(128)
.setKeepAliveTimeInMillis(60000).build
But I didn't observe any changes. Still 200% processor usage only and
the algorithm is 4-5 times slower than the parallel collections
version.
I am sure I am doing something silly so please help!!!:)
This expression:
val futures = (0 until M1_ROWS).grouped(PARTITION_ROWS).map( i =>
multiplyer ? MultiplyMatrix(i.toArray.map(m1(_)), m2)
)
creates a lazy collection, so your _.get makes your entire program serial.
So the solution is to make that expression strict by adding toList or similar.