Why HPX requires future's "then" to be part of a DAG (directed acyclic graph)? - hpx

In HPX introduction tutorials you learn that you can make use of future's then() method, that allows you to enqueue some operation to be computed when the future is ready.
In this manual there is a sentence that says "Used to build up dataflow DAGs (directed acyclic graphs)" when explaining how to use thens.
My question is, what does it mean that this queue has to be acyclic? Can I make a function that re-computes a future inside a then? This would look like myfuture.then( recompute myfuture ; myfuture.then() ) ?

You can think of a hpx::future (very similar, if not identical to std::experimental::future, see https://en.cppreference.com/w/cpp/experimental/future) is an one-shot pipeline between an anonymous producer and a consumer. It does not represent the task itself but just the produced result (that might not have been computed yet).
Thus 'recomputing' a future (as you put it) can only mean to reinitialize the future from an asynchronous provider (hpx::async, future<>::then, etc.).
hpx::future<int> f = hpx::async([]{ return 42; });
hpx::future<int> f2 = f.then(
[](hpx::future<int> r)
{
// this is guaranteed not to block as the continuation
// will be called only after `f` has become ready (note:
// `f` has been moved-to `r`)
int result = r.get();
// 'reinitialize' the future
r = hpx::async([]{ return 21; });
// ...do things with 'r'
return result;
});
// f2 now represents the result of executing the chain of the two lambdas
std::cout << f2.get() << '\n'; // prints '42'
I'm not sure if this answers your question and why you would like to do that, but here you go.

Related

Vulkan memory barrier for indirect compute shader dispatch

I have two compute shaders and the first one modifies DispatchIndirectCommand buffer, which is later used by the second one.
// This is the code of the first shader
struct DispatchIndirectCommand{
uint x;
uint y;
uint z;
};
restrict layout(std430, set = 0, binding = 5) buffer Indirect{
DispatchIndirectCommand[1] dispatch_indirect;
};
void main(){
// some stuff
dispatch_indirect[0].x = number_of_groups; // I used debugPrintfEXT to make sure that this number is correct
}
I execute them as follows
vkCmdBindPipeline(cmd_buffer, VK_PIPELINE_BIND_POINT_COMPUTE, first_shader);
vkCmdDispatch(cmd_buffer, x, 1, 1);
vkCmdBindPipeline(cmd_buffer, VK_PIPELINE_BIND_POINT_COMPUTE, second_shader);
vkCmdDispatchIndirect(cmd_buffer, indirect_buffer, 0);
The problem is that the changes made by first shader are not reflected by the second one.
// This is the code of the second shader
void main(){
debugPrintfEXT("%d", gl_GlobalInvocationID.x); //this seems to never be called
}
I initialise the indirect_buffer with VkDispatchIndirectCommand{.x=0,.y=1,.z=1}, and it seems that the second shader always executes with x==0, because the debugPrintfEXT never prints anything. I tried to add a memory barrier like
VkBufferMemoryBarrier barrier;
barrier.sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;
barrier.srcAccessMask = VK_ACCESS_SHADER_WRITE_BIT;
barrier.dstAccessMask = VK_ACCESS_SHADER_READ_BIT;
barrier.srcQueueFamilyIndex = queue_idx;
barrier.dstQueueFamilyIndex = queue_idx;
barrier.buffer = indirect_buffer;
barrier.offset = 0;
barrier.size = sizeof_indirect_buffer;
However, this does not seem to make any difference. What does seem to work, is when I use
barrier.dstAccessMask = VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_INDIRECT_COMMAND_READ;
When I use such access flags, the all compute shaders work properly. However, I get a validation error
Validation Error: [ VUID-vkCmdPipelineBarrier-dstAccessMask-02816 ] Object 0: handle = 0x5561b60356c8, type = VK_OBJECT_TYPE_COMMAND_BUFFER; | MessageID = 0x69c8467d | vkCmdPipelineBarrier(): .pMemoryBarriers[1].dstAccessMask bit VK_ACCESS_INDIRECT_COMMAND_READ_BIT is not supported by stage mask (VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT). The Vulkan spec states: The dstAccessMask member of each element of pMemoryBarriers must only include access flags that are supported by one or more of the pipeline stages in dstStageMask, as specified in the table of supported access types (https://vulkan.lunarg.com/doc/view/1.2.182.0/linux/1.2-extensions/vkspec.html#VUID-vkCmdPipelineBarrier-dstAccessMask-02816)
Vulkan's documentation states that
VK_ACCESS_INDIRECT_COMMAND_READ_BIT specifies read access to indirect command data read as part of an indirect build, trace, drawing or dispatching command. Such access occurs in the VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT pipeline stage
It looks really confusing. It clearly does mention "dispatching command", but at the same time it says that the stage must be VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT and not VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT. Is the specification contradictory/imprecise or am I missing something?
You seem to be employing trial-and-error strategy. Do not use such strategy in low level APIs, and preferrably in computer engineering in general. Sooner or later you will encounter something that will look like it works when you test it, but be invalid code anyway. The spec did tell you the exact thing to do, so you never ever had a legitimate reason to try any other flags or with no barriers at all.
As you discovered, the appropriate access flag for indirect read is VK_ACCESS_INDIRECT_COMMAND_READ_BIT. And as the spec also says, the appropriate stage is VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT.
So the barrier for your case should probably be:
srcStageMask = VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT
srcAccessMask = VK_ACCESS_SHADER_WRITE_BIT;
dstStageMask = VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT
dstAccessMask = VK_ACCESS_INDIRECT_COMMAND_READ_BIT;
The name of the stage flag is slightly confusing, but the spec is otherwisely very clear it aplies to compute:
VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT specifies the stage of the pipeline where VkDrawIndirect* / VkDispatchIndirect* / VkTraceRaysIndirect* data structures are consumed.
and also:
For the compute pipeline, the following stages occur in this order:
VK_PIPELINE_STAGE_DRAW_INDIRECT_BIT
VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT
PS: Also relevant GitHub Issue: KhronosGroup/Vulkan-Docs#176

Dafny not verifying trivial assertion

The code below, also at https://rise4fun.com/Dafny/I4wM ,
function abst(bits: array<int>,low:int,high: int): int
requires 0<=low<=high<=8
requires bits.Length==8
decreases high-low
reads bits
{ if low==high
then 0
else 2*abst(bits,low+1,high) + bits[low]
}
method M() {
var byte: int;
var bits:= new int[8];
bits[0]:= 1;
bits[1]:= 1;
bits[2]:= 1;
bits[3]:= 1;
bits[4]:= 1;
bits[5]:= 1;
bits[6]:= 1;
bits[7]:= 1;
// assert abst(bits,2,2) == 0; // Why is this needed?
assert abst(bits,0,2) == 3;
}
fails to verify the assertion on the last line. (I'm using the on-line Dafny at Rise4Fun.) If the assertion just before is uncommented, the verification succeeds.
I'd be grateful for help. Thanks!
The Dafny verifier unrolls functions only once or twice. Specifically, if the function occurs in an assertion you're trying to prove (like abst does in your example), then the function gets unrolled twice. However, to prove your assertion, the definition of abst needs to be unrolled 3 times. Therefore, the verifier can't prove it automatically. You'll have to help it along.
When you uncomment the first assertion, it gets used in the proof of the second assertion. Since the first assertion provides the last instantiation of abst that you need, it helps prove the second assertion. The first assertion itself is also proved, but that's automatic, since it only requires one unrolling of abst.
If you write out the proof in detail, it looks like:
calc {
abst(bits,0,2);
== // def. abst
2*abst(bits,1,2) + bits[0];
== // def. abst
2*(2*abst(bits,2,2) + bits[1]) + bits[0];
== // def. abst
2*(2*0 + bits[1]) + bits[0];
== // def. bits[0] and bits[1]
2*(2*0 + 1) + 1;
== // arithmetic
3;
}
In this proof, you can see the needed instantiations of the function.
Do you really need arrays in your application? Arrays are allocated on the heap and are mutable (which means you need the reads clause and other complications). If you only need an immutable sequence, then I would recommend you change array<int> to seq<int>. This will make things easier (again--if you don't need the mutable qualities of arrays). Moreover, for your assertion above, it brings another benefit: All the arguments you then pass to abst are literals. For literals, Dafny is willing to unroll functions (more or less) without limits. So, then your original assertion verifies automatically.
Rustan

Is it safe, to share an array between threads?

Is it safe, to share an array between promises like I did it in the following code?
#!/usr/bin/env perl6
use v6;
sub my_sub ( $string, $len ) {
my ( $s, $l );
if $string.chars > $len {
$s = $string.substr( 0, $len );
$l = $len;
}
else {
$s = $string;
$l = $s.chars;
}
return $s, $l;
}
my #orig = <length substring character subroutine control elements now promise>;
my $len = 7;
my #copy;
my #length;
my $cores = 4;
my $p = #orig.elems div $cores;
my #vb = ( 0..^$cores ).map: { [ $p * $_, $p * ( $_ + 1 ) ] };
#vb[#vb.end][1] = #orig.elems;
my #promise;
for #vb -> $r {
#promise.push: start {
for $r[0]..^$r[1] -> $i {
( #copy[$i], #length[$i] ) = my_sub( #orig[$i], $len );
}
};
}
await #promise;
It depends how you define "array" and "share". So far as array goes, there are two cases that need to be considered separately:
Fixed size arrays (declared my #a[$size]); this includes multi-dimensional arrays with fixed dimensions (such as my #a[$xs, $ys]). These have the interesting property that the memory backing them never has to be resized.
Dynamic arrays (declared my #a), which grow on demand. These are, under the hood, actually using a number of chunks of memory over time as they grow.
So far as sharing goes, there are also three cases:
The case where multiple threads touch the array over its lifetime, but only one can ever be touching it at a time, due to some concurrency control mechanism or the overall program structure. In this case the arrays are never shared in the sense of "concurrent operations using the arrays", so there's no possibility to have a data race.
The read-only, non-lazy case. This is where multiple concurrent operations access a non-lazy array, but only to read it.
The read/write case (including when reads actually cause a write because the array has been assigned something that demands lazy evaluation; note this can never happen for fixed size arrays, as they are never lazy).
Then we can summarize the safety as follows:
| Fixed size | Variable size |
---------------------+----------------+---------------+
Read-only, non-lazy | Safe | Safe |
Read/write or lazy | Safe * | Not safe |
The * indicating the caveat that while it's safe from Perl 6's point of view, you of course have to make sure you're not doing conflicting things with the same indices.
So in summary, fixed size arrays you can safely share and assign to elements of from different threads "no problem" (but beware false sharing, which might make you pay a heavy performance penalty for doing so). For dynamic arrays, it is only safe if they will only be read from during the period they are being shared, and even then if they're not lazy (though given array assignment is mostly eager, you're not likely to hit that situation by accident). Writing, even to different elements, risks data loss, crashes, or other bad behavior due to the growing operation.
So, considering the original example, we see my #copy; and my #length; are dynamic arrays, so we must not write to them in concurrent operations. However, that happens, so the code can be determined not safe.
The other posts already here do a decent job of pointing in better directions, but none nailed the gory details.
Just have the code that is marked with the start statement prefix return the values so that Perl 6 can handle the synchronization for you. Which is the whole point of that feature.
Then you can wait for all of the Promises, and get all of the results using an await statement.
my #promise = do for #vb -> $r {
start
do # to have the 「for」 block return its values
for $r[0]..^$r[1] -> $i {
$i, my_sub( #orig[$i], $len )
}
}
my #results = await #promise;
for #results -> ($i,$copy,$len) {
#copy[$i] = $copy;
#length[$i] = $len;
}
The start statement prefix is only sort-of tangentially related to parallelism.
When you use it you are saying, “I don't need these results right now, but probably will later”.
That is the reason it returns a Promise (asynchrony), and not a Thread (concurrency)
The runtime is allowed to delay actually running that code until you finally ask for the results, and even then it could just do all of them sequentially in the same thread.
If the implementation actually did that, it could result in something like a deadlock if you instead poll the Promise by continually calling it's .status method waiting for it to change from Planned to Kept or Broken, and only then ask for its result.
This is part of the reason the default scheduler will start to work on any Promise codes if it has any spare threads.
I recommend watching jnthn's talk “Parallelism, Concurrency,
and Asynchrony in Perl 6”.
slides
This answer applies to my understanding of the situation on MoarVM, not sure what the state of art is on the JVM backend (or the Javascript backend fwiw).
Reading a scalar from several threads can be done safely.
Modifying a scalar from several threads can be done without having to fear for a segfault, but you may miss updates:
$ perl6 -e 'my $i = 0; await do for ^10 { start { $i++ for ^10000 } }; say $i'
46785
The same applies to more complex data structures like arrays (e.g. missing values being pushed) and hashes (missing keys being added).
So, if you don't mind missing updates, changing shared data structures from several threads should work. If you do mind missing updates, which I think is what you generally want, you should look at setting up your algorithm in a different way, as suggested by #Zoffix Znet and #raiph.
No.
Seriously. Other answers seem to make too many assumptions about the implementation, none of which are tested by the spec.

Make interpreter execute faster

I've created an interprter for a simple language. It is AST based (to be more exact, an irregular heterogeneous AST) with visitors executing and evaluating nodes. However I've noticed that it is extremely slow compared to "real" interpreters. For testing I've ran this code:
i = 3
j = 3
has = false
while i < 10000
j = 3
has = false
while j <= i / 2
if i % j == 0 then
has = true
end
j = j+2
end
if has == false then
puts i
end
i = i+2
end
In both ruby and my interpreter (just finding primes primitively). Ruby finished under 0.63 second, and my interpreter was over 15 seconds.
I develop the interpreter in C++ and in Visual Studio, so I've used the profiler to see what takes the most time: the evaluation methods.
50% of the execution time was to call the abstract evaluation method, which then casts the passed expression and calls the proper eval method. Something like this:
Value * eval (Exp * exp)
{
switch (exp->type)
{
case EXP_ADDITION:
eval ((AdditionExp*) exp);
break;
...
}
}
I could put the eval methods into the Exp nodes themselves, but I want to keep the nodes clean (Terence Parr saied something about reusability in his book).
Also at evaluation I always reconstruct the Value object, which stores the result of the evaluated expression. Actually Value is abstract, and it has derived value classes for different types (That's why I work with pointers, to avoid object slicing at returning). I think this could be another reason of slowness.
How could I make my interpreter as optimized as possible? Should I create bytecodes out of the AST and then interpret bytecodes instead? (As far as I know, they could be much faster)
Here is the source if it helps understanding my problem: src
Note: I haven't done any error handling yet, so an illegal statement or an error will simply freeze the program. (Also sorry for the stupid "error messages" :))
The syntax is pretty simple, the currently executed file is in OTZ1core/testfiles/test.txt (which is the prime finder).
I appreciate any help I can get, I'm really beginner at compilers and interpreters.
One possibility for a speed-up would be to use a function table instead of the switch with dynamic retyping. Your call to the typed-eval is going through at least one, and possibly several, levels of indirection. If you distinguish the typed functions instead by name and give them identical signatures, then pointers to the various functions can be packed into an array and indexed by the type member.
value (*evaltab[])(Exp *) = { // the order of functions must match
Exp_Add, // the order type values
//...
};
Then the whole switch becomes:
evaltab[exp->type](exp);
1 indirection, 1 function call. Fast.

Reliable clean-up in Mathematica

For better or worse, Mathematica provides a wealth of constructs that allow you to do non-local transfers of control, including Return, Catch/Throw, Abort and Goto. However, these kinds of non-local transfers of control often conflict with writing robust programs that need to ensure that clean-up code (like closing streams) gets run. Many languages provide ways of ensuring that clean-up code gets run in a wide variety of circumstances; Java has its finally blocks, C++ has destructors, Common Lisp has UNWIND-PROTECT, and so on.
In Mathematica, I don't know how to accomplish the same thing. I have a partial solution that looks like this:
Attributes[CleanUp] = {HoldAll};
CleanUp[body_, form_] :=
Module[{return, aborted = False},
Catch[
CheckAbort[
return = body,
aborted = True];
form;
If[aborted,
Abort[],
return],
_, (form; Throw[##]) &]];
This certainly isn't going to win any beauty contests, but it also only handles Abort and Throw. In particular, it fails in the presence of Return; I figure if you're using Goto to do this kind of non-local control in Mathematica you deserve what you get.
I don't see a good way around this. There's no CheckReturn for instance, and when you get right down to it, Return has pretty murky semantics. Is there a trick I'm missing?
EDIT: The problem with Return, and the vagueness in its definition, has to do with its interaction with conditionals (which somehow aren't "control structures" in Mathematica). An example, using my CleanUp form:
CleanUp[
If[2 == 2,
If[3 == 3,
Return["foo"]]];
Print["bar"],
Print["cleanup"]]
This will return "foo" without printing "cleanup". Likewise,
CleanUp[
baz /.
{bar :> Return["wongle"],
baz :> Return["bongle"]},
Print["cleanup"]]
will return "bongle" without printing cleanup. I don't see a way around this without tedious, error-prone and maybe impossible code-walking or somehow locally redefining Return using Block, which is heinously hacky and doesn't actually seem to work (though experimenting with it is a great way to totally wedge a kernel!)
Great question, but I don't agree that the semantics of Return are murky; They are documented in the link you provide. In short, Return exits the innermost construct (namely, a control structure or function definition) in which it is invoked.
The only case in which your CleanUp function above fails to cleanup from a Return is when you directly pass a single or CompoundExpression (e.g. (one;two;three) directly as input to it.
Return exits the function f:
In[28]:= f[] := Return["ret"]
In[29]:= CleanUp[f[], Print["cleaned"]]
During evaluation of In[29]:= cleaned
Out[29]= "ret"
Return exits x:
In[31]:= x = Return["foo"]
In[32]:= CleanUp[x, Print["cleaned"]]
During evaluation of In[32]:= cleaned
Out[32]= "foo"
Return exits the Do loop:
In[33]:= g[] := (x = 0; Do[x++; Return["blah"], {10}]; x)
In[34]:= CleanUp[g[], Print["cleaned"]]
During evaluation of In[34]:= cleaned
Out[34]= 1
Returns from the body of CleanUp at the point where body is evaluated (since CleanUp is HoldAll):
In[35]:= CleanUp[Return["ret"], Print["cleaned"]];
Out[35]= "ret"
In[36]:= CleanUp[(Print["before"]; Return["ret"]; Print["after"]),
Print["cleaned"]]
During evaluation of In[36]:= before
Out[36]= "ret"
As I noted above, the latter two examples are the only problematic cases I can contrive (although I could be wrong) but they can be handled by adding a definition to CleanUp:
In[44]:= CleanUp[CompoundExpression[before___, Return[ret_], ___], form_] :=
(before; form; ret)
In[45]:= CleanUp[Return["ret"], Print["cleaned"]]
During evaluation of In[46]:= cleaned
Out[45]= "ret"
In[46]:= CleanUp[(Print["before"]; Return["ret"]; Print["after"]),
Print["cleaned"]]
During evaluation of In[46]:= before
During evaluation of In[46]:= cleaned
Out[46]= "ret"
As you said, not going to win any beauty contests, but hopefully this helps solve your problem!
Response to your update
I would argue that using Return inside If is unnecessary, and even an abuse of Return, given that If already returns either the second or third argument based on the state of the condition in the first argument. While I realize your example is probably contrived, If[3==3, Return["Foo"]] is functionally identical to If[3==3, "foo"]
If you have a more complicated If statement, you're better off using Throw and Catch to break out of the evaluation and "return" something to the point you want it to be returned to.
That said, I realize you might not always have control over the code you have to clean up after, so you could always wrap the expression in CleanUp in a no-op control structure, such as:
ret1 = Do[ret2 = expr, {1}]
... by abusing Do to force a Return not contained within a control structure in expr to return out of the Do loop. The only tricky part (I think, not having tried this) is having to deal with two different return values above: ret1 will contain the value of an uncontained Return, but ret2 would have the value of any other evaluation of expr. There's probably a cleaner way to handle that, but I can't see it right now.
HTH!
Pillsy's later version of CleanUp is a good one. At the risk of being pedantic, I must point out a troublesome use case:
Catch[CleanUp[Throw[23], Print["cleanup"]]]
The problem is due to the fact that one cannot explicitly specify a tag pattern for Catch that will match an untagged Throw.
The following version of CleanUp addresses that problem:
SetAttributes[CleanUp, HoldAll]
CleanUp[expr_, cleanup_] :=
Module[{exprFn, result, abort = False, rethrow = True, seq},
exprFn[] := expr;
result = CheckAbort[
Catch[
Catch[result = exprFn[]; rethrow = False; result],
_,
seq[##]&
],
abort = True
];
cleanup;
If[abort, Abort[]];
If[rethrow, Throw[result /. seq -> Sequence]];
result
]
Alas, this code is even less likely to be competitive in a beauty contest. Furthermore, it wouldn't surprise me if someone jumped in with yet another non-local control flow that that this code will not handle. Even in the unlikely event that it handles all possible cases now, problematic cases could be introduced in Mathematica X (where X > 7.01).
I fear that there cannot be a definitive answer to this problem until Wolfram introduces a new control structure expressly for this purpose. UnwindProtect would be a fine name for such a facility.
Michael Pilat provided the key trick for "catching" returns, but I ended up using it in a slightly different way, using the fact that Return forces the return value of a named function as well as control structures like Do. I made the expression that is being cleaned up after into the down-value of a local symbol, like so:
Attributes[CleanUp] = {HoldAll};
CleanUp[expr_, form_] :=
Module[{body, value, aborted = False},
body[] := expr;
Catch[
CheckAbort[
value = body[],
aborted = True];
form;
If[aborted,
Abort[],
value],
_, (form; Throw[##]) &]];