Say I have a TensorFlow variable to track the mean of a value. mean can be updated with the following graph snippet:
mean.assign((step * mean.read() + value) / (step + 1))
Unfortunately, those operations are not atomic, so if two different portions of the graph try to update the same mean variable one of the updates may be lost.
If instead I were tracking sum, I could just do
sum.assign_add(value, use_locking=True)
and everything would be great. Unfortunately, in other cases a more complicated update to mean (or std or etc.) may be required, and it may be impossible to use tf.assign_add.
Question: Is there any way to make the first code snippet atomic?
Unfortunately I believe the answer is no, since (1) I don't remember any such mechanism and (2) one of our reasons for making optimizers C++ ops was to get atomic behavior. My main source of hope is XLA, but I do not whether whether this kind of atomicity can be guaranteed there.
The underlying problem of the example is that there are two operations - read and subsequent assign - that together need to be executed atomically.
Since beginning of 2018, the tensorflow team added the CriticalSection class to the codebase. However, this only works for resource variables (as pointed out in Geoffrey's comments). Hence, value in the example below needs to be acquired as:
value = tf.get_variable(..., use_resource=True, ...)
Although I did not test this, according to the class' documentation the atomic update issue should then be solvable as follows:
def update_mean(step, value):
old_value = mean.read_value()
with tf.control_dependencies([old_value]):
return mean.assign((step * old_value + value) / (step + 1))
cs = tf.CriticalSection()
mean_update = cs.execute(update_mean, step, value)
session.run(mean_update)
Essentially, it provides a lock from the beginning of execute() till its end, i.e. covering the whole assignment operation including read and assign.
Related
I'm trying to use the python interface of SCIP tool (https://github.com/scipopt/PySCIPOpt) to solve a mixed-integer optimization problem.
I want to define an OR-constraint with three constraints, but only one of them must be satisfied.
For example, I want to minimize a variable x with three constraints x>=1, x>=2, x>=3, but only one of them must be valid, and then minimize the value of x. Of course the result should be x=1.
However the OR-constraint API addConsOr requires both the constraint list and result variable (resvar, resultant variable of the operation). While I can provide the list of constraints, I don't know the meaning of result variable in the second function parameter. When I set the second parameter to a new variable, the following code cannot run and result in segmentation fault.
from pyscipopt import Model
model = Model()
x = model.addVar(vtype = "I")
b = model.addVar(vtype="B")
model.addConsOr([x>=1, x>=2, x>=3], b)
model.setObjective(x, "minimize")
model.optimize()
print("Optimal value:", model.getObjVal())
Also, setting the second variable to True also gets segmentation fault.
model.addConsOr([x>=1, x>=2, x>=3], True)
What you are describing is not an OR-constraint. An or-constraint is a constraint that takes into account a set of binary variables and gets the result as an OR of these values, as explained in the SCIP documentation.
What you want is a general disjunctive constraint. Those exist in SCIP as SCIPcreateConsDisjunction but are not wrapped in the Python API yet. Fortunately, you can extend the API yourself quite easily. Simply add the correct function to scip.pxd and define the wrapper in scip.pyx. Just look at how it is done for the existing constraint types and do it the same way. The people over at the PySCIPopt GitHub will be happy if you create a pull-request with your changes.
My question is:
Re. optimization: Does x += y at the center of a loop always cause a read after write data dependency and so prevent vectorization?
See https://cvw.cac.cornell.edu/vector/coding_dependencies
Read after write ("flow" or "RAW") dependency
This kind of dependency is not vectorizable. It occurs when the values
of variables involved in a particular loop iteration (the "read") are
determined in a previous loop iteration (the "write"). In other words,
a variable is read (used as an operand to a mathematical operation)
after its value has been modified by a previous loop iteration.
This question is very general in that it is basically asking whether using the += operator in the center of a loop precludes vectorization by causing a read after write ("flow" or "RAW") data dependency.
Eg.
for(i...){
for(j...){
x(i,j) += y(i,j)
}
}
See
https://gcc.gnu.org/projects/tree-ssa/vectorization.html
Example 14: Double reduction:
In your example, there is no issue with data dependencies assuming the array does not alias each other: the loop can be easily and safely vectorized since each x(i,j) value is only dependent of y(i,j) and y(i,j) is only read in the loop. If x and y are overlapping, then it is much harder to vectorize the loop since it causes an implicit dependency between the computation of x(i,j) value (because y(i,j) can alias with x(i-1,j) which is computed just before).
In general, automatic vectorization is very hard if not even possible when there is a dependency chain (since the sequential order is mainly the only possible one). The provided article mention such a case (where each operation require the previous one to be computed due to a data dependency). This has nothing to do with the specific += operator. It is all about data dependencies (or any loop carried dependency).
Notes
Regarding floating-point (FP) numbers, operations are no associative by default, so there is no way to execute something like x(0) + x(1) + x(2) + ... in a vectorized way (if x is an array of FP numbers). Indeed, the only possible correct order is ((x(0) + x(1)) + x(2)) + ... based on the IEEE-754 standard. Compiler options can be used to consider FP operations as associative so to be able to execute this kind of computation in a vectorized way. This breaks the IEEE-754 standard and some codes requiring the ordering not to be changed (eg. Kahan summation). Also please note that while the above expression can be vectorized (see parallel scan), it is generally not very efficient on most mainstream CPU yet. Still there are hardware that can benefit from parallel scan: mainstream discrete GPUs.
A C++ standard library implements std::copy with the following code (ignoring all sorts of wrappers and concept checks etc) with the simple loop:
for (; __first != __last; ++__result, ++__first)
*__result = *__first;
Now, suppose I want a general-purpose std::copy-like function for warps (not blocks; not grids) to use for collaboratively copying data from one place to another. Let's even assume for simplicity that the function takes pointers rather than an arbitrary iterator.
Of course, writing general-purpose code in CUDA is often a useless pursuit - since we might be sacrificing a lot of the benefit of using a GPU in the first place in favor of generality - so I'll allow myself some boolean/enum template parameters to possibly select between frequently-occurring cases, avoiding runtime checks. So the signature might be, say:
template <typename T, bool SomeOption, my_enum_t AnotherOption>
T* copy(
T* __restrict__ destination,
const T* __restrict__ source,
size_t length
);
but for each of these cases I'm aiming for optimal performance (or optimal expected performance given that we don't know what other warps are doing).
Which factors should I take into consideration when writing such a function? Or in other words: Which cases should I distinguish between in implementing this function?
Notes:
This should target Compute Capabilities 3.0 or better (i.e. Kepler or newer micro-architectures)
I don't want to make a Runtime API memcpy() call. At least, I don't think I do.
Factors I believe should be taken into consideration:
Coalescing memory writes - ensuring that consecutive lanes in a warp write to consecutive memory locations (no gaps).
Type size vs Memory transaction size I - if sizeof(T) is sizeof(T) is 1 or 2, and we have have each lane write a single element, the entire warp would write less than 128B, wasting some of the memory transaction. Instead, we should have each thread place 2 or 4 input elements in a register, and write that
Type size vs Memory transaction size II - For type sizes such that lcm(4, sizeof(T)) > 4, it's not quite clear what to do. How well does the compiler/the GPU handle writes when each lane writes more than 4 bytes? I wonder.
Slack due to the reading of multiple elements at a time - If each thread wishes to read 2 or 4 elements for each write, and write 4-byte integers - we might have 1 or 2 elements at the beginning and the end of the input which must be handled separately.
Slack due to input address mis-alignment - The input is read in 32B transactions (under reasonable assumptions); we thus have to handle the first elements up to the multiple of 32B, and the last elements (after the last such multiple,) differently.
Slack due to output address mis-alignment - The output is written in transactions of upto 128B (or is it just 32B?); we thus have to handle the first elements up to the multiple of this number, and the last elements (after the last such multiple,) differently.
Whether or not T is trivially-copy-constructible. But let's assume that it is.
But it could be that I'm missing some considerations, or that some of the above are redundant.
Factors I've been wondering about:
The block size (i.e. how many other warps are there)
The compute capability (given that it's at least 3)
Whether the source/target is in shared memory / constant memory
Choice of caching mode
I'm currently working on a project that does not include GSAP (Greensock's JS Tweening library), but since it's super easy to create your own Custom Easing functions with it's visual editor - I was wondering if there is a way to break down the desired ease-function so that it can be reused in a CreateJS Tween?
Example:
var myEase = CustomEase.create("myCustomEase", [
{s:0,cp:0.413,e:0.672},{s:0.672,cp:0.931,e:1.036},
{s:1.036,cp:1.141,e:1.036},{s:1.036,cp:0.931,e:0.984},
{s:0.984,cp:1.03699,e:1.004},{s:1.004,cp:0.971,e:0.988},
{s:0.988,cp:1.00499,e:1}
]);
So that it turns it into something like:
var myEase = function(t, b, c, d) {
//Some magic algorithm performed on the 7 bezier/control points above...
}
(Here is what the graph would look like for this particular easing method.)
I took the time to port and optimize the original GSAP-based CustomEase class... but due to license restrictions / legal matters (basically a grizzly bear that I do not want to poke with a stick...), posting the ported code would violate it.
However, it's fair for my own use. Therefore, I believe it's only fair that I guide you and point you to the resources that made it possible.
The original code (not directly compatible with CreateJS) can be found here:
https://github.com/art0rz/gsap-customease/blob/master/CustomEase.js (looks like the author was also asked to take down the repo on github - sorry if the rest of this post makes no sense at all!)
Note that CreateJS's easing methods only takes a "time ratio" value (not time, start, end, duration like GSAP's easing method does). That time ratio is really all you need, given it goes from 0.0 (your start value) to 1.0 (your end value).
With a little bit of effort, you can discard those parameters from the ease() method and trim down the final returned expression.
Optimizations:
I took a few extra steps to optimize the above code.
1) In the constructor, you can store the segments.length value directly as this.length in a property of the CustomEase instance to cut down a bit on the amount of accessors / property lookups in the ease() method (where qty is set).
2) There's a few redundant calculations done per Segments that can be eliminated in the ease() method. For instance, the s.cp - s.s and s.e - s.s operations can be precalculated and stored in a couple of properties in each Segments (in its constructor).
3) Finally, I'm not sure why it was designed this way, but you can unwrap the function() {...}(); that are returning the constructors for each classes. Perhaps it was used to trap the scope of some variables, but I don't see why it couldn't have wrapped the entire thing instead of encapsulating each one separately.
Need more info? Leave a comment!
I am working on fairly large Mathematica projects and the problem arises that I have to intermittently check numerical results but want to easily revert to having all my constructs in analytical form.
The code is fairly fluid I don't want to use scoping constructs everywhere as they add work overhead. Is there an easy way for identifying and clearing all assignments that are numerical?
EDIT: I really do know that scoping is the way to do this correctly ;-). However, for my workflow I am really just looking for a dirty trick to nix all numerical assignments after the fact instead of having the foresight to put down a Block.
If your assignments are on the top level, you can use something like this:
a = 1;
b = c;
d = 3;
e = d + b;
Cases[DownValues[In],
HoldPattern[lhs_ = rhs_?NumericQ] |
HoldPattern[(lhs_ = rhs_?NumericQ;)] :> Unset[lhs],
3]
This will work if you have a sufficient history length $HistoryLength (defaults to infinity). Note however that, in the above example, e was assigned 3+c, and 3 here was not undone. So, the problem is really ambiguous in formulation, because some numbers could make it into definitions. One way to avoid this is to use SetDelayed for assignments, rather than Set.
Another alternative would be to analyze the names in say Global' context (if that is the context where your symbols live), and then say OwnValues and DownValues of the symbols, in a fashion similar to the above, and remove definitions with purely numerical r.h.s.
But IMO neither of these approaches are robust. I'd still use scoping constructs and try to isolate numerics. One possibility is to wrap you final code in Block, and assign numerical values inside this Block. This seems a much cleaner approach. The work overhead is minimal - you just have to remember which symbols you want to assign the values to. Block will automatically ensure that outside it, the symbols will have no definitions.
EDIT
Yet another possibility is to use local rules. For example, one could define rule[a] = a->1; rule[d]=d->3 instead of the assignments above. You could then apply these rules, extracting them as say
DownValues[rule][[All, 2]], whenever you want to test with some numerical arguments.
Building on Andrew Moylan's solution, one can construct a Block like function that would takes rules:
SetAttributes[BlockRules, HoldRest]
BlockRules[rules_, expr_] :=
Block ## Append[Apply[Set, Hold#rules, {2}], Unevaluated[expr]]
You can then save your numeric rules in a variable, and use BlockRules[ savedrules, code ], or even define a function that would apply a fixed set of rules, kind of like so:
In[76]:= NumericCheck =
Function[body, BlockRules[{a -> 3, b -> 2`}, body], HoldAll];
In[78]:= a + b // NumericCheck
Out[78]= 5.
EDIT In response to Timo's comment, it might be possible to use NotebookEvaluate (new in 8) to achieve the requested effect.
SetAttributes[BlockRules, HoldRest]
BlockRules[rules_, expr_] :=
Block ## Append[Apply[Set, Hold#rules, {2}], Unevaluated[expr]]
nb = CreateDocument[{ExpressionCell[
Defer[Plot[Sin[a x], {x, 0, 2 Pi}]], "Input"],
ExpressionCell[Defer[Integrate[Sin[a x^2], {x, 0, 2 Pi}]],
"Input"]}];
BlockRules[{a -> 4}, NotebookEvaluate[nb, InsertResults -> "True"];]
As the result of this evaluation you get a notebook with your commands evaluated when a was locally set to 4. In order to take it further, you would have to take the notebook
with your code, open a new notebook, evaluate Notebooks[] to identify the notebook of interest and then do :
BlockRules[variablerules,
NotebookEvaluate[NotebookPut[NotebookGet[nbobj]],
InsertResults -> "True"]]
I hope you can make this idea work.