Some questions on DLR call site caching - dynamic-language-runtime

Kindly correct me if my understanding about DLR operation resolving methodology is wrong
DLR has something call as call site caching. Given "a+b" (where a and b are both integers), the first time it will not be able to understand the +(plus) operator as wellas the operand. So it will resolve the operands and then the operators and will cache in level 1 after building a proper rule.
Next time onwards, if a similar kind of match is found (where the operand are of type integers and operator is of type +), it will look into the cache and will return the result and henceforth the performance will be enhanced.
If this understanding of mine is correct(if you fell that I am incorrect in the basic understanding kindly rectify that), I have the below questions for which I am looking answers
a) What kind of operations happen in level 1 , level 2
and level 3 call site caching in general
( i mean plus, minus..what kind)
b) DLR caches those results. Where it stores..i mean the cache location
c) How it makes the rules.. any algorithm.. if so could you please specify(any link or your own answer)
d) Sometime back i read somewhere that in level 1 caching , DLR has 10 rules,
level 2 it has 20 or something(may be) while level 3 has 100.
If a new operation comes, then I makes a new rule. Are these rules(level 1 , 2 ,3) predefined?
e) Where these rules are kept?
f) Instead of passing two integers (a and b in the example),
if we pass two strings or one string and one integer,
whether it will form a new rule?
Thanks

a) If I understand this correctly currently all of the levels work effectively the same - they run a delegate which tests the arguments which matter for the rule and then either performs an operation or notes the failure (the failure is noted by either setting a value on the call site or doing a tail call to the update method). Therefore every rule works like:
public object Rule(object x, object y) {
if(x is int && y is int) {
return (int)x + (int)y;
}
CallSiteOps.SetNotMatched(site);
return null;
}
And a delegate to this method is used in the L0, L1, and L2 caches. But the behavior here could change (and did change many times during development). For example at one point in time the L2 cache was a tree based upon the type of the arguments.
b) The L0 cache and the L1 caches are stored on the CallSite object. The L0 cache is a single delegate which is always the 1st thing to run. Initially this is set to a delegate which just updates the site for the 1st time. Additional calls attempt to perform the last operation the call site saw.
The L1 cache includes the last 10 actions the call site saw. If the L0 cache fails all the delegates in the L1 cache will be tried.
The L2 cache lives on the CallSiteBinder. The CallSiteBinder should be shared across multiple call sites. For example there should generally be one and only one additiona call site binder for the language assuming all additions are the same. If the L0 and L1 all of the available rules in the L2 cache will be searched. Currently the upper limit for the L2 cache is 128.
c) Rules can ultimately be produced in 2 ways. The general solution is to create a DynamicMetaObject which includes both the expression tree of the operation to be performed as well as the restrictions (the test) which determines if it's applicable. This is something like:
public DynamicMetaObject FallbackBinaryOperation(DynamicMetaObject target, DynamicMetaObject arg) {
return new DynamicMetaObject(
Expression.Add(Expression.Convert(target, typeof(int)), Expression.Convert(arg, typeof(int))),
BindingRestrictions.GetTypeRestriction(target, typeof(int)).Merge(
BindingRestrictions.GetTypeRestriction(arg, typeof(int))
)
);
}
This creates the rule which adds two integers. This isn't really correct because if you get something other than integers it'll loop infinitely - so usually you do a bunch of type checks, produce the addition for the things you can handle, and then produce an error if you can't handle the addition.
The other way to make a rule is provide the delegate directly. This is more advanced and enables some advanced optimizations to avoid compiling code all of the time. To do this you override BindDelegate on CallSiteBinder, inspect the arguments, and provide a delegate which looks like the Rule method I typed above.
d) Answered in b)
e) I believe this is the same question as b)
f) If you put the appropriate restrictions on then yes (which you're forced to do for the standard binders). Because the restrictions will fail because you don't have two ints the DLR will probe the caches and when it doesn't find a rule it'll call the binder to produce a new rule. That new rule will come back w/ a new set of restrictions, will be installed in the L0, L1, and L2 caches, and then will perform the operation for strings from then on out.

Related

How do I remember the root of a binary search tree in Haskell

I am new to Functional programming.
The challenge I have is regarding the mental map of how a binary search tree works in Haskell.
In other programs (C,C++) we have something called root. We store it in a variable. We insert elements into it and do balancing etc..
The program takes a break does other things (may be process user inputs, create threads) and then figures out it needs to insert a new element in the already created tree. It knows the root (stored as a variable) and invokes the insert function with the root and the new value.
So far so good in other languages. But how do I mimic such a thing in Haskell, i.e.
I see functions implementing converting a list to a Binary Tree, inserting a value etc.. That's all good
I want this functionality to be part of a bigger program and so i need to know what the root is so that i can use it to insert it again. Is that possible? If so how?
Note: Is it not possible at all because data structures are immutable and so we cannot use the root at all to insert something. in such a case how is the above situation handled in Haskell?
It all happens in the same way, really, except that instead of mutating the existing tree variable we derive a new tree from it and remember that new tree instead of the old one.
For example, a sketch in C++ of the process you describe might look like:
int main(void) {
Tree<string> root;
while (true) {
string next;
cin >> next;
if (next == "quit") exit(0);
root.insert(next);
doSomethingWith(root);
}
}
A variable, a read action, and loop with a mutate step. In haskell, we do the same thing, but using recursion for looping and a recursion variable instead of mutating a local.
main = loop Empty
where loop t = do
next <- getLine
when (next /= "quit") $ do
let t' = insert next t
doSomethingWith t'
loop t'
If you need doSomethingWith to be able to "mutate" t as well as read it, you can lift your program into State:
main = loop Empty
where loop t = do
next <- getLine
when (next /= "quit") $ do
loop (execState doSomethingWith (insert next t))
Writing an example with a BST would take too much time but I give you an analogous example using lists.
Let's invent a updateListN which updates the n-th element in a list.
updateListN :: Int -> a -> [a] -> [a]
updateListN i n l = take (i - 1) l ++ n : drop i l
Now for our program:
list = [1,2,3,4,5,6,7,8,9,10] -- The big data structure we might want to use multiple times
main = do
-- only for shows
print $ updateListN 3 30 list -- [1,2,30,4,5,6,7,8,9,10]
print $ updateListN 8 80 list -- [1,2,3,4,5,6,7,80,9,10]
-- now some illustrative complicated processing
let list' = foldr (\i l -> updateListN i (i*10) l) list list
-- list' = [10,20,30,40,50,60,70,80,90,100]
-- Our crazily complicated illustrative algorithm still needs `list`
print $ zipWith (-) list' list
-- [9,18,27,36,45,54,63,72,81,90]
See how we "updated" list but it was still available? Most data structures in Haskell are persistent, so updates are non-destructive. As long as we have a reference of the old data around we can use it.
As for your comment:
My program is trying the following a) Convert a list to a Binary Search Tree b) do some I/O operation c) Ask for a user input to insert a new value in the created Binary Search Tree d) Insert it into the already created list. This is what the program intends to do. Not sure how to get this done in Haskell (or) is am i stuck in the old mindset. Any ideas/hints welcome.
We can sketch a program:
data BST
readInt :: IO Int; readInt = undefined
toBST :: [Int] -> BST; toBST = undefined
printBST :: BST -> IO (); printBST = undefined
loop :: [Int] -> IO ()
loop list = do
int <- readInt
let newList = int : list
let bst = toBST newList
printBST bst
loop newList
main = loop []
"do balancing" ... "It knows the root" nope. After re-balancing the root is new. The function balance_bst must return the new root.
Same in Haskell, but also with insert_bst. It too will return the new root, and you will use that new root from that point forward.
Even if the new root's value is the same, in Haskell it's a new root, since one of its children has changed.
See ''How to "think functional"'' here.
Even in C++ (or other imperative languages), it would usually be considered a poor idea to have a single global variable holding the root of the binary search tree.
Instead code that needs access to a tree should normally be parameterised on the particular tree it operates on. That's a fancy way of saying: it should be a function/method/procedure that takes the tree as an argument.
So if you're doing that, then it doesn't take much imagination to figure out how several different sections of code (or one section, on several occasions) could get access to different versions of an immutable tree. Instead of passing the same tree to each of these functions (with modifications in between), you just pass a different tree each time.
It's only a little more work to imagine what your code needs to do to "modify" an immutable tree. Obviously you won't produce a new version of the tree by directly mutating it, you'll instead produce a new value (probably by calling methods on the class implementing the tree for you, but if necessary by manually assembling new nodes yourself), and then you'll return it so your caller can pass it on - by returning it to its own caller, by giving it to another function, or even calling you again.
Putting that all together, you can have your whole program manipulate (successive versions of) this binary tree without ever having it stored in a global variable that is "the" tree. An early function (possibly even main) creates the first version of the tree, passes it to the first thing that uses it, gets back a new version of the tree and passes it to the next user, and so on. And each user of the tree can call other subfunctions as needed, with possibly many of new versions of the tree produced internally before it gets returned to the top level.
Note that I haven't actually described any special features of Haskell here. You can do all of this in just about any programming language, including C++. This is what people mean when they say that learning other types of programming makes them better programmers even in imperative languages they already knew. You can see that your habits of thought are drastically more limited than they need to be; you could not imagine how you could deal with a structure "changing" over the course of your program without having a single variable holding a structure that is mutated, when in fact that is just a small part of the tools that even C++ gives you for approaching the problem. If you can only imagine this one way of dealing with it then you'll never notice when other ways would be more helpful.
Haskell also has a variety of tools it can bring to this problem that are less common in imperative languages, such as (but not limited to):
Using the State monad to automate and hide much of the boilerplate of passing around successive versions of the tree.
Function arguments allow a function to be given an unknown "tree-consumer" function, to which it can give a tree, without any one place both having the tree and knowing which function it's passing it to.
Lazy evaluation sometimes negates the need to even have successive versions of the tree; if the modifications are expanding branches of the tree as you discover they are needed (like a move-tree for a game, say), then you could alternatively generate "the whole tree" up front even if it's infinite, and rely on lazy evaluation to limit how much work is done generating the tree to exactly the amount you need to look at.
Haskell does in fact have mutable variables, it just doesn't have functions that can access mutable variables without exposing in their type that they might have side effects. So if you really want to structure your program exactly as you would in C++ you can; it just won't really "feel like" you're writing Haskell, won't help you learn Haskell properly, and won't allow you to benefit from many of the useful features of Haskell's type system.

A general-purpose warp-level std::copy-like function - what should it account for?

A C++ standard library implements std::copy with the following code (ignoring all sorts of wrappers and concept checks etc) with the simple loop:
for (; __first != __last; ++__result, ++__first)
*__result = *__first;
Now, suppose I want a general-purpose std::copy-like function for warps (not blocks; not grids) to use for collaboratively copying data from one place to another. Let's even assume for simplicity that the function takes pointers rather than an arbitrary iterator.
Of course, writing general-purpose code in CUDA is often a useless pursuit - since we might be sacrificing a lot of the benefit of using a GPU in the first place in favor of generality - so I'll allow myself some boolean/enum template parameters to possibly select between frequently-occurring cases, avoiding runtime checks. So the signature might be, say:
template <typename T, bool SomeOption, my_enum_t AnotherOption>
T* copy(
T* __restrict__ destination,
const T* __restrict__ source,
size_t length
);
but for each of these cases I'm aiming for optimal performance (or optimal expected performance given that we don't know what other warps are doing).
Which factors should I take into consideration when writing such a function? Or in other words: Which cases should I distinguish between in implementing this function?
Notes:
This should target Compute Capabilities 3.0 or better (i.e. Kepler or newer micro-architectures)
I don't want to make a Runtime API memcpy() call. At least, I don't think I do.
Factors I believe should be taken into consideration:
Coalescing memory writes - ensuring that consecutive lanes in a warp write to consecutive memory locations (no gaps).
Type size vs Memory transaction size I - if sizeof(T) is sizeof(T) is 1 or 2, and we have have each lane write a single element, the entire warp would write less than 128B, wasting some of the memory transaction. Instead, we should have each thread place 2 or 4 input elements in a register, and write that
Type size vs Memory transaction size II - For type sizes such that lcm(4, sizeof(T)) > 4, it's not quite clear what to do. How well does the compiler/the GPU handle writes when each lane writes more than 4 bytes? I wonder.
Slack due to the reading of multiple elements at a time - If each thread wishes to read 2 or 4 elements for each write, and write 4-byte integers - we might have 1 or 2 elements at the beginning and the end of the input which must be handled separately.
Slack due to input address mis-alignment - The input is read in 32B transactions (under reasonable assumptions); we thus have to handle the first elements up to the multiple of 32B, and the last elements (after the last such multiple,) differently.
Slack due to output address mis-alignment - The output is written in transactions of upto 128B (or is it just 32B?); we thus have to handle the first elements up to the multiple of this number, and the last elements (after the last such multiple,) differently.
Whether or not T is trivially-copy-constructible. But let's assume that it is.
But it could be that I'm missing some considerations, or that some of the above are redundant.
Factors I've been wondering about:
The block size (i.e. how many other warps are there)
The compute capability (given that it's at least 3)
Whether the source/target is in shared memory / constant memory
Choice of caching mode

How to convert Greensock's CustomEase functions to be usable in CreateJS's Tween system?

I'm currently working on a project that does not include GSAP (Greensock's JS Tweening library), but since it's super easy to create your own Custom Easing functions with it's visual editor - I was wondering if there is a way to break down the desired ease-function so that it can be reused in a CreateJS Tween?
Example:
var myEase = CustomEase.create("myCustomEase", [
{s:0,cp:0.413,e:0.672},{s:0.672,cp:0.931,e:1.036},
{s:1.036,cp:1.141,e:1.036},{s:1.036,cp:0.931,e:0.984},
{s:0.984,cp:1.03699,e:1.004},{s:1.004,cp:0.971,e:0.988},
{s:0.988,cp:1.00499,e:1}
]);
So that it turns it into something like:
var myEase = function(t, b, c, d) {
//Some magic algorithm performed on the 7 bezier/control points above...
}
(Here is what the graph would look like for this particular easing method.)
I took the time to port and optimize the original GSAP-based CustomEase class... but due to license restrictions / legal matters (basically a grizzly bear that I do not want to poke with a stick...), posting the ported code would violate it.
However, it's fair for my own use. Therefore, I believe it's only fair that I guide you and point you to the resources that made it possible.
The original code (not directly compatible with CreateJS) can be found here:
https://github.com/art0rz/gsap-customease/blob/master/CustomEase.js (looks like the author was also asked to take down the repo on github - sorry if the rest of this post makes no sense at all!)
Note that CreateJS's easing methods only takes a "time ratio" value (not time, start, end, duration like GSAP's easing method does). That time ratio is really all you need, given it goes from 0.0 (your start value) to 1.0 (your end value).
With a little bit of effort, you can discard those parameters from the ease() method and trim down the final returned expression.
Optimizations:
I took a few extra steps to optimize the above code.
1) In the constructor, you can store the segments.length value directly as this.length in a property of the CustomEase instance to cut down a bit on the amount of accessors / property lookups in the ease() method (where qty is set).
2) There's a few redundant calculations done per Segments that can be eliminated in the ease() method. For instance, the s.cp - s.s and s.e - s.s operations can be precalculated and stored in a couple of properties in each Segments (in its constructor).
3) Finally, I'm not sure why it was designed this way, but you can unwrap the function() {...}(); that are returning the constructors for each classes. Perhaps it was used to trap the scope of some variables, but I don't see why it couldn't have wrapped the entire thing instead of encapsulating each one separately.
Need more info? Leave a comment!

Clearing numerical values in Mathematica

I am working on fairly large Mathematica projects and the problem arises that I have to intermittently check numerical results but want to easily revert to having all my constructs in analytical form.
The code is fairly fluid I don't want to use scoping constructs everywhere as they add work overhead. Is there an easy way for identifying and clearing all assignments that are numerical?
EDIT: I really do know that scoping is the way to do this correctly ;-). However, for my workflow I am really just looking for a dirty trick to nix all numerical assignments after the fact instead of having the foresight to put down a Block.
If your assignments are on the top level, you can use something like this:
a = 1;
b = c;
d = 3;
e = d + b;
Cases[DownValues[In],
HoldPattern[lhs_ = rhs_?NumericQ] |
HoldPattern[(lhs_ = rhs_?NumericQ;)] :> Unset[lhs],
3]
This will work if you have a sufficient history length $HistoryLength (defaults to infinity). Note however that, in the above example, e was assigned 3+c, and 3 here was not undone. So, the problem is really ambiguous in formulation, because some numbers could make it into definitions. One way to avoid this is to use SetDelayed for assignments, rather than Set.
Another alternative would be to analyze the names in say Global' context (if that is the context where your symbols live), and then say OwnValues and DownValues of the symbols, in a fashion similar to the above, and remove definitions with purely numerical r.h.s.
But IMO neither of these approaches are robust. I'd still use scoping constructs and try to isolate numerics. One possibility is to wrap you final code in Block, and assign numerical values inside this Block. This seems a much cleaner approach. The work overhead is minimal - you just have to remember which symbols you want to assign the values to. Block will automatically ensure that outside it, the symbols will have no definitions.
EDIT
Yet another possibility is to use local rules. For example, one could define rule[a] = a->1; rule[d]=d->3 instead of the assignments above. You could then apply these rules, extracting them as say
DownValues[rule][[All, 2]], whenever you want to test with some numerical arguments.
Building on Andrew Moylan's solution, one can construct a Block like function that would takes rules:
SetAttributes[BlockRules, HoldRest]
BlockRules[rules_, expr_] :=
Block ## Append[Apply[Set, Hold#rules, {2}], Unevaluated[expr]]
You can then save your numeric rules in a variable, and use BlockRules[ savedrules, code ], or even define a function that would apply a fixed set of rules, kind of like so:
In[76]:= NumericCheck =
Function[body, BlockRules[{a -> 3, b -> 2`}, body], HoldAll];
In[78]:= a + b // NumericCheck
Out[78]= 5.
EDIT In response to Timo's comment, it might be possible to use NotebookEvaluate (new in 8) to achieve the requested effect.
SetAttributes[BlockRules, HoldRest]
BlockRules[rules_, expr_] :=
Block ## Append[Apply[Set, Hold#rules, {2}], Unevaluated[expr]]
nb = CreateDocument[{ExpressionCell[
Defer[Plot[Sin[a x], {x, 0, 2 Pi}]], "Input"],
ExpressionCell[Defer[Integrate[Sin[a x^2], {x, 0, 2 Pi}]],
"Input"]}];
BlockRules[{a -> 4}, NotebookEvaluate[nb, InsertResults -> "True"];]
As the result of this evaluation you get a notebook with your commands evaluated when a was locally set to 4. In order to take it further, you would have to take the notebook
with your code, open a new notebook, evaluate Notebooks[] to identify the notebook of interest and then do :
BlockRules[variablerules,
NotebookEvaluate[NotebookPut[NotebookGet[nbobj]],
InsertResults -> "True"]]
I hope you can make this idea work.

What is the standard way to optimise mutual recursion in F#/Scala?

These languages do not support mutually recursive functions optimization 'natively', so I guess it must be trampoline or.. heh.. rewriting as a loop) Do I miss something?
UPDATE: It seems that I did lie about FSharp, but I just didn't see an example of mutual tail-calls while googling
First of all, F# supports mutually recursive functions natively, because it can benefit from the tailcall instruction that's available in the .NET IL (MSDN). However, this is a bit tricky and may not work on some alternative implementations of .NET (e.g. Compact Frameworks), so you may sometimes need to deal with this by hand.
In general, I that there are a couple of ways to deal with it:
Trampoline - throw an exception when the recursion depth is too high and implement a top-level loop that handles the exception (the exception would carry information to resume the call). Instead of exception you can also simply return a value specifying that the function should be called again.
Unwind using timer - when the recursion depth is too high, you create a timer and give it a callback that will be called by the timer after some very short time (the timer will continue the recursion, but the used stack will be dropped).
The same thing could be done using a global stack that stores the work that needs to be done. Instead of scheduling a timer, you would add function to the stack. At the top-level, the program would pick functions from the stack and run them.
To give a specific example of the first technique, in F# you could write this:
type Result<´T> =
| Done of ´T
| Call of (unit -> ´T)
let rec factorial acc n =
if n = 0 then Done acc
else Call(fun () -> factorial (acc * n) (n + 1))
This can be used for mutually recursive functions as well. The imperative loop would simply call the f function stored in Call(f) until it produces Done with the final result. I think this is probably the cleanest way to implement this.
I'm sure there are other sophisticated techniques for dealing with this problem, but those are the two I know about (and that I used).
On Scala 2.8, scala.util.control.TailCalls:
import scala.util.control.TailCalls._
def isEven(xs: List[Int]): TailRec[Boolean] = if (xs.isEmpty)
done(true)
else
tailcall(isOdd(xs.tail))
def isOdd(xs: List[Int]): TailRec[Boolean] = if (xs.isEmpty)
done(false)
else
tailcall(isEven(xs.tail))
isEven((1 to 100000).toList).result
Just to have the code handy for when you Bing for F# mutual recursion:
let rec isOdd x =
if x = 1 then true else isEven (x-1)
and isEven x =
if x = 0 then true else isOdd (x-1)
printfn "%A" (isEven 10000000)
This will StackOverflow if you compile without tail calls (the default in "Debug" mode, which preserves stacks for easier debugging), but run just fine when compiled with tail calls (the default in "Release" mode). The compiler does tail calls by default (see the --tailcalls option), and .NET implementations on most platforms honor it.