Properly allocating short / long jumps in a VM with only absolute jumps - virtual-machine

Using relative jumps and assuming the code is being lowered from a structured language (so jumps are nested), the code can be generated "inside out": the inner jumps are resolved (to short or long) leading to precise offset allowing outer jumps to be resolved immediately.
If jumps are absolute though that seems like an issue: it's obviously impossible to resolve inner jumps before outer ones (as the destination offsets are not yet known), but resolving outer jumps require assuming a size for the inner jumps, and any mis-prediction requires updating the outer jump which can cascade into updating the inner jumps, which then requires re-updating the outer jumps, ....
Is there a better solution than this iteration until the jumps converge?
This assumes long jumps have a longer encoding and are more expensive than short jumps, so are undesirable (or possibly forbidden entirely e.g. the ISA forbids "overlong" encodings).

Related

Determining a program's execution time by its length in bits?

This is a question popped into my mind while reading the halting problem, collatz conjecture and Kolmogorov complexity. I have tried to search for something similar but I was unable to find a particular topic maybe because it is not of great value or it could just be a trivial question.
For the sake of simplicity I will give three examples of programs/functions.
function one(s):
return s
function two(s):
while (True):
print s
function three(s):
for i from 0 to 10^10:
print(s)
So my questions is, if there is a way to formalize the length of a program (like the bits used to describe it) and also the internal memory used by the program, to determine the minimum/maximum number of time/steps needed to decide whether the program will terminate or run forever.
For example, in the first function the program doesn't alter its internal memory and halts after some time steps.
In the second example, the program runs forever but the program also doesn't alter its internal memory. For example, if we considered all the programs with the same length as with the program two that do not alter their state, couldn't we determine an upper bound of steps, which if surpassed we could conclude that this program will never terminate ? (If not why ?)
On the last example, the program alters its state (variable i). So, at each step the upper bound may change.
[In short]
Kolmogorov complexity suggests a way of finding the (descriptive) complexity of an object such as a piece of text. I would like to know, given a formal way of describing the memory-space used by a program (computed in runtime), if we could compute a maximum number of steps, which if surpassed would allow us to know whether this program will terminate or run forever.
Finally, I would like to suggest me any source that I might find useful and help me figure out what I am exactly looking for.
Thank you. (sorry for my English, not my native language. I hope I was clear)
If a deterministic Turing machine enters precisely the same configuration twice (which we can detect b keeping a trace of configurations seen so far), then we immediately know the TM will loop forever.
If it known in advance that a deterministic Turing machine cannot possibly use more than some fixed constant amount of its input tape, then the TM must explicitly halt or eventually enter some configuration it has already visited. Suppose the TM can use at most k tape cells, the tape alphabet is T and the set of states is Q. Then there are (|T|+1)^k * |Q| unique configurations (the number of strings over (T union blank) of length k times the number of states) and by the pigeonhole principle we know that a TM that takes that many steps must enter some configuration it has already been to before.
one: because we are given that this function does not use internal memory, we know that it either halts or loops forever.
two: because we are given that this function does not use internal memory, we know that it either halts or loops forever.
three: because we are given that this function only uses a fixed amount of internal memory (like 34 bits) we can tell in fewer than 2^34 iterations of the loop whether the TM will halt or not for any given input s, guaranteed.
Now, knowing how much tape a TM is going to use, or how much memory a program is going to use, is not a problem a TM can solve. But if you have an oracle (like a person who was able to do a proof) that tells you a correct fixed upper bound on memory, then the halting problem is solvable.

Negamax: what to do with "partial" results after canceling a search?

I'm implementing negamax with alpha/beta transposition table based on the pseudo code here, with roughly this algorithm:
NegaMax():
1. Transposition Table lookup
2. Loop through moves
2a. **Bail if I'm out of time**
2b. Make move, call -NegaMax, undo move
2c. Update bestvalue, alpha/beta but if appropriate
3. Transposition table store/update
4. Return bestvalue
I'm also using iterative deepening, calling NegaMax with progressively higher depths.
My question is: when I determine I've run out of time (2a. in the beginning of move loop) what is the right thing to do? Do I bail immediately (not updating the transposition table) or do I just break the loop (saving whatever partial work I've done)?
Currently, I return null at that point, signifying that the search was canceled before "completing" that node (whether by trying every move or the alpha/beta cut). The null gets propagated up and up the stack, and each node on the way up bails by return, so step 3 never runs.
Essentially, I only store values in the TT if the node "completed". The scenario I keep seeing with the iterative deepening:
I get through depths 1-5 really quick, so the TT has a depth = 5, type = Exact entry.
The depth = 6 search is taking a long time, so I bail.
I ultimately return the best move in the transposition table, which is the move I found during the depth = 5 search. The problem is, if I start a new depth = 6 search, it feels like I'm starting it from scratch. However, if I save whatever partial results I found, I worry that I'll have corrupted my TT, potentially by overwriting the completed depth = 5 entry with an incomplete depth = 6 entry.
If the search wasn't completed, the score is inaccurate and should likely not be added to the TT. If you have a best move from the previous ply and it is still best and the score hasn't dropped significantly, you might play that.
On the other hand, if at depth 6 you discover that the opponent has a mate in 3 (oops!) or could win your queen, you might have to spend even more time to try to resolve that.
That would leave you with less time for the remaining moves (if any...), but it might be better to be slightly short on time than to get mated with plenty of time remaining. :-)

Amortized runtime for insertion in scapegoat tree

I am working on the following problem, from a problem set for a course I am self studying.
I have solved the first part. I'm stuck on the second. These are my thoughts so far. I think that the proper way to rebuild the subtree rooted at v would be to traverse it once to copy the values into an array in sorted order, and then, traverse it once again to build it into a balanced binary tree. Thus, this would be linear in v.size. However, I don't see where the potential and the constant can turn this into a O(1), let alone how such a constant could depend upon alpha. As I thought the rebuild operation was independent of alpha, and alpha simply affects how often you have to rebuild? So would the alpha come out of the potential function? And then the c just serves to cancel the alpha? If so, could I have some guidance as to how to rewrite the potential function?
You don't need to rewrite the potential function. The way that c and alpha interact is in the part of (2) in which "a subtree that is not alpha-balanced". That should help you derive a lower bound on the potential of that subtree. Part (1) helps you derive an upper bound on the potential of that subtree after the rebuilding. The resulting difference in potential should help you pay for the rebuilding.
In particular, the lower bound will be something like f(c,alpha) * m for some function f. This problem wants you to find an expression for c in terms of alpha so that f(c,alpha) >= 1.

Memory efficiency in If statements

I'm thinking more about how much system memory my programs will use nowadays. I'm currently doing A level Computing at college and I know that in most programs the difference will be negligible but I'm wondering if the following actually makes any difference, in any language.
Say I wanted to output "True" or "False" depending on whether a condition is true. Personally, I prefer to do something like this:
Dim result As String
If condition Then
Result = "True"
Else
Result = "False"
EndIf
Console.WriteLine(result)
However, I'm wondering if the following would consume less memory, etc.:
If condition Then
Console.WriteLine("True")
Else
Console.WriteLine("False")
EndIf
Obviously this is a very much simplified example and in most of my cases there is much more to be outputted, and I realise that in most commercial programs these kind of statements are rare, but hopefully you get the principle.
I'm focusing on VB.NET here because that is the language used for the course, but really I would be interested to know how this differs in different programming languages.
The main issue making if's fast or slow is predictability.
Modern CPU's (anything after 2000) use a mechanism called branch prediction.
Read the above link first, then read on below...
Which is faster?
The if statement constitutes a branch, because the CPU needs to decide whether to follow or skip the if part.
If it guesses the branch correctly the jump will execute in 0 or 1 cycle (1 nanosecond on a 1Ghz computer).
If it does not guess the branch correctly the jump will take 50 cycles (give or take) (1/200th of a microsecord).
Therefore to even feel these differences as a human, you'd need to execute the if statement many millions of times.
The two statements above are likely to execute in exactly the same amount of time, because:
assigning a value to a variable takes negligible time; on average less than a single cpu cycle on a multiscalar CPU*.
calling a function with a constant parameter requires the use of an invisible temporary variable; so in all likelihood code A compiles to almost the exact same object code as code B.
*) All current CPU's are multiscalar.
Which consumes less memory
As stated above, both versions need to put the boolean into a variable.
Version A uses an explicit one, declared by you; version B uses an implicit one declared by the compiler.
However version A is guaranteed to only have one call to the function WriteLine.
Whilst version B may (or may not) have two calls to the function WriteLine.
If the optimizer in the compiler is good, code B will be transformed into code A, if it's not it will remain with the redundant calls.
How bad is the waste
The call takes about 10 bytes for the assignment of the string (Unicode 2 bytes per char).
But so does the other version, so that's the same.
That leaves 5 bytes for a call. Plus maybe a few extra bytes to set up a stackframe.
So lets say due to your totally horrible coding you have now wasted 10 bytes.
Not much to worry about.
From a maintainability point of view
Computer code is written for humans, not machines.
So from that point of view code A is clearly superior.
Imagine not choosing between 2 options -true or false- but 20.
You only call the function once.
If you decide to change the WriteLine for another function you only have to change it in one place, not two or 20.
How to speed this up?
With 2 values it's pretty much impossible, but if you had 20 values you could use a lookup table.
Obviously that optimization is not worth it unless code gets executed many times.
If you need to know the precise amount of memory the instructions are going to take, you can use ildasm on your code, and see for yourself. However, the amount of memory consumed by your code is much less relevant today, when the memory is so cheap and abundant, and compilers are smart enough to see common patterns and reduce the amount of code that they generate.
A much greater concern is readability of your code: if a complex chain of conditions always leads to printing a conditionally set result, your first code block expresses this idea in a cleaner way than the second one does. Everything else being equal, you should prefer whatever form of code that you find the most readable, and let the compiler worry about optimization.
P.S. It goes without saying that Console.WriteLine(condition) would produce the same result, but that is of course not the point of your question.

HLSL branch avoidance

I have a shader where I want to move half of the vertices in the vertex shader. I'm trying to decide the best way to do this from a performance standpoint, because we're dealing with well over 100,000 verts, so speed is critical. I've looked at 3 different methods: (pseudo-code, but enough to give you the idea. The <complex formula> I can't give out, but I can say that it involves a sin() function, as well as a function call (just returns a number, but still a function call), as well as a bunch of basic arithmetic on floating point numbers).
if (y < 0.5)
{
x += <complex formula>;
}
This has the advantage that the <complex formula> is only executed half the time, but the downside is that it definitely causes a branch, which may actually be slower than the formula. It is the most readable, but we care more about speed than readability in this context.
x += step(y, 0.5) * <complex formula>;
Using HLSL's step() function (which returns 0 if the first param is greater and 1 if less), you can eliminate the branch, but now the <complex formula> is being called every time, and its results are being multiplied by 0 (thus wasted effort) half of the time.
x += (y < 0.5) ? <complex formula> : 0;
This I don't know about. Does the ?: cause a branch? And if not, are both sides of the equation evaluated or only the one that is relevant?
The final possibility is that the <complex formula> could be offloaded back to the CPU instead of the GPU, but I worry that it will be slower in calculating sin() and other operations, which might result in a net loss. Also, it means one more number has to be passed to the shader, and that could cause overhead as well. Anyone have any insight as to which would be the best course of action?
Addendum:
According to http://msdn.microsoft.com/en-us/library/windows/desktop/bb509665%28v=vs.85%29.aspx
the step() function uses a ?: internally, so it's probably no better than my 3rd solution, and potentially worse since <complex formula> is definitely called every time, whereas it may be only called half the time with a straight ?:. (Nobody's answered that part of the question yet.) Though avoiding both and using:
x += (1.0 - y) * <complex formula>;
may well be better than any of them, since there's no comparison being made anywhere. (And y is always either 0 or 1.) Still executes the <complex formula> needlessly half the time, but might be worth it to avoid branches altogether.
Perhaps look at this answer.
My guess (this is a performance question: measure it!) is that you are best off keeping the if statement.
Reason number one: The shader compiler, in theory (and if invoked correctly), should be clever enough to make the best choice between a branch instruction, and something similar to the step function, when it compiles your if statement. The only way to improve on it is to profile[1]. Note that it's probably hardware-dependent at this level of granularity.
[1] Or if you have specific knowledge about how your data is laid out, read on...
Reason number two is the way shader units work: If even one fragment or vertex in the unit takes a different branch to the others, then the shader unit must take both branches. But if they all take the same branch - the other branch is ignored. So while it is per-unit, rather than per-vertex - it is still possible for the expensive branch to be skipped.
For fragments, the shader units have on-screen locality - meaning you get best performance with groups of nearby pixels all taking the same branch (see the illustration in my linked answer). To be honest, I don't know how vertices are grouped into units - but if your data is grouped appropriately - you should get the desired performance benefit.
Finally: It's worth pointing out that your <complex formula> - if you're saying that you can hoist it out of your HLSL manually - it may well get hoisted into a CPU-based pre-shader anyway (on PC at least, from memory Xbox 360 doesn't support this, no idea about PS3). You can check this by decompiling the shader. If it is something that you only need to calculate once per-draw (rather than per-vertex/fragment) it probably is best for performance to do it on the CPU.
I got tired of my conditionals being ignored so I just made a another kernel and did an override in c execution.
If you need it to be accurate all the time I suggest this fix.