Measuring performance after Tail Call Optimization(TCO) - optimization

I have an idea about what it is. My question is :-
1.) If i program my code which is amenable to Tail Call optimization(Last statement in a function[recursive function] being a function call only, no other operation there) then do i need to set any optimization level so that compiler does TCO. In what mode of optimization will compiler perform TCO, optimizer for space or time.
2.) How do i find out which all compilers (MSVC, gcc, ARM-RVCT) does support TCO
3.) Assuming some compiler does TCO, we enable it then, What is the way to find out that the compielr has actually done TCO? Will Code size, tell it or Cycles taken to execute it will tell that or both?
-AD

Most compilers support TCO, it is a relatively old technique. As far as how to enable it with a specific compiler, check the documentation for your compilers. gcc will enable the optimization at every optimization level except -O1, I think the specific option for this is -foptimize-sibling-calls. As far as how to tell how/if the compiler is doing TCO, look at the assembler output (gcc -S for example) or disassemble the object code.

Optimization is Compiler specific. Consult the documentation for the various optimization flags for them
You will find that in the Compilers documentation too. If you are curious, you can write a tail recursive function and pass it a big argument, and lookout for a stack-overflow. (tho checking the generated assembler might be a better choice, if you understand the code generated.)
You just use the debugger, and look out the address of function arguments/local variables. If they increase/decrease on each logical frame that the debugger shows (or if it actually only shows one frame, even though you did several calls), you know whether TCO was done or wasn't done.

If you want your compiler to do tail call optimization, just check either
a) the doc of the compiler at which optimization level it will be performed or
b) check the asm, if the function will call itself (you dont even need big asm knowledge to spot the just the symbol of the function again)
If you really really want tail recursion my question would be:
Why dont you perform the tail call removal yourself? It means nothing else than removing the recursion, and if its removable then its not only possible by the compiler on low level but also on algorithmic level by you, that you can programm it direct into your code (it means nothing else than go for a loop instead of a call to yourself).

One way to determine if tail-call is happening is to see if you can force a stack overflow. The following program does not produce a stack overflow using VC++ 2005 Express Edition and, even though its results exceed the capacity of long double rather quickly, you can tell that all of the iterations are being processed when TCO is happening:
/* FibTail.c 0.00 UTF-8 dh:2008-11-23
* --|----1----|----2----|----3----|----4----|----5----|----6----|----*
*
* Demonstrate Fibonacci computation by tail call to see whether it is
* is eliminated through compiler optimization.
*/
#include <stdio.h>
long double fibcycle(long double f0, long double f1, unsigned i)
{ /* accumulate successive fib(n-i) values by tail calls */
if (i == 0) return f1;
return fibcycle(f1, f0+f1, --i);
}
long double fib(unsigned n)
{ /* the basic fib(n) setup and return. */
return fibcycle(1.0, 0.0, n);
}
int main(int argc, char* argv[ ])
{ /* compute some fibs until something breaks */
int i;
printf("\n i fib(i)\n\n");
for (i = 1; i > 0; i+=i)
{ /* Do for powers of 2 until i flips negative
or stack overflow, whichever comes first */
printf("%12d %30.20LG \n", i, fib((unsigned) i) );
}
printf("\n\n");
return 0;
}
Notice, however, that the simplifications to make a pure tail-call in fibcycle is tantamount to figuring out an interative version that doesn't do a tail-call at all (and will work with or without TCO in the compiler.
It might be interesting to experiment in order to see how well the TCO can find optimizations that are not already near-optimal and easily replaced by iterations.

Related

What is the order of local variables on the stack?

I'm currently trying to do some tests with the buffer overflow vulnerability.
Here is the vulnerable code
void win()
{
printf("code flow successfully changed\n");
}
int main(int argc, char **argv)
{
volatile int (*fp)();
char buffer[64];
fp = 0;
gets(buffer);
if(fp) {
printf("calling function pointer, jumping to 0x%08x\n", fp);
fp();
}
}
The exploit is quite sample and very basic: all what I need here is to overflow the buffer and override the fp value to make it hold the address of win() function.
While trying to debug the program, I figured out that fb is placed below the buffer (i.e with a lower address in memory), and thus I am not able to modify its value.
I thought that once we declare a local variable x before y, x will be higher in memory (i.e at the bottom of the stack) so x can override y if it exceeds its boundaries which is not the case here.
I'm compiling the program with gcc gcc version 5.2.1, no special flags (only tested -O0)
Any clue?
The order of local variable on the stack is unspecified.
It may change between different compilers, different versions or different optimization options. It may even depend on the names of the variables or other seemingly unrelated things.
The order of local variables on the stack is not defined until compile/link (build) time. I'm no expert certainly, but I think you'd have to do some sort of a "hex dump", or perhaps run the code in a debugger environment to find out where it's allocated. And I'd also guess that this would be a relative address, not an absolute one.
And as #RalfFriedl has indicated in his answer, the location could change under any number of compiler options invoked for different platforms, etc.
But we do know it's possible to do buffer overflows. Although you do hear less about them now due to defensive measures such as address space layout randomization (ASLR), they're still around and paying the bills for someone I'd guess. There are of course many many online articles on the subject; here's one that seems fairly current(https://www.synopsys.com/blogs/software-security/detect-prevent-and-mitigate-buffer-overflow-attacks/).
Good luck (should you even say that to someone practicing buffer overflow attacks?). At any rate, I hope you learn some things, and use it for good :)

Is inline PTX more efficient than C/C++ code?

I have noticed that PTX code allows for some instructions with complex semantics, such as bit field extract (bfe), find most-significant non-sign bit (bfind), and population count (popc).
Is it more efficient to use them explicitly rather than write code with their intended semantics in C/C++?
For example: "population count", or popc, means counting the one bits. So should I write:
__device__ int popc(int a) {
int d = 0;
while (a != 0) {
if (a & 0x1) d++;
a = a >> 1;
}
return d;
}
for that functionality, or should I, rather, use:
__device__ int popc(int a) {
int d;
asm("popc.u32 %1 %2;":"=r"(d): "r"(a));
return d;
}
? Will the inline PTX be more efficient? Should we write inline PTX to to get peak performance?
also - does GPU have some extra magic instruction corresponding to PTX instructions?
The compiler may identify what you're doing and use a fancy instruction to do it, or it may not. The only way to know in the general case is to look at the output of the compilation in ptx assembly, by using -ptx flag added to nvcc. If the compiler generates it for you, there is no need to hand-code the inline assembly yourself (or use an instrinsic).
Also, whether or not it makes a performance difference in the general case depends on whether or not the code path is used in a significant way, and on other factors such as the current performance limiters of your kernel (e.g. compute-bound or memory-bound).
A few more points in addition to #RobertCrovella's answer:
Even if you do use PTX for something - that should happen rarely. Limit it to small functions of no more than a few PTX lines - which you can then re-use for multiple purposes as you see fit, with most of your coding being in C/C++.
An example of this principle are the intrinsics #njuffa mentiond, in (that's not an official copy of the file I think). Please read it through to see which intrinsics are available to you. That doesn't mean you should use them all, of course.
For your specific example - you do want the PTX over the first version; it certainly won't do any harm. But, again, it is also an example of how you do not need to actually write PTX, since popc has a corresponding __popc intrinsic (again, as #njuffa has noted).
You might also want to have a look at the source code of some CUDA-based libraries to see what kind of PTX snippets they've chosen to use.

What can a compiler do with branching information?

On a modern Pentium it is no longer possible to give branching hints to the processor it seems. Assuming that a profiling compiler such as gcc with profile-guided optimization gains information about likely branching behavior, what can it do to produce code that will execute more quickly?
The only option I know of is to move unlikely branches to the end of a function. Is there anything else?
Update.
http://download.intel.com/products/processor/manual/325462.pdf volume 2a, section 2.1.1 says
"Branch hint prefixes (2EH, 3EH) allow a program to give a hint to the processor about the most likely code path for
a branch. Use these prefixes only with conditional branch instructions (Jcc). Other use of branch hint prefixes
and/or other undefined opcodes with Intel 64 or IA-32 instructions is reserved; such use may cause unpredictable
behavior."
I don't know if these actually have any effect however.
On the other hand section 3.4.1. of http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf says
"
Compilers generate code that improves the efficiency of branch prediction in Intel processors. The Intel
C++ Compiler accomplishes this by:
keeping code and data on separate pages
using conditional move instructions to eliminate branches
generating code consistent with the static branch prediction algorithm
inlining where appropriate
unrolling if the number of iterations is predictable
With profile-guided optimization, the compiler can lay out basic blocks to eliminate branches for the most
frequently executed paths of a function or at least improve their predictability. Branch prediction need
not be a concern at the source level. For more information, see Intel C++ Compiler documentation.
"
http://cache-www.intel.com/cd/00/00/40/60/406096_406096.pdf says in "Performance Improvements with PGO "
"
PGO works best for code with many frequently executed branches that are difficult to
predict at compile time. An example is the code with intensive error-checking in which
the error conditions are false most of the time.
The infrequently executed (cold) errorhandling code can be relocated so the branch is rarely predicted incorrectly. Minimizing
cold code interleaved into the frequently executed (hot) code improves instruction cache
behavior."
There are two possible sources for the information you want:
There's Intel 64 and IA-32 Architectures Software Developer's Manual (3 volumes). This is a huge work which has evolved for decades. It's the best reference I know on a lot of subjects, including floating-point. In this case, you want to check volume 2, the instruction set reference.
There's Intel 64 and IA-32 Architectures Optmization Reference Manual. This will tell you in somewhat brief terms what to expect from each microarchitecture.
Now, I don't know what you mean by a "modern Pentium" processor, this is 2013, right? There aren't any Pentiums anymore...
The instruction set does support telling the processor if the branch is expected to be taken or not taken by a prefix to the conditional branch instructions (such as JC, JZ, etc). See volume 2A of (1), section 2.1.1 (of the version I have) Instruction Prefixes. There is the 2E and 3E prefixes for not taken and taken respectively.
As to whether these prefixes actually have any effect, if we can get that information, it will be on Optimization Reference Manual, the section for the microarchitecture you want (and I'm sure it won't be the Pentium).
Apart from using those, there is an entire section on the Optimization Reference Manual on that subject, that's section 3.4.1 (of the version I have).
It makes no sense to reproduce that here, since you can download the manual for free.
Briefly:
Eliminate branches by using conditional instructions (CMOV, SETcc),
Consider the static prediction algorithm (3.4.1.3),
Inlining
Loop unrolling
Also, some compilers, GCC, for instance, even when CMOV is not possible, often perform bitwise arithmetic to select one of two distinct things computed, thus avoiding branches. It does this particularly with SSE instructions when vectorizing loops.
Basically, the static conditions are:
Unconditional branches are predicted to be taken (... kind of expectable...)
Indirect branches are predicted not to be taken (because of a data dependency)
Backward conditionals are predicted to be taken (good for loops)
Forward conditionals are predicted not to be taken
You probably want to read the entire section 3.4.1.
If it's clear that a loop is rarely entered, or that it normally iterates very few times, then the compiler might avoid unrolling the loop, as doing so can add a lot of harmful complexity to handle edge conditions (an odd-number iterations, etc.). Vectorisation, in particular, should be avoided in such cases.
The compiler might rearrange nested tests, so that the one that most frequently results in a short-cut can be used to avoid performing a test on something with a 50% pass rate.
Register allocation can be optimised to avoid having a rarely-used block force register spill in the common case.
These are just some examples. I'm sure there are others I haven't thought of.
Off the top of my head, you have two options.
Option #1: Inform the compiler of the hints and let the compiler organize the code appropriately. For example, GCC supports the following ...
__builtin_expect((long)!!(x), 1L) /* GNU C to indicate that <x> will likely be TRUE */
__builtin_expect((long)!!(x), 0L) /* GNU C to indicate that <x> will likely be FALSE */
If you put them in macro form such as ...
#if <some condition to indicate support>
#define LIKELY(x) __builtin_expect((long)!!(x), 1L)
#define UNLIKELY(x) __builtin_expect((long)!!(x), 0L)
#else
#define LIKELY(x) (x)
#define UNLIKELY(x) (x)
#endif
... you can now use them as ...
if (LIKELY (x != 0)) {
/* DO SOMETHING */
} else {
/* DO SOMETHING ELSE */
}
This leaves the compiler free to organize the branches according to static branch prediction algorithms, and/or if the processor and compiler support it, to use instructions that indicate which branch is more likely to be taken.
Option #2: Use math to avoid branching.
if (a < b)
y = C;
else
y = D;
This could be re-written as ...
x = -(a < b); /* x = -1 if a < b, x = 0 if a >= b */
x &= (C - D); /* x = C - D if a < b, x = 0 if a >= b */
x += D; /* x = C if a < b, x = D if a >= b */
Hope this helps.
It can make the fall-through (ie the case where a branch is not taken) the most used path. That has two big effects:
only 1 branch can be taken per clock, or on some processors even per 2 clocks, so if there are any other branches (there usually are, most code that matters is in a loop), a taken branch is bad news, a non-taken branch less so.
when the branch predictor is wrong, the code that it does have to execute is more likely to be in the code cache (or µop cache, where applicable). If it wasn't, that would have been a double-whammy of restarting the pipeline and waiting for a cache miss. This is less of an issue in most loops, since both sides of the branch are likely to be in the cache, but it comes into play in big loops and other code.
It can also decide whether to do if-conversion based on better data than a heuristic guess. If-conversions may seem like "always a good idea", but they're not, they're only "often a good idea". If the branch in the branching implementation is very well-predicted, the if-converted code can well be slower.

Input setting using Registers

I have a simple c program for printing n Fibonacci numbers and I would like to compile it to ELF object file. Instead of setting the number of fibonacci numbers (n) directly in my c code, I would like to set them in the registers since I am simulating it for an ARM processor.How can I do that?
Here is the code snippet
#include <stdio.h>
#include <stdlib.h>
#define ITERATIONS 3
static float fib(float i) {
return (i>1) ? fib(i-1) + fib(i-2) : i;
}
int main(int argc, char **argv) {
float i;
printf("starting...\n");
for(i=0; i<ITERATIONS; i++) {
printf("fib(%f) = %f\n", i, fib(i));
}
printf("finishing...\n");
return 0;
}
I would like to set the ITERATIONS counter in my Registers rather than in the code.
Thanks in advance
The register keyword can be used to suggest to the compiler that it uses a registers for the iterator and the number of iterations:
register float i;
register int numIterations = ITERATIONS;
but that will not help much. First of all, the compiler may or may not use your suggestion. Next, values will still need to be placed on the stack for the call to fib(), and, finally, depending on what functions you call within your loop, code in the procedure are calling could save your register contents in the stack frame at procedure entry, and restore them as part of the code implementing the procedure return.
If you really need to make every instruction count, then you will need to write machine code (using an assembly language). That way, you have direct control over your register usage. Assembly language programming is not for the faint of heart. Assembly language development is several times slower than using higher level languages, your risk of inserting bugs is greater, and they are much more difficult to track down. High level languages were developed for a reason, and the C language was developed to help write Unix. The minicomputers that ran the first Unix systems were extremely slow, but the reason C was used instead of assembly was that even then, it was more important to have code that took less time to code, had fewer bugs, and was easier to debug than assembler.
If you want to try this, here are the answers to a previous question on stackoverflow about resources for ARM programming that might be helpful.
One tactic you might take is to isolate your performance-critical code into a procedure, write the procedure in C, the capture the generated assembly language representation. Then rewrite the assembler to be more efficient. Test thoroughly, and get at least one other set of eyeballs to look the resulting code over.
Good Luck!
Make ITERATIONS a variable rather than a literal constant, then you can set its value directly in your debugger/simulator's watch or locals window just before the loop executes.
Alternatively as it appears you have stdio support, why not just accept the value via console input?

Is it possible to compare two Objective-C blocks by content?

float pi = 3.14;
float (^piSquare)(void) = ^(void){ return pi * pi; };
float (^piSquare2)(void) = ^(void){ return pi * pi; };
[piSquare isEqualTo: piSquare2]; // -> want it to behave like -isEqualToString...
To expand on Laurent's answer.
A Block is a combination of implementation and data. For two blocks to be equal, they would need to have both the exact same implementation and have captured the exact same data. Comparison, thus, requires comparing both the implementation and the data.
One might think comparing the implementation would be easy. It actually isn't because of the way the compiler's optimizer works.
While comparing simple data is fairly straightforward, blocks can capture objects-- including C++ objects (which might actually work someday)-- and comparison may or may not need to take that into account. A naive implementation would simply do a byte level comparison of the captured contents. However, one might also desire to test equality of objects using the object level comparators.
Then there is the issue of __block variables. A block, itself, doesn't actually have any metadata related to __block captured variables as it doesn't need it to fulfill the requirements of said variables. Thus, comparison couldn't compare __block values without significantly changing compiler codegen.
All of this is to say that, no, it isn't currently possible to compare blocks and to outline some of the reasons why. If you feel that this would be useful, file a bug via http://bugreport.apple.com/ and provide a use case.
Putting aside issues of compiler implementation and language design, what you're asking for is provably undecidable (unless you only care about detecting 100% identical programs). Deciding if two programs compute the same function is equivalent to solving the halting problem. This is a classic consequence of Rice's Theorem: Any "interesting" property of Turing machines is undecidable, where "interesting" just means that it's true for some machines and false for others.
Just for fun, here's the proof. Assume we can create a function to decide if two blocks are equivalent, called EQ(b1, b2). Now we'll use that function to solve the halting problem. We create a new function HALT(M, I) that tells us if Turing machine M will halt on input I like so:
BOOL HALT(M,I) {
return EQ(
^(int) {return 0;},
^(int) {M(I); return 0;}
);
}
If M(I) halts then the blocks are equivalent, so HALT(M,I) returns YES. If M(I) doesn't halt then the blocks are not equivalent, so HALT(M,I) returns NO. Note that we don't have to execute the blocks -- our hypothetical EQ function can compute their equivalence just by looking at them.
We have now solved the halting problem, which we know is not possible. Therefore, EQ cannot exist.
I don't think this is possible. Blocks can be roughly seen as advanced functions (with access to global or local variables). The same way you cannot compare functions' content, you cannot compare blocks' content.
All you can do is to compare their low-level implementation, but I doubt that the compiler will guarantee that two blocks with the same content share their implementation.