gcc (I tried 4.7.2 on Mac and Linux with -O3 flag) optimizes ackermann function to a single call with a big local stack. An example Ackermann code below:
int ack(int m,int n){
if(m == 0) return n+1;
if(n == 0) return ack(m-1,1);
return ack(m-1,ack(m,n-1));
}
When disassembled, there is only one recursive call to ack function, instead of two calls (I couldn't parse what is going on -ack is now transformed by gcc into a function with 8 arguments, and local stack of 49 int and 9 char). I tried to look up what kind of transformation passes gcc did to optimize Ackermann function to a single call, but didn't find anything of interest. I will appreciate pointers on what major transformation passes gcc performed to convert the deeply recursive Ackermann to a single recursive call. LLVM gcc (I tried v4.2 on mac) doesn't reduce it to single recursive call yet, and is 4x slower with -O3 flag. This optimization seems very interesting.
The first pass is tail-call elimination. GCC does this at most optimization levels. Essentially, all function calls in tail position are transformed into goto's, like this:
int ack(int m, int n) {
begin:
if (m == 0) return n + 1;
if (n == 0) { m -= 1; n = 1; goto begin; }
n = ack(m, n-1); m -= 1; goto begin;
}
Now there is only one recursive call remaining and GCC, at -O3 level only, inlines this for a couple of iterations. Resulting in the huge monster you saw.
Related
I am having trouble correctly accessing a variable in a Fortran DLL from a Fortran EXE when the variable is part of a COMMON block.
I have a trivial code simple.f90 which I compile into a DLL using MSYS64/MinGW-w64 gfortran 9.2 as
x86_64-w64-mingw32-gfortran simple.f90 -o simple.dll -shared
! simple.f90
module m
implicit none
integer :: a, b
!common /numbers/ a, b
end module
subroutine init_vals
use m
implicit none
a = 1
b = 2
end subroutine
This library is used from a even simpler program prog.f90, compiled as
x86_64-w64-mingw32-gfortran prog.f90 -o prog -L. -lsimple
! prog.90
program p
use m
implicit none
print *, 'Before', a, b
call init_vals
print *, 'After', a, b
end program
When the COMMON block /numbers/ is commented out, the code works and prints the expected result:
Before 0 0
After 1 2
However, when I uncomment the COMMON block, the output becomes
Before 0 0
After 0 0
as if the variables used by the program were suddenly distinct from those used in the library.
Both variants work equally well in a Linux-based OS with gfortran 9.1.
I am aware that "On some systems, procedures and global variables (module variables and COMMON blocks) need special handling to be accessible when they are in a shared library," as mentioned here: https://gcc.gnu.org/onlinedocs/gcc-4.9.0/gfortran/GNU-Fortran-Compiler-Directives.html . However, I was not able to insert a statement of the type
!GCC$ ATTRIBUTES DLLIMPORT :: numbers
or
!GCC$ ATTRIBUTES DLLEXPORT :: numbers
anywhere in the code without being snapped at by the compiler.
As pointed out by M. Chinoune in the comment, current gfortran lacks the ability to import common blocks from DLLs. Even though there has been a patch for some time, it is not yet merged. In the end, I needed two things to make the above code work:
First, apply the following patch to GCC 9.2 and compile the compiler manually in MSYS2:
--- gcc/fortran/trans-common.c.org 2019-03-11 14:58:44.000000000 +0100
+++ gcc/fortran/trans-common.c 2019-09-26 08:31:16.243405900 +0200
## -102,6 +102,7 ##
#include "trans.h"
#include "stringpool.h"
#include "fold-const.h"
+#include "attribs.h"
#include "stor-layout.h"
#include "varasm.h"
#include "trans-types.h"
## -423,6 +424,9 ##
/* If there is no backend_decl for the common block, build it. */
if (decl == NULL_TREE)
{
+ unsigned id;
+ tree attribute, attributes;
+
if (com->is_bind_c == 1 && com->binding_label)
decl = build_decl (input_location, VAR_DECL, identifier, union_type);
else
## -454,6 +458,23 ##
gfc_set_decl_location (decl, &com->where);
+ /* Add extension attributes to COMMON block declaration. */
+ if (com->head)
+ {
+ attributes = NULL_TREE;
+ for (id = 0; id < EXT_ATTR_NUM; id++)
+ {
+ if (com->head->attr.ext_attr & (1 << id))
+ {
+ attribute = build_tree_list (
+ get_identifier (ext_attr_list[id].middle_end_name),
+ NULL_TREE);
+ attributes = chainon (attributes, attribute);
+ }
+ }
+ decl_attributes (&decl, attributes, 0);
+ }
+
if (com->threadprivate)
set_decl_tls_model (decl, decl_default_tls_model (decl));
Second, only the line
!GCC$ ATTRIBUTES DLLIMPORT :: a, b
was needed in the main program (right after implicit none), but not any exports anywhere. This is apparently a different syntactical approach then in Intel Fortran, where one imports the COMMON block rather than its constituents. I also found out that I needed to import both a and b even if I only needed b. (When only a was needed, importing a only was enough.)
I'm working with a custom kernel char device which sometimes returns large negative values (around the thousands, say -2000) for its ioctl().
In userspace, I don't get these values returned from the ioctl call. Instead I get a return value of -1 back with errno set to the negated value from the kernel module (+2000).
As far as I can read and google, __syscall_return() is the macro which is supposed to interpret negative return values as errors. But, it only seems to look for values between -1 and -125. So I didn't expect these large negative values to be translated.
Where are these return values translated? Is it expected behaviour?
I am on Linux 2.6.35.10 with EGLIBC 2.11.3-4+deb6u6.
The translation and move to errno occur on the libc level. Both Gnu libc and μClibc treat negative numbers down to at least -4095 as error conditions, per http://www.makelinux.net/ldd3/chp-6-sect-1
See https://github.molgen.mpg.de/git-mirror/glibc/blob/85b290451e4d3ab460a57f1c5966c5827ca807ca/sysdeps/unix/sysv/linux/aarch64/ioctl.S for the Gnu libc implementation of ioctl.
So, with the help of BRPocock I will report my findings here.
The linux kernel will do a error check for all syscalls along the lines of (from unistd.h):
#define __syscall_return(type, res) \
do { \
if ((unsigned long)(res) >= (unsigned long)(-125)) { \
errno = -(res); \
res = -1; \
} \
return (type) (res); \
} while (0)
Libc will also do an error check for all syscalls along the lines of (from syscall.S):
.text
ENTRY (syscall)
PUSHARGS_6 /* Save register contents. */
_DOARGS_6(44) /* Load arguments. */
movl 20(%esp), %eax /* Load syscall number into %eax. */
ENTER_KERNEL /* Do the system call. */
POPARGS_6 /* Restore register contents. */
cmpl $-4095, %eax /* Check %eax for error. */
jae SYSCALL_ERROR_LABEL /* Jump to error handler if error. */
ret /* Return to caller. */
PSEUDO_END (syscall)
Glibc gives a reason for the 4096 value (from sysdep.h):
/* Linux uses a negative return value to indicate syscall errors,
unlike most Unices, which use the condition codes' carry flag.
Since version 2.1 the return value of a system call might be
negative even if the call succeeded. E.g., the `lseek' system call
might return a large offset. Therefore we must not anymore test
for < 0, but test for a real error by making sure the value in %eax
is a real error number. Linus said he will make sure the no syscall
returns a value in -1 .. -4095 as a valid result so we can savely
test with -4095. */
__syscall_return seems to be missing from newer kernels, I haven't researched that yet.
I'm trying to figure out how to run LLVM's built-in loop vectorizer. I have a small program containing an extremely simple loop (I had some output at one point which is why stdio.h is still being included despite never being used):
1 #include <stdio.h>
2
3 unsigned NUM_ELS = 10000;
4
5 int main() {
6 int A[NUM_ELS];
7
8 #pragma clang loop vectorize(enable)
9 for (int i = 0; i < NUM_ELS; ++i) {
10 A[i] = i*2;
11 }
12
13 return 0;
14 }
As you can see, it does nothing at all useful; I just need the for loop to be vectorizable. I'm compiling it to LLVM bytecode with
clang -emit-llvm -O0 -c loop1.c -o loop1.bc
llvm-dis -f loop1.bc
Then I'm applying the vectorizer with
opt -loop-vectorize -force-vector-width=4 -S -debug loop1.ll
However, the debug output gives me this:
LV: Checking a loop in "main" from loop1.bc
LV: Loop hints: force=? width=4 unroll=0
LV: Found a loop: for.cond
LV: SCEV could not compute the loop exit count.
LV: Not vectorizing: Cannot prove legality.
I've dug around in the LLVM source a bit, and it looks like SCEV comes from the ScalarEvolution pass, which has the task of (among other things) counting the number of back edges back to the loop condition, which in this case (if I'm not mistaken) should be the trip count minus the first trip (so 9,999 in this case). I've run this pass on a much larger benchmark and it gives me the exact same error at every loop, so I'm guessing it isn't the loop itself, but that I'm not giving it enough information.
I've spent quite a bit of time combing through the documentation and Google results to find an example of a full opt command using this transformation, but have been unsuccessful so far; I'd appreciate any hints as to what I may be missing (I'm new to vectorizing code so it could be something very obvious).
Thank you,
Stephen
vectorization depends on number of other optimization which needs to be run before. They are not run at all at -O0, therefore you cannot expect that your code would be 'just' vectorized there.
Adding -O2 before -loop-vectorize in opt cmdline would help here (make sure your 'A' array is external / used somehow, otherwise everything will be optimized away).
I wanted to check whether g++ supports tail calling so I wrote this simple program to check it: http://ideone.com/hnXHv
using namespace std;
size_t st;
void PrintStackTop(const std::string &type)
{
int stack_top;
if(st == 0) st = (size_t) &stack_top;
cout << "In " << type << " call version, the stack top is: " << (st - (size_t) &stack_top) << endl;
}
int TailCallFactorial(int n, int a = 1)
{
PrintStackTop("tail");
if(n < 2)
return a;
return TailCallFactorial(n - 1, n * a);
}
int NormalCallFactorial(int n)
{
PrintStackTop("normal");
if(n < 2)
return 1;
return NormalCallFactorial(n - 1) * n;
}
int main(int argc, char *argv[])
{
st = 0;
cout << TailCallFactorial(5) << endl;
st = 0;
cout << NormalCallFactorial(5) << endl;
return 0;
}
When I compiled it normally it seems g++ doesn't really notice any difference between the two versions:
> g++ main.cpp -o TailCall
> ./TailCall
In tail call version, the stack top is: 0
In tail call version, the stack top is: 48
In tail call version, the stack top is: 96
In tail call version, the stack top is: 144
In tail call version, the stack top is: 192
120
In normal call version, the stack top is: 0
In normal call version, the stack top is: 48
In normal call version, the stack top is: 96
In normal call version, the stack top is: 144
In normal call version, the stack top is: 192
120
The stack difference is 48 in both of them, while the tail call version needs one more
int. (Why?)
So I thought optimization might be handy:
> g++ -O2 main.cpp -o TailCall
> ./TailCall
In tail call version, the stack top is: 0
In tail call version, the stack top is: 80
In tail call version, the stack top is: 160
In tail call version, the stack top is: 240
In tail call version, the stack top is: 320
120
In normal call version, the stack top is: 0
In normal call version, the stack top is: 64
In normal call version, the stack top is: 128
In normal call version, the stack top is: 192
In normal call version, the stack top is: 256
120
The stack size increased in both cases, and while the compiler might think my CPU is slower than my memory (which its not anyway), I don't know why 80 bytes are necessary for a simple function. (Why is it?).
There tail call version also takes more space than the normal version, and its completely logical if an int has the size of 16 bytes. (no, I don't own a 128 bit CPU).
Now thinking what reason the compiler has not to tail call, I thought it might be exceptions, because they depend on the stack tightly. So I tried without exceptions:
> g++ -O2 -fno-exceptions main.cpp -o TailCall
> ./TailCall
In tail call version, the stack top is: 0
In tail call version, the stack top is: 64
In tail call version, the stack top is: 128
In tail call version, the stack top is: 192
In tail call version, the stack top is: 256
120
In normal call version, the stack top is: 0
In normal call version, the stack top is: 48
In normal call version, the stack top is: 96
In normal call version, the stack top is: 144
In normal call version, the stack top is: 192
120
Which cut the normal version back to non-optimized stack size, while the optimized one has 8 bytes over it. still an int is not 8 bytes.
I thought there is something I missed in c++ that needs the stack arranged so I tried c: http://ideone.com/tJPpc
Still no tail calling, but the stack is much smaller (32 bit each frame in both version).
Then I tried with optimization:
> gcc -O2 main.c -o TailCall
> ./TailCall
In tail call version, the stack top is: 0
In tail call version, the stack top is: 0
In tail call version, the stack top is: 0
In tail call version, the stack top is: 0
In tail call version, the stack top is: 0
120
In normal call version, the stack top is: 0
In normal call version, the stack top is: 0
In normal call version, the stack top is: 0
In normal call version, the stack top is: 0
In normal call version, the stack top is: 0
120
Not only it tail call optimized the first, it also tail call optimized the second!
Why doesn't g++ do tail call optimization while its clearly available in the platform? is there any way to force it?
Because you're passing a temporary std::string object to the PrintStackTop(std::string) function. This object is allocated on the stack and thus prevent the tail call optimization.
I modified your code:
void PrintStackTopStr(char const*const type)
{
int stack_top;
if(st == 0) st = (size_t) &stack_top;
cout << "In " << type << " call version, the stack top is: " << (st - (size_t) &stack_top) << endl;
}
int RealTailCallFactorial(int n, int a = 1)
{
PrintStackTopStr("tail");
if(n < 2)
return a;
return RealTailCallFactorial(n - 1, n * a);
}
Compile with: g++ -O2 -fno-exceptions -o tailcall tailcall.cpp
And it now uses the tail call optimisation. You can see it in action if you use the -S flag to produce the assembly:
L39:
imull %ebx, %esi
subl $1, %ebx
L38:
movl $LC2, (%esp)
call __Z16PrintStackTopStrPKc
cmpl $1, %ebx
jg L39
You see the recursive call inlined as a loop (jg L39).
I don't find the other answer satisfying because a local object has no effect on the stack once it's gone.
Here is a good article which mentions that the lifetime of local objects extends into the tail-called function. Tail call optimization requires destroying locals before relinquishing control, GCC will not apply it unless it is sure that no local object will be accessed by the tail call.
Lifetime analysis is hard though, and it looks like it's being done too conservatively. Setting a global pointer to reference a local disables TCO even if the local's lifetime (scope) ends before the tail call.
{
int x;
static int * p;
p = & x;
} // x is dead here, but the enclosing function still has TCO disabled.
This still doesn't seem to model what's happening, so I found another bug. Passing local to a parameter with a user-defined or non-trivial destructor also disables TCO. (Defining the destructor = delete allows TCO.)
std::string has a nontrivial destructor, so that's causing the issue here.
The workaround is to do these things in a nested function call, because lifetime analysis will then be able to tell that the object is dead by the tail call. But there's no need to forgo all C++ objects.
The original code with temporary std::string object is still tail recursive, since the destructor for that object is executed immediately after exit from PrintStackTop("");, so nothing should be executed after the recursive return statement.
However, there are two issues that lead to confusion of tail call optimization (TCO):
the argument is passed by reference to the PrintStackTop function
non-trivial destructor of std::string
It can be verified by custom class that each of those two issues is able to break TCO.
As it is noted in the previous answer by #Potatoswatter there is a workaround for both of those issues. It is enough to wrap call of PrintStackTop by another function to help the compiler to perform TCO even with temporary std::string:
void PrintStackTopTail()
{
PrintStackTop("tail");
}
int TailCallFactorial(int n, int a = 1)
{
PrintStackTopTail();
//...
}
Note that is not enough to limit the scope by enclosing { PrintStackTop("tail"); } in curly braces. It must be enclosed as a separate function.
Now it can be verified with g++ version 4.7.2 (compilation options -O2) that tail recursion is replaced by a loop.
The similar issue is observed in Pass-by-reference hinders gcc from tail call elimination
Note that printing (st - (size_t) &stack_top) is not enough to be sure that TCO is performed, for example with the optimization option -O3 the function TailCallFactorial is self inlined five times, so TailCallFactorial(5) is executed as a single function call, but the issue is revealed for larger argument values (for example for TailCallFactorial(15);). So, the TCO may be verified by reviewing assembly output generated with the -S flag.
26: execve(prog[0],prog,env);
27: return 0;
execve() does not return on success, and the text, data, bss, and
stack of the calling process are overwritten by that of the program
loaded.
what's return 0; for?
I suggest it is to cease this compiler warning.
$ cat | gcc -W -Wall -x c -
int main(){}
^D
<stdin>: In function 'main':
<stdin>:1:1: warning: control reaches end of non-void function
This also will make happy static analyzers and IDE warnings about same thing.
That line is in case execve() somehow fails and does return. Theoretically, it never should happen, but it does sometimes. Often, the return value is set to some random number to signify that there was an error.