Why isn't g++ tail call optimizing while gcc is?

Why isn't g++ tail call optimizing while gcc is? - optimization

I wanted to check whether g++ supports tail calling so I wrote this simple program to check it: http://ideone.com/hnXHv
using namespace std;
size_t st;
void PrintStackTop(const std::string &type)
{
int stack_top;
if(st == 0) st = (size_t) &stack_top;
cout << "In " << type << " call version, the stack top is: " << (st - (size_t) &stack_top) << endl;
}
int TailCallFactorial(int n, int a = 1)
{
PrintStackTop("tail");
if(n < 2)
return a;
return TailCallFactorial(n - 1, n * a);
}
int NormalCallFactorial(int n)
{
PrintStackTop("normal");
if(n < 2)
return 1;
return NormalCallFactorial(n - 1) * n;
}
int main(int argc, char *argv[])
{
st = 0;
cout << TailCallFactorial(5) << endl;
st = 0;
cout << NormalCallFactorial(5) << endl;
return 0;
}
When I compiled it normally it seems g++ doesn't really notice any difference between the two versions:
> g++ main.cpp -o TailCall
> ./TailCall
In tail call version, the stack top is: 0
In tail call version, the stack top is: 48
In tail call version, the stack top is: 96
In tail call version, the stack top is: 144
In tail call version, the stack top is: 192
120
In normal call version, the stack top is: 0
In normal call version, the stack top is: 48
In normal call version, the stack top is: 96
In normal call version, the stack top is: 144
In normal call version, the stack top is: 192
120
The stack difference is 48 in both of them, while the tail call version needs one more
int. (Why?)
So I thought optimization might be handy:
> g++ -O2 main.cpp -o TailCall
> ./TailCall
In tail call version, the stack top is: 0
In tail call version, the stack top is: 80
In tail call version, the stack top is: 160
In tail call version, the stack top is: 240
In tail call version, the stack top is: 320
120
In normal call version, the stack top is: 0
In normal call version, the stack top is: 64
In normal call version, the stack top is: 128
In normal call version, the stack top is: 192
In normal call version, the stack top is: 256
120
The stack size increased in both cases, and while the compiler might think my CPU is slower than my memory (which its not anyway), I don't know why 80 bytes are necessary for a simple function. (Why is it?).
There tail call version also takes more space than the normal version, and its completely logical if an int has the size of 16 bytes. (no, I don't own a 128 bit CPU).
Now thinking what reason the compiler has not to tail call, I thought it might be exceptions, because they depend on the stack tightly. So I tried without exceptions:
> g++ -O2 -fno-exceptions main.cpp -o TailCall
> ./TailCall
In tail call version, the stack top is: 0
In tail call version, the stack top is: 64
In tail call version, the stack top is: 128
In tail call version, the stack top is: 192
In tail call version, the stack top is: 256
120
In normal call version, the stack top is: 0
In normal call version, the stack top is: 48
In normal call version, the stack top is: 96
In normal call version, the stack top is: 144
In normal call version, the stack top is: 192
120
Which cut the normal version back to non-optimized stack size, while the optimized one has 8 bytes over it. still an int is not 8 bytes.
I thought there is something I missed in c++ that needs the stack arranged so I tried c: http://ideone.com/tJPpc
Still no tail calling, but the stack is much smaller (32 bit each frame in both version).
Then I tried with optimization:
> gcc -O2 main.c -o TailCall
> ./TailCall
In tail call version, the stack top is: 0
In tail call version, the stack top is: 0
In tail call version, the stack top is: 0
In tail call version, the stack top is: 0
In tail call version, the stack top is: 0
120
In normal call version, the stack top is: 0
In normal call version, the stack top is: 0
In normal call version, the stack top is: 0
In normal call version, the stack top is: 0
In normal call version, the stack top is: 0
120
Not only it tail call optimized the first, it also tail call optimized the second!
Why doesn't g++ do tail call optimization while its clearly available in the platform? is there any way to force it?

Because you're passing a temporary std::string object to the PrintStackTop(std::string) function. This object is allocated on the stack and thus prevent the tail call optimization.
I modified your code:
void PrintStackTopStr(char const*const type)
{
int stack_top;
if(st == 0) st = (size_t) &stack_top;
cout << "In " << type << " call version, the stack top is: " << (st - (size_t) &stack_top) << endl;
}
int RealTailCallFactorial(int n, int a = 1)
{
PrintStackTopStr("tail");
if(n < 2)
return a;
return RealTailCallFactorial(n - 1, n * a);
}
Compile with: g++ -O2 -fno-exceptions -o tailcall tailcall.cpp
And it now uses the tail call optimisation. You can see it in action if you use the -S flag to produce the assembly:
L39:
imull %ebx, %esi
subl $1, %ebx
L38:
movl $LC2, (%esp)
call __Z16PrintStackTopStrPKc
cmpl $1, %ebx
jg L39
You see the recursive call inlined as a loop (jg L39).

I don't find the other answer satisfying because a local object has no effect on the stack once it's gone.
Here is a good article which mentions that the lifetime of local objects extends into the tail-called function. Tail call optimization requires destroying locals before relinquishing control, GCC will not apply it unless it is sure that no local object will be accessed by the tail call.
Lifetime analysis is hard though, and it looks like it's being done too conservatively. Setting a global pointer to reference a local disables TCO even if the local's lifetime (scope) ends before the tail call.
{
int x;
static int * p;
p = & x;
} // x is dead here, but the enclosing function still has TCO disabled.
This still doesn't seem to model what's happening, so I found another bug. Passing local to a parameter with a user-defined or non-trivial destructor also disables TCO. (Defining the destructor = delete allows TCO.)
std::string has a nontrivial destructor, so that's causing the issue here.
The workaround is to do these things in a nested function call, because lifetime analysis will then be able to tell that the object is dead by the tail call. But there's no need to forgo all C++ objects.

The original code with temporary std::string object is still tail recursive, since the destructor for that object is executed immediately after exit from PrintStackTop("");, so nothing should be executed after the recursive return statement.
However, there are two issues that lead to confusion of tail call optimization (TCO):
the argument is passed by reference to the PrintStackTop function
non-trivial destructor of std::string
It can be verified by custom class that each of those two issues is able to break TCO.
As it is noted in the previous answer by #Potatoswatter there is a workaround for both of those issues. It is enough to wrap call of PrintStackTop by another function to help the compiler to perform TCO even with temporary std::string:
void PrintStackTopTail()
{
PrintStackTop("tail");
}
int TailCallFactorial(int n, int a = 1)
{
PrintStackTopTail();
//...
}
Note that is not enough to limit the scope by enclosing { PrintStackTop("tail"); } in curly braces. It must be enclosed as a separate function.
Now it can be verified with g++ version 4.7.2 (compilation options -O2) that tail recursion is replaced by a loop.
The similar issue is observed in Pass-by-reference hinders gcc from tail call elimination
Note that printing (st - (size_t) &stack_top) is not enough to be sure that TCO is performed, for example with the optimization option -O3 the function TailCallFactorial is self inlined five times, so TailCallFactorial(5) is executed as a single function call, but the issue is revealed for larger argument values (for example for TailCallFactorial(15);). So, the TCO may be verified by reviewing assembly output generated with the -S flag.

Related

Valgrind reports "invalid write" at "X bytes below stack pointer"

I'm running some code under Valgrind, compiled with gcc 7.5 targeting an aarch64 (ARM 64 bits) architecture, with optimizations enabled.
I get the following error:
==3580== Invalid write of size 8
==3580== at 0x38865C: ??? (in ...)
==3580== Address 0x1ffeffdb70 is on thread 1's stack
==3580== 16 bytes below stack pointer
This is the assembly dump in the vicinity of the offending code:
388640: a9bd7bfd stp x29, x30, [sp, #-48]!
388644: f9000bfc str x28, [sp, #16]
388648: a9024ff4 stp x20, x19, [sp, #32]
38864c: 910003fd mov x29, sp
388650: d1400bff sub sp, sp, #0x2, lsl #12
388654: 90fff3f4 adrp x20, 204000 <_IO_stdin_used-0x4f0>
388658: 3dc2a280 ldr q0, [x20, #2688]
38865c: 3c9f0fe0 str q0, [sp, #-16]!
I'm trying to ascertain whether this is a possible bug in my code (note that I've thoroughly reviewed my code and I'm fairly confident it's correct), or whether Valgrind will blindly report any writes below the stack pointer as an error.
Assuming the latter, it looks like a Valgrind bug since the offending instruction at 0x38865c uses the pre-decrement addressing mode, so it's not actually writing below the stack pointer.
Furthermore, at address 0x388640 a similar access (and again with pre-decrement addressing mode) is performed, yet this isn't reported by Valgrind; the main difference being the use of an x register at address 0x388640 versus a q register at address 38865c.
I'd also like to draw attention to the large stack pointer subtraction at 0x388650, which may or may not have anything to do with the issue (note this subtraction makes sense, given that the offending C code declares a large array on the stack).
So, will anyone help me make sense of this, and whether I should worry about my code?

The code looks fine, and the write is certainly not below the stack pointer. The message seems to be a valgrind bug, possibly #432552, which is marked as fixed. OP confirms that the message is not produced after upgrading valgrind to 3.17.0.

code declares a large array on the stack
should [I] worry about my code?
I think it depends upon your desire for your code to be more portable.
Take this bit of code that I believe represents at least one important thing you mentioned in your post:
#include <stdio.h>
#include <stdlib.h>
long long foo (long long sz, long long v) {
long long arr[sz]; // allocating a variable on the stack
arr[sz-1] = v;
return arr[sz-1];
}
int main (int argc, char *argv[]) {
long long n = atoll(argv[1]);
long long v = foo(n, n);
printf("v = %lld\n", v);
}
$ uname -mprsv
Darwin 20.5.0 Darwin Kernel Version 20.5.0: Sat May 8 05:10:33 PDT 2021; root:xnu-7195.121.3~9/RELEASE_X86_64 x86_64 i386
$ gcc test.c
$ a.out 1047934
v = 1047934
$ a.out 1047935
Segmentation fault: 11
$ uname -snrvmp
Linux localhost.localdomain 3.19.8-100.fc20.x86_64 #1 SMP Tue May 12 17:08:50 UTC 2015 x86_64 x86_64
$ gcc test.c
$ ./a.out 2147483647
v = 2147483647
$ ./a.out 2147483648
v = 2147483648
There are at least some minor portability concerns with this code. The amount of allocatable stack memory for these two environments differs significantly. And that's only for two platforms. Haven't tried it on my Windows 10 vm but I don't think I need to because I got bit by this one a long time ago.

Beyond OP issue that was due to a Valgrind bug, the title of this question is bound to attract more people (like me) who are getting "invalid write at X bytes below stack pointer" as a legitimate error.
My piece of advice: check that the address you're writing to is not a local variable of another function (not present in the call stack)!
I stumbled upon this issue while attempting to write into the address returned by yyget_lloc(yyscanner) while outside of function yyparse (the former returns the address of a local variable in the latter).

Logical NOT operation in JVM

I'm trying to mimic the behavior of a NOT gate using Jasmin. The behavior is as follows:
Pop an integer off the stack
if the integer is 0, push 1 back onto the stack
else push 0 back onto the stack
I've tried two different attempts at this but to no avail.
Attempt 1:
...(other code1)
ifeq 3 ; if the top of stack is 0, jump 3 lines down to "i_const1"
i_const0 ; top of stack was not 0, so we push 0
goto 2 ; jump 2 lines down to first line of (other code2)
i_const1
...(other code2)
Of course, the above example doesn't work because ifeq <offset> takes in a Label rather than a hard-coded integer as its offset. Is there a similar operation to ifeq that does accept integers as a parameter?
Attempt 2:
...
ifeq Zero ; top of stack is 0, so jump to Zero
i_const0 ; top of stack was 1 or greater, so we push 0
...
... (some code in between)
...
ifeq Zero ; top of stack is 0, so jump to Zero
i_const0 ; top of stack was 1 or greater, so we push 0
...
Zero:
i_const1 ; top of stack was 0, so push 1 to stack
goto <???> ; How do I know which "ifeq Zero" called this label?
The problem with this is that I've more than one place in my code making use of the NOT operation. I tried using ifeq with labels, but after I'm done how do I know which line to return to using goto? Is there a way to dynamically determine which "ifeq Zero" made the jump?
Any insight would be greatly appreciated.

Is there a similar operation to ifeq that does accept integers as a parameter?
Yes, you can specify relative offsets using $ sign.
BUT the relative offsets are counted in bytes, not in lines.
ifeq $+7 ; 0: jump +7 bytecodes forward from this instruction
iconst_0 ; +3
goto $+4 ; +4
iconst_1 ; +7
# ... ; +8
Is there a way to dynamically determine which "ifeq Zero" made the jump?
No. Use multiple different labels instead of single Zero.
Well, there is actually a pair of bytecodes (jsr/ret) that support dynamic return address. But these bytecodes are deprecated and are not supported in Java 6+ class files.

Linux ioctl return value interpreted by who?

I'm working with a custom kernel char device which sometimes returns large negative values (around the thousands, say -2000) for its ioctl().
In userspace, I don't get these values returned from the ioctl call. Instead I get a return value of -1 back with errno set to the negated value from the kernel module (+2000).
As far as I can read and google, __syscall_return() is the macro which is supposed to interpret negative return values as errors. But, it only seems to look for values between -1 and -125. So I didn't expect these large negative values to be translated.
Where are these return values translated? Is it expected behaviour?
I am on Linux 2.6.35.10 with EGLIBC 2.11.3-4+deb6u6.

The translation and move to errno occur on the libc level. Both Gnu libc and μClibc treat negative numbers down to at least -4095 as error conditions, per http://www.makelinux.net/ldd3/chp-6-sect-1
See https://github.molgen.mpg.de/git-mirror/glibc/blob/85b290451e4d3ab460a57f1c5966c5827ca807ca/sysdeps/unix/sysv/linux/aarch64/ioctl.S for the Gnu libc implementation of ioctl.

So, with the help of BRPocock I will report my findings here.
The linux kernel will do a error check for all syscalls along the lines of (from unistd.h):
#define __syscall_return(type, res) \
do { \
if ((unsigned long)(res) >= (unsigned long)(-125)) { \
errno = -(res); \
res = -1; \
} \
return (type) (res); \
} while (0)
Libc will also do an error check for all syscalls along the lines of (from syscall.S):
.text
ENTRY (syscall)
PUSHARGS_6 /* Save register contents. */
_DOARGS_6(44) /* Load arguments. */
movl 20(%esp), %eax /* Load syscall number into %eax. */
ENTER_KERNEL /* Do the system call. */
POPARGS_6 /* Restore register contents. */
cmpl $-4095, %eax /* Check %eax for error. */
jae SYSCALL_ERROR_LABEL /* Jump to error handler if error. */
ret /* Return to caller. */
PSEUDO_END (syscall)
Glibc gives a reason for the 4096 value (from sysdep.h):
/* Linux uses a negative return value to indicate syscall errors,
unlike most Unices, which use the condition codes' carry flag.
Since version 2.1 the return value of a system call might be
negative even if the call succeeded. E.g., the `lseek' system call
might return a large offset. Therefore we must not anymore test
for < 0, but test for a real error by making sure the value in %eax
is a real error number. Linus said he will make sure the no syscall
returns a value in -1 .. -4095 as a valid result so we can savely
test with -4095. */
__syscall_return seems to be missing from newer kernels, I haven't researched that yet.

gcc transformation pass for ackermann

gcc (I tried 4.7.2 on Mac and Linux with -O3 flag) optimizes ackermann function to a single call with a big local stack. An example Ackermann code below:
int ack(int m,int n){
if(m == 0) return n+1;
if(n == 0) return ack(m-1,1);
return ack(m-1,ack(m,n-1));
}
When disassembled, there is only one recursive call to ack function, instead of two calls (I couldn't parse what is going on -ack is now transformed by gcc into a function with 8 arguments, and local stack of 49 int and 9 char). I tried to look up what kind of transformation passes gcc did to optimize Ackermann function to a single call, but didn't find anything of interest. I will appreciate pointers on what major transformation passes gcc performed to convert the deeply recursive Ackermann to a single recursive call. LLVM gcc (I tried v4.2 on mac) doesn't reduce it to single recursive call yet, and is 4x slower with -O3 flag. This optimization seems very interesting.

The first pass is tail-call elimination. GCC does this at most optimization levels. Essentially, all function calls in tail position are transformed into goto's, like this:
int ack(int m, int n) {
begin:
if (m == 0) return n + 1;
if (n == 0) { m -= 1; n = 1; goto begin; }
n = ack(m, n-1); m -= 1; goto begin;
}
Now there is only one recursive call remaining and GCC, at -O3 level only, inlines this for a couple of iterations. Resulting in the huge monster you saw.

What's the line after execve for since it doesn't return on success?

26: execve(prog[0],prog,env);
27: return 0;
execve() does not return on success, and the text, data, bss, and
stack of the calling process are overwritten by that of the program
loaded.
what's return 0; for?

I suggest it is to cease this compiler warning.
$ cat | gcc -W -Wall -x c -
int main(){}
^D
<stdin>: In function 'main':
<stdin>:1:1: warning: control reaches end of non-void function
This also will make happy static analyzers and IDE warnings about same thing.

That line is in case execve() somehow fails and does return. Theoretically, it never should happen, but it does sometimes. Often, the return value is set to some random number to signify that there was an error.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Why isn't g++ tail call optimizing while gcc is? - optimization

Related

Valgrind reports "invalid write" at "X bytes below stack pointer"

Logical NOT operation in JVM

Linux ioctl return value interpreted by who?

gcc transformation pass for ackermann

What's the line after execve for since it doesn't return on success?

Categories

Resources