How to read the assembly code dump by JIT C2 - jvm

Below is the assembly code dump output from JIT C2.
It performs a func call (callq), but in the comment section, JIT outputs a call stack.
Does this imply inline is only applied up to SomeClass::SomeMethod? Thanks for the answering.
0x00007f4a9f4f269f: callq 0x00007f4a9d0453e0 ; OopMap{rbp=Oop [288]=Oop [312]=Oop [112]=Oop [120]=Oop [128]=Oop [136]=Oop [176]=Oop [192]=Oop off=4132}
;*if_icmpeq
; - org.apache.spark.xyz.abc.SomeClass::SomeMethod#178 (line 87)
; - org.apache.spark.abc.xyz.OtherClass::OtherMethod#575 (line 561)
; {runtime_call}

The comments show the current inlining stack corresponding to the given machine instruction.
;*if_icmpeq
; - org.apache.spark.xyz.abc.SomeClass::SomeMethod#178 (line 87)
; - org.apache.spark.abc.xyz.OtherClass::OtherMethod#575 (line 561)
In particular, the above comment means that the corresponding call instruction was generated as a result of compiling if_icmpeq bytecode at SomeClass.SomeMethod, which was inlined into the compilation of OtherClass.OtherMethod.
Here OtherClass.OtherMethod is the root method being compiled, and SomeClass.SomeMethod is a method at the first inlining level.

Related

Valgrind reports "invalid write" at "X bytes below stack pointer"

I'm running some code under Valgrind, compiled with gcc 7.5 targeting an aarch64 (ARM 64 bits) architecture, with optimizations enabled.
I get the following error:
==3580== Invalid write of size 8
==3580== at 0x38865C: ??? (in ...)
==3580== Address 0x1ffeffdb70 is on thread 1's stack
==3580== 16 bytes below stack pointer
This is the assembly dump in the vicinity of the offending code:
388640: a9bd7bfd stp x29, x30, [sp, #-48]!
388644: f9000bfc str x28, [sp, #16]
388648: a9024ff4 stp x20, x19, [sp, #32]
38864c: 910003fd mov x29, sp
388650: d1400bff sub sp, sp, #0x2, lsl #12
388654: 90fff3f4 adrp x20, 204000 <_IO_stdin_used-0x4f0>
388658: 3dc2a280 ldr q0, [x20, #2688]
38865c: 3c9f0fe0 str q0, [sp, #-16]!
I'm trying to ascertain whether this is a possible bug in my code (note that I've thoroughly reviewed my code and I'm fairly confident it's correct), or whether Valgrind will blindly report any writes below the stack pointer as an error.
Assuming the latter, it looks like a Valgrind bug since the offending instruction at 0x38865c uses the pre-decrement addressing mode, so it's not actually writing below the stack pointer.
Furthermore, at address 0x388640 a similar access (and again with pre-decrement addressing mode) is performed, yet this isn't reported by Valgrind; the main difference being the use of an x register at address 0x388640 versus a q register at address 38865c.
I'd also like to draw attention to the large stack pointer subtraction at 0x388650, which may or may not have anything to do with the issue (note this subtraction makes sense, given that the offending C code declares a large array on the stack).
So, will anyone help me make sense of this, and whether I should worry about my code?
The code looks fine, and the write is certainly not below the stack pointer. The message seems to be a valgrind bug, possibly #432552, which is marked as fixed. OP confirms that the message is not produced after upgrading valgrind to 3.17.0.
code declares a large array on the stack
should [I] worry about my code?
I think it depends upon your desire for your code to be more portable.
Take this bit of code that I believe represents at least one important thing you mentioned in your post:
#include <stdio.h>
#include <stdlib.h>
long long foo (long long sz, long long v) {
long long arr[sz]; // allocating a variable on the stack
arr[sz-1] = v;
return arr[sz-1];
}
int main (int argc, char *argv[]) {
long long n = atoll(argv[1]);
long long v = foo(n, n);
printf("v = %lld\n", v);
}
$ uname -mprsv
Darwin 20.5.0 Darwin Kernel Version 20.5.0: Sat May 8 05:10:33 PDT 2021; root:xnu-7195.121.3~9/RELEASE_X86_64 x86_64 i386
$ gcc test.c
$ a.out 1047934
v = 1047934
$ a.out 1047935
Segmentation fault: 11
$ uname -snrvmp
Linux localhost.localdomain 3.19.8-100.fc20.x86_64 #1 SMP Tue May 12 17:08:50 UTC 2015 x86_64 x86_64
$ gcc test.c
$ ./a.out 2147483647
v = 2147483647
$ ./a.out 2147483648
v = 2147483648
There are at least some minor portability concerns with this code. The amount of allocatable stack memory for these two environments differs significantly. And that's only for two platforms. Haven't tried it on my Windows 10 vm but I don't think I need to because I got bit by this one a long time ago.
Beyond OP issue that was due to a Valgrind bug, the title of this question is bound to attract more people (like me) who are getting "invalid write at X bytes below stack pointer" as a legitimate error.
My piece of advice: check that the address you're writing to is not a local variable of another function (not present in the call stack)!
I stumbled upon this issue while attempting to write into the address returned by yyget_lloc(yyscanner) while outside of function yyparse (the former returns the address of a local variable in the latter).

Yosys logic loop falsely detected

I've been testing yosys for some use cases.
Version: Yosys 0.7+200 (git sha1 155a80d, gcc-6.3 6.3.0 -fPIC -Os)
I wrote a simple block which converts gray code to binary:
module gray2bin (gray, bin);
parameter WDT = 3;
input [WDT-1:0] gray;
output [WDT-1:0] bin;
assign bin = {gray[WDT-1], bin[WDT-1:1]^gray[WDT-2:0]};
endmodule
This is an acceptable and valid code in verilog, and there is no loop in it.
It passes compilation and synthesis without any warnings in other tools.
But, when I run in yosys the next commands:
read_verilog gray2bin.v
scc
I get that a logic loop was found:
Found an SCC: $xor$gray2bin.v:11$1
Found 1 SCCs in module gray2bin.
Found 1 SCCs.
The next code, which is equivalent, pass the check:
module gray2bin2 (
gray,
bin
);
parameter WDT = 3;
input [WDT-1:0] gray;
output [WDT-1:0] bin;
assign bin[WDT-1] = gray[WDT-1];
genvar i;
generate
for (i = WDT-2; i>=0; i=i-1) begin : gen_serial_xor
assign bin[i] = bin[i+1]^gray[i];
end
endgenerate
endmodule
Am I missing a flag or synthesis option of some kind?
Using word-wide operators this circuit clearly has a loop (generated with yosys -p 'prep; show' gray2bin.v):
You have to synthesize the circuit to a gate-level representation to get a loop-free version (generated with yosys -p 'synth; splitnets -ports; show' gray2bin.v, the call to splitnets is just there for better visualization):
The answer given by CliffordVienna indeed gives a solution, but I also want to clarify that that it's not suitable to all purposes.
My analysis was done for the purpose of formal verification. Since I replaced the prep to synth to solve the falsely identified logic loops, my formal code got optimized. Wires which I've created that were driven only by the assume property pragma, were removed - this made many assertions redundant.
It's not correct to reduce any logic for the purpose of behavioral verification.
Therefore, if the purpose is to prepare a verification database, I suggest not to use the synth command, but to use a subset of commands the synth command executes.
You can find those commands under:
http://www.clifford.at/yosys/cmd_synth.html
In general, I've used all the commands specified in the above link that do not optimize logic:
hierarchy -check
proc
check
wreduce
alumacc
fsm
memory -nomap
memory_map
techmap
abc -fast
hierarchy -check
stat
check
And everything works as expected.

Linux ioctl return value interpreted by who?

I'm working with a custom kernel char device which sometimes returns large negative values (around the thousands, say -2000) for its ioctl().
In userspace, I don't get these values returned from the ioctl call. Instead I get a return value of -1 back with errno set to the negated value from the kernel module (+2000).
As far as I can read and google, __syscall_return() is the macro which is supposed to interpret negative return values as errors. But, it only seems to look for values between -1 and -125. So I didn't expect these large negative values to be translated.
Where are these return values translated? Is it expected behaviour?
I am on Linux 2.6.35.10 with EGLIBC 2.11.3-4+deb6u6.
The translation and move to errno occur on the libc level. Both Gnu libc and μClibc treat negative numbers down to at least -4095 as error conditions, per http://www.makelinux.net/ldd3/chp-6-sect-1
See https://github.molgen.mpg.de/git-mirror/glibc/blob/85b290451e4d3ab460a57f1c5966c5827ca807ca/sysdeps/unix/sysv/linux/aarch64/ioctl.S for the Gnu libc implementation of ioctl.
So, with the help of BRPocock I will report my findings here.
The linux kernel will do a error check for all syscalls along the lines of (from unistd.h):
#define __syscall_return(type, res) \
do { \
if ((unsigned long)(res) >= (unsigned long)(-125)) { \
errno = -(res); \
res = -1; \
} \
return (type) (res); \
} while (0)
Libc will also do an error check for all syscalls along the lines of (from syscall.S):
.text
ENTRY (syscall)
PUSHARGS_6 /* Save register contents. */
_DOARGS_6(44) /* Load arguments. */
movl 20(%esp), %eax /* Load syscall number into %eax. */
ENTER_KERNEL /* Do the system call. */
POPARGS_6 /* Restore register contents. */
cmpl $-4095, %eax /* Check %eax for error. */
jae SYSCALL_ERROR_LABEL /* Jump to error handler if error. */
ret /* Return to caller. */
PSEUDO_END (syscall)
Glibc gives a reason for the 4096 value (from sysdep.h):
/* Linux uses a negative return value to indicate syscall errors,
unlike most Unices, which use the condition codes' carry flag.
Since version 2.1 the return value of a system call might be
negative even if the call succeeded. E.g., the `lseek' system call
might return a large offset. Therefore we must not anymore test
for < 0, but test for a real error by making sure the value in %eax
is a real error number. Linus said he will make sure the no syscall
returns a value in -1 .. -4095 as a valid result so we can savely
test with -4095. */
__syscall_return seems to be missing from newer kernels, I haven't researched that yet.

smulwb assembly instruction

I'm trying to understand this code:
inline SInt32 smul32by16(SInt32 i32, SInt16 i16)
{
register SInt32 r;
asm volatile("smulwb %0, %1, %2" : "=r"(r) : "r"(i32), "r"(i16));
return r;
}
Does anybody know what this assembly instruction does?
Update:
P.S. I use objective C. and I should understand some code from assembly. That's why it's difficult for me to understand this code.
It does signed 32 bit by signed 16 bit multiplication and returns the top 32 bit of 48 bit result. The b specifies to use the bottom 16bit of the third operand.
So, translating it into pseudo code:
int_48 temp;
temp = i32*i16;
result = temp >> 16;
See here for the description of the ARM SMUL and SMULW instructions:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0553a/CHDIABBH.html
by using asm you can give assembler commands.
and using volatile for the reason,
volatile for the asm construct, to prevent GCC from deleting the asm statement as unused
see this link for better understanding
command inside ask instruction means :
SMULWB R4, R5, R3 ; Multiplies R5 with the bottom halfword of R3,
; extracts top 32 bits and writes to R4.

Can a partially tail-recursive function still gain the optimization advantages of a fully tail-recursive function?

I realise the answer to this question could be different for different languages, and the language I am most interested in is C++. If the tag needs be changed because this can't be answered in a language-agnostic manner, feel free.
Is it possible to have a function be partially tail-recursive and still get any advantage that being tail-recursive would get you?
As I understand it, tail-recursion is where instead of doing a full function call, the compiler will optimise the function to just change the arguments in place to the new arguments and jump to the beginning of the function.
If you have a function like this:
def example(arg):
if arg == 0:
return 0 # base case
if arg % 2 == 0:
return example(arg - 1) # should be tail recursive
return 3 + example(arg - 1) # isn't tail recursive because 3 is added to the result
When an optimiser encounters something like that (where the function is tail-recursive in some cases and not in others) will it turn the one into a jump and the other into a call, or will some fact of optimisation reality (if I knew it I wouldn't be asking) make it have to turn everything into a call and lose all the efficiency you would have had if the function were tail-recursive?
In Scheme, the first language that comes to mind when I think of tail calls, the second case is guaranteed to be a tail call by the language specification. (Terminology note: it is preferred to refer to such function calls as 'tail calls'.)
The Scheme specification defines exactly what tail calls are in Scheme and mandates that compilers support them specially. You can see the definition in 11.20. Tail calls and tail contexts of R6RS (source).
Note that in Scheme, the specification says nothing about optimization of tail calls. Rather, it says that an implementation must support an unbounded number of active tail calls — a semantic property of the language runtime. They can be implemented as normal calls, but usually aren't.
Example, in C:
Take a C version of your example.
int example(int arg)
{
if (arg == 0)
return 0;
if ((arg % 2) == 0)
return example(arg - 1);
return 3 + example(arg - 1);
}
Compile it using gcc's usual optimization settings (-O2) for i386:
_example:
pushl %ebp
xorl %eax, %eax
movl %esp, %ebp
movl 8(%ebp), %edx
testl %edx, %edx
jne L5
jmp L15
.align 4,0x90
L14:
decl %edx
testl %edx, %edx
je L7
L5:
testb $1, %dl
je L14
decl %edx
addl $3, %eax
testl %edx, %edx
jne L5
L7:
leave
ret
L15:
leave
xorl %eax, %eax
ret
Note that there are no function calls in the assembly code. GCC has not only optimized your tail call into a jump, it optimized the non-tail call into a jump as well.
As far as I understand it, a smart compiler could apply tail recursion to your first call by just jumping to the example entry point instead of setting up a new stack frame. A following return will unwind the stack to the original caller, effectively "ending" both calls in one step, even if it cannot do that for the other call.
And you could optimize your function by moving the adding of 3 inside the call:
def example(arg, add=0):
arg += add
....
return example(arg - 1, 3) # tail now too
Another technique would be to create a second function and have both call each other.
I don't know if python or C++ compilers can handle that though, but you can check assembly output for C++. Strangely I think checking bytecode output for python may be harder.