LLVM insertvalue bad optimized? - optimization

Should I avoid using the 'insertvalue' instruction combined with load and store when I emit LLVM code?
I always get bad optimized native code when I use it. Look at the following example:
; ModuleID = 'mod'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
target triple = "x86_64-pc-linux-gnu"
%A = type { i64, i64, i64, i64, i64, i64, i64, i64 }
#aa = external global %A*
define void #func() {
entry:
%a1 = load %A** #aa
%a2 = load %A* %a1
%a3 = insertvalue %A %a2, i64 3, 3
store %A %a3, %A* %a1
ret void
}
When I run "llc -o - -O3 mod.ll", I get this horrible code:
func: # #func
.Ltmp0:
.cfi_startproc
# BB#0: # %entry
movq aa(%rip), %rax
movq (%rax), %r8
movq 8(%rax), %r9
movq 16(%rax), %r10
movq 32(%rax), %rdi
movq 40(%rax), %rcx
movq 48(%rax), %rdx
movq 56(%rax), %rsi
movq %rsi, 56(%rax)
movq %rdx, 48(%rax)
movq %rcx, 40(%rax)
movq %rdi, 32(%rax)
movq %r10, 16(%rax)
movq %r9, 8(%rax)
movq %r8, (%rax)
movq $3, 24(%rax)
ret
But what I would like to see is this:
func: # #func
.Ltmp0:
.cfi_startproc
# BB#0: # %entry
movq aa(%rip), %rax
movq $3, 24(%rax)
ret
Of course I can use getelementptr or something, but sometimes it is easier to generate insertvalue and extractvalue instructions, and I want these to be optimized...
I think it would be quite easy for the codegen to see that things like these are bad:
movq 56(%rax), %rsi
movq %rsi, 56(%rax)

First, note that llc does not do any IR-level optimizations. So, you should run opt to run the set of IR-level optimizers.
However, opt does not help in this. I'd expect that standard IR-level optimizers canonicalize the stuff into gep somehow.
Please file a LLVM PR, this looks like a missed optimization!

Related

EXC_BAD_ACCESS (code=1, address=0x0) when trying to create NSURLCredetial in Kotlin/Native

Does anyone know why I get Thread 5: EXC_BAD_ACCESS (code=1, address=0x0), when trying to create NSURLCredetials in the iOS specific code of Kotlin multi-platform project. I get the error when this code is executed:
val credentials = NSURLCredential.credentialWithUser(this.username, this.password, NSURLCredentialPersistence.NSURLCredentialPersistenceForSession)
And here is a bit more details about the error:
0x10d8f83cb <+123>: leaq 0x87f02e(%rip), %rdi ; __KonanTlsKey
0x10d8f83d2 <+130>: movl $0xe0, %esi
0x10d8f83d7 <+135>: callq 0x10d1662e0 ; LookupTLS
-> 0x10d8f83dc <+140>: leaq 0x87da7d(%rip), %rdi ; kobjref:platform.Foundation.NSURLCredentialPersistence.$OBJECT#internal
0x10d8f83e3 <+147>: leaq 0x6b8746(%rip), %rdx ; ktype:platform.Foundation.NSURLCredentialPersistence.$OBJECT
0x10d8f83ea <+154>: leaq -0x691(%rip), %rcx ; kfun:platform.Foundation.NSURLCredentialPersistence.$OBJECT.<init>()platform.Foundation.NSURLCredentialPersistence.$OBJECT at Foundation.kt
0x10d8f83f1 <+161>: movq %rax, %rsi
0x10d8f83f4 <+164>: movq -0x40(%rbp), %r8

What is the meaning of %rsp, %rbp in my bomblab problem?

I am currently working on famous System Programming problem, "Bomb lab" phase2. When I disassembled phase_2 with gdb, code looks like this...
Phase_2
0x00005555555555cb <+0>: endbr64
0x00005555555555cf <+4>: push %rbp
0x00005555555555d0 <+5>: push %rbx
0x00005555555555d1 <+6>: sub $0x28,%rsp
0x00005555555555d5 <+10>: mov %fs:0x28,%rax
0x00005555555555de <+19>: mov %rax,0x18(%rsp)
0x00005555555555e3 <+24>: xor %eax,%eax
0x00005555555555e5 <+26>: mov %rsp,%rsi
0x00005555555555e8 <+29>: callq 0x555555555bd5 <read_six_numbers>
0x00005555555555ed <+34>: cmpl $0x0,(%rsp)
0x00005555555555f1 <+38>: js 0x5555555555fd <phase_2+50>
0x00005555555555f3 <+40>: mov %rsp,%rbp
0x00005555555555f6 <+43>: mov $0x1,%ebx
0x00005555555555fb <+48>: jmp 0x555555555615 <phase_2+74>
0x00005555555555fd <+50>: callq 0x555555555ba9 <explode_bomb>
0x0000555555555602 <+55>: jmp 0x5555555555f3 <phase_2+40>
0x0000555555555604 <+57>: callq 0x555555555ba9 <explode_bomb>
0x0000555555555609 <+62>: add $0x1,%ebx
0x000055555555560c <+65>: add $0x4,%rbp
0x0000555555555610 <+69>: cmp $0x6,%ebx
0x0000555555555613 <+72>: je 0x555555555621 <phase_2+86>
0x0000555555555615 <+74>: mov %ebx,%eax
0x0000555555555617 <+76>: add 0x0(%rbp),%eax
0x000055555555561a <+79>: cmp %eax,0x4(%rbp)
0x000055555555561d <+82>: je 0x555555555609 <phase_2+62>
0x000055555555561f <+84>: jmp 0x555555555604 <phase_2+57>
0x0000555555555621 <+86>: mov 0x18(%rsp),%rax
0x0000555555555626 <+91>: xor %fs:0x28,%rax
0x000055555555562f <+100>: jne 0x555555555638 <phase_2+109>
0x0000555555555631 <+102>: add $0x28,%rsp
0x0000555555555635 <+106>: pop %rbx
0x0000555555555636 <+107>: pop %rbp
0x0000555555555637 <+108>: retq
0x0000555555555638 <+109>: callq 0x555555555220 <__stack_chk_fail#plt>
I guess that in line <+62>, %ebx means index, and increments the value until the value is 6 (by line <+69>). But I don't really understand lines such as
0x000055555555560c <+65>: add $0x4,%rbp
or
0x0000555555555615 <+74>: mov %ebx,%eax
0x0000555555555617 <+76>: add 0x0(%rbp),%eax
0x000055555555561a <+79>: cmp %eax,0x4(%rbp)
I guess it is probably related to features of sequence that I have to find, but I don't understand why values such as 0x4(%rbp)is compared with %eax etc... Can someone explain?

#autorelease Pool and Loops (for, while, do) Syntax

clang allows the following loop syntax:
for (...) #autorelease { ... }
while (...) #autorelease { ... }
do #autorelease { ... } while (...);
I haven't found any documentation on that syntax so far (Apple doesn't use this syntax in their guides, at least no in the guides introducing the #autorelease construct), but is it reasonable to assume that the three statement above are equivalent to the three statements below:
for (...) { #autorelease { ... } }
while (...) { #autorelease { ... } }
do { #autorelease { ... } } while (...);
Since that is what I would expect them to be (going by standard C syntax rules), yet I'm not entirely sure if that's really the case. It could also be some "special syntax", where the autorelease pool is not renewed for every loop iteration.
The reason that the first syntax example works is clear when you consider that any conditional statement can omit the { ... } block, resulting in only the following statement being executed.
For example:
if (something == YES)
NSLog(#"Something is yes");
is equivalent to
if (something == YES)
{
NSLog(#"Something is yes");
}
The #autoreleasepool { ... } block is simply the next statement following the conditional.
Personally I use the second syntax as it's less error-prone when making changes, and I find it easier to read. Imagine that when you add a statement between the conditional and the #autoreleasepool { ... } block, the result is considerably different from the original. See this naive example...
int i = 1;
while (i <= 10)
#autoreleasepool
{
NSLog(#"Iteration %d", i);
++i;
}
Will output "Iteration 1" through "Iteration 10". However:
int i = 1;
int total = 0;
while (i <= 10)
total += i;
#autoreleasepool
{
NSLog(#"Iteration %d", i);
++i;
}
Will actually cause an infinite loop because the ++i statement is never reached as it is syntactically equivalent to:
int i = 1;
int total = 0;
while (i <= 10)
{
total += i;
}
#autoreleasepool
{
NSLog(#"Iteration %d", i);
++i;
}
Both syntax are same
-(void)aFunc
{
int i=0;
for(;i<5;)
#autoreleasepool {
++i;
}
}
-(void)bFunc
{
int i=0;
for(;i<5;)
{
#autoreleasepool {
++i;
}
}
}
Assembly code
"-[AppDelegate aFunc]": ## #"\01-[AppDelegate aFunc]"
.cfi_startproc
Lfunc_begin0:
.loc 1 12 0 ## /Users/Parag/Desktop/Test/Test/AppDelegate.m:12:0
## BB#0:
pushq %rbp
Ltmp2:
.cfi_def_cfa_offset 16
Ltmp3:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp4:
.cfi_def_cfa_register %rbp
subq $32, %rsp
movq %rdi, -8(%rbp)
movq %rsi, -16(%rbp)
.loc 1 14 12 prologue_end ## /Users/Parag/Desktop/Test/Test/AppDelegate.m:14:12
Ltmp5:
movl $0, -20(%rbp)
LBB0_1: ## =>This Inner Loop Header: Depth=1
.loc 1 15 5 ## /Users/Parag/Desktop/Test/Test/AppDelegate.m:15:5
Ltmp6:
cmpl $5, -20(%rbp)
jge LBB0_3
## BB#2: ## in Loop: Header=BB0_1 Depth=1
.loc 1 16 26 ## /Users/Parag/Desktop/Test/Test/AppDelegate.m:16:26
Ltmp7:
callq _objc_autoreleasePoolPush
.loc 1 17 13 ## /Users/Parag/Desktop/Test/Test/AppDelegate.m:17:13
movl -20(%rbp), %ecx
addl $1, %ecx
movl %ecx, -20(%rbp)
.loc 1 18 9 ## /Users/Parag/Desktop/Test/Test/AppDelegate.m:18:9
movq %rax, %rdi
callq _objc_autoreleasePoolPop
jmp LBB0_1
Ltmp8:
LBB0_3:
.loc 1 19 1 ## /Users/Parag/Desktop/Test/Test/AppDelegate.m:19:1
addq $32, %rsp
popq %rbp
ret
Ltmp9:
Lfunc_end0:
.file 2 "/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.8.sdk/System/Library/Frameworks/Foundation.framework/Headers/NSObject.h"
.file 3 "/Users/Parag/Desktop/Test/Test/AppDelegate.h"
.cfi_endproc
.align 4, 0x90
"-[AppDelegate bFunc]": ## #"\01-[AppDelegate bFunc]"
.cfi_startproc
Lfunc_begin1:
.loc 1 20 0 ## /Users/Parag/Desktop/Test/Test/AppDelegate.m:20:0
## BB#0:
pushq %rbp
Ltmp12:
.cfi_def_cfa_offset 16
Ltmp13:
.cfi_offset %rbp, -16
movq %rsp, %rbp
Ltmp14:
.cfi_def_cfa_register %rbp
subq $32, %rsp
movq %rdi, -8(%rbp)
movq %rsi, -16(%rbp)
.loc 1 22 12 prologue_end ## /Users/Parag/Desktop/Test/Test/AppDelegate.m:22:12
Ltmp15:
movl $0, -20(%rbp)
LBB1_1: ## =>This Inner Loop Header: Depth=1
.loc 1 23 5 ## /Users/Parag/Desktop/Test/Test/AppDelegate.m:23:5
Ltmp16:
cmpl $5, -20(%rbp)
jge LBB1_3
## BB#2: ## in Loop: Header=BB1_1 Depth=1
.loc 1 25 26 ## /Users/Parag/Desktop/Test/Test/AppDelegate.m:25:26
Ltmp17:
callq _objc_autoreleasePoolPush
.loc 1 26 14 ## /Users/Parag/Desktop/Test/Test/AppDelegate.m:26:14
movl -20(%rbp), %ecx
addl $1, %ecx
movl %ecx, -20(%rbp)
.loc 1 27 9 ## /Users/Parag/Desktop/Test/Test/AppDelegate.m:27:9
movq %rax, %rdi
callq _objc_autoreleasePoolPop
Ltmp18:
.loc 1 28 5 ## /Users/Parag/Desktop/Test/Test/AppDelegate.m:28:5
jmp LBB1_1
Ltmp19:
LBB1_3:
.loc 1 29 1 ## /Users/Parag/Desktop/Test/Test/AppDelegate.m:29:1
addq $32, %rsp
popq %rbp
ret
Ltmp20:
Lfunc_end1:
I have tried the following code:
#interface Foo : NSObject
#end
#implementation Foo
- (void) dealloc
{
NSLog(#"Deallocating %#.", self);
[super dealloc];
}
#end
for (;;) #autoreleasepool {
[[[Foo alloc] init] autorelease];
sleep(1);
}
The console starts to fill with the deallocated Foo instances, so the syntax appears to work as expected.
This is just the normal C syntax for blocks and statements. When if, else, for, while, etc. do not have braces, they take the following statement, which could be a compound statement.
For example, you can do:
for (...) if (...) { ... }
if (...) while (...) { ... }
and so on... #autoreleasepool blocks are no different.

Why do we need a constant time *single byte* comparison function?

Looking at Go standard library, there's a ConstantTimeByteEq function that looks like this:
func ConstantTimeByteEq(x, y uint8) int {
z := ^(x ^ y)
z &= z >> 4
z &= z >> 2
z &= z >> 1
return int(z)
}
Now, I understand the need for constant time string (array, etc.) comparison, as a regular algorithm could short-circuit after the first unequal element. But in this case, isn't a regular comparison of two fixed-sized integers a constant time operation at the CPU level already?
The point is likely to avoid branch mispredictions, in addition to having the result as 1 or 0 instead of true or false (allowing follow ups as bitwise operations).
Compare how this compiles:
var a, b, c, d byte
_ = a == b && c == d
=>
0017 (foo.go:15) MOVQ $0,BX
0018 (foo.go:15) MOVQ $0,DX
0019 (foo.go:15) MOVQ $0,CX
0020 (foo.go:15) MOVQ $0,AX
0021 (foo.go:16) JMP ,24
0022 (foo.go:16) MOVQ $1,AX
0023 (foo.go:16) JMP ,30
0024 (foo.go:16) CMPB BX,DX
0025 (foo.go:16) JNE ,29
0026 (foo.go:16) CMPB CX,AX
0027 (foo.go:16) JNE ,29
0028 (foo.go:16) JMP ,22
0029 (foo.go:16) MOVQ $0,AX
With this:
var a, b, c, d byte
_ = subtle.ConstantTimeByteEq(a, b) & subtle.ConstantTimeByteEq(c, d)
=>
0018 (foo.go:15) MOVQ $0,DX
0019 (foo.go:15) MOVQ $0,AX
0020 (foo.go:15) MOVQ $0,DI
0021 (foo.go:15) MOVQ $0,SI
0022 (foo.go:16) XORQ AX,DX
0023 (foo.go:16) XORQ $-1,DX
0024 (foo.go:16) MOVQ DX,BX
0025 (foo.go:16) SHRB $4,BX
0026 (foo.go:16) ANDQ BX,DX
0027 (foo.go:16) MOVQ DX,BX
0028 (foo.go:16) SHRB $2,BX
0029 (foo.go:16) ANDQ BX,DX
0030 (foo.go:16) MOVQ DX,AX
0031 (foo.go:16) SHRB $1,DX
0032 (foo.go:16) ANDQ DX,AX
0033 (foo.go:16) MOVBQZX AX,DX
0034 (foo.go:16) MOVQ DI,BX
0035 (foo.go:16) XORQ SI,BX
0036 (foo.go:16) XORQ $-1,BX
0037 (foo.go:16) MOVQ BX,AX
0038 (foo.go:16) SHRB $4,BX
0039 (foo.go:16) ANDQ BX,AX
0040 (foo.go:16) MOVQ AX,BX
0041 (foo.go:16) SHRB $2,BX
0042 (foo.go:16) ANDQ BX,AX
0043 (foo.go:16) MOVQ AX,BX
0044 (foo.go:16) SHRB $1,BX
0045 (foo.go:16) ANDQ BX,AX
0046 (foo.go:16) MOVBQZX AX,BX
Although the latter version is longer, it's also linear -- there are no branches.
Not necessarily. And it is hard to tell what the compiler will emit after doing its optimizations. You could end up with different machine code for the highlevel "compare one byte". Leaking just a tiny bit in a side channel might change your encryption from "basically unbreakable" to "hopefully not worth the money needed for a crack".
If the code which called the function were to immediately branch based upon the result, the use of the constant-time method wouldn't provide much extra security. On the other hand, if one were to call the function on a bunch of different pairs of bytes, keeping a running total of the results, and only branch based upon the final result, then an outside snooper might be able to determine whether that last branch was taken, but wouldn't know which of the previous byte comparisons was responsible for it.
That having been said, I'm not sure I see a whole lot of advantages in most usage cases for going through the trouble of distilling the method's output to a zero or one; simply keeping a running tally of notEqual = (A0 ^ B0); notEqual |= (A1 ^ B1); notEqual |= (A2 ^ B2); ... would achieve the same effect and be much faster.

printf("%%%s","hello")

If I write printf("%%%s","hello"), how is it interpreted as by the compiler? Enlighten me, someone.
The compiler simply interprets this as calling printf with two strings as arguments (but see Zack's comment).
This happens at compile time (i.e. the compiler does this):
The strings ("%%%s" and "hello") are copied directly into the executable, the compiler leaves them as-is.
This happens at runtime (i.e. the C standard library does this when the app is running):
printf stand for 'print formatted'. When this function is called, it needs at least one argument. The first argument is the format. The next arguments are "arguments" to this format. They are formatted as specified in the first argument.
About optimization
I have written an example and ran Clang/LLVM with -S:
$ emacs printftest.c
$ clang printftest.c -S -o printftest_unopt.s # not optimized
$ clang printftest.c -S -O -o printftest_opt.s # optimized: -O flag
C code
#include <stdio.h>
int main() {
printf("%%%s", "hello");
return 0;
}
printftest_unopt.s (not optimized)
; ...
_main:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
movl $0, %eax
movl $0, -4(%rbp)
movl %eax, -8(%rbp)
xorb %al, %al
leaq L_.str(%rip), %rdi
leaq L_.str1(%rip), %rsi
callq _printf ; printf called here <----------------
movl %eax, -12(%rbp)
movl -8(%rbp), %eax
addq $16, %rsp
popq %rbp
ret
.section __TEXT,__cstring,cstring_literals
L_.str:
.asciz "%%%s"
L_.str1:
.asciz "hello"
; ...
printftest_opt.s (optimized)
; ...
_main:
pushq %rbp
movq %rsp, %rbp
leaq L_.str(%rip), %rdi
leaq L_.str1(%rip), %rsi
xorb %al, %al
callq _printf ; printf called here <----------------
xorl %eax, %eax
popq %rbp
ret
.section __TEXT,__cstring,cstring_literals
L_.str:
.asciz "%%%s"
L_.str1:
.asciz "hello"
; ...
Conclusion
As you can see (in the __TEXT,__cstring,cstring_literals section and the callq to printf), LLVM (a very, very good compiler) does not optimize printf("%%%s", "hello");. :)
"%%" means to print an actual "%" character; %s means to print a string from the argument list. So you'll see "%hello".
%% will print a literal '%' character.