Why do we need a constant time *single byte* comparison function? - cryptography

Looking at Go standard library, there's a ConstantTimeByteEq function that looks like this:
func ConstantTimeByteEq(x, y uint8) int {
z := ^(x ^ y)
z &= z >> 4
z &= z >> 2
z &= z >> 1
return int(z)
}
Now, I understand the need for constant time string (array, etc.) comparison, as a regular algorithm could short-circuit after the first unequal element. But in this case, isn't a regular comparison of two fixed-sized integers a constant time operation at the CPU level already?

The point is likely to avoid branch mispredictions, in addition to having the result as 1 or 0 instead of true or false (allowing follow ups as bitwise operations).
Compare how this compiles:
var a, b, c, d byte
_ = a == b && c == d
=>
0017 (foo.go:15) MOVQ $0,BX
0018 (foo.go:15) MOVQ $0,DX
0019 (foo.go:15) MOVQ $0,CX
0020 (foo.go:15) MOVQ $0,AX
0021 (foo.go:16) JMP ,24
0022 (foo.go:16) MOVQ $1,AX
0023 (foo.go:16) JMP ,30
0024 (foo.go:16) CMPB BX,DX
0025 (foo.go:16) JNE ,29
0026 (foo.go:16) CMPB CX,AX
0027 (foo.go:16) JNE ,29
0028 (foo.go:16) JMP ,22
0029 (foo.go:16) MOVQ $0,AX
With this:
var a, b, c, d byte
_ = subtle.ConstantTimeByteEq(a, b) & subtle.ConstantTimeByteEq(c, d)
=>
0018 (foo.go:15) MOVQ $0,DX
0019 (foo.go:15) MOVQ $0,AX
0020 (foo.go:15) MOVQ $0,DI
0021 (foo.go:15) MOVQ $0,SI
0022 (foo.go:16) XORQ AX,DX
0023 (foo.go:16) XORQ $-1,DX
0024 (foo.go:16) MOVQ DX,BX
0025 (foo.go:16) SHRB $4,BX
0026 (foo.go:16) ANDQ BX,DX
0027 (foo.go:16) MOVQ DX,BX
0028 (foo.go:16) SHRB $2,BX
0029 (foo.go:16) ANDQ BX,DX
0030 (foo.go:16) MOVQ DX,AX
0031 (foo.go:16) SHRB $1,DX
0032 (foo.go:16) ANDQ DX,AX
0033 (foo.go:16) MOVBQZX AX,DX
0034 (foo.go:16) MOVQ DI,BX
0035 (foo.go:16) XORQ SI,BX
0036 (foo.go:16) XORQ $-1,BX
0037 (foo.go:16) MOVQ BX,AX
0038 (foo.go:16) SHRB $4,BX
0039 (foo.go:16) ANDQ BX,AX
0040 (foo.go:16) MOVQ AX,BX
0041 (foo.go:16) SHRB $2,BX
0042 (foo.go:16) ANDQ BX,AX
0043 (foo.go:16) MOVQ AX,BX
0044 (foo.go:16) SHRB $1,BX
0045 (foo.go:16) ANDQ BX,AX
0046 (foo.go:16) MOVBQZX AX,BX
Although the latter version is longer, it's also linear -- there are no branches.

Not necessarily. And it is hard to tell what the compiler will emit after doing its optimizations. You could end up with different machine code for the highlevel "compare one byte". Leaking just a tiny bit in a side channel might change your encryption from "basically unbreakable" to "hopefully not worth the money needed for a crack".

If the code which called the function were to immediately branch based upon the result, the use of the constant-time method wouldn't provide much extra security. On the other hand, if one were to call the function on a bunch of different pairs of bytes, keeping a running total of the results, and only branch based upon the final result, then an outside snooper might be able to determine whether that last branch was taken, but wouldn't know which of the previous byte comparisons was responsible for it.
That having been said, I'm not sure I see a whole lot of advantages in most usage cases for going through the trouble of distilling the method's output to a zero or one; simply keeping a running tally of notEqual = (A0 ^ B0); notEqual |= (A1 ^ B1); notEqual |= (A2 ^ B2); ... would achieve the same effect and be much faster.

Related

What is the meaning of %rsp, %rbp in my bomblab problem?

I am currently working on famous System Programming problem, "Bomb lab" phase2. When I disassembled phase_2 with gdb, code looks like this...
Phase_2
0x00005555555555cb <+0>: endbr64
0x00005555555555cf <+4>: push %rbp
0x00005555555555d0 <+5>: push %rbx
0x00005555555555d1 <+6>: sub $0x28,%rsp
0x00005555555555d5 <+10>: mov %fs:0x28,%rax
0x00005555555555de <+19>: mov %rax,0x18(%rsp)
0x00005555555555e3 <+24>: xor %eax,%eax
0x00005555555555e5 <+26>: mov %rsp,%rsi
0x00005555555555e8 <+29>: callq 0x555555555bd5 <read_six_numbers>
0x00005555555555ed <+34>: cmpl $0x0,(%rsp)
0x00005555555555f1 <+38>: js 0x5555555555fd <phase_2+50>
0x00005555555555f3 <+40>: mov %rsp,%rbp
0x00005555555555f6 <+43>: mov $0x1,%ebx
0x00005555555555fb <+48>: jmp 0x555555555615 <phase_2+74>
0x00005555555555fd <+50>: callq 0x555555555ba9 <explode_bomb>
0x0000555555555602 <+55>: jmp 0x5555555555f3 <phase_2+40>
0x0000555555555604 <+57>: callq 0x555555555ba9 <explode_bomb>
0x0000555555555609 <+62>: add $0x1,%ebx
0x000055555555560c <+65>: add $0x4,%rbp
0x0000555555555610 <+69>: cmp $0x6,%ebx
0x0000555555555613 <+72>: je 0x555555555621 <phase_2+86>
0x0000555555555615 <+74>: mov %ebx,%eax
0x0000555555555617 <+76>: add 0x0(%rbp),%eax
0x000055555555561a <+79>: cmp %eax,0x4(%rbp)
0x000055555555561d <+82>: je 0x555555555609 <phase_2+62>
0x000055555555561f <+84>: jmp 0x555555555604 <phase_2+57>
0x0000555555555621 <+86>: mov 0x18(%rsp),%rax
0x0000555555555626 <+91>: xor %fs:0x28,%rax
0x000055555555562f <+100>: jne 0x555555555638 <phase_2+109>
0x0000555555555631 <+102>: add $0x28,%rsp
0x0000555555555635 <+106>: pop %rbx
0x0000555555555636 <+107>: pop %rbp
0x0000555555555637 <+108>: retq
0x0000555555555638 <+109>: callq 0x555555555220 <__stack_chk_fail#plt>
I guess that in line <+62>, %ebx means index, and increments the value until the value is 6 (by line <+69>). But I don't really understand lines such as
0x000055555555560c <+65>: add $0x4,%rbp
or
0x0000555555555615 <+74>: mov %ebx,%eax
0x0000555555555617 <+76>: add 0x0(%rbp),%eax
0x000055555555561a <+79>: cmp %eax,0x4(%rbp)
I guess it is probably related to features of sequence that I have to find, but I don't understand why values such as 0x4(%rbp)is compared with %eax etc... Can someone explain?

Under ARC, is it legal/safe to assign to an object-type ivar using runtime methods?

Based on the technique described here I'm setting ivars in object instances using the ivar_getOffset() method.
Now I have the case where the ivar is a NSString*:
NSString* _name;
UPDATE:
I was on the wrong track (can't use ivar_getOffset in this case), I should be using:
object_setIvar(target, _ivar, [stringValue copy]);
to assign object values to ivars.
My question remains: is this safe to do under ARC? Will the ivar's existing object be properly released by ARC?
I'm worried that ARC might not properly release whatever string the ivar was set to before the assignment. Though I doubt it, otherwise the compiler wouldn't let me like in the case of object_setInstanceVariable ("not available in automatic reference counting mode") - right?
Yes, doing this with ARC is legal, at least in Mac OS X 10.8.2
If you would disassemble libobjc, you would find this:
================ B E G I N O F P R O C E D U R E ================
; Basic Block Input Regs: rdx rsi rdi - Killed Regs: rbx rbp r15
_object_setIvar:
000000000000c1f5 55 push rbp ; XREF=0xc1e1
000000000000c1f6 4889E5 mov rbp, rsp
000000000000c1f9 4157 push r15
000000000000c1fb 4156 push r14
000000000000c1fd 4155 push r13
000000000000c1ff 4154 push r12
000000000000c201 53 push rbx
000000000000c202 4883EC18 sub rsp, 0x18
000000000000c206 488955C8 mov qword [ss:rbp-0x40+var_8], rdx
000000000000c20a 4889F3 mov rbx, rsi
000000000000c20d 4989FF mov r15, rdi
000000000000c210 4D85FF test r15, r15
000000000000c213 0F8449010000 je 0xc362
; Basic Block Input Regs: rbx - Killed Regs: <nothing>
000000000000c219 4885DB test rbx, rbx
000000000000c21c 0F8440010000 je 0xc362
; Basic Block Input Regs: r15 - Killed Regs: rax
000000000000c222 41F6C701 test r15L, 0x1
000000000000c226 4C89F8 mov rax, r15
000000000000c229 740F je 0xc23a
; Basic Block Input Regs: <nothing> - Killed Regs: rax
000000000000c22b 4883E00F and rax, 0xf
000000000000c22f 48C1E003 shl rax, 0x3
000000000000c233 4803057ECE1000 add rax, qword [ds:0x1190b8]
; Basic Block Input Regs: rax rbx - Killed Regs: rdx rbp rsi rdi r12
000000000000c23a 4C8B20 mov r12, qword [ds:rax] ; XREF=0xc229
000000000000c23d 48C745D000000000 mov qword [ss:rbp-0x40+var_16], 0x0
000000000000c245 4889DF mov rdi, rbx
000000000000c248 E805F5FFFF call _ivar_getName
000000000000c24d 488D55D0 lea rdx, qword [ss:rbp-0x40+var_16]
000000000000c251 4C89E7 mov rdi, r12
000000000000c254 4889C6 mov rsi, rax
000000000000c257 E8C9E7FFFF call __class_getVariable
000000000000c25c 4885C0 test rax, rax
000000000000c25f 7433 je 0xc294
; Basic Block Input Regs: <nothing> - Killed Regs: r14
000000000000c261 4C8D75D0 lea r14, qword [ss:rbp-0x40+var_16]
000000000000c265 EB1F jmp 0xc286
; Basic Block Input Regs: rax rbx r12 r14 - Killed Regs: rdx rbp rsi rdi
000000000000c267 E88EC3FFFF call __class_getSuperclass ; XREF=0xc292
000000000000c26c 488945D0 mov qword [ss:rbp-0x40+var_16], rax
000000000000c270 4889DF mov rdi, rbx
000000000000c273 E8DAF4FFFF call _ivar_getName
000000000000c278 4C89E7 mov rdi, r12
000000000000c27b 4889C6 mov rsi, rax
000000000000c27e 4C89F2 mov rdx, r14
000000000000c281 E89FE7FFFF call __class_getVariable
; Basic Block Input Regs: rax rbx - Killed Regs: <nothing>
000000000000c286 4839D8 cmp rax, rbx ; XREF=0xc265
000000000000c289 7409 je 0xc294
; Basic Block Input Regs: rbp - Killed Regs: rdi
000000000000c28b 488B7DD0 mov rdi, qword [ss:rbp-0x40+var_16]
000000000000c28f 4885FF test rdi, rdi
000000000000c292 75D3 jne 0xc267
; Basic Block Input Regs: rax rbx rbp - Killed Regs: rbx rdi r12 r13
000000000000c294 4C8B6DD0 mov r13, qword [ss:rbp-0x40+var_16] ; XREF=0xc25f, 0xc289
000000000000c298 4889DF mov rdi, rbx
000000000000c29b E8B3F2FFFF call _ivar_getOffset
000000000000c2a0 4889C3 mov rbx, rax
000000000000c2a3 4D8D241F lea r12, qword [ds:r15+rbx]
000000000000c2a7 4C89EF mov rdi, r13
000000000000c2aa E8B3F2FFFF call __class_usesAutomaticRetainRelease
000000000000c2af 84C0 test al, al
000000000000c2b1 746B je 0xc31e
; Basic Block Input Regs: rax r13 - Killed Regs: rdi r14
000000000000c2b3 4C89EF mov rdi, r13
000000000000c2b6 E840A00000 call __class_getInstanceStart
000000000000c2bb 4189C6 mov r14d, eax
000000000000c2be 4C89EF mov rdi, r13
000000000000c2c1 E80EA80000 call _class_getWeakIvarLayout
000000000000c2c6 4885C0 test rax, rax
000000000000c2c9 7423 je 0xc2ee
; Basic Block Input Regs: rax rbx r14 - Killed Regs: rcx rsi rdi
000000000000c2cb 4489F1 mov ecx, r14d
000000000000c2ce 4889DF mov rdi, rbx
000000000000c2d1 4829CF sub rdi, rcx
000000000000c2d4 4889C6 mov rsi, rax
000000000000c2d7 E8483C0000 call __ZL17is_scanned_offsetlPKh ; is_scanned_offset(long, unsigned char const*)
000000000000c2dc 84C0 test al, al
000000000000c2de 740E je 0xc2ee
; Basic Block Input Regs: rbp r12 - Killed Regs: rsi rdi
000000000000c2e0 4C89E7 mov rdi, r12
000000000000c2e3 488B75C8 mov rsi, qword [ss:rbp-0x40+var_8]
000000000000c2e7 E82B440100 call _objc_storeWeak
000000000000c2ec EB74 jmp 0xc362
; Basic Block Input Regs: rax r13 - Killed Regs: rdi
000000000000c2ee 4C89EF mov rdi, r13 ; XREF=0xc2c9, 0xc2de
000000000000c2f1 E8C5A70000 call _class_getIvarLayout
000000000000c2f6 4885C0 test rax, rax
000000000000c2f9 7423 je 0xc31e
; Basic Block Input Regs: rax rbx r14 - Killed Regs: rcx rsi rdi
000000000000c2fb 4489F1 mov ecx, r14d
000000000000c2fe 4889DF mov rdi, rbx
000000000000c301 4829CF sub rdi, rcx
000000000000c304 4889C6 mov rsi, rax
000000000000c307 E8183C0000 call __ZL17is_scanned_offsetlPKh ; is_scanned_offset(long, unsigned char const*)
000000000000c30c 84C0 test al, al
000000000000c30e 740E je 0xc31e
; Basic Block Input Regs: rbp r12 - Killed Regs: rsi rdi
000000000000c310 4C89E7 mov rdi, r12
000000000000c313 488B75C8 mov rsi, qword [ss:rbp-0x40+var_8]
000000000000c317 E8044C0100 call _objc_storeStrong
000000000000c31c EB44 jmp 0xc362
; Basic Block Input Regs: <nothing> - Killed Regs: <nothing>
000000000000c31e 803DEB49110000 cmp byte [ds:_UseGC], 0x0 ; XREF=0xc2b1, 0xc2f9, 0xc30e
000000000000c325 7428 je 0xc34f
; Basic Block Input Regs: rax r13 - Killed Regs: rdi
000000000000c327 4C89EF mov rdi, r13
000000000000c32a E8A5A70000 call _class_getWeakIvarLayout
000000000000c32f 4885C0 test rax, rax
000000000000c332 741B je 0xc34f
; Basic Block Input Regs: rax rbx - Killed Regs: rsi rdi
000000000000c334 4889DF mov rdi, rbx
000000000000c337 4889C6 mov rsi, rax
000000000000c33a E8E53B0000 call __ZL17is_scanned_offsetlPKh ; is_scanned_offset(long, unsigned char const*)
000000000000c33f 84C0 test al, al
000000000000c341 740C je 0xc34f
; Basic Block Input Regs: rbp r12 - Killed Regs: rsi rdi
000000000000c343 488B7DC8 mov rdi, qword [ss:rbp-0x40+var_8]
000000000000c347 4C89E6 mov rsi, r12
000000000000c34a E831230000 call _objc_assign_weak
; Basic Block Input Regs: rbx rbp r15 - Killed Regs: rax rdx rsi rdi
000000000000c34f 488D055AE51000 lea rax, qword [ds:_objc_assign_ivar_internal] ; XREF=0xc325, 0xc332, 0xc341
000000000000c356 488B7DC8 mov rdi, qword [ss:rbp-0x40+var_8]
000000000000c35a 4C89FE mov rsi, r15
000000000000c35d 4889DA mov rdx, rbx
000000000000c360 FF10 call qword [ds:rax]
; Basic Block Input Regs: <nothing> - Killed Regs: rbx rsp rbp r12 r13 r14 r15
000000000000c362 4883C418 add rsp, 0x18 ; XREF=0xc213, 0xc21c, 0xc2ec, 0xc31c
000000000000c366 5B pop rbx
Note the call of _class_usesAutomaticRetainRelease at 000000000000c2aa.
The implementation of this method is available at opensource.apple.com
The header comment to this functions is:
/***********************************************************************
* _class_usesAutomaticRetainRelease
* Returns YES if class was compiled with -fobjc-arc
**********************************************************************/
Which means this particular function of ObjC runtime is aware of ARC.
I don't know the answer, but I know how to determine the answer...
Create an ivar that points to an instance of a class of your own, like LC2DFoo. Put a breakpoint in -[LC2DFoo dealloc]. Create an instance of LC2DFoo and assign it to the ivar using the runtime method you describe. Next, set the same ivar to something else, like nil. Do you hit your breakpoint?

GNU inline assembly optimisation

I am trying to write a small library for highly optimised x86-64 bit operation code and am fiddling with inline asm.
While testing this particular case has caught my attention:
unsigned long test = 0;
unsigned long bsr;
// bit test and set 39th bit
__asm__ ("btsq\t%1, %0 " : "+rm" (test) : "rJ" (39) );
// bit scan reverse (get most significant bit id)
__asm__ ("bsrq\t%1, %0" : "=r" (bsr) : "rm" (test) );
printf("test = %lu, bsr = %d\n", test, bsr);
compiles and runs fine in both gcc and icc, but when I inspect the assembly I get differences
gcc -S -fverbose-asm -std=gnu99 -O3
movq $0, -8(%rbp)
## InlineAsm Start
btsq $39, -8(%rbp)
## InlineAsm End
movq -8(%rbp), %rax
movq %rax, -16(%rbp)
## InlineAsm Start
bsrq -16(%rbp), %rdx
## InlineAsm End
movq -8(%rbp), %rsi
leaq L_.str(%rip), %rdi
xorb %al, %al
callq _printf
I am wondering why so complicated? I am writing high performance code in which the number of instructions is critical. I am especially wondering why gcc makes a copy of my variable test before passing it to the second inline asm?
Same code compiled with icc gives far better results:
xorl %esi, %esi # test = 0
movl $.L_2__STRING.0, %edi # has something to do with printf
orl $32832, (%rsp) # part of function initiation
xorl %eax, %eax # has something to do with printf
ldmxcsr (%rsp) # part of function initiation
btsq $39, %rsi #106.0
bsrq %rsi, %rdx #109.0
call printf #111.2
despite the fact that gcc decides to keep my variables on the stack rather then in registers, what I do not understand is why make a copy of test before passing it to the second asm?
If I put test in as an input/output variable in the second asm
__asm__ ("bsrq\t%1, %0" : "=r" (bsr) , "+rm" (test) );
then those lines disappear.
movq $0, -8(%rbp)
## InlineAsm Start
btsq $39, -8(%rbp)
## InlineAsm End
## InlineAsm Start
bsrq -8(%rbp), %rdx
## InlineAsm End
movq -8(%rbp), %rsi
leaq L_.str(%rip), %rdi
xorb %al, %al
callq _printf
Is this gcc screwed up optimisation or am I missing some vital compiler switches? I do have icc for my production system, but if I decide to distribute the source code at some point then it will have to be able to compile with gcc too.
compilers used:
gcc version 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.1.00)
icc Version 12.0.2
I've tried your example on Linux like this (making it "evil" by forcing a stack ref/loc for test through using &test in the printf:):#include <stdio.h>
int main(int argc, char **argv)
{
unsigned long test = 0;
unsigned long bsr;
// bit test and set 39th bit
asm ("btsq\t%1, %0 " : "+rm" (test) : "rJ" (39) );
// bit scan reverse (get most significant bit id)
asm ("bsrq\t%1, %0" : "=r" (bsr) : "rm" (test) );
printf("test = %lu, bsr = %d, &test = %p\n", test, bsr, &test);
return 0;
}
and compiled it with various versions of gcc -O3 ... to the following results:
code generated gcc version
================================================================================
400630: 48 83 ec 18 sub $0x18,%rsp 4.7.2,
400634: 31 c0 xor %eax,%eax 4.6.2,
400636: bf 50 07 40 00 mov $0x400750,%edi 4.4.6
40063b: 48 8d 4c 24 08 lea 0x8(%rsp),%rcx
400640: 48 0f ba e8 27 bts $0x27,%rax
400645: 48 89 44 24 08 mov %rax,0x8(%rsp)
40064a: 48 89 c6 mov %rax,%rsi
40064d: 48 0f bd d0 bsr %rax,%rdx
400651: 31 c0 xor %eax,%eax
400653: e8 68 fe ff ff callq 4004c0
[ ... ]
---------------------------------------------------------------------------------
4004f0: 48 83 ec 18 sub $0x18,%rsp 4.1
4004f4: 31 c0 xor %eax,%eax
4004f6: bf 28 06 40 00 mov $0x400628,%edi
4004fb: 48 8d 4c 24 10 lea 0x10(%rsp),%rcx
400500: 48 c7 44 24 10 00 00 00 00 movq $0x0,0x10(%rsp)
400509: 48 0f ba e8 27 bts $0x27,%rax
40050e: 48 89 44 24 10 mov %rax,0x10(%rsp)
400513: 48 89 c6 mov %rax,%rsi
400516: 48 0f bd d0 bsr %rax,%rdx
40051a: 31 c0 xor %eax,%eax
40051c: e8 c7 fe ff ff callq 4003e8
[ ... ]
---------------------------------------------------------------------------------
400500: 48 83 ec 08 sub $0x8,%rsp 3.4.5
400504: bf 30 06 40 00 mov $0x400630,%edi
400509: 31 c0 xor %eax,%eax
40050b: 48 c7 04 24 00 00 00 00 movq $0x0,(%rsp)
400513: 48 89 e1 mov %rsp,%rcx
400516: 48 0f ba 2c 24 27 btsq $0x27,(%rsp)
40051c: 48 8b 34 24 mov (%rsp),%rsi
400520: 48 0f bd 14 24 bsr (%rsp),%rdx
400525: e8 fe fe ff ff callq 400428
[ ... ]
---------------------------------------------------------------------------------
4004e0: 48 83 ec 08 sub $0x8,%rsp 3.2.3
4004e4: bf 10 06 40 00 mov $0x400610,%edi
4004e9: 31 c0 xor %eax,%eax
4004eb: 48 c7 04 24 00 00 00 00 movq $0x0,(%rsp)
4004f3: 48 0f ba 2c 24 27 btsq $0x27,(%rsp)
4004f9: 48 8b 34 24 mov (%rsp),%rsi
4004fd: 48 89 e1 mov %rsp,%rcx
400500: 48 0f bd 14 24 bsr (%rsp),%rdx
400505: e8 ee fe ff ff callq 4003f8
[ ... ]
and while there's a significant difference in the created code (including whether the bsr acceesses test as register or memory), none of the tested revs recreate the assembly that you've shown. I'd suspect a bug in the 4.2.x version you used on MacOSX, but then I don't have either your testcase nor that specific compiler version available.
Edit: The code above is obviously different in the sense that it forces test into the stack; if that is not done, then all "plain" gcc versions I've tested do a direct pair bts $39, %rsi / bsr %rsi, %rdx.
I have found, though, that clang creates different code there: 140: 50 push %rax
141: 48 c7 04 24 00 00 00 00 movq $0x0,(%rsp)
149: 31 f6 xor %esi,%esi
14b: 48 0f ba ee 27 bts $0x27,%rsi
150: 48 89 34 24 mov %rsi,(%rsp)
154: 48 0f bd d6 bsr %rsi,%rdx
158: bf 00 00 00 00 mov $0x0,%edi
15d: 30 c0 xor %al,%al
15f: e8 00 00 00 00 callq printf#plt>so the difference seems to be indeed between the code generators of clang/llvm and "gcc proper".

LLVM insertvalue bad optimized?

Should I avoid using the 'insertvalue' instruction combined with load and store when I emit LLVM code?
I always get bad optimized native code when I use it. Look at the following example:
; ModuleID = 'mod'
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
target triple = "x86_64-pc-linux-gnu"
%A = type { i64, i64, i64, i64, i64, i64, i64, i64 }
#aa = external global %A*
define void #func() {
entry:
%a1 = load %A** #aa
%a2 = load %A* %a1
%a3 = insertvalue %A %a2, i64 3, 3
store %A %a3, %A* %a1
ret void
}
When I run "llc -o - -O3 mod.ll", I get this horrible code:
func: # #func
.Ltmp0:
.cfi_startproc
# BB#0: # %entry
movq aa(%rip), %rax
movq (%rax), %r8
movq 8(%rax), %r9
movq 16(%rax), %r10
movq 32(%rax), %rdi
movq 40(%rax), %rcx
movq 48(%rax), %rdx
movq 56(%rax), %rsi
movq %rsi, 56(%rax)
movq %rdx, 48(%rax)
movq %rcx, 40(%rax)
movq %rdi, 32(%rax)
movq %r10, 16(%rax)
movq %r9, 8(%rax)
movq %r8, (%rax)
movq $3, 24(%rax)
ret
But what I would like to see is this:
func: # #func
.Ltmp0:
.cfi_startproc
# BB#0: # %entry
movq aa(%rip), %rax
movq $3, 24(%rax)
ret
Of course I can use getelementptr or something, but sometimes it is easier to generate insertvalue and extractvalue instructions, and I want these to be optimized...
I think it would be quite easy for the codegen to see that things like these are bad:
movq 56(%rax), %rsi
movq %rsi, 56(%rax)
First, note that llc does not do any IR-level optimizations. So, you should run opt to run the set of IR-level optimizers.
However, opt does not help in this. I'd expect that standard IR-level optimizers canonicalize the stuff into gep somehow.
Please file a LLVM PR, this looks like a missed optimization!

printf("%%%s","hello")

If I write printf("%%%s","hello"), how is it interpreted as by the compiler? Enlighten me, someone.
The compiler simply interprets this as calling printf with two strings as arguments (but see Zack's comment).
This happens at compile time (i.e. the compiler does this):
The strings ("%%%s" and "hello") are copied directly into the executable, the compiler leaves them as-is.
This happens at runtime (i.e. the C standard library does this when the app is running):
printf stand for 'print formatted'. When this function is called, it needs at least one argument. The first argument is the format. The next arguments are "arguments" to this format. They are formatted as specified in the first argument.
About optimization
I have written an example and ran Clang/LLVM with -S:
$ emacs printftest.c
$ clang printftest.c -S -o printftest_unopt.s # not optimized
$ clang printftest.c -S -O -o printftest_opt.s # optimized: -O flag
C code
#include <stdio.h>
int main() {
printf("%%%s", "hello");
return 0;
}
printftest_unopt.s (not optimized)
; ...
_main:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
movl $0, %eax
movl $0, -4(%rbp)
movl %eax, -8(%rbp)
xorb %al, %al
leaq L_.str(%rip), %rdi
leaq L_.str1(%rip), %rsi
callq _printf ; printf called here <----------------
movl %eax, -12(%rbp)
movl -8(%rbp), %eax
addq $16, %rsp
popq %rbp
ret
.section __TEXT,__cstring,cstring_literals
L_.str:
.asciz "%%%s"
L_.str1:
.asciz "hello"
; ...
printftest_opt.s (optimized)
; ...
_main:
pushq %rbp
movq %rsp, %rbp
leaq L_.str(%rip), %rdi
leaq L_.str1(%rip), %rsi
xorb %al, %al
callq _printf ; printf called here <----------------
xorl %eax, %eax
popq %rbp
ret
.section __TEXT,__cstring,cstring_literals
L_.str:
.asciz "%%%s"
L_.str1:
.asciz "hello"
; ...
Conclusion
As you can see (in the __TEXT,__cstring,cstring_literals section and the callq to printf), LLVM (a very, very good compiler) does not optimize printf("%%%s", "hello");. :)
"%%" means to print an actual "%" character; %s means to print a string from the argument list. So you'll see "%hello".
%% will print a literal '%' character.