Using LLVM to optimize programs that use large structs - optimization

I made a toy Brainfuck compiler. It works, but given the known initial state, the output is far less optimized than I hoped.
I have this state structure:
struct state {
unsigned char mem[0x1000];
unsigned long ip;
unsigned index;
};
The state structure (which looks like type { [4096 x i8], i64, i32 } in LLVM IR) is allocated with an alloca instruction, and then zeroed with a memset call (the intrinsic version).
And my operations are implemented as you would expect:
< as state.index--
> as state.index++
- as state.mem[state.index]--
+ as state.mem[state.index]++
. as putchar(state.mem[state.index])
, as state.mem[state.index] = getchar()
[ as the beginning of a while (state.mem[state.index] != 0) { loop
] as the end of a loop
For each operation, I emit the simplest matching LLVM IR I can think of. For instance, + is implemented as:
; %index = &state.index
%index = getelementptr inbounds %"state", %"state"* %state, i64 0, i32 1
; %0 = *%index
%0 = load i64, i64* %index, align 8
; %arrayidx = &state.mem[%0]
%arrayidx = getelementptr inbounds %"state", %"state"* %state, i64 0, i32 0, i64 %0
; %1 = *%arrayidx
%1 = load i8, i8* %arrayidx, align 1
; %inc = %1 + 1
%inc = add i8 %1, 1
; *arrayidx = %inc
store i8 %inc, i8* %arrayidx, align 1
I thought that this would be enough information to let LLVM optimize programs so hard that there would barely be anything left. The initial state is known, no pointer to it is shared, and sequential increments are easy to detect. Obviously, loops are harder to optimize, but I could understand that.
Much to my disappointment, however, the resulting code is still an ugly mess of getelementptr, load and store. None of these were elided in favor of something simpler.
I wasn't sure if I was just doing something wrong, so I took a hello world program and converted it to C by basically replacing each Brainfuck character by its matching C code as shown above, compiled it with Clang on O3 and dumped the resulting IR, and found it to be vastly equivalent. It appears that Clang isn't any more able to cope with this than my poor toy compiler.
However, if I take index off the struct and make it a local, Clang is able to optimize most of its uses into IR registers. So what's the deal here? Why is LLVM not able to optimize patterns of access to a struct? Is there a way I can tell LLVM that this memory is 100% private and that it can optimize its uses any way it wants?
If this makes an important difference, I'm LLVM 3.7 svn, up to date as of sometime last week.

Most probably you're not providing a data layout string. Without this many optimizers are unable to produce decent results - they cannot know the size of the pointer, etc.
See http://llvm.org/docs/LangRef.html#data-layout for more information. I would suggest to grab the data layout string as generated by clang on your platform and paste into .ll as the first step.

Related

Why does the Rust compiler perform a copy when moving an immutable value?

My intuition must be wrong about moves and copies. I would expect the Rust compiler optimize away moves of an immutable value as a no-op. Since the value is immutable, we can safely reuse it after the move. But Rust 1.65.0 on Godbolt compiles to assembly that copies the value to a new position in memory. The Rust code that I am studying:
pub fn f_int() {
let x = 3;
let y = x;
println!("{}, {}", x, y);
}
The resulting assembly with -C opt-level=3:
; pub fn f_int() {
sub rsp, 88
; let x = 3;
mov dword ptr [rsp], 3
; let y = x;
mov dword ptr [rsp + 4], 3
mov rax, rsp
...
Why does let y = x; result in mov dword ptr [rsp + 4], 3 and mov rax, rsp? Why doesn't the compiler treat y as the same variable as x in the assembly?
(This question looks similar but it is about strings which are not Copy. My question is about integers which are Copy. It looks like what I am describing is not a missed optimization opportunity but a fundamental mistake in my understanding.)
I would not call it a fundamental mistake in your understanding, but there are some interesting observations here.
First, println!() (and the formatting machinery in particular) is surprisingly hard to optimize, due to its design. So the fact that with println!() it was not optimized is not surprising.
Second, it is generally not obvious it is OK to perform this optimization, because it observably make the addresses equivalent. And println!() takes the address of the printed values (and passes them to an opaque function). In fact, Copy types are harder to justify than non-Copy types in that regard, because with Copy types the original variable may still be used after a move while with non-Copy types it is possible that not.
If you change your example like this
pub fn f_int() -> i32 {
let x = 3;
let y = x;
// println!("{}, {}", x, y);
x+y
}
the optimisation takes place
example::f_int:
mov eax, 6
ret
The println!() macro (as well as write!()...) takes references on its parameters and provides the formatting machinery with these references.
Probably, the compiler deduces that providing some functions (that are not inlined) with references requires the data being stored somewhere in memory in order to have an address.
Because the type is Copy, the semantics implies that we have two distinct storages, otherwise, sharing the storage would have been an optimisation for a move operation (not a copy).

Nim sequence assignment value semantics

I was under the impression that sequences and strings always get deeply copied on assignment. Today I got burned when interfacing with a C library to which I pass unsafeAddr of a Nim sequence. The C library writes into the memory area starting at the passed pointer.
Since I don't want the original Nim sequence to be changed by the library I thought I'll simply copy the sequence by assigning it to a new variable named copy and pass the address of the copy to the library.
Lo and behold, the modifications showed up in the original Nim sequence nevertheless. What's even more weird is that this behavior depends on whether the copy is declared via let copy = ... (changes do show up) or via var copy = ... (changes do not show up).
The following code demonstrates this in a very simplified Nim example:
proc changeArgDespiteCopyAssignment(x: seq[int], val: int): seq[int] =
let copy = x
let copyPtr = unsafeAddr(copy[0])
copyPtr[] = val
result = copy
proc dontChangeArgWhenCopyIsDeclaredAsVar(x: seq[int], val: int): seq[int] =
var copy = x
let copyPtr = unsafeAddr(copy[0])
copyPtr[] = val
result = copy
let originalSeq = #[1, 2, 3]
var ret = changeArgDespiteCopyAssignment(originalSeq, 9999)
echo originalSeq
echo ret
ret = dontChangeArgWhenCopyIsDeclaredAsVar(originalSeq, 7777)
echo originalSeq
echo ret
This prints
#[9999, 2, 3]
#[9999, 2, 3]
#[9999, 2, 3]
#[7777, 2, 3]
So the first call changes originalSeq while the second doesn't. Can someone explain what is going on under the hood?
I'm using Nim 1.6.6 and a total Nim newbie.
Turns out there are a lot of issues concerned with this behavior in the nim-lang issue tracker. For example:
let semantics gives 3 different results depends on gc, RT vs VM, backend, type, global vs local scope
Seq assignment does not perform a deep copy
Let behaves differently in proc for default gc
assigning var to local let does not properly copy in default gc
clarify spec/implementation for let: move or copy?
RFC give default gc same semantics as --gc:arc as much as possible
Long story short, whether a copy is made depends on a lot of factors, for sequences especially on the scope (global vs. local ) and the gc (refc, arc, orc) in use.
More generally, the type involved (seq vs. array), the code generation backend (C vs. JS) and whatnot can also be relevant.
This behavior has tricked a lot of beginners and is not well received by some of the contributors. It doesn't happen with the newer GCs --gc:arc or --gc:orc where the latter is expected to become the default GC in an upcoming Nim version.
It has never been fixed in the current default gc because of performance concerns, backward compatibility risks and the expectation that it will disappear anyway once we transition to the newer GCs.
Personally, I would have expected that it at least gets clearly singled out in the Nim language manual. Well, it isn't.
I took a quick look at the generated C code, a highly edited and simplified version is below.
The essential thing that is missing in the let code = ... variant is the call to genericSeqAssign() which makes a copy of the argument and assigns that to copy, instead the argument is assigned to copy directly. So, there is definitely no deep copy on assignment in that case.
I don't know if that is intentional or if it is a bug in the code generation (my first impression was astonishment). Any ideas?
tySequence_A* changeArgDespiteCopyAssignment(tySequence_A* x, NI val) {
tySequence_A* result = NIM_NIL;
tySequence_A* copy;
NI* copyPtr;
copy = x; /* NO genericSeqAssign() call !*/
copyPtr = &copy->data[(NI) 0];
*copyPtr = val;
genericSeqAssign(&result, copy, &NTIseqLintT_B);
return result;
}
tySequence_A* dontChangeArgWhenCopyIsDeclaredAsVar(tySequence_A* x, NI val) {
tySequence_A* result = NIM_NIL;
tySequence_A copy;
NI* copyPtr;
genericSeqAssign(&copy, x, &NTIseqLintT_B);
copyPtr = &copy->data[(NI) 0];
*copyPtr = val;
genericSeqAssign(&result, copy, &NTIseqLintT_B);
return result;
}

RISC-V inline assembly using memory not behaving correctly

This system call code is not working at all. The compiler is optimizing things out and generally behaving strangely:
template <typename... Args>
inline void print(Args&&... args)
{
char buffer[1024];
auto res = strf::to(buffer) (std::forward<Args> (args)...);
const size_t size = res.ptr - buffer;
register const char* a0 asm("a0") = buffer;
register size_t a1 asm("a1") = size;
register long syscall_id asm("a7") = ECALL_WRITE;
register long a0_out asm("a0");
asm volatile ("ecall" : "=r"(a0_out)
: "m"(*(const char(*)[size]) a0), "r"(a1), "r"(syscall_id) : "memory");
}
This is a custom system call that takes a buffer and a length as arguments.
If I write this using global assembly it works as expected, but program code has generally been extraordinarily good if I write the wrappers inline.
A function that calls the print function with a constant string produces invalid machine code:
0000000000120f54 <start>:
start():
120f54: fa1ff06f j 120ef4 <public_donothing-0x5c>
-->
120ef4: 747367b7 lui a5,0x74736
120ef8: c0010113 addi sp,sp,-1024
120efc: 55478793 addi a5,a5,1364 # 74736554 <add_work+0x74615310>
120f00: 00f12023 sw a5,0(sp)
120f04: 00a00793 li a5,10
120f08: 00f10223 sb a5,4(sp)
120f0c: 000102a3 sb zero,5(sp)
120f10: 00500593 li a1,5
120f14: 06600893 li a7,102
120f18: 00000073 ecall
120f1c: 40010113 addi sp,sp,1024
120f20: 00008067 ret
It's not loading a0 with the buffer at sp.
What am I doing wrong?
It's not loading a0 with the buffer at sp.
Because you didn't ask for a pointer as an "r" input in a register. The one and only guaranteed/supported behaviour of T foo asm("a0") is to make an "r" constraint (including +r or =r) pick that register.
But you used "m" to let it pick an addressing mode for that buffer, not necessarily 0(a0), so it probably picked an SP-relative mode. If you add asm comments inside the template like "ecall # 0 = %0 1 = %1 2 = %2" you can look at the compiler's asm output and see what it picked. (With clang, use -no-integrated-as so asm comments in the template come through in the -S output.)
Wrapping a system call does need the pointer in a specific register, i.e. using "r" or +"r"
asm volatile ("ecall # 0=%0 1=%1 2=%2 3=%3 4=%4"
: "=r"(a0_out)
: "r"(a0), "r"(a1), "r"(syscall_id), "m"(*(const char(*)[size]) a0)
: // "memory" unneeded; the "m" input tells the compiler which memory is read
);
That "m" input can be used instead of the "memory" clobber, not instead of an "r" pointer input. (For write specifically, because it only reads that one area of pointed-to memory and has no other side-effects on memory user-space can see, only on kernel write write buffers and file-descriptor positions which aren't C objects this program can access directly. For a read call, you'd need the memory to be an output operand.)
With optimization disabled, compilers do typically pick another register as the base for the "m" input (e.g. 0(a5) for GCC), but with optimization enabled GCC picks 0(a0) so it doesn't cost extra instructions. Clang still picks 0(a2), wasting an instruction to set up that pointer, even though the "=r"(a0_out) is not early-clobber. (Godbolt, with a very cut-down version of the function that doesn't call strf::to, whatever that is, just copies a byte into the buffer.)
Interestingly, with optimization enabled for my cut-down stand-alone version of the function without fixing the bug, GCC and clang do happen to put a pointer to buffer into a0, picking 0(a0) as the template expansion for that operand (see the Godbolt link above). This seems to be a missed optimization vs. using 16(sp); I don't see why they'd need the buffer address in a register at all.
But without optimization, GCC picks ecall # 0 = a0 1 = 0(a5) 2 = a1. (In my simplified version of the function, it sets a5 with mv a5,a0, so it did actually have the address in a0 as well. So it's a good thing you had more code in your function to make it not happen to work by accident, so you could find the bug in your code.)

LLVM GEP and store vs load and insertvalue: Storing value to a pointer to an aggregate

What is the difference between getelementptr and store vs. load and insertvalue when storing a value to a pointer to an aggregate type? Is one preferred in certain circumstances? And if so why? Or am I headed entirely in the wrong direction?
Example:
; A contrived example
%X = type {i32, i64}
define i32 #main(i32 %argc, i8** %argv) {
entry:
%sX.0 = alloca %X
; change the first value with GEP + store
%val1 = getelementptr %X, %X* %sX.0, i32 0, i32 0
store i32 42, i32* %val1
; change the second value with load + insertvalue
%sX.1 = load %X, %X* %sX.0
%sX.2 = insertvalue %X %sX.1, i64 42, 1
store %X %sX.2, %X* %sX.0 ; I suppose this could be considered less than ideal
; however in some cases it is nice to have the
; struct `load`ed
ret i32 0
}
Interestingly using llc -O=0 ... they both compile to the the same instructions. What amounts to the following, which is what I had hoped.
movl $42, -16(%rsp) # GEP + store
movq $42, -8(%rsp) # load + insertvalue
Background:
I was reading the LLVM Language Reference and I was reading about insertvalue. The reference notes the extractvalue instructions similarity with GEP and the following differences.
The major differences to getelementptr indexing are:
Since the value being indexed is not a pointer, the first index is omitted and assumed to be zero.
At least one index must be specified.
Not only struct indices but also array indices must be in bounds.
The following question on StackOverflow also mentions the use of getelementptr and insertvalue, but for different reasons. LLVM insertvalue bad optimized?
Semantically, loading and later storeing the entire object is more wasteful. What if it's a huge struct? What if it's an array of structs? GEP allows you to access an exact location in memory where you want to load/store, without the need to load/store anything else.
While the two forms were lowered to the same instructions in your example, it isn't generally guaranteed.

ROL / ROR on variable using inline assembly only in Objective-C [duplicate]

This question already has answers here:
ROL / ROR on variable using inline assembly in Objective-C
(2 answers)
Closed 9 years ago.
A few days ago, I asked the question below. Because I was in need of a quick answer, I added:
The code does not need to use inline assembly. However, I haven't found a way to do this using Objective-C / C++ / C instructions.
Today, I would like to learn something. So I ask the question again, looking for an answer using inline assembly.
I would like to perform ROR and ROL operations on variables in an Objective-C program. However, I can't manage it – I am not an assembly expert.
Here is what I have done so far:
uint8_t v1 = ....;
uint8_t v2 = ....; // v2 is either 1, 2, 3, 4 or 5
asm("ROR v1, v2");
the error I get is:
Unknown use of instruction mnemonic with unknown size suffix
How can I fix this?
A rotate is just two shifts - some bits go left, the others right - once you see this rotating is easy without assembly. The pattern is recognised by some compilers and compiled using the rotate instructions. See wikipedia for the code.
Update: Xcode 4.6.2 (others not tested) on x86-64 compiles the double shift + or to a rotate for 32 & 64 bit operands, for 8 & 16 bit operands the double shift + or is kept. Why? Maybe the compiler understands something about the performance of these instructions, maybe the just didn't optimise - but in general if you can avoid assembler do so, the compiler invariably knows best! Also using static inline on the functions, or using macros defined in the same way as the standard macro MAX (a macro has the advantage of adapting to the type of its operands), can be used to inline the operations.
Addendum after OP comment
Here is the i86_64 assembler as an example, for full details of how to use the asm construct start here.
First the non-assembler version:
static inline uint32 rotl32_i64(uint32 value, unsigned shift)
{
// assume shift is in range 0..31 or subtraction would be wrong
// however we know the compiler will spot the pattern and replace
// the expression with a single roll and there will be no subtraction
// so if the compiler changes this may break without:
// shift &= 0x1f;
return (value << shift) | (value >> (32 - shift));
}
void test_rotl32(uint32 value, unsigned shift)
{
uint32 shifted = rotl32_i64(value, shift);
NSLog(#"%8x <<< %u -> %8x", value & 0xFFFFFFFF, shift, shifted & 0xFFFFFFFF);
}
If you look at the assembler output for profiling (so the optimiser kicks in) in Xcode (Product > Generate Output > Assembly File, then select Profiling in the pop-up menu as the bottom of the window) you will see that rotl32_i64 is inlined into test_rotl32 and compiles down to a rotate (roll) instruction.
Now producing the assembler directly yourself is a bit more involved than for the ARM code FrankH showed. This is because to take a variable shift value a specific register, cl, must be used, so we need to give the compiler enough information to do that. Here goes:
static inline uint32 rotl32_i64_asm(uint32 value, unsigned shift)
{
// i64 - shift must be in register cl so create a register local assigned to cl
// no need to mask as i64 will do that
register uint8 cl asm ( "cl" ) = shift;
uint32 shifted;
// emit the rotate left long
// %n values are replaced by args:
// 0: "=r" (shifted) - any register (r), result(=), store in var (shifted)
// 1: "0" (value) - *same* register as %0 (0), load from var (value)
// 2: "r" (cl) - any register (r), load from var (cl - which is the cl register so this one is used)
__asm__ ("roll %2,%0" : "=r" (shifted) : "0" (value), "r" (cl));
return shifted;
}
Change test_rotl32 to call rotl32_i64_asm and check the assembly output again - it should be the same, i.e. the compiler did as well as we did.
Further note that if the commented out masking line in rotl32_i64 is included it essentially becomes rotl32 - the compiler will do the right thing for any architecture all for the cost of a single and instruction in the i64 version.
So asm is there is you need it, using it can be somewhat involved, and the compiler will invariably do as well or better by itself...
HTH
The 32bit rotate in ARM would be:
__asm__("MOV %0, %1, ROR %2\n" : "=r"(out) : "r"(in), "M"(N));
where N is required to be a compile-time constant.
But the output of the barrel shifter, whether used on a register or an immediate operand, is always a full-register-width; you can shift a constant 8-bit quantity to any position within a 32bit word, or - as here - shift/rotate the value in a 32bit register any which way.
But you cannot rotate 16bit or 8bit values within a register using a single ARM instruction. None such exists.
That's why the compiler, on ARM targets, when you use the "normal" (portable [Objective-]C/C++) code (in << xx) | (in >> (w - xx)) will create you one assembler instruction for a 32bit rotate, but at least two (a normal shift followed by a shifted or) for 8/16bit ones.