Which one is better, gcc or armcc for NEON optimizations? - embedded

Refering to #auselen's answer here: Using ARM NEON intrinsics to add alpha and permute, looks like armcc compiler is far more better than the gcc compiler for NEON optimizations. Is this really true? I haven't really tried armcc compiler. But I got pretty optimized code using the gcc compiler with -O3 optimization flag. But now I'm wondering if armcc is really that good? So which of the two compiler is better, considering all the factors?

Compilers are software as well, they tend to improve over time. Any generic claim like armcc is better than GCC on NEON (or better said as vectorization) can't hold true forever since one developer group can close the gap with enough attention. However initially it is logical to expect compilers developed by hardware companies to be superior because they need to demonstrate/market these features.
One recent example I saw was here on Stack Overflow about an answer for branch prediction. Quoting from last line of updated section "This goes to show that even mature modern compilers can vary wildly in their ability to optimize code...".
I am a big fan of GCC, but I wouldn't bet on quality of code produced by it against compilers from Intel or ARM. I expect any mainstream commercial compiler to produce code at least as good as GCC.
One empirical answer to this question could be to use hilbert-space's neon optimization example and see how different compilers optimize it.
void neon_convert (uint8_t * __restrict dest, uint8_t * __restrict src, int n)
{
int i;
uint8x8_t rfac = vdup_n_u8 (77);
uint8x8_t gfac = vdup_n_u8 (151);
uint8x8_t bfac = vdup_n_u8 (28);
n/=8;
for (i=0; i<n; i++)
{
uint16x8_t temp;
uint8x8x3_t rgb = vld3_u8 (src);
uint8x8_t result;
temp = vmull_u8 (rgb.val[0], rfac);
temp = vmlal_u8 (temp,rgb.val[1], gfac);
temp = vmlal_u8 (temp,rgb.val[2], bfac);
result = vshrn_n_u16 (temp, 8);
vst1_u8 (dest, result);
src += 8*3;
dest += 8;
}
}
This is armcc 5.01
20: f421140d vld3.8 {d1-d3}, [r1]!
24: e2822001 add r2, r2, #1
28: f3810c04 vmull.u8 q0, d1, d4
2c: f3820805 vmlal.u8 q0, d2, d5
30: f3830806 vmlal.u8 q0, d3, d6
34: f2880810 vshrn.i16 d0, q0, #8
38: f400070d vst1.8 {d0}, [r0]!
3c: e1520003 cmp r2, r3
40: bafffff6 blt 20 <neon_convert+0x20>
This is GCC 4.4.3-4.7.1
1e: f961 040d vld3.8 {d16-d18}, [r1]!
22: 3301 adds r3, #1
24: 4293 cmp r3, r2
26: ffc0 4ca3 vmull.u8 q10, d16, d19
2a: ffc1 48a6 vmlal.u8 q10, d17, d22
2e: ffc2 48a7 vmlal.u8 q10, d18, d23
32: efc8 4834 vshrn.i16 d20, q10, #8
36: f940 470d vst1.8 {d20}, [r0]!
3a: d1f0 bne.n 1e <neon_convert+0x1e>
Which looks extremely similar, so we have a draw. After seeing this I tried mentioned add alpha and permute again.
void neonPermuteRGBtoBGRA(unsigned char* src, unsigned char* dst, int numPix)
{
numPix /= 8; //process 8 pixels at a time
uint8x8_t alpha = vdup_n_u8 (0xff);
for (int i=0; i<numPix; i++)
{
uint8x8x3_t rgb = vld3_u8 (src);
uint8x8x4_t bgra;
bgra.val[0] = rgb.val[2]; //these lines are slow
bgra.val[1] = rgb.val[1]; //these lines are slow
bgra.val[2] = rgb.val[0]; //these lines are slow
bgra.val[3] = alpha;
vst4_u8(dst, bgra);
src += 8*3;
dst += 8*4;
}
}
Compiling with gcc...
$ arm-linux-gnueabihf-gcc --version
arm-linux-gnueabihf-gcc (crosstool-NG linaro-1.13.1-2012.05-20120523 - Linaro GCC 2012.05) 4.7.1 20120514 (prerelease)
$ arm-linux-gnueabihf-gcc -std=c99 -O3 -c ~/temp/permute.c -marm -mfpu=neon-vfpv4 -mcpu=cortex-a9 -o ~/temp/permute_gcc.o
00000000 <neonPermuteRGBtoBGRA>:
0: e3520000 cmp r2, #0
4: e2823007 add r3, r2, #7
8: b1a02003 movlt r2, r3
c: e92d01f0 push {r4, r5, r6, r7, r8}
10: e1a021c2 asr r2, r2, #3
14: e24dd01c sub sp, sp, #28
18: e3520000 cmp r2, #0
1c: da000019 ble 88 <neonPermuteRGBtoBGRA+0x88>
20: e3a03000 mov r3, #0
24: f460040d vld3.8 {d16-d18}, [r0]!
28: eccd0b06 vstmia sp, {d16-d18}
2c: e59dc014 ldr ip, [sp, #20]
30: e2833001 add r3, r3, #1
34: e59d6010 ldr r6, [sp, #16]
38: e1530002 cmp r3, r2
3c: e59d8008 ldr r8, [sp, #8]
40: e1a0500c mov r5, ip
44: e59dc00c ldr ip, [sp, #12]
48: e1a04006 mov r4, r6
4c: f3c73e1f vmov.i8 d19, #255 ; 0xff
50: e1a06008 mov r6, r8
54: e59d8000 ldr r8, [sp]
58: e1a0700c mov r7, ip
5c: e59dc004 ldr ip, [sp, #4]
60: ec454b34 vmov d20, r4, r5
64: e1a04008 mov r4, r8
68: f26401b4 vorr d16, d20, d20
6c: e1a0500c mov r5, ip
70: ec476b35 vmov d21, r6, r7
74: f26511b5 vorr d17, d21, d21
78: ec454b34 vmov d20, r4, r5
7c: f26421b4 vorr d18, d20, d20
80: f441000d vst4.8 {d16-d19}, [r1]!
84: 1affffe6 bne 24 <neonPermuteRGBtoBGRA+0x24>
88: e28dd01c add sp, sp, #28
8c: e8bd01f0 pop {r4, r5, r6, r7, r8}
90: e12fff1e bx lr
Compiling with armcc...
$ armcc
ARM C/C++ Compiler, 5.01 [Build 113]
$ armcc --C99 --cpu=Cortex-A9 -O3 -c permute.c -o permute_arm.o
00000000 <neonPermuteRGBtoBGRA>:
0: e1a03fc2 asr r3, r2, #31
4: f3870e1f vmov.i8 d0, #255 ; 0xff
8: e0822ea3 add r2, r2, r3, lsr #29
c: e1a031c2 asr r3, r2, #3
10: e3a02000 mov r2, #0
14: ea000006 b 34 <neonPermuteRGBtoBGRA+0x34>
18: f420440d vld3.8 {d4-d6}, [r0]!
1c: e2822001 add r2, r2, #1
20: eeb01b45 vmov.f64 d1, d5
24: eeb02b46 vmov.f64 d2, d6
28: eeb05b40 vmov.f64 d5, d0
2c: eeb03b41 vmov.f64 d3, d1
30: f401200d vst4.8 {d2-d5}, [r1]!
34: e1520003 cmp r2, r3
38: bafffff6 blt 18 <neonPermuteRGBtoBGRA+0x18>
3c: e12fff1e bx lr
In this case armcc produces much better code. I think this justifies fgp's answer above. Most of the time GCC will produce good enough code, but you should keep an eye on critical parts or most importantly first you must measure / profile.

If you use NEON intrinsics, the compiler shouldn't matter that much. Most (if not all) NEON intrinsics translate to a single NEON instruction, so the only thing left to the compiler is register allocation and instruction scheduling. In my experience, both GCC 4.2 and Clang 3.1 do reasonably well at those tasks.
Note, however, that the NEON instructions are bit more expressive than the NEON instrinsics. For example, NEON load/store instructions have pre- and post-increment adressing modes which combine a load or store with an increment of the address register, thus saving you one instruction. The NEON intrinsics don't provide an explicit way to do that, but instead rely on the compiler to combine a reguler NEON load/store intrinsic and an address increment into a load/store instruction with post-increment. Similarly, some load/store instructions allow you to specify the alignment of the memory address, and execute faster if you specify stricter alignment guarantees. The NEON intrinsics, again, don't allow you to specify alignment explicitly, but instead rely on the compiler to deduce the correct alignment specifier. In theory, you use "align" attributes on your pointers to provide suitable hints to the compiler, but at least Clang seems to ignore those...
In my experience, neither Clang nor GCC are very bright when it comes to those kinds of optimizations. Fortunately, the additional performance benefit of these kinds of optimization usually isn't all that high - it's more like 10% than 100%.
Another area where those two compilers aren't particularly smart is avoidance of stack spilling. If you code uses more vector-valued variables than there are NEON registers, I've seem both compilers produce horrible code. Basically, what they seem to do is to schedule instructions based on the assumption that there are enough registers available. Register allocation seems to come afterwards, and seems to simply spill values to the stack once it runs of registers. So make sure you code has a working set of less than 16 128-bit vectors or 32 64-bit vectory at any time!
Overall, I've got pretty good results from both GCC and Clang, but I regularly had to reorganize the code a bit to avoid compiler Idiosyncrasies. My advice would be to stick with GCC or Clang, but check on the regularly with the dissassembler of your choice.
So, overall, I'd say sticking with GCC is fine. You might want to look at the dissassembly of the performance-critical parts, though, and check if it looks reasonable.

Related

Count integers in [1..N] with K zero bits below the leading 1? (popcount for a contiguous range without HW POPCNT)

I have following task:
Count how many numbers between 1 and N will have exactly K zero non-leading bits. (e.g. 710=1112 will have 0 of them, 4 will have 2)
N and K satisfy condition 0 ≤ K, N ≤ 1000000000
This version uses POPCNT and is fast enough on my machine:
%include "io.inc"
section .bss
n resd 1
k resd 1
ans resd 1
section .text
global CMAIN
CMAIN:
GET_DEC 4,n
GET_DEC 4,k
mov ecx,1
mov edx,0
;ecx is counter from 1 to n
loop_:
mov eax, ecx
popcnt eax,eax;in eax now amount of bits set
mov edx, 32
sub edx, eax;in edx now 32-bits set=bits not set
mov eax, ecx;count leading bits
bsr eax, eax;
xor eax, 0x1f;
sub edx, eax
mov eax, edx
; all this lines something like (gcc):
; eax=32-__builtin_clz(x)-_mm_popcnt_u32(x);
cmp eax,[k];is there k non-leading bits in ecx?
jnz notk
;if so, then increment ans
mov edx,[ans]
add edx,1
mov [ans],edx
notk:
;increment counter, compare to n and loop
inc ecx
cmp ecx,dword[n]
jna loop_
;print ans
PRINT_DEC 4,ans
xor eax, eax
ret
It should be okay in terms of speed (~0.8 sec), but it wasn't accepted because (I guess) CPU used on testing server is too old so it shows that runtime error happened.
I tried using precounting trick with a 64K * 4-byte lookup table, but it wasn't fast enough:
%include "io.inc"
section .bss
n resd 1
k resd 1
ans resd 1
wordbits resd 65536; bits set in numbers from 0 to 65536
section .text
global CMAIN
CMAIN:
mov ebp, esp; for correct debugging
mov ecx,0
;mov eax, ecx
;fill in wordbits, ecx is wordbits array index
precount_:
mov eax,ecx
xor ebx,ebx
;c is ebx, v is eax
;for (c = 0; v; c++){
; v &= v - 1; // clear the least significant bit set
;}
lloop_:
mov edx,eax
dec edx
and eax,edx
inc ebx
test eax,eax
jnz lloop_
;computed bits set
mov dword[wordbits+4*ecx],ebx
inc ecx
cmp ecx,65536
jna precount_
;0'th element should be 0
mov dword[wordbits],0
GET_DEC 4,edi;n
GET_DEC 4,esi;k
mov ecx,1
xor edx,edx
xor ebp,ebp
loop_:
mov eax, ecx
;popcnt eax,eax
mov edx,ecx
and eax,0xFFFF
shr edx,16
mov eax,dword[wordbits+4*eax]
add eax,dword[wordbits+4*edx]
;previous lines are to implement absent instruction popcnt.
; they simply do eax=wordbits[x & 0xFFFF] + wordbits[x >> 16]
mov edx, 32
sub edx, eax
;and the same as before:
;non-leading zero bits=32-bits set-__builtin_clz(x)
mov eax, ecx
bsr eax, eax
xor eax, 0x1f
sub edx, eax
mov eax, edx
;compare to k again to see if this number has exactly k
;non-leading zero bits
cmp edx, esi
jnz notk
;increment ebp (answer) if so
mov edx, ebp
add edx, 1
mov ebp, edx
;and (or) go to then next iteration
notk:
inc ecx
cmp ecx, edi
jna loop_
;print answer what is in ebp
PRINT_DEC 4, ebp
xor eax, eax
ret
(>1 sec)
Should I speed up second program (if so, then how?) or somehow replace POPCNT with some other (which?) instructions (I guess SSE2 and older should be available)?
First of all, a server too old to have popcnt will be significantly slower in other ways and have different bottlenecks. Given that it has pshufb but not popcnt, it's a Core 2 first or second-gen (Conroe or Penryn). See Agner Fog's microarch PDF (on https://agner.org/optimize/). Also lower clock speeds, so the best you can do on that CPU might not be enough to let brute-force work.
There are probably algorithmic improvements that could save huge amounts of time, like noting that every 4 increments cycle the low 2 bits through a 00, 01, 10, 11 pattern: 2 zeros happens once per four increments, 1 zero happens twice, no zeros happens once. For every number >= 4, these 2 bits are below the leading bit and thus part of the count. Generalizing this into a combinatorics formula for each MSB-position between 1 and log2(N) might be a way to do vastly less work. Handling the numbers between 2^M and N is less obvious.
Versions here:
cleaned up popcnt version, 536ms on i7-6700k # 3.9GHz, no algorithmic optimization across iterations. For k=8, N=1000000000
Naive LUT version (2 loads per iteration, no inter-iteration optimization): ~595 ms on a good run, more often ~610 ms for k=8, N=1000000000. Core2Duo (Conroe) # 2.4GHz: 1.69 s. (A couple worse versions of that in the edit history, the first having partial-register stalls on Core 2.)
(unfinished, cleanup code not written) Optimized LUT version (unrolled, and high-half/MSB BSR work hoisted leaving only 1 lookup (cmp/jcc) per iteration), 210 ms on Skylake, 0.58s on Core 2 # 2.4GHz. Time should be realistic; we're all the work, just missing the last 2^16 iterations where the MSB is in the low 16. Handling any necessary corner cases in the outer loop, and cleanup, shouldn't affect speed more than 1%.
(even more unfinished): vectorize the optimized LUT version with pcmpeqb / psubb (with psadbw in an outer loop, like How to count character occurrences using SIMD shows - the inner loop reduces to counting byte elements in a fixed-size array that match a value calculated in the outer loop. Just like the scalar version). 18ms on Skylake, ~0.036s on Core 2. Those times are probably now including a considerable amount of startup overhead. But as expected/hoped, about 16x faster on both.
Histogram the wordbits table once (perhaps as you generate it). Instead of searching 64kiB to find matching bytes, just look up the answer for every outer-loop iteration! That should let you go thousands of times faster for large N. (Although you still need to handle the low 1..64K and the partial range when N isn't a multiple of 64K.)
To usefully measure the faster versions, you could slap a repeat loop around the whole thing so the whole process still takes some measurable time, like half a second. (Since it's asm, no compiler will optimize away the work from doing the same N,k repeatedly.) Or you could do the timing inside the program, with rdtsc if you know the TSC frequency. But being able to use perf stat on the whole process easily is nice, so I'd keep doing that (take out the printf and make a static executable to further minimize startup overhead).
You seem to be asking about micro-optimizing the brute-force approach that still checks every number separately. (There are significant optimizations possible to how you implement the 32 - clz - popcnt == k though.)
There are other ways to do popcnt that are generally faster, e.g. bithacks like in How to count the number of set bits in a 32-bit integer?. But when you have a lot of popcounting to do in a tight loop (enough to keep a lookup table hot in cache), the LUT can be good.
If you have fast SSSE3 pshufb, it could be worth using it to do a SIMD popcount for four dwords in parallel in an XMM register (auto-vectorizing the loop), or even better in a YMM register with AVX2. (First-gen Core2 has pshufb but it's not single uop until 2nd-gen Core2. Still possibly worth it.)
Or much better, using SIMD to count LUT elements that match what we're looking for, for a given high-half of a number.
The brute force checking contiguous ranges of numbers opens up a major optimization for the LUT strategy: the upper n bits of the number only change once per 2^n increments. So you can hoist the count of those bits out of an inner-most loop. This also can make it worth using a smaller table (that fits in L1d cache).
Speaking of which, your 64k * 4 table is 256KiB, the size of your L2 cache. This means it's probably having to come in from L3 every time you loop through it. Your desktop CPU should have enough L3 bandwidth for that (and the access pattern is contiguous thanks to the increments), and modern servers have larger L2, but there's very little reason not to use a byte LUT (popcnt(-1) is only 32). Modern Intel CPUs (since about Haswell) don't rename AL separately from the rest of EAX/RAX, and a movzx byte load is just as cheap as a mov dword load.
; General LUT lookup with two 16-bit halves
movzx edx, cx ; low 16 bits
mov eax, ecx
shr eax, 16 ; high 16 bits
movzx edx, byte [wordbits + edx]
add dl, [wordbits + eax]
; no partial-reg stall for reading EDX after this, on Intel Sandybridge and later
; on Core 2, set up so you can cmp al,dl later to avoid it
On an Intel CPU so old that it doesn't support popcnt, that will cause a partial-register stall. Do the next compare with cmp al, dl instead. (Use lea or add or sub on the bsr result, instead of the popcount LUT load, so you can avoid a partial-register stall.)
Normally you'd want to use a smaller LUT, like maybe 11 bits per step, so 3 steps handles a whole 32-bit number (2^11 = 2048 bytes, a small fraction of 32k L1d). But with this sequential access pattern, hardware prefetch can handle it and fully hide the latency, especially when the L1d prefetches will hit in L2. Again, this is good because this loop touches no memory other this lookup table. Lookup tables are a lot worse in the normal case where significant amounts of other work happen between each popcount, or you have any other valuable data in cache you'd rather not evict.
Optimized for Skylake (i7-6700k): even with 2 LUT accesses per iteration: 0.600 seconds at 3.9GHz
vs. 0.536 seconds with popcnt. Hoisting the high-half LUT lookup (and maybe the 32 constant) might even let that version be faster.
Note: a CPU so old that it doesn't have popcnt will be significantly different from Skylake. Optimizing this for Skylake is a bit silly, unless you take this further and wind up beating the popcnt version on Skylake, which is possible if we can hoist the BSR work by having nested loops, with an inner loop that uses the same BSR result for the whole range of numbers from 2^m .. 2^(m+1)-1 (clamped to a 64k range so you can also hoist the high half popcnt LUT lookup). popcnt_low(i) == some constant calculated from k, popcnt_high(i), and clz(i).
3 major things were quite important for Skylake (some of them relevant for older CPUs, including avoiding taken branches for front-end reasons):
Avoid having a cmp/jcc touching a 32-byte boundary on Intel Skylake-derived CPUs with up-to-date microcode, because Intel mitigated the JCC erratum by disabling the uop cache for such lines: 32-byte aligned routine does not fit the uops cache
This looking at the disassembly and deciding whether to make instructions longer (e.g. with lea eax, [dword -1 + edx] to force a 4-byte displacement instead of the smaller disp8.) and whether to use align 32 at the top of a loop.
No-increment is much more common than increment, and Intel CPUs can only run taken branches at 1/clock. (But since Haswell have a 2nd execution unit on another port that can run predicted-not-taken branches.) Change jne notk to je yesk to a block below the function that jumps back. Tail-duplication of the dec ecx / jnz .loop / else fall through to a jmp print_and_exit helped a tiny amount vs. just jumping back to after the je yesk.
It's taken so rarely (and has a consistent enough pattern) that it doesn't mispredict often, so setnz al / add ebx, eax would probably be worse.
Optimize the 32 - clz - popcnt == k check, taking advantage of the fact that bsr gives you 31-clz. So 31-clz - (popcnt-1) = 32-clz-popcnt.
Since we're comparing that for == k, that can be further rearranged to popcnt-1 + k == 31-clz.
When we're using a LUT for popcount, instead of a popcnt instruction that has to run on port 1, we can afford to use a 3-component (slow) LEA like lea edx, [edx + esi - 1] to do the popcnt-1+k. Since it has 3 components (2 registers and a displacement, 2 + signs in the addressing mode), it can only run on port 1 (with 3 cycle latency), competing with bsr (and popcnt if we were using it).
Taking advantage of lea saved instructions in general, even in the popcnt version. So did counting the loop down towards 0, with a macro-fused 1-uop dec/jnz instead of inc + cmp/jne. (I haven't tried counting up to see if L1d HW prefetch works better in that direction; the popcnt version won't care but the LUT version might.)
Ported to work without io.inc, just using hard-coded N and k with printf for output. This is not "clean" code, e.g. nasty hacks like %define wordbits edi that I put in to test changing alignment of branches by using indexed addressing modes instead of [reg + disp32] for every access to the array. That happened to do the trick, getting almost all of the uops to come from DSB (the uop cache) instead of MITE, i.e. avoided the JCC erratum slowdown. The other way to do it would be making instructions longer, to push the cmp/je and dec/jnz past a 32-byte boundary. (Or to change the alignment of the start of the loop.) Uop-cache fetch happens in lines of up-to-6 uops and can be a bottleneck if you end up with a line with only a couple uops. (Skylake's loop-buffer aka LSD is also disabled by microcode to fix an earlier erratum; Intel had more big bugs with Skylake than most designs.)
%use SMARTALIGN
alignmode p6, 64
section .bss
wordbits: resb 65536
; n resd 1
; k resd 1
ans resd 1
section .rodata
n: dd 1000000000
k: dd 8
print_fmt: db `ans: %d\n`, 0
section .text
global main
main: ; no popcnt version
push ebp
push edi ; save some call-preserved registers
push esi
push ebx
mov edi, wordbits
%define wordbits edi ; dirty hack, use indexed addressing modes instead of reg+disp32.
; Avoids Skylake JCC erratum problems, and is is slightly better on Core2 with good instruction scheduling
;fill in wordbits, ecx is wordbits array index
mov ecx, 1 ; leave wordbits[0] = 0
.init_loop:
mov eax,ecx
xor ebx,ebx
.popc_loop:
lea edx, [eax-1]
inc ebx
and eax,edx ; v &= v - 1; // blsr
jnz .popc_loop
;computed bits set
mov [wordbits + ecx], bl
inc ecx
cmp ecx,65536
jb .init_loop ; bugfix: array out of bounds with jna: stores to wordbits[65536]
; GET_DEC 4,n
; GET_DEC 4,k
mov ecx, [n] ; ecx counts from n down to 1
; mov esi, [k]
xor ebx, ebx ; ebx = ans
mov esi, 1
sub esi, [k] ; 1-k
align 32
.loop:
;popcnt eax, ecx
movzx eax, cx
mov ebp, ecx ; using an extra register (EBP) to schedule instructions better(?) for Core2 decode
movzx edx, byte [wordbits + eax]
shr ebp, 16
; xor eax, eax ; break false dependency, or just let OoO exec hide it after breaking once per iter
bsr eax, ecx ; eax = 31-lzcnt for non-zero ecx
; sub edx, esi ; sub now avoids partial-reg stuff. Could have just used EBX to allow BL.
add eax, esi ; Add to BSR result seems slightly better on Core2 than sub from popcnt
add dl, [wordbits + ebp] ; we don't read EDX, no partial-register stall even on P6-family
;; want: k == 32-__builtin_clz(x)-_mm_popcnt_u32(x)
cmp al, dl ; 31-clz+(1-k) == popcount. or 31-clz == popcnt - (1-k)
je .yesk ; not-taken is the more common fast path
.done_inc:
dec ecx
jnz .loop ; }while(--n >= 0U)
.print_and_exit:
;print ans
; PRINT_DEC 4,ans
push ebx
push print_fmt
extern printf
call printf
add esp, 8
pop ebx
pop esi
pop edi
pop ebp
xor eax, eax
ret
align 8
.yesk:
inc ebx
; jmp .done_inc ; tail duplication is a *tiny* bit faster
dec ecx
jnz .loop
jmp .print_and_exit
This is version 3, updated to avoid partial-register penalties on Core 2 (Conroe). Runs in 1.78s 1.69s vs. 3.18s. Still sometimes as fast on Skylake, but more often 610ms instead of 594ms. I don't have perf counter access on my Core 2; it's too old for perf to fully support, and I don't have perf for the kernel that booted last.
(disassembly and perf results for version 1 on Godbolt: https://godbolt.org/z/ox7e8G)
On my Linux desktop, i7-6700k at 3.9GHz. (EPP = balance_performance, not full performance, so it doesn't want to turbo to 4.2GHz apparently.) I don't need sudo to use perf because I set /proc/sys/kernel/perf_event_paranoid = 0. I use taskset -c 3 just to avoid CPU migrations for single-threaded workloads.
# Results from version 1, not the Core2-friendly version.
# Version 3 sometimes runs this fast, but more often ~610ms
# Event counts are near identical for both, except cycles, but uops_issue and executed are mysteriously lower, like 9,090,858,203 executed.
$ nasm -felf32 foo.asm -l/dev/stdout &&
gcc -m32 -no-pie -fno-pie -fno-plt foo.o
$ taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,branch-misses,instructions,uops_issued.any,uops_executed.thread -r 2 ./a.out
ans: 12509316
ans: 12509316
Performance counter stats for './a.out' (2 runs):
597.78 msec task-clock # 0.999 CPUs utilized ( +- 0.12% )
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
62 page-faults # 0.103 K/sec ( +- 0.81% )
2,328,038,096 cycles # 3.894 GHz ( +- 0.12% )
2,000,637,322 branches # 3346.789 M/sec ( +- 0.00% )
1,719,724 branch-misses # 0.09% of all branches ( +- 0.02% )
11,015,217,584 instructions # 4.73 insn per cycle ( +- 0.00% )
9,148,164,159 uops_issued.any # 15303.609 M/sec ( +- 0.00% )
9,102,818,982 uops_executed.thread # 15227.753 M/sec ( +- 0.00% )
(from a separate run):
9,204,430,548 idq.dsb_uops # 15513.249 M/sec ( +- 0.00% )
1,008,922 idq.mite_uops # 1.700 M/sec ( +- 20.51% )
0.598156 +- 0.000760 seconds time elapsed ( +- 0.13% )
This is about 3.93 fused-domain (front-end) uops/clock. So we're pretty close to the 4/clock front-end width.
With popcnt:
Your original (with GET_DEC replaced by loading a constant) ran in 1.3 sec on my desktop, for k=8 N=1000000000. This version runs in about 0.54 sec. My version of your original wasn't even the worst possible case for alignment of branches (another version was about 1.6 sec), although since I did have to change things it could be different from your machine.
I used mostly the same optimizations as above to save uops and help out the front-end inside the loop. (But I did this first, so it's missing some optimizations.)
align 32
.loop:
mov eax, ecx
popcnt eax,eax
lea edx, [dword eax - 32 + 31] ; popcnt - 32 = -(bits not set)
; dword displacement pads the cmp/jnz location to avoid the JCC erratum penalty on Intel
; xor eax, eax ; break false dependency, or just let OoO exec hide it after breaking once per iter
bsr eax, ecx ; eax = 31-lzcnt
; xor eax, 0x1f ; eax = lzcnt (for non-zero x)
; want: 32-__builtin_clz(x)-_mm_popcnt_u32(x) = (31-clz) + 1-popcnt = (31-clz) - (popcnt-1)
sub eax, edx
cmp eax, esi ;is there k non-leading bits in ecx?
%if 0
jnz .notk
inc ebx ;if so, then increment ans
.notk:
%else
jz .yesk ; not-taken is the more common fast path
.done_inc:
%endif
dec ecx
jnz .loop ; }while(--n >= 0U)
;print ans
; PRINT_DEC 4,ans
push ebx
push print_fmt
extern printf
call printf
add esp, 8
pop ebx
pop esi
xor eax, eax
ret
.yesk:
inc ebx
jmp .done_inc ;; TODO: tail duplication
(Unfinished) Inner loop with invariant clz(x) and high-half popcount
This version runs in only 0.58 sec on my 2.4GHz Core 2 Duo E6600 (Conroe), same microarchitecture as your Xeon 3050 2.13GHz.
And in 210ms on my Skylake.
It does most of the work, only missing cleanup for N < 65536 (or the low 65536 of a larger N, where the MSB is in the low half), and maybe missing handling a couple other corner cases in the outer loop. But the inner loop totally dominates the run-time, and it doesn't have to run more so these times should be realistic.
It still brute-force checks every single number, but most of the per-number work dependent on the high half is loop invariant and hoisted out. Assuming non-zero high halves, but only 2^16 numbers have their MSB in the low 16. And narrowing to only the low 12 or 14 bits means less cleanup, as well as a smaller part of the LUT to loop over that can stay hot in L1d.
%use SMARTALIGN
alignmode p6, 64
section .bss
align 4096
wordbits: resb 65536
; n resd 1
; k resd 1
; ans resd 1
section .rodata
;n: dd 0x40000000 ; low half zero, maybe useful to test correctness for a version that doesn't handle that.
n: dd 1000000000 ; = 0x3b9aca00
k: dd 8
print_fmt: db `ans: %d\n`, 0
section .text
global main
align 16
main:
main_1lookup:
push ebp
push edi ; save some call-preserved registers
push esi
push ebx
mov edi, wordbits
;%define wordbits edi ; dirty hack, use indexed addressing modes instead of reg+disp32.
; actually slightly worse on Skylake: causes un-lamination of cmp bl, [reg+reg],
; although the front-end isn't much of a bottleneck anymore
; also seems pretty much neutral to use disp32+reg on Core 2, maybe reg-read stalls or just not a front-end bottleneck
;fill in wordbits, ecx is wordbits array index
mov ecx, 1 ; leave wordbits[0] = 0
.init_loop:
mov eax,ecx
xor ebx,ebx
.popc_loop:
lea edx, [eax-1]
inc ebx
and eax,edx ; v &= v - 1; // blsr
jnz .popc_loop
;computed bits set
mov [wordbits + ecx], bl
inc ecx
cmp ecx,65536
jb .init_loop
; GET_DEC 4,n
; GET_DEC 4,k
mov ecx, [n] ; ecx counts from n down to 1
; mov esi, [k]
xor esi, esi ; ans
mov ebp, 1
sub ebp, [k] ; 1-k
align 32
.outer:
mov eax, ecx ; using an extra register (EBP) to schedule instructions better(?) for Core2 decode
shr eax, 16
; xor eax, eax ; break false dependency, or just let OoO exec hide it after breaking once per iter
bsr ebx, ecx ; eax = 31-lzcnt for non-zero ecx
;; want: k == 32-__builtin_clz(x)-_mm_popcnt_u32(x)
; 31-clz+(1-k) == popcount. or 31-clz == popcnt - (1-k)
; 31-cls+(1-k) - popcount(hi(x)) == popcount(lo(x))
add ebx, ebp
sub bl, byte [wordbits + eax]
;movzx edx, cx
lea edx, [ecx - 4] ; TODO: handle cx < 4 making this wrap
movzx edx, dx
and ecx, -65536 ; clear low 16 bits, which we're processing with the inner loop.
align 16
.low16:
cmp bl, [wordbits + edx + 0]
je .yesk0
.done_inc0:
cmp bl, [wordbits + edx + 1]
je .yesk1
.done_inc1:
cmp bl, [wordbits + edx + 2]
je .yesk2
.done_inc2:
cmp bl, [wordbits + edx + 3]
je .yesk3
.done_inc3:
; TODO: vectorize with pcmpeqb / psubb / psadbw!!
; perhaps over fewer low bits to only use 16kiB of L1d cache
sub edx, 4
jae .low16 ; }while(lowhalf-=4 doesn't wrap)
sub ecx, 65536
ja .outer
; TODO: handle ECX < 65536 initially or after handling leading bits. Probably with BSR in the inner loop
.print_and_exit:
;print ans
; PRINT_DEC 4,ans
push esi
push print_fmt
extern printf
call printf
add esp, 8
pop ebx
pop esi
pop edi
pop ebp
xor eax, eax
ret
align 16
%assign i 0
%rep 4
;align 4
.yesk%+i:
inc esi
jmp .done_inc%+i
%assign i i+1
%endrep
; could use a similar %rep block for the inner loop
; attempt tail duplication?
; TODO: skip the next cmp/jcc when jumping back.
; Two in a row will never both be equal
; dec ecx
; jnz .loop
; jmp .print_and_exit
Skylake perf results:
(update after outer-loop over-count on first iter bugfix, ans: 12497876)
ans: 12498239 # This is too low by a bit vs. 12509316
# looks reasonable given skipping cleanup
209.46 msec task-clock # 0.992 CPUs utilized
0 context-switches # 0.000 K/sec
0 cpu-migrations # 0.000 K/sec
62 page-faults # 0.296 K/sec
813,311,333 cycles # 3.883 GHz
1,263,086,089 branches # 6030.123 M/sec
824,103 branch-misses # 0.07% of all branches
2,527,743,287 instructions # 3.11 insn per cycle
1,300,567,770 uops_issued.any # 6209.065 M/sec
2,299,321,355 uops_executed.thread # 10977.234 M/sec
(from another run)
37,150,918 idq.dsb_uops # 174.330 M/sec
1,266,487,977 idq.mite_uops # 5942.976 M/sec
0.211235157 seconds time elapsed
0.209838000 seconds user
0.000000000 seconds sys
Note that uops_issued.any is about the same as idq.DSB_uops + idq.MITE_uops - if we'd used EDI as a pointer to save code-size, uops_issued.any would be much higher because of unlamination of the indexed addressing modes from micro + macro-fused cmp+jcc.
Also interesting that branch misses is even lower; perhaps the unrolling helped distribute the history better in the IT-TAGE predictor table.
SSE2 SIMD
Also unfinished, not handling corner cases or cleanup, but I think doing approximately the right amount of work.
Unlike in How to count character occurrences using SIMD, the array we're matching against has known limits on how often matches can occur, so it happens to be (mostly?) safe to not do nested loops, just a 2^14 (16384) iteration loop unrolled by 2 before widening the byte counters out to dword. At least for k=8.
This gets a total count of 12507677, just slightly lower than 12509316 (correct for N=1000000000, k=8), but I haven't checked if that's all due to not doing 1..16384, or if I'm losing any counts anywhere.
You could unroll over outer loop iterations to make use of each XMM vector twice or 4 times for each load. (With sequential access to an array in L1d cache, that could possibly let us go slightly faster by doing more ALU work per load, but not much faster.) By setting up 2 or 4 vectors to match against for 2 or 4 different high halves, you can spend longer in the inner loop. Possibly we could go a bit faster than 1 compare/accumulate per clock. But that might run into (cold) register read bottlenecks on Core 2, though.
The version below just does a standard unroll.
;;;;; Just the loop from main_SSE2, same init stuff and print as main_1lookup
align 32
.outer:
mov eax, ecx ; using an extra register (EBP) to schedule instructions better(?) for Core2 decode
shr eax, 16-2
; xor eax, eax ; break false dependency, or just let OoO exec hide it after breaking once per iter
bsr ebx, ecx ; eax = 31-lzcnt for non-zero ecx
;; want: k == 32-__builtin_clz(x)-_mm_popcnt_u32(x)
; 31-clz+(1-k) == popcount. or 31-clz == popcnt - (1-k)
; 31-cls+(1-k) - popcount(hi(x)) == popcount(lo(x))
add ebx, ebp
movzx edx, al
; movzx edx, byte [wordbits + edx]
sub bl, byte [wordbits + edx]
shr eax, 8 ; high part is more than 16 bits if low is 14, needs to be broken up
sub bl, byte [wordbits + eax]
; movzx eax, byte [wordbits + eax]
; add eax, edx
; sub ebx, eax
movzx eax, bl
movd xmm7, eax
pxor xmm0, xmm0
pxor xmm1, xmm1 ; 2 accumulators
pshufb xmm7, xmm0 ; broadcast byte to search for.
;; Actually SSSE3, but it only takes a few more insns to broadcast a byte with just SSE2.
;; e.g. imul eax, 0x01010101 / movd / pshufd
;movzx edx, cx
; lea edx, [ecx - 4] ; TODO: handle cx < 4 making this wrap
; movzx edx, dx
and ecx, -16384 ; clear low bits, which we're processing with the inner loop.
mov edx, wordbits ; quick and dirty, just loop forward over the array
;; FIXME: handle non-zero CX on first outer loop iteration, maybe loop backwards so we can go downwards toward 0,
;; or calculate an end-pointer if we can use that without register-read stalls on Core 2.
;; Also need to handle the leftover part not being a multiple of 32 in size
;; So maybe just make a more-flexible copy of this loop and peel the first outer iteration (containing that inner loop)
;; if the cleanup for that slows down the common case of doing exactly 16K
align 16
.low14:
movdqa xmm2, [edx]
movdqa xmm3, [edx + 16]
ds pcmpeqb xmm2, xmm7 ; extra prefixes for padding for Skylake JCC erratum: 18ms vs. 25ms
ds psubb xmm0, xmm2
ds add edx, 32
cs pcmpeqb xmm3, xmm7
cs psubb xmm1, xmm3
; hits are rare enough to not wrap counters?
; TODO: may need an inner loop to accumulate after 256 steps if every other 32nd element is a match overflowing some SIMD element
cmp edx, wordbits + 16384
jb .low14
pxor xmm7, xmm7
psadbw xmm0, xmm7
psadbw xmm1, xmm7 ; byte -> qword horizontal sum
paddd xmm0, xmm1 ; reduce to 1 vector
movhlps xmm1, xmm0
paddd xmm0, xmm1 ; hsum the low/high counts
movd eax, xmm0
add esi, eax ; sum in scalar (could sink this out)
sub ecx, 16384
ja .outer
; TODO: handle ECX < 65536 initially or after handling leading bits. Probably with BSR in the inner loop
Probably can just PADDD into a vector accumulator and only hsum to scalar outside the loop, but we might want more free vector regs?
Here's an attempt at algorithmic optimization.
I. Number of desired integers within the range [0; 2 ** floor(log2(N)))
All of these integers are less than N, therefore we only need to check how many of them have exactly K zero-bits below the leading one bit.
For an integer of bit-length n, there are n - 1 possible positions to place our zeros (bits below leading one bit). Therefore number of desired integers of bit-length n is the number of ways to pick k zeros out of n - 1 places (without repetition, unordered). We can compute that using binomial coefficient formula:
n! / (k! * (n - k)!)
If we're using 32-bit integers, then max possible value of n is 31 (and same for k). Factorial for 31 is still huge and won't fit even in a 64-bit number, so we have to perform repeated division (can be constexpr precomputed at compile time).
To get total number of integers we compute binomial coefficient for n from 1 up to floor(log2(N)) and sum them up.
II. Number of desired integers within the range [2 ** floor(log2(N)); N]
Start with the bit right after the leading one bit. And apply the following algorithm:
If current bit is zero, then we can't do anything about that bit (it has to be zero, if we change it to one, then the integer value becomes larger than N), so we simply decrement our zero-bits budget K and move to the next bit.
If current bit is one, then we can pretend that it is zero. Now any combination of remaining lower-significance bits will fit in range below N. Fetch binomial coefficient value to figure out how many ways to pick remaining number of zeros from remaining number of bits and add to the total.
Algorithm stops once we run out of bits or K becomes zero. At this point if K equals remaining number of bits, that means we can zero them out to get the desired integer, therefore we increment the total count by one (count N itself towards the total). Or if K is zero and all the remaining bits are one, then we can also count N towards the total.
Code:
#include <stdio.h>
#include <chrono>
template<typename T>
struct Coefficients {
static constexpr unsigned size_v = sizeof(T) * 8;
// Zero-initialize.
// Indexed by [number_of_zeros][number_of_bits]
T value[size_v][size_v] = {};
constexpr Coefficients() {
// How many different ways we can choose k items from n items
// without order and without repetition.
//
// n! / k! (n - k)!
value[0][0] = 1;
value[0][1] = 1;
value[1][1] = 1;
for(unsigned i = 2; i < size_v; ++i) {
value[0][i] = 1;
value[1][i] = i;
T r = i;
for(unsigned j = 2; j < i; ++j) {
r = (r * (i - j + 1)) / j;
value[j][i] = r;
}
value[i][i] = 1;
}
}
};
template<typename T>
__attribute__((noinline)) // To make it easier to benchmark
T count_combinations(T max_value, T zero_bits) {
if( max_value == 0 )
return 0;
constexpr int size = sizeof(T) * 8;
constexpr Coefficients<T> coefs;
// assert(zeros_bits < size)
int bits = size - __builtin_clz(max_value);
T total = 0;
// Count all-ones count.
#pragma clang loop vectorize(disable)
for(int i = 0; i < bits - 1; ++i) {
total += coefs.value[zero_bits][i];
}
// Count interval [2**bits, max_value]
bits -= 1;
T mask = T(1) << bits;
max_value &= ~mask; // Remove leading bit
mask = mask >> 1;
#pragma clang loop vectorize(disable)
while( zero_bits && zero_bits < bits ) {
if( max_value & mask ) {
// If current bit is one, then we can pretend that it is zero
// (which would only make the value smaller, which means that
// it would still be < max_value) and grab all combinations of
// zeros within the remaining bits.
total += coefs.value[zero_bits - 1][bits - 1];
// And then stop pretending it's zero and continue as normal.
} else {
// If current bit is zero, we can't do anything about it, just
// have to spend a zero from our budget.
zero_bits--;
}
max_value &= ~mask;
mask = mask >> 1;
bits--;
}
// At this point we don't have any more zero bits, or we don't
// have any more bits at all.
if( (zero_bits == bits) ||
(zero_bits == 0 && max_value == ((mask << 1) - 1)) ) {
total++;
}
return total;
}
int main() {
using namespace std::chrono;
unsigned count = 0;
time_point t0 = high_resolution_clock::now();
for(int i = 0; i < 1000; ++i) {
count |= count_combinations<unsigned>(1'000'000'000, 8);
}
time_point t1 = high_resolution_clock::now();
auto duration = duration_cast<nanoseconds>(t1 - t0).count();
printf("result = %u, time = %lld ns\n", count, duration / 1000);
return 0;
}
Results (for N=1'000'000'000, K=8, running on i7-9750H):
result = 12509316, time = 35 ns
If binomial coefficients are computed at runtime, then takes ~3.2 µs.

BIOS Interrupts in protected mode

I'm working on an operating system project, using isolinux (syslinux 4.5) as bootloader, loading my kernel with multiboot header organised at 0x200000.
As I know the kernel is already in 32-bit protected mode. My question: Is there any easier way to get access to BIOS Interrupts? (Basically I want 0x10 :D)
After loading, my kernel sets up its own GDT and IDT entries and further remaps IRQs. So, is it possible to jump into real mode just after the kernel is loaded and set up VGA/SVGA modes (VBE 2.0 mode). Then after I'll proceed with my kernel and jump into protected mode where I use VBE 2.0 physical buffer address to write onto screen? If yes how? I tried a lot but didn't get success :(
Side note:
I searched a lot on internet and found that syslinux 1.x+ provides _intcall api, I'm not 100% sure about it.
Refer to "syslinux 4.5\com32\lib\sys\initcall.c"
BIOS was designed for 16-bit machines. However, still you have three options to call BIOS interrupts in protected mode.
Switch back to real mode and re-enter protected mode (Easiest approach).
Use v86 mode (not available in 64-bit long mode).
Write your own 16-bit x86 processor emulator (Hardest approach).
I used first approach in my operating system for VBE and disk access through BIOS.
Code used for this purpose in my operating system:
;______________________________________________________________________________________________________
;Switch to 16-bit real Mode
;IN/OUT: nothing
go16:
[BITS 32]
cli ;Clear interrupts
pop edx ;save return location in edx
jmp 0x20:PM16 ;Load CS with selector 0x20
;For go to 16-bit real mode, first we have to go to 16-bit protected mode
[BITS 16]
PM16:
mov ax, 0x28 ;0x28 is 16-bit protected mode selector.
mov ss, ax
mov ds, ax
mov es, ax
mov gs, ax
mov fs, ax
mov sp, 0x7c00+0x200 ;Stack hase base at 0x7c00+0x200
mov eax, cr0
and eax, 0xfffffffe ;Clear protected enable bit in cr0
mov cr0, eax
jmp 0x50:realMode ;Load CS and IP
realMode:
;Load segment registers with 16-bit Values.
mov ax, 0x50
mov ds, ax
mov fs, ax
mov gs, ax
mov ax, 0
mov ss, ax
mov ax, 0
mov es, ax
mov sp, 0x7c00+0x200
cli
lidt[.idtR] ;Load real mode interrupt vector table
sti
push 0x50 ;New CS
push dx ;New IP (saved in edx)
retf ;Load CS, IP and Start real mode
;Real mode interrupt vector table
.idtR: dw 0xffff ;Limit
dd 0 ;Base
The short answer is no. BIOS calls are designed to operate in real mode and don't respect restraints set by protected mode, so you're not allowed to use them and the CPU will triple-fault if you try.
However, x86 processors provide Virtual 8086 mode, which can be used to emulate an x86 processor running in 16-bit real mode. The OSDev wiki and forums provide a wealth of information on this topic. If you go this route, it is generally a good idea to map the kernel to the higher half (Linux uses 0xC0000000) to avoid interfering with VM86 code.

Julia vs Python compiler/interpreter optimization

I recently asked the following question about Python:
Interpreter optimization in Python
Say I have a string in x, is the Python interpreter smart enough to
know that: string.replace(x, x) should be converted to a NOP?
The answer seems to be No (although the Python interpreter is able to do some optimizations through the peephole optimiser).
I don't know what the equivalent expression in Julia would be, but is Julia capable of optimizing these types of relatively obvious
statements?
Relying on the compiler
The question is can Julia provide enough information to LLVM so that the compiler can optimize things?
From your example yes, and you can verify with code_native. For example the answer is premultiplied and unnecessary assignment to x is optimized away and the function always returns a constant
julia> f()=(x=7*24*60*60)
f (generic function with 1 method)
julia> code_native(f,())
.section __TEXT,__text,regular,pure_instructions
Filename: none
Source line: 1
push RBP
mov RBP, RSP
mov EAX, 604800
Source line: 1
pop RBP
ret
Optimization from typing
And it can go a bit further at times because more knowledge is available from type information. The converse is that the Any type and globals are to be avoided if possible.
Example
In case I, the comparison needs to be made because y might be greater than 256, but in the 2nd case since it's only 1 byte, it's value can't be greater than 256 and the function will be optimized to always return 1.
Case I
julia> g(y::Int16)=(y<256?1:0)
g (generic function with 1 method)
julia> code_native(g,(Int16,))
.section __TEXT,__text,regular,pure_instructions
Filename: none
Source line: 1
push RBP
mov RBP, RSP
cmp DI, 256
Source line: 1
setl AL
movzx EAX, AL
pop RBP
ret
Case II
julia> g(y::Int8)=(y<256?1:0)
g (generic function with 2 methods)
julia> code_native(g,(Int8,))
.section __TEXT,__text,regular,pure_instructions
Filename: none
Source line: 1
push RBP
mov RBP, RSP
mov EAX, 1
Source line: 1
pop RBP
ret

addressing mode efficiency

Can someone tell me if 'immediate' adressing mode is more
efficient than addresing through [eax] or any other way.
Lets say that I have long function with some reads and
some writes (say 5 reads, five writes) to some int value in memory
mov some_addr, 12 //immediate adressing
//other code
mov eax, aome_addr
//other code
mov some_addr, eax // and so on
versus
mov eax, some_addr
mov [eax], 12 // addressing thru [eax]
//other code
mov ebx, [eax]
//other code
mov [eax], ebx // and so on
which one is faster ?
Probably the register indirect access is slightly faster, but for sure it is shorter in its encoding, for example (warning — gas syntax)
67 89 18 mov %ebx, (%eax)
67 8b 18 mov (%eax), %ebx
vs.
89 1c 25 00 00 00 00 mov %ebx, some_addr
8b 1c 25 00 00 00 00 mov some_addr, %ebx
So it has some implications while loading the instr., the use of cache etc, so that it is probably a bit faster, but in a long function with some reads and writes — I don't think it of much importance...
(The zeros in the hex code are supposed to be filled in by the linker (just to have said this).)
[update date="2012-09-30 ~21h30 CEST":
I have run some tests and I really wonder what they revealed. So much that I didn't investigate further :-)
48 8D 04 25 00 00 00 00 leaq s_var,%rax
C7 00 CC BA ED FE movl $0xfeedbacc,(%rax)
performs in most runs better than
C7 04 25 00 00 00 00 CC movl $0xfeedbacc,s_var
BA ED FE
I'm really surprised, and now I'm wondering how Maratyszcza would explain this. I have an idea already, but I'm willing ... what the fun... no, seeing these (example) results
movl to s_var
All 000000000E95F890 244709520
Avg 00000000000000E9 233
Min 00000000000000C8 200
Max 0000000000276B22 2583330
leaq s_var, movl to (reg)
All 000000000BF77C80 200768640
Avg 00000000000000BF 191
Min 00000000000000AA 170
Max 00000000001755C0 1529280
might for sure be supportive for his statement that the instruction decoder takes a max of 8 bytes per cycle, but it doesn't show how many bytes are really decoded.
In the leaq/movl pair, each instruction is (incl. operands) less than 8 bytes, so it is likely the case that each instruction is dispatched within one cycle, while the single movl is to be divided into two. Still I'm convinced that it is not the decoder slowing things down, since even with the 11 byte movl its work is done after the third byte — then it just has to wait for the pipeline streaming in the address and the immediate, both of which need no decoding.
Since this is 64 bit mode code, I also tested with the 1 byte shorter rip-relative addressing — with (almost) the same result.
Note: These measurements might heavily depend on the (micro-) architecture which they are run on. The values above are given running the testing code on an Atom N450 (constant TSC, boot#1.6GHz, fixed at 1.0GHz during test run), which is unlikely to be representative for the whole x86(-64) platform.
Note: The measurements are taken at runtime, with no further analysis such as occurring task/context switches or other intervening interrupts!
/update]
Addressing using registers are fastest.
Please see Register addressing mode vs Direct addressing mode

Memory leak in si_user_byuid/getpwuid originating from CPSharedResourcesDirectory in iOS

I am at the optimizing/analysing phase of a product that will go live in a couple of weeks, and I am astonished to find some leaks that do not (I believe) originate from my code. One of them is the strdup/malloc leak present in iOS 5.1.1, for which I can do nothing other than wait for an update.
However today I have discovered a new one, and I'm currently trying to pinpoint where and how its leaking.
Instruments reports:
Bytes Used # Leaks Symbol Name
16 Bytes 100.0% 1 start
16 Bytes 100.0% 1 main
16 Bytes 100.0% 1 UIApplicationMain
16 Bytes 100.0% 1 -[UIApplication _run]
16 Bytes 100.0% 1 _UIAccessibilityInitialize
16 Bytes 100.0% 1 -[UIApplication(UIKitApplicationAccessibility) _accessibilityInit]
16 Bytes 100.0% 1 -[UIApplication(UIKitApplicationAccessibility) _updateAccessibilitySettingsLoader]
16 Bytes 100.0% 1 _AXSAccessibilityEnabled
16 Bytes 100.0% 1 _getBooleanPreference
16 Bytes 100.0% 1 CPCopySharedResourcesPreferencesDomainForDomain
16 Bytes 100.0% 1 CPSharedResourcesDirectory
16 Bytes 100.0% 1 getpwuid
16 Bytes 100.0% 1 si_user_byuid
16 Bytes 100.0% 1 search_item_bynumber
16 Bytes 100.0% 1 si_user_byuid
16 Bytes 100.0% 1 _fsi_get_user
16 Bytes 100.0% 1 0x119048
16 Bytes 100.0% 1 0x1180e4
16 Bytes 100.0% 1 malloc
From a little digging I did, I found out that getpwuid is a unix/linux function imported by <pwd.h>.
Double-clicking on si_user_byuid in Instruments comes up with a "No Source" sign, and double-clicking on getpwuid comes up with arm7 assembly (which I am sorry to inform you Im not familiar with):
+0x0 push {r4, r5, r7, lr}
+0x2 add r7, sp, #8
+0x4 movw.w r5, #18506 ; 0x484a
+0x8 movt.w r5, #3265 ; 0xcc1
+0xc mov r4, r0
+0xe add r5, r15
+0x10 ldr r0, [r5]
+0x12 cbnz getpwuid
+0x14 movw.w r0, #47748 ; 0xba84
+0x18 movt.w r0, #1 ; 0x1
+0x1c add r0, r15
+0x1e bl.w getpwuid
+0x22 str r0, [r5]
+0x24 mov r1, r4
+0x26 bl.w getpwuid
+0x2a mov r4, r0
+0x2c movs r0, #201 ; 0xc9
+0x2e mov r1, r4
+0x30 bl.w getpwuid
+0x34 movs r0, #0 ; 0x0
+0x36 cmp r4, #0 ; 0x0
+0x38 it ne
+0x3a addne.w r0, r4, #28 ; 0x1c
+0x3e pop {r4, r5, r7, pc}
So:
Has anyone seen this before ?
Could it be a false-positive ?
Could it be limited to iOS 5.1.1 ?
Here's the source of the whole thing starting from getpwuid. From a quick look, it seems there's some caching going on, so I wouldn't worry too much unless the leak is big.