How do you make code which only gets compiled for platforms which can perform unaligned reads? - optimization

The bounty expires in 6 days. Answers to this question are eligible for a +100 reputation bounty.
fadedbee wants to reward an existing answer:
Thanks to your help, I've just published my first crate: crates.io/crates/levarint64 (It still needs work...)
The purpose of the function below is to speed-up (possibly unaligned) u64 reads from slices.
The optimised function compiles to mov rax, qword ptr [rdi] on x86_64 and ldr x0, [x0] on aarch64. (The unoptimised version (when used on a little-endian platform) gets compiled to the same assembly, but often explodes into more than sixteen instructions when inlined at -O3.)
This code is not yet correct (see the FIXME):
// Unoptimised version, suitable for both endianesses and any lack of unaligned reads.
#[cfg(target_endian="big")]
fn u64_from_slice(slice: &[u8]) -> u64 {
debug_assert!(slice.len() >= size_of::<u64>());
unsafe {
*slice.get_unchecked(0) as u64 |
((*slice.get_unchecked(1) as u64) << 8) |
((*slice.get_unchecked(2) as u64) << 16) |
((*slice.get_unchecked(3) as u64) << 24) |
((*slice.get_unchecked(4) as u64) << 32) |
((*slice.get_unchecked(5) as u64) << 40) |
((*slice.get_unchecked(6) as u64) << 48) |
((*slice.get_unchecked(7) as u64) << 56)
}
}
// FIXME: This is only valid on architectures which can perform unaligned reads.
#[cfg(target_endian="little")]
pub fn u64_from_slice(slice: &[u8]) -> u64 {
debug_assert!(slice.len() >= size_of::<u64>());
unsafe {
let r = &*(slice as *const [u8] as *const [u8; size_of::<u64>()]);
*mem::transmute::<&[u8; size_of::<u64>()], &u64>(r)
}
}
Many years ago I worked on ARM architectures where unaligned reads caused an aligned read followed by a rearrangement of the bytes so that the u8 or u16 at that address was moved to the lowest bits of the register.
In this case, my target_endian="little" isn't sufficient to make the code above correct.
How do I make sure that those ARM architectures (and possibly others) are excluded from running the optimised version?

The optimised function compiles to mov rax, qword ptr [rdi] on x86_64 and ldr x0, [x0] on aarch64.
Is it really such a useful gain when from_le_bytes is so very close, and an unreachable_unchecked() in the failure branch basically gets you there? The only thing both retain is a branch on the size of the slice, but that should be a ridiculously well predicted branch.
Many years ago I worked on ARM architectures where unaligned reads caused an aligned read followed by a rearrangement of the bytes so that the u8 or u16 at that address was moved to the lowest bits of the register.
You might be thinking about ARMv6 and older, especially ARMv5 and down, which would round address down to a multiple of 4 then possibly do weird rotations.
AVMv8 supports unaligned reads just fine, at least for most operations, though there may be a perf hit.
How do I make sure that those ARM architectures (and possibly others) are excluded from running the optimised version?
I think explicitly enumerating with target_arch is your least bad bet, it'll probably leave perfs on the table as handling of unaligned read is not always an ISA property (especially when it comes to performance profile).

While not resolving the original question, I was able to satisfy my requirements with no unsafe code, by not using slices at all:
#![feature(slice_as_chunks)]
...
pub fn u64_from_first_eight(buf: &[u8; 9]) -> u64 {
let parts: (&[[u8; 8]], &[u8]) = buf.as_chunks();
u64::from_le_bytes(parts.0[0])
}
pub fn u64_from_last_eight(buf: &[u8; 9]) -> u64 {
let parts: (&[u8], &[[u8; 8]]) = buf.as_rchunks();
u64::from_le_bytes(parts.1[0])
}
These generate efficient assembly code.
x86_64:
example::u64_from_first_eight:
mov rax, qword ptr [rdi]
ret
example::u64_from_last_eight:
mov rax, qword ptr [rdi + 1]
ret
aarch64:
example::u64_from_first_eight:
ldr x0, [x0]
ret
example::u64_from_last_eight:
ldur x0, [x0, #1]
ret
Update: Thanks to Chayim Friedman, I now have the following code which does not depend on nightly features, and yet compiles to the same single assembly instructions.
pub fn u64_from_low_eight(buf: &[u8; 9]) -> u64 {
let bytes: &[u8; size_of::<u64>()] = buf[..size_of::<u64>()].try_into().unwrap();
u64::from_le_bytes(*bytes)
}
pub fn u64_from_high_eight(buf: &[u8; 9]) -> u64 {
let bytes: &[u8; size_of::<u64>()] = buf[1..(size_of::<u64>()+1)].try_into().unwrap();
u64::from_le_bytes(*bytes)
}

Related

Which is the fastest way to find the last N bits of an integer?

Which algorithm is fastest for returning the last n bits in an unsigned integer?
1.
return num & ((1 << bits) - 1)
2.
return num % (1 << bits)
3.
let shift = num.bitWidth - bits
return (num << shift) >> shift
(where bitWidth is the width of the integer, in bits)
Or is there another, faster algorithm?
This is going to depend heavily on what compiler you have, what the optimization settings are, and what size of integers you're working with.
My hypothesis going into this section was that the answer would be "the compiler will be smart enough to optimize all of these in a way that's better than whatever you'd choose to write." And in some sense, that's correct. Consider the following three pieces of code:
#include <stdint.h>
#include <limits.h>
uint32_t lastBitsOf_v1(uint32_t number, uint32_t howManyBits) {
return number & ((1 << howManyBits) - 1);
}
uint32_t lastBitsOf_v2(uint32_t number, uint32_t howManyBits) {
return number % (1 << howManyBits);
}
uint32_t lastBitsOf_v3(uint32_t number, uint32_t howManyBits) {
uint32_t shift = sizeof(number) * CHAR_BIT - howManyBits;
return (number << shift) >> shift;
}
Over at the godbolt compiler explorer with optimization turned up to -Ofast with -march=native enabled, we get this code generated for the three functions:
lastBitsOf_v1(unsigned int, unsigned int):
bzhi eax, edi, esi
ret
lastBitsOf_v2(unsigned int, unsigned int):
bzhi eax, edi, esi
ret
lastBitsOf_v3(unsigned int, unsigned int):
mov eax, 32
sub eax, esi
shlx edi, edi, eax
shrx eax, edi, eax
ret
Notice that the compiler recognized what you were trying to do with the first two versions of this function and completely rewrote the code to use the bzhi x86 instruction. This instruction copies the lower bits of one register into another. In other words, the compiler was able to generate a single assembly instruction! On the other hand, the compiler didn't recognize what the last version was trying to do, so it actually generated the code as written and actually did the shifts and subtraction.
But that's not the end of the story. Imagine that the number of bits to extract is known in advance. For example, suppose we want the lower 13 bits. Now, watch what happens with this code:
#include <stdint.h>
#include <limits.h>
uint32_t lastBitsOf_v1(uint32_t number) {
return number & ((1 << 13) - 1);
}
uint32_t lastBitsOf_v2(uint32_t number) {
return number % (1 << 13);
}
uint32_t lastBitsOf_v3(uint32_t number) {
return (number << 19) >> 19;
}
These are literally the same functions, just with the bit amount hardcoded. Now look at what gets generated:
lastBitsOf_v1(unsigned int):
mov eax, edi
and eax, 8191
ret
lastBitsOf_v2(unsigned int):
mov eax, edi
and eax, 8191
ret
lastBitsOf_v3(unsigned int):
mov eax, edi
and eax, 8191
ret
All three versions get compiled to the exact same code. The compiler saw what we're doing in each case and replaced it with this much simpler code that's basically the first version.
After seeing all of this, what should you do? My recommendation would be the following:
Unless this code is an absolute performance bottleneck - as in, you've measured your code's runtime and you're absolutely certain that the code for extracting the low bits of numbers is what's actually slowing you down - I wouldn't worry too much about this at all. Pick the most readable code that you can. I personally find option (1) the cleanest, but that's just me.
If you absolutely must get every ounce of performance out of this that you can, rather than taking my word for it, I'd recommend tinkering around with different versions of the code and seeing what assembly gets generated in each case and running some performance experiments. After all, if something like this is really important, you'd want to see it for yourself!
Hope this helps!

Direct2D COM calls returning 64-bit structs and C++Builder 2010

I'm trying to get the size of a Direct2D Bitmap and getting an immediate crash.
// props and target etc all set up beforehand.
CComPtr<ID2D1Bitmap> &b;
target->CreateBitmap(D2D1::SizeU(1024,1024), frame.p_data, 1024* 4, &props, &b));
D2D_SIZE_U sz = b->GetPixelSize(); // Crashes here.
All other operations using the bitmap (including drawing it) work correctly. It's just returning the size that seems to be the problem.
Based on a articles like this by Rudy V, my suspicion is that it's some incompatibility with C++Builder 2010 and how COM functions return 64-bit structures. http://rvelthuis.de/articles/articles-convert.html
The Delphi declaration of GetPixelSize looks like this: (from D2D1.pas)
// Returns the size of the bitmap in resolution dependent units, (pixels).
procedure GetPixelSize(out pixelSize: TD2D1SizeU); stdcall;
... and in D2D1.h it's
//
// Returns the size of the bitmap in resolution dependent units, (pixels).
//
STDMETHOD_(D2D1_SIZE_U, GetPixelSize)(
) CONST PURE;
Can I fix this without rewriting the D2D headers?
All suggestions welcome - except upgrading from C++Builder 2010 which is more of a task than I'm ready for at the moment.
„getInfo“ is a function derived from Delphi code, which can work around.
void getInfo(void* itfc, void* info, int vmtofs)
{
asm {
push info // pass pointer to return result
mov eax,itfc // eax poionts to interface
push eax // pass pointer to interface
mov eax,[eax] // eax points to VMT
add eax,vmtofs // eax points rto address of virtual function
call dword ptr [eax] // call function
}
}
Disassembly of code generated by CBuilder, which results in a crash:
Graphics.cpp.162: size = bmp->GetSize();
00401C10 8B4508 mov eax,[ebp+$08]
00401C13 FF7004 push dword ptr [eax+$04]
00401C16 8D55DC lea edx,[ebp-$24]
00401C19 52 push edx
00401C1A 8B4D08 mov ecx,[ebp+$08]
00401C1D 8B4104 mov eax,[ecx+$04]
00401C20 8B10 mov edx,[eax]
00401C22 FF5210 call dword ptr [edx+$10]
00401C25 8B4DDC mov ecx,[ebp-$24]
00401C28 894DF8 mov [ebp-$08],ecx
00401C2B 8B4DE0 mov ecx,[ebp-$20]
00401C2E 894DFC mov [ebp-$04],ecx
„bmp“ is declared as
ID2D1Bitmap* bmp;
Code to call „getInfo“:
D2D1_SIZE_F size;
getInfo(bmp,&pf,0x10);
You get 0x10 (vmtofs) from disassembly line „call dword ptr [edx+$10]“
You can call „GetPixelSize“, „GetPixelFormat“ and others by calling „getInfo“
D2D1_SIZE_U ps;// = bmp->GetPixelSize();
getInfo(bmp,&ps,0x14);
D2D1_PIXEL_FORMAT pf;// = bmp->GetPixelFormat();
getInfo(bmp,&pf,0x18);
„getInfo“ works with methods „STDMETHOD_ ... CONST PURE;“, which return a result.
STDMETHOD_(D2D1_SIZE_F, GetSize)(
) CONST PURE;
For this method CBuilder generates malfunctional code.
In case of
STDMETHOD_(void, GetDpi)(
__out FLOAT *dpiX,
__out FLOAT *dpiY
) CONST PURE;
the CBuilder code works fine, „getDpi“ results void.

Speeding up the loop

I have the following piece of code:
for chunk in imagebuf.chunks_mut(4) {
let temp = chunk[0];
chunk[0] = chunk[2];
chunk[2] = temp;
}
For an array of 40000 u8s, it takes about 2.5 ms on my machine, compiled using cargo build --release.
The following C++ code takes about 100 us for the exact same data (verified by implementing it and using FFI to call it from rust):
for(;imagebuf!=endbuf;imagebuf+=4) {
char c=imagebuf[0];
imagebuf[0]=imagebuf[2];
imagebuf[2]=c;
}
I'm thinking it should be possible to speed up the Rust implementation to perform as fast as the C++ version.
The Rust program was built using cargo --release, the C++ program was built without any optimization flags.
Any hints?
I cannot reproduce the timings you are getting. You probably have an error in how you measure (or I have 😉). On my machine both versions run in exactly the same time.
In this answer, I will first compare the assembly output of both, the C++ and the Rust version. Afterwards I will describe how to reproduce my timings.
Assembly comparison
I generated the assembly code with the amazing Compiler Explorer (Rust code, C++ Code). I compiled the C++ code with optimizations activated (-O3), too, to make it a fair game (C++ compiler optimizations had no impact on the measured timings though). Here is the resulting assembly (Rust left, C++ right):
example::foo_rust: | foo_cpp(char*, char*):
test rsi, rsi | cmp rdi, rsi
je .LBB0_5 | je .L3
mov r8d, 4 |
.LBB0_2: | .L5:
cmp rsi, 4 |
mov rdx, rsi |
cmova rdx, r8 |
test rdi, rdi |
je .LBB0_5 |
cmp rdx, 3 |
jb .LBB0_6 |
movzx ecx, byte ptr [rdi] | movzx edx, BYTE PTR [rdi]
movzx eax, byte ptr [rdi + 2] | movzx eax, BYTE PTR [rdi+2]
| add rdi, 4
mov byte ptr [rdi], al | mov BYTE PTR [rdi-2], al
mov byte ptr [rdi + 2], cl | mov BYTE PTR [rdi-4], dl
lea rdi, [rdi + rdx] |
sub rsi, rdx | cmp rsi, rdi
jne .LBB0_2 | jne .L5
.LBB0_5: | .L3:
| xor eax, eax
ret | ret
.LBB0_6: |
push rbp +-----------------+
mov rbp, rsp |
lea rdi, [rip + panic_bounds_check_loc.3] |
mov esi, 2 |
call core::panicking::panic_bounds_check#PLT |
You can immediately see that C++ does in fact produce a lot less assembly (without optimization C++ produced nearly as many instruction as Rust does). I am not sure about all of the additional instructions Rust produces, but at least half of them are for bound checking. But this bound checking is, as far as I understand, not for the actual accesses via [] but just once every loop iteration. This is just for the case that the slice's length is not divisible by 4. But I guess the Rust assembly could be better still (even with bound checks).
As mentioned in the comments, you can remove bound checking by using get_unchecked() and get_unchecked_mut(). Note however, that this did not influence the performance in my measurements!
Lastly: you should use [&]::swap(i, j) here.
for chunk in imagebuf.chunks_mut(4) {
chunk.swap(0, 2);
}
This, again, did not notably influence performance. But it's shorter and better code.
Measuring
I used this C++ code (in foocpp.cpp):
extern "C" void foo_cpp(char *imagebuf, char *endbuf);
void foo_cpp(char* imagebuf, char* endbuf) {
for(;imagebuf!=endbuf;imagebuf+=4) {
char c=imagebuf[0];
imagebuf[0]=imagebuf[2];
imagebuf[2]=c;
}
}
I compiled it with:
gcc -c -O3 foocpp.cpp && ar rvs libfoocpp.a foocpp.o
Then I used this Rust code to measure everything:
#![feature(test)]
extern crate libc;
extern crate test;
use test::black_box;
use std::time::Instant;
#[link(name = "foocpp")]
extern {
fn foo_cpp(start: *mut libc::c_char, end: *const libc::c_char);
}
pub fn foo_rust(imagebuf: &mut [u8]) {
for chunk in imagebuf.chunks_mut(4) {
let temp = chunk[0];
chunk[0] = chunk[2];
chunk[2] = temp;
}
}
fn main() {
let mut buf = [0u8; 40_000];
let before = Instant::now();
foo_rust(black_box(&mut buf));
black_box(buf);
println!("rust: {:?}", Instant::now() - before);
// ----------------------------------
let mut buf = [0u8 as libc::c_char; 40_000];
let before = Instant::now();
let ptr = buf.as_mut_ptr();
let end = unsafe { ptr.offset(buf.len() as isize) };
unsafe { foo_cpp(black_box(ptr), black_box(end)); }
black_box(buf);
println!("cpp: {:?}", Instant::now() - before);
}
The black_box() all over the place prevents the compiler from optimizing where it isn't supposed to. I executed it with (nightly compiler):
LIBRARY_PATH=.:$LIBRARY_PATH cargo run --release
Giving me (i7-6700HQ) values like these:
rust: Duration { secs: 0, nanos: 30583 }
cpp: Duration { secs: 0, nanos: 30810 }
The times fluctuate a lot (way more than the difference between both versions). I am not exactly sure why the additional assembly generated by Rust does not result in a slower execution, though.

How unwind ARM Cortex M3 stack

The ARM Coretex STM32's HardFault_Handler can only get several registers values, r0, r1,r2, r3, lr, pc, xPSR, when crash happened. But there is no FP and SP in the stack. Thus I could not unwind the stack.
Is there any solution for this? Thanks a lot.
[update]
Following a web instruction to let ARMGCC(Keil uvision IDE) generate FP by adding a compiling option "--use_frame_pointer", but I could not find the FP in the stack. I am a real newbie here. Below is my demo code:
int test2(int i, int j)
{
return i/j;
}
int main()
{
SCB->CCR |= 0x10;
int a = 10;
int b = 0;
int c;
c = test2(a,b);
}
enum { r0 = 0, r1, r2, r3, r11, r12, lr, pc, psr};
void Hard_Fault_Handler(uint32_t *faultStackAddress)
{
uint32_t r0_val = faultStackAddress[r0];
uint32_t r1_val = faultStackAddress[r1];
uint32_t r2_val = faultStackAddress[r2];
uint32_t r3_val = faultStackAddress[r3];
uint32_t r12_val = faultStackAddress[r12];
uint32_t r11_val = faultStackAddress[r11];
uint32_t lr_val = faultStackAddress[lr];
uint32_t pc_val = faultStackAddress[pc];
uint32_t psr_val = faultStackAddress[psr];
}
I have two questions here:
1. I am not sure where the index of FP(r11) in the stack, or whether it is pushed into stack or not. I assume it is before r12, because I compared the assemble source before and after adding the option "--use_frame_pointer". I also compared the values read from Hard_Fault_Handler, seems like r11 is not in the stack. Because r11 address I read points to a place where the code is not my code.
[update] I have confirmed that FP is pushed into the stack. The second question still needs to be answered.
See below snippet code:
Without the option "--use_frame_pointer"
test2 PROC
MOVS r0,#3
BX lr
ENDP
main PROC
PUSH {lr}
MOVS r0,#0
BL test2
MOVS r0,#0
POP {pc}
ENDP
with the option "--use_frame_pointer"
test2 PROC
PUSH {r11,lr}
ADD r11,sp,#4
MOVS r0,#3
MOV sp,r11
SUB sp,sp,#4
POP {r11,pc}
ENDP
main PROC
PUSH {r11,lr}
ADD r11,sp,#4
MOVS r0,#0
BL test2
MOVS r0,#0
MOV sp,r11
SUB sp,sp,#4
POP {r11,pc}
ENDP
2. Seems like FP is not in the input parameter faultStackAddress of Hard_Fault_Handler(), where can I get the caller's FP to unwind the stack?
[update again]
Now I understood the last FP(r11) is not stored in the stack. All I need to do is to read the value of r11 register, then I can unwind the whole stack.
So now my final question is how to read it using inline assembler of C. I tried below code, but failed to read the correct value from r11 following the reference of http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0472f/Cihfhjhg.html
volatile int top_fp;
__asm
{
mov top_fp, r11
}
r11's value is 0x20009DCC
top_fp's value is 0x00000004
[update 3] Below is my whole code.
int test5(int i, int j, int k)
{
char a[128] = {0} ;
a[0] = 'a';
return i/j;
}
int test2(int i, int j)
{
char a[18] = {0} ;
a[0] = 'a';
return test5(i, j, 0);
}
int main()
{
SCB->CCR |= 0x10;
int a = 10;
int b = 0;
int c;
c = test2(a,b); //create a divide by zero crash
}
/* The fault handler implementation calls a function called Hard_Fault_Handler(). */
#if defined(__CC_ARM)
__asm void HardFault_Handler(void)
{
TST lr, #4
ITE EQ
MRSEQ r0, MSP
MRSNE r0, PSP
B __cpp(Hard_Fault_Handler)
}
#else
void HardFault_Handler(void)
{
__asm("TST lr, #4");
__asm("ITE EQ");
__asm("MRSEQ r0, MSP");
__asm("MRSNE r0, PSP");
__asm("B Hard_Fault_Handler");
}
#endif
void Hard_Fault_Handler(uint32_t *faultStackAddress)
{
volatile int top_fp;
__asm
{
mov top_fp, r11
}
//TODO: use top_fp to unwind the whole stack.
}
[update 4] Finally, I made it out. My solution:
Note: To access r11, we have to use embedded assembler, see here, which costs me much time to figure it out.
//we have to use embedded assembler.
__asm int getRegisterR11()
{
mov r0,r11
BX LR
}
//call it from Hard_Fault_Handler function.
/*
Function call stack frame:
FP1(r11) -> | lr |(High Address)
| FP2|(prev FP)
| ...|
Current FP(r11) ->| lr |
| FP1|(prev FP)
| ...|(Low Address)
With FP, we can access lr(link register) which is the address to return when the current functions returns(where you were).
Then (current FP - 1) points to prev FP.
Thus we can unwind the stack.
*/
void unwindBacktrace(uint32_t topFp, uint16_t* backtrace)
{
uint32_t nextFp = topFp;
int j = 0;
//#define BACK_TRACE_DEPTH 5
//loop backtrace using FP(r11), save lr into an uint16_t array.
for(int i = 0; i < BACK_TRACE_DEPTH; i++)
{
uint32_t lr = *((uint32_t*)nextFp);
if ((lr >= 0x08000000) && (lr <= 0x08FFFFFF))
{
backtrace[j*2] = LOW_16_BITS(lr);
backtrace[j*2 + 1] = HIGH_16_BITS(lr);
j += 1;
}
nextFp = *((uint32_t*)nextFp - 1);
if (nextFp == 0)
{
break;
}
}
}
#if defined(__CC_ARM)
__asm void HardFault_Handler(void)
{
TST lr, #4
ITE EQ
MRSEQ r0, MSP
MRSNE r0, PSP
B __cpp(Hard_Fault_Handler)
}
#else
void HardFault_Handler(void)
{
__asm("TST lr, #4");
__asm("ITE EQ");
__asm("MRSEQ r0, MSP");
__asm("MRSNE r0, PSP");
__asm("B Hard_Fault_Handler");
}
#endif
void Hard_Fault_Handler(uint32_t *faultStackAddress)
{
//get back trace
int topFp = getRegisterR11();
unwindBacktrace(topFp, persistentData.faultStack.back_trace);
}
Very primitive method to unwind the stack in such case is to read all stack memory above SP seen at the time of HardFault_Handler and process it using arm-none-eabi-addr2line. All link register entries saved on stack will be transformed into source line (remember that actual code path goes the line before LR points to). Note, if functions in between were called using branch instruction (b) instead of branch and link (bl) you'll not see them using this method.
(I don't have enough reputation points to write comments, so I'm editing my answer):
UPDATE for question 2:
Why do you expect that Hard_Fault_Handler has any arguments? Hard_Fault_Handler is usally a function to which address is stored in vector (exception) table. When the processor exception happens then Hard_Fault_Handler will be executed. There is no arguments passing involved doing this. But still, all registers at the time the fault happens are preserved. Specifically, if you compiled without omit-frame-pointer you can just read value of R11 (or R7 in Thumb-2 mode). However, to be sure that in your code Hard_Fault_Handler is actually a real hard fault handler, look into startup.s code and see if Hard_Fault_Handler is at the third entry in vector table. If there is an other function, it means Hard_Fault_Handler is just called from that function explicitly. See this article for details. You can also read my blog :) There is a chapter about stack which is based on Android example, but a lot of things are the same in general.
Also note, most probably in faultStackAddress should be stored a stack pointer, not a frame pointer.
UPDATE 2
Ok, lets clarify some things. Firstly, please paste the code from which you call Hard_Fault_Handler. Secondly, I guess you call it from within real HardFault exception handler. In that case you cannot expect that R11 will be at faultStackAddress[r11]. You've already mentioned it at the first sentence in your question. There will be only r0-r3, r12, lr, pc and psr.
You've also written:
But there is no FP and SP in the stack. Thus I could not unwind the
stack. Is there any solution for this?
The SP is not "in the stack" because you have it already in one of the stack registers (msp or psp). See again THIS ARTICLE. Also, FP is not crucial to unwind stack because you can do it without it (by "navigating" through saved Link Registers). Other thing is that if you dump memory below your SP you can expect FP to be just next to saved LR if you really need it.
Answering your last question: I don't now how you're verifying this code and how you're calling it (you need to paste full code). You can look into assembly of that function and see what's happening under the hood. Other thing you can do is to follow this post as a template.

are 2^n exponent calculations really less efficient than bit-shifts?

if I do:
int x = 4;
pow(2, x);
Is that really that much less efficient than just doing:
1 << 4
?
Yes. An easy way to show this is to compile the following two functions that do the same thing and then look at the disassembly.
#include <stdint.h>
#include <math.h>
uint32_t foo1(uint32_t shftAmt) {
return pow(2, shftAmt);
}
uint32_t foo2(uint32_t shftAmt) {
return (1 << shftAmt);
}
cc -arch armv7 -O3 -S -o - shift.c (I happen to find ARM asm easier to read but if you want x86 just remove the arch flag)
_foo1:
# BB#0:
push {r7, lr}
vmov s0, r0
mov r7, sp
vcvt.f64.u32 d16, s0
vmov r0, r1, d16
blx _exp2
vmov d16, r0, r1
vcvt.u32.f64 s0, d16
vmov r0, s0
pop {r7, pc}
_foo2:
# BB#0:
movs r1, #1
lsl.w r0, r1, r0
bx lr
You can see foo2 only takes 2 instructions vs foo1 which takes several instructions. It has to move the data to the FP HW registers (vmov), convert the integer to a float (vcvt.f64.u32) call the exp function and then convert the answer back to an uint (vcvt.u32.f64) and move it from the FP HW back to the GP registers.
Yes. Though by how much I can't say. The easiest way to determine that is to benchmark it.
The pow function uses doubles... At least, if it conforms to the C standard. Even if that function used bitshift when it sees a base of 2, there would still be testing and branching to reach that conclusion, by which time your simple bitshift would be completed. And we haven't even considered the overhead of a function call yet.
For equivalency, I assume you meant to use 1 << x instead of 1 << 4.
Perhaps a compiler could optimize both of these, but it's far less likely to optimize a call to pow. If you need the fastest way to compute a power of 2, do it with shifting.
Update... Since I mentioned it's easy to benchmark, I decided to do just that. I happen to have Windows and Visual C++ handy so I used that. Results will vary. My program:
#include <Windows.h>
#include <cstdio>
#include <cmath>
#include <ctime>
LARGE_INTEGER liFreq, liStart, liStop;
inline void StartTimer()
{
QueryPerformanceCounter(&liStart);
}
inline double ReportTimer()
{
QueryPerformanceCounter(&liStop);
double milli = 1000.0 * double(liStop.QuadPart - liStart.QuadPart) / double(liFreq.QuadPart);
printf( "%.3f ms\n", milli );
return milli;
}
int main()
{
QueryPerformanceFrequency(&liFreq);
const size_t nTests = 10000000;
int x = 4;
int sumPow = 0;
int sumShift = 0;
double powTime, shiftTime;
// Make an array of random exponents to use in tests.
const size_t nExp = 10000;
int e[nExp];
srand( (unsigned int)time(NULL) );
for( int i = 0; i < nExp; i++ ) e[i] = rand() % 31;
// Test power.
StartTimer();
for( size_t i = 0; i < nTests; i++ )
{
int y = (int)pow(2, (double)e[i%nExp]);
sumPow += y;
}
powTime = ReportTimer();
// Test shifting.
StartTimer();
for( size_t i = 0; i < nTests; i++ )
{
int y = 1 << e[i%nExp];
sumShift += y;
}
shiftTime = ReportTimer();
// The compiler shouldn't optimize out our loops if we need to display a result.
printf( "Sum power: %d\n", sumPow );
printf( "Sum shift: %d\n", sumShift );
printf( "Time ratio of pow versus shift: %.2f\n", powTime / shiftTime );
system("pause");
return 0;
}
My output:
379.466 ms
15.862 ms
Sum power: 157650768
Sum shift: 157650768
Time ratio of pow versus shift: 23.92
That depends on the compiler, but in general (when the compiler is not totally braindead) yes, the shift is one CPU instruction, the other is a function call, that involves saving the current state an setting up a stack frame, that requires many instructions.
Generally yes, as bit shift is very basic operation for the processor.
On the other hand many compilers optimise code so that raising to power is in fact just a bit shifting.