Arm loop code optimization for calculating mean standard deviation - objective-c

I want to improve some code which is using 25% of my app CPU, the code is the next:
for (int i=0; i<8; i++) {
unsigned f = *p++;
sum += f;
sqsum += f*f;
I made some arm code but it is not working, even not compiling, which is the next:
void loop(uint8_t * p , int * sum ,int * qsum)
__asm__ volatile("vld4.8 {d0}, [%0]! \n"
"mov r4, #0 \n"
"vmlal.u8 [%1]!, [%1]!, d0 \n"
"vmull.u8 r4, d0 , d0 \n"
"vmlal.u8 [%2]!, [%2]!, r4\n"
: "r"(p), "r"(sum), "r"(qsum)
: "r4"
Any help?
Here is the my function to improve:
void calculateMeanStDev8x8(cv::Mat* patch, int sx, int sy, int& mean, float& stdev)
unsigned sum=0;
unsigned sqsum=0;
for (int j=0; j< 8; j++) {
const unsigned char* p = (const unsigned char*)(patch->data + (j+sy)*patch->step + sx); //Apuntador al inicio de la matrix
//The code to improve
for (int i=0; i<8; i++) {
unsigned f = *p++;
sum += f;
sqsum += f*f;
mean = sum >> 6;
int r = (sum*sum) >> 6;
stdev = sqrtf(sqsum - r);
if (stdev < .1) {

That loop is a perfect candidate for NEON optimization. You can fit your 8 unsigned integers into a single NEON register. There is no "sum all elements of a vector" instruction, but you can use the pairwise add to compute the sum of the 8 elements in 3 steps. Since we can't see the rest of your application, it's hard to know what the big picture is, but NEON is your best bet for improving the speed. All recent Apple products support NEON instructions and in XCode you can use the NEON intrinsics mixed with your C++ code.


SMHasher setup?

The SMHasher test suite for hash functions is touted as the best of the lot. But the latest version I've got (from rurban) gives absolutely no clue on how to check your proposed hash function (it does include an impressive battery of hash functions, but some of interest --if only for historic value-- are missing). Add that I'm a complete CMake newbie.
It's actually quite simple. You just need to install CMake.
Building SMHasher
To build SMHasher on a Linux/Unix machine:
git clone
cd smhasher/
git submodule init
git submodule update
cmake .
Adding a new hash function
To add a new function, you can edit just three files: Hashes.cpp, Hashes.h and main.cpp.
For example, I will add the ElfHash:
unsigned long ElfHash(const unsigned char *s)
unsigned long h = 0, high;
while (*s)
h = (h << 4) + *s++;
if (high = h & 0xF0000000)
h ^= high >> 24;
h &= ~high;
return h;
First, need to modify it slightly to take a seed and length:
uint32_t ElfHash(const void *key, int len, uint32_t seed)
unsigned long h = seed, high;
const uint8_t *data = (const uint8_t *)key;
for (int i = 0; i < len; i++)
h = (h << 4) + *data++;
if (high = h & 0xF0000000)
h ^= high >> 24;
h &= ~high;
return h;
Add this function definition to Hashes.cpp. Also add the following to Hashes.h:
uint32_t ElfHash(const void *key, int len, uint32_t seed);
inline void ElfHash_test(const void *key, int len, uint32_t seed, void *out) {
*(uint32_t *) out = ElfHash(key, len, seed);
In file main.cpp add the following line into array g_hashes:
{ ElfHash_test, 32, 0x0, "ElfHash", "ElfHash 32-bit", POOR, {0x0} },
(The third value is self-verification. You will learn this only after running the test once.)
Finally, rebuild and run the test:
./SMHasher ElfHash
It will show you all the tests that this hash function fails. (It is very bad.)

Basic group arithmetic in libsodium

I am trying to implement a simple cryptographic primitive.
Under the following code: given sa, sk, hn, I want to compute sb: such that sg*G = (sb + sk . hn)*G.
However, after finding sb, the following equality does not hold: sb*G + ( = saG.
My understand stand is that in the exponent is arithmetic modulo the order of group instead of L.
However, I have a few questions relating to their implementation:
why the scalar has to be chosen from [0,L] where L is the order of the subgroup?
is there a "helper" function that multiplies two large scalar without performing modulo L?
int main(void)
if (sodium_init() < 0) {
/* panic! the library couldn't be initialized, it is not safe to use */
return -1;
uint8_t sb[crypto_core_ed25519_SCALARBYTES];
uint8_t sa[crypto_core_ed25519_SCALARBYTES];
uint8_t hn[crypto_core_ed25519_SCALARBYTES];
uint8_t sk[crypto_core_ed25519_SCALARBYTES];
crypto_core_ed25519_scalar_random(sa); // s_a <- [0,l]
crypto_core_ed25519_scalar_random(sk); // sk <- [0,l]
crypto_core_ed25519_scalar_random(hn); // hn <- [0,l]
uint8_t product[crypto_core_ed25519_SCALARBYTES];
crypto_core_ed25519_scalar_mul(product, sk,hn); // sk*hn
crypto_core_ed25519_scalar_sub(sb, sa, product); // sb = sa-hn*sk
uint8_t point1[crypto_core_ed25519_BYTES];
crypto_scalarmult_ed25519_base(point1, sa);
uint8_t point2[crypto_core_ed25519_BYTES];
uint8_t sum[crypto_core_ed25519_BYTES];
// equal
// crypto_core_ed25519_scalar_add(sum, sb, product);
// crypto_scalarmult_ed25519_base(point2, sum);
// is not equal
uint8_t temp1[crypto_core_ed25519_BYTES];
uint8_t temp2[crypto_core_ed25519_BYTES];
crypto_scalarmult_ed25519_base(temp1, sb); // sb*G
crypto_scalarmult_ed25519_base(temp2, product); //
crypto_core_ed25519_add(point2, temp1, temp2);
if(memcmp(point1, point2, 32) != 0)
printf("[-] Not equal ");
return -1;
printf("[+] equal");
return 0;
I got the answer from jedisct1 , the author of libsodium and I will post it here:
crypto_scalarmult_ed25519_base() clamps the scalar (clears the 3 lower bits, set the high bit) before performing the multiplication.
Use crypto_scalarmult_ed25519_base_noclamp() to prevent this.
Or, even better, use the Ristretto group instead.

Simulating a card game. degenerate suits

This might be a bit cryptic title but I have a very specific problem. First my current setup
Namely in my card simulator I deal 32 cards to 4 players in sets of 8. So 8 cards per player.
With the 4 standard suits (spades, harts , etc)
My current implementation cycles threw all combinations of 8 out of 32
witch gives me a large number of possibilities.
Namely the first player can have 10518300 different hands be dealt.
The second can then be dealt 735471 different hands.
The third player then 12870 different hands.
and finally the fourth can have only 1
giving me a grand total of 9.9561092e+16 different unique ways to deal a deck of 32 cards to 4 players. if the order of cards doesn’t matter.
On a 4 Ghz processor even with 1 tick per possibility it would take me half a year.
However I would like to simplify this dealing of cards by making the exchange of diamonds, harts and spades. Meaning that dealing of 8 harts to player 1 is equivalent to dealing 8 spades. (note that this doesn’t apply to clubs)
I am looking for a way to generate this. Because this will cut down the possibilities of the first hand by at least a factor of 6. My current implementation is in c++.
But feel free to answer in a different Languages
/** */
unsigned cjasMain::nChoosek( unsigned n, unsigned k )
//assert(k < n);
if (k > n) return 0;
if (k * 2 > n) k = n-k;
if (k == 0) return 1;
int result = n;
for( int i = 2; i <= k; ++i ) {
result *= (n-i+1);
result /= i;
return result;
/** [combination c n p x]
* get the [x]th lexicographically ordered set of [r] elements in [n]
* output is in [c], and should be sizeof(int)*[r]
* */
void cjasMain::Combination(int8_t* c,unsigned n,unsigned r, unsigned x){
int i,p,k = 0;
c[i] = (i != 0) ? c[i-1] : 0;
do {
p = nChoosek(n-c[i],r-(i+1));
k = k + p;
} while(k < x);
k = k - p;
c[r-1] = c[r-2] + x - k;
/** */
template <unsigned n,std::size_t r>
void cjasMain::Combinations()
static_assert(n>=r,"error n needs to be larger then r");
std::vector<bool> v(n);
std::fill(v.begin() + r, v.end(), true);
for (int i = 0; i < n; ++i)
if (!v[i])
COUT << (i+1) << " ";
static int j=0;
COUT <<'\t'<< j++<< "\n";
while (std::next_permutation(v.begin(), v.end()));
A requirement is that from lexicographical number I can get back the original array.
Even the slightest optimization can help my monto carol simulation I hope.

Determine Position of Most Signifiacntly Set Bit in a Byte

I have a byte I am using to store bit flags. I need to compute the position of the most significant set bit in the byte.
Example Byte: 00101101 => 6 is the position of the most significant set bit
Compact Hex Mapping:
[0x00] => 0x00
[0x01] => 0x01
[0x02,0x03] => 0x02
[0x04,0x07] => 0x03
[0x08,0x0F] => 0x04
[0x10,0x1F] => 0x05
[0x20,0x3F] => 0x06
[0x40,0x7F] => 0x07
[0x80,0xFF] => 0x08
TestCase in C:
#include <stdio.h>
unsigned char check(unsigned char b) {
unsigned char c = 0x08;
unsigned char m = 0x80;
do {
if(m&b) { return c; }
else { c -= 0x01; }
} while(m>>=1);
return 0; //never reached
int main() {
unsigned char input[256] = {
0xf0,0xf1,0xf2,0xf3,0xf4,0xf5,0xf6,0xf7,0xf8,0xf9,0xfa,0xfb,0xfc,0xfd,0xfe,0xff };
unsigned char truth[256] = {
int i,r;
int f = 0;
for(i=0; i<256; ++i) {
if(r !=(truth[i])) {
printf("failed %d : 0x%x : %d\n",i,0x000000FF & ((int)input[i]),r);
f += 1;
if(!f) { printf("passed all\n"); }
else { printf("failed %d\n",f); }
return 0;
I would like to simplify my check() function to not involve looping (or branching preferably). Is there a bit twiddling hack or hashed lookup table solution to compute the position of the most significant set bit in a byte?
Your question is about an efficient way to compute log2 of a value. And because you seem to want a solution that is not limited to the C language I have been slightly lazy and tweaked some C# code I have.
You want to compute log2(x) + 1 and for x = 0 (where log2 is undefined) you define the result as 0 (e.g. you create a special case where log2(0) = -1).
static readonly Byte[] multiplyDeBruijnBitPosition = new Byte[] {
7, 2, 3, 4,
6, 1, 5, 0
public static Byte Log2Plus1(Byte value) {
if (value == 0)
return 0;
var roundedValue = value;
roundedValue |= (Byte) (roundedValue >> 1);
roundedValue |= (Byte) (roundedValue >> 2);
roundedValue |= (Byte) (roundedValue >> 4);
var log2 = multiplyDeBruijnBitPosition[((Byte) (roundedValue*0xE3)) >> 5];
return (Byte) (log2 + 1);
This bit twiddling hack is taken from Find the log base 2 of an N-bit integer in O(lg(N)) operations with multiply and lookup where you can see the equivalent C source code for 32 bit values. This code has been adapted to work on 8 bit values.
However, you may be able to use an operation that gives you the result using a very efficient built-in function (on many CPU's a single instruction like the Bit Scan Reverse is used). An answer to the question Bit twiddling: which bit is set? has some information about this. A quote from the answer provides one possible reason why there is low level support for solving this problem:
Things like this are the core of many O(1) algorithms such as kernel schedulers which need to find the first non-empty queue signified by an array of bits.
That was a fun little challenge. I don't know if this one is completely portable since I only have VC++ to test with, and I certainly can't say for sure if it's more efficient than other approaches. This version was coded with a loop but it can be unrolled without too much effort.
static unsigned char check(unsigned char b)
unsigned char r = 8;
unsigned char sub = 1;
unsigned char s = 7;
for (char i = 0; i < 8; i++)
sub = sub & ((( b & (1 << s)) >> s--) - 1);
r -= sub;
return r;
I'm sure everyone else has long since moved on to other topics but there was something in the back of my mind suggesting that there had to be a more efficient branch-less solution to this than just unrolling the loop in my other posted solution. A quick trip to my copy of Warren put me on the right track: Binary search.
Here's my solution based on that idea:
// see if there's a bit set in the upper half
if ((b >> 4) != 0)
offset = 4;
b >>= 4;
offset = 0;
// see if there's a bit set in the upper half of what's left
if ((b & 0x0C) != 0)
offset += 2;
b >>= 2;
// see if there's a bit set in the upper half of what's left
if > ((b & 0x02) != 0)
b >>= 1;
return b + offset;
Branch-less C++ implementation:
static unsigned char check(unsigned char b)
unsigned char adj = 4 & ((((unsigned char) - (b >> 4) >> 7) ^ 1) - 1);
unsigned char offset = adj;
b >>= adj;
adj = 2 & (((((unsigned char) - (b & 0x0C)) >> 7) ^ 1) - 1);
offset += adj;
b >>= adj;
adj = 1 & (((((unsigned char) - (b & 0x02)) >> 7) ^ 1) - 1);
return (b >> adj) + offset + adj;
Yes, I know that this is all academic :)
It is not possible in plain C. The best I would suggest is the following implementation of check. Despite quite "ugly" I think it runs faster than the ckeck version in the question.
int check(unsigned char b)
if(b&128) return 8;
if(b&64) return 7;
if(b&32) return 6;
if(b&16) return 5;
if(b&8) return 4;
if(b&4) return 3;
if(b&2) return 2;
if(b&1) return 1;
return 0;
Edit: I found a link to the actual code:
The algorithm below is named nlz8 in that file. You can choose your favorite hack.
From last comment of:
> Hacker's Delight explains how to correct for the error in 32-bit floats
> in 5-3 Counting Leading 0's. Here's their code, which uses an anonymous
> union to overlap asFloat and asInt: k = k & ~(k >> 1); asFloat =
> (float)k + 0.5f; n = 158 - (asInt >> 23); (and yes, this relies on
> implementation-defined behavior) - Derrick Coetzee Jan 3 '12 at 8:35
unsigned char check (unsigned char b) {
union {
float asFloat;
int asInt;
} u;
unsigned k = b & ~(b >> 1);
u.asFloat = (float)k + 0.5f;
return 32 - (158 - (u.asInt >> 23));
Edit -- not exactly sure what the asker means by language independent, but below is the equivalent code in python.
import ctypes
class Anon(ctypes.Union):
_fields_ = [
("asFloat", ctypes.c_float),
("asInt", ctypes.c_int)
def check(b):
k = int(b) & ~(int(b) >> 1)
a = Anon(asFloat=(float(k) + float(0.5)))
return 32 - (158 - (a.asInt >> 23))

are 2^n exponent calculations really less efficient than bit-shifts?

if I do:
int x = 4;
pow(2, x);
Is that really that much less efficient than just doing:
1 << 4
Yes. An easy way to show this is to compile the following two functions that do the same thing and then look at the disassembly.
#include <stdint.h>
#include <math.h>
uint32_t foo1(uint32_t shftAmt) {
return pow(2, shftAmt);
uint32_t foo2(uint32_t shftAmt) {
return (1 << shftAmt);
cc -arch armv7 -O3 -S -o - shift.c (I happen to find ARM asm easier to read but if you want x86 just remove the arch flag)
# BB#0:
push {r7, lr}
vmov s0, r0
mov r7, sp
vcvt.f64.u32 d16, s0
vmov r0, r1, d16
blx _exp2
vmov d16, r0, r1
vcvt.u32.f64 s0, d16
vmov r0, s0
pop {r7, pc}
# BB#0:
movs r1, #1
lsl.w r0, r1, r0
bx lr
You can see foo2 only takes 2 instructions vs foo1 which takes several instructions. It has to move the data to the FP HW registers (vmov), convert the integer to a float (vcvt.f64.u32) call the exp function and then convert the answer back to an uint (vcvt.u32.f64) and move it from the FP HW back to the GP registers.
Yes. Though by how much I can't say. The easiest way to determine that is to benchmark it.
The pow function uses doubles... At least, if it conforms to the C standard. Even if that function used bitshift when it sees a base of 2, there would still be testing and branching to reach that conclusion, by which time your simple bitshift would be completed. And we haven't even considered the overhead of a function call yet.
For equivalency, I assume you meant to use 1 << x instead of 1 << 4.
Perhaps a compiler could optimize both of these, but it's far less likely to optimize a call to pow. If you need the fastest way to compute a power of 2, do it with shifting.
Update... Since I mentioned it's easy to benchmark, I decided to do just that. I happen to have Windows and Visual C++ handy so I used that. Results will vary. My program:
#include <Windows.h>
#include <cstdio>
#include <cmath>
#include <ctime>
LARGE_INTEGER liFreq, liStart, liStop;
inline void StartTimer()
inline double ReportTimer()
double milli = 1000.0 * double(liStop.QuadPart - liStart.QuadPart) / double(liFreq.QuadPart);
printf( "%.3f ms\n", milli );
return milli;
int main()
const size_t nTests = 10000000;
int x = 4;
int sumPow = 0;
int sumShift = 0;
double powTime, shiftTime;
// Make an array of random exponents to use in tests.
const size_t nExp = 10000;
int e[nExp];
srand( (unsigned int)time(NULL) );
for( int i = 0; i < nExp; i++ ) e[i] = rand() % 31;
// Test power.
for( size_t i = 0; i < nTests; i++ )
int y = (int)pow(2, (double)e[i%nExp]);
sumPow += y;
powTime = ReportTimer();
// Test shifting.
for( size_t i = 0; i < nTests; i++ )
int y = 1 << e[i%nExp];
sumShift += y;
shiftTime = ReportTimer();
// The compiler shouldn't optimize out our loops if we need to display a result.
printf( "Sum power: %d\n", sumPow );
printf( "Sum shift: %d\n", sumShift );
printf( "Time ratio of pow versus shift: %.2f\n", powTime / shiftTime );
return 0;
My output:
379.466 ms
15.862 ms
Sum power: 157650768
Sum shift: 157650768
Time ratio of pow versus shift: 23.92
That depends on the compiler, but in general (when the compiler is not totally braindead) yes, the shift is one CPU instruction, the other is a function call, that involves saving the current state an setting up a stack frame, that requires many instructions.
Generally yes, as bit shift is very basic operation for the processor.
On the other hand many compilers optimise code so that raising to power is in fact just a bit shifting.