Initial value for Newton-Raphson Division algorithm when input is close to 0

Initial value for Newton-Raphson Division algorithm when input is close to 0 - division

I am implementing a Newton-Raphson Division algorithm for values from [-16:16] in S15.16 over an FPGA. For values from |[1:16]| I achieved a MSE of 10e-9with 3 iterations. The way that I have initialized the value a0 is doing the inverse of the middle point in each range:
Some examples are:
From range 1 <= x < 2: a0 = 2/3
From range 6 <= x < 7: a0 = 2/13
From range 15 <= x < 16: a0 = 2/31
This approximation works well, as can be see in the following plot:
So, the problem here is in the range contained within [0:1]. How to find the optimal initial value or an approximation for the initial value?
In the wikipedia says that:
For the subproblem of choosing an initial estimate X0, it is convenient to apply a bit-shift to the divisor D to scale it so that 0.5 ≤ D ≤ 1; by applying the same bit-shift to the numerator N, one ensures the quotient does not change. Then one could use a linear approximation in the form
to initialize Newton–Raphson. To minimize the maximum of the absolute value of the error of this approximation on interval [0.5,1], one should use
Ok, this approximation works well for the range [0.5:1], but:
whats happens when the value tends to get smaller, close to 0. E.g:
0.1, 0.01, 0.001, 0.00001... and so on? I see a problem here because I think is necessary a initial value for each range between [0.001:0.01], [0.01:0.1]... etc.
For smaller values, is better apply other algorithm such as Goldschmidt division algorithm?
These are the codes that I have implemented to emulate the Newton-Raphson division in fixed point:
i = 0
# 16 fractional bits
SHIFT = 2 ** 16
# Lut with "optimal?" values to init NRD algorithm
LUT = np.round(1 / np.arange(0.5, 16, 1) * SHIFT).astype(np.int64)
LUT_f = 1 / np.arange(0.5, 16, 1)
# Function to simulates the NRD algorithm in S15.16 over a FPGA
def FIXED_RECIPROCAL(x):
# Smart adressing to the initial iteration value
adress = x >> 16
# Get the value from LUT
a0 = LUT[adress]
# Algorithm with only 3 iterations
for i in range(3):
s1 = (a0*a0) >> 16
s2 = (x*s1) >> 16
a0 = (a0 << 1) - s2
# Return rescaled value (Only for analysis purposes)
return(a0 / SHIFT)
# ----- TEST ----- #
t = np.arange(1, 16, 0.1)
teor = 1 / t
t_fixed = (t * SHIFT).astype(np.int32)
prac = np.zeros(len(t))
for value in t_fixed:
prac[i] = FIXED_RECIPROCAL(value)
i = i + 1
# Get and print Errors
errors = abs(prac - teor)
mse = ((prac - teor)**2).mean(axis=None)
print("Max Error : %s" % ('{:.3E}'.format(np.max(errors))))
print("MSE: : %s" % ('{:.3E}'.format(mse)))
# Print the obtained values:
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.plot(t, teor, label='Theorical division')
plt.plot(t, prac, '.', label='Newton-Raphson Division')
plt.legend(fontsize=16)
plt.title('Float 32 theorical division Vs. S15.16 Newton-Raphson division', fontsize=22)
plt.xlabel('x', fontsize=20)
plt.ylabel('1 / x', fontsize=20)
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)

The ISO-C99 code below demonstrates how one can achieve an almost correctly rounded implementation of Newton-Raphson based s15.16 division, using a 2 kilobit lookup table for the initial reciprocal approximation, and a number of 32x32-bit multipliers capable of delivering both the low 32 bits of the full product and the high 32 bits. For ease of implementation the signed s15.16 division is mapped back to unsigned 16.16 division for operands in [0, 231].
We need to make full use of the 32-bit data path throughout by keeping operands normalized. In essence we are converting the computation into a quasi floating-point format. This requires a priority encoder to find the most significant set bit in both the dividend and the divisor in the initial normalization step. For software convenience, this is mapped to a CLZ (count leading zeros) operation, present in many processors, in the code below.
After computing the reciprocal of the divisor b we multiply by the dividend a to determine the raw quotient q = (1/b)*a. In order to round correctly to nearest or even we need to compute the remainder for the quotient a well as its increment and decrement. The correctly rounded quotient corresponds to the candidate quotient with the remainder of smallest magnitude.
In order for this to work perfectly, we would need a raw quotient that is within 1 ulp of the mathematical result. Unfortunately, this is not the case here, since the raw quotient is occasionally off by ±2 ulps. We would need effectively 33 bits in some of the intermediate computation, which could be simulated in software but I don't have time to puzzle this out right now. The code "as-is" delivers the correctly rounded result in more than 99.999% of random test cases.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#define TAB_BITS_IN (8) /* 256 entry LUT */
#define TAB_BITS_OUT (9) /* 9 bits effective, 8 bits stored */
#define TRUNC_COMP (1) /* compensate truncation in fixed-point multiply */
int clz (uint32_t a); // count leadzing zeros: a priority encoder
uint32_t umul32_hi (uint32_t a, uint32_t b); // upper half of 32x32-bit product
/* i in [0,255]: (int)(1.0 / (1.0 + 1.0/512.0 + i / 256.0) * 512 + .5) & 0xff
In a second step tuned to minimize the number of incorrect results with the
specific implementation of the two refinement steps chosen.
*/
static uint8_t rcp_tab[256] =
{
0xff, 0xfd, 0xfb, 0xf9, 0xf7, 0xf5, 0xf3, 0xf1,
0xf0, 0xee, 0xec, 0xea, 0xe8, 0xe6, 0xe5, 0xe3,
0xe1, 0xdf, 0xdd, 0xdc, 0xda, 0xd8, 0xd7, 0xd5,
0xd3, 0xd2, 0xd0, 0xce, 0xcd, 0xcb, 0xc9, 0xc8,
0xc6, 0xc5, 0xc3, 0xc2, 0xc0, 0xbf, 0xbd, 0xbc,
0xba, 0xb9, 0xb7, 0xb6, 0xb4, 0xb3, 0xb1, 0xb0,
0xae, 0xad, 0xac, 0xaa, 0xa9, 0xa7, 0xa6, 0xa5,
0xa4, 0xa2, 0xa1, 0x9f, 0x9e, 0x9d, 0x9c, 0x9a,
0x99, 0x98, 0x96, 0x95, 0x94, 0x93, 0x91, 0x90,
0x8f, 0x8e, 0x8c, 0x8b, 0x8a, 0x89, 0x88, 0x87,
0x86, 0x84, 0x83, 0x82, 0x81, 0x80, 0x7f, 0x7e,
0x7c, 0x7b, 0x7a, 0x79, 0x78, 0x77, 0x76, 0x74,
0x74, 0x73, 0x71, 0x71, 0x70, 0x6f, 0x6e, 0x6d,
0x6b, 0x6b, 0x6a, 0x68, 0x67, 0x67, 0x66, 0x65,
0x64, 0x63, 0x62, 0x61, 0x60, 0x5f, 0x5e, 0x5d,
0x5c, 0x5b, 0x5b, 0x59, 0x58, 0x58, 0x56, 0x56,
0x55, 0x54, 0x53, 0x52, 0x51, 0x51, 0x50, 0x4f,
0x4e, 0x4e, 0x4c, 0x4b, 0x4b, 0x4a, 0x48, 0x48,
0x48, 0x46, 0x46, 0x45, 0x44, 0x43, 0x43, 0x42,
0x41, 0x40, 0x3f, 0x3f, 0x3e, 0x3d, 0x3c, 0x3b,
0x3b, 0x3a, 0x39, 0x38, 0x38, 0x37, 0x36, 0x36,
0x35, 0x34, 0x34, 0x33, 0x32, 0x31, 0x30, 0x30,
0x2f, 0x2e, 0x2e, 0x2d, 0x2d, 0x2c, 0x2b, 0x2a,
0x2a, 0x29, 0x28, 0x27, 0x27, 0x26, 0x26, 0x25,
0x24, 0x23, 0x23, 0x22, 0x21, 0x21, 0x21, 0x20,
0x1f, 0x1f, 0x1e, 0x1d, 0x1d, 0x1c, 0x1c, 0x1b,
0x1a, 0x19, 0x19, 0x19, 0x18, 0x17, 0x17, 0x16,
0x16, 0x15, 0x14, 0x13, 0x13, 0x12, 0x12, 0x11,
0x11, 0x10, 0x0f, 0x0f, 0x0e, 0x0e, 0x0e, 0x0d,
0x0c, 0x0c, 0x0b, 0x0b, 0x0a, 0x0a, 0x09, 0x08,
0x08, 0x07, 0x07, 0x07, 0x06, 0x05, 0x05, 0x04,
0x04, 0x03, 0x03, 0x02, 0x02, 0x01, 0x01, 0x01
};
/* Divide two u16.16 fixed-point operands each in [0, 2**31]. Attempt to round
the result to nearest of even. Currently this does not always succeed. We
would need effectively 33 bits in intermediate computation for that, so the
raw quotient is within +/- 1 ulp of the mathematical result.
*/
uint32_t div_core (uint32_t a, uint32_t b)
{
/* normalize dividend and divisor to [1,2); bit 31 is the integer bit */
uint8_t lza = clz (a);
uint8_t lzb = clz (b);
uint32_t na = a << lza;
uint32_t nb = b << lzb;
/* LUT is addressed by most significant fraction bits of divisor */
uint32_t idx = (nb >> (32 - 1 - TAB_BITS_IN)) & 0xff;
uint32_t rcp = rcp_tab [idx] | 0x100; // add implicit msb
/* first NR iteration */
uint32_t f = (rcp * rcp) << (32 - 2*TAB_BITS_OUT);
uint32_t p = umul32_hi (f, nb);
rcp = (rcp << (32 - TAB_BITS_OUT)) - p;
/* second NR iteration */
rcp = rcp << 1;
p = umul32_hi (rcp, nb);
rcp = umul32_hi (rcp, 0 - p);
/* compute raw quotient as (1/b)*a; off by at most +/- 2ulps */
rcp = (rcp << 1) | TRUNC_COMP;
uint32_t quot = umul32_hi (rcp, na);
uint8_t shift = lza - lzb + 15;
quot = (shift > 31) ? 0 : (quot >> shift);
/* round quotient using 4:1 mux */
uint32_t ah = a << 16;
uint32_t prod = quot * b;
uint32_t rem1 = abs (ah - prod);
uint32_t rem2 = abs (ah - prod - b);
uint32_t rem3 = abs (ah - prod + b);
int sel = (((rem2 < rem1) << 1) | ((rem3 < rem1) & (quot != 0)));
switch (sel) {
case 0:
default:
quot = quot;
break;
case 1:
quot = quot - 1;
break;
case 2: /* fall through */
case 3:
quot = quot + 1;
break;
}
return quot;
}
int32_t div_s15p16 (int32_t a, int32_t b)
{
uint32_t aa = abs (a);
uint32_t ab = abs (b);
uint32_t quot = div_core (aa, ab);
quot = ((a ^ b) & 0x80000000) ? (0 - quot) : quot;
return (int32_t)quot;
}
uint64_t umul32_wide (uint32_t a, uint32_t b)
{
return ((uint64_t)a) * b;
}
uint32_t umul32_hi (uint32_t a, uint32_t b)
{
return (uint32_t)(umul32_wide (a, b) >> 32);
}
#define VARIANT (1)
int clz (uint32_t a)
{
#if VARIANT == 1
static const uint8_t clz_tab[32] = {
31, 22, 30, 21, 18, 10, 29, 2, 20, 17, 15, 13, 9, 6, 28, 1,
23, 19, 11, 3, 16, 14, 7, 24, 12, 4, 8, 25, 5, 26, 27, 0
};
a |= a >> 16;
a |= a >> 8;
a |= a >> 4;
a |= a >> 2;
a |= a >> 1;
return clz_tab [0x07c4acddu * a >> 27] + (!a);
#elif VARIANT == 2
uint32_t b;
int n = 31 + (!a);
if ((b = (a & 0xffff0000u))) { n -= 16; a = b; }
if ((b = (a & 0xff00ff00u))) { n -= 8; a = b; }
if ((b = (a & 0xf0f0f0f0u))) { n -= 4; a = b; }
if ((b = (a & 0xccccccccu))) { n -= 2; a = b; }
if (( (a & 0xaaaaaaaau))) { n -= 1; }
return n;
#elif VARIANT == 3
int n = 0;
if (!(a & 0xffff0000u)) { n |= 16; a <<= 16; }
if (!(a & 0xff000000u)) { n |= 8; a <<= 8; }
if (!(a & 0xf0000000u)) { n |= 4; a <<= 4; }
if (!(a & 0xc0000000u)) { n |= 2; a <<= 2; }
if ((int32_t)a >= 0) n++;
if ((int32_t)a == 0) n++;
return n;
#elif VARIANT == 4
uint32_t b;
int n = 32;
if ((b = (a >> 16))) { n = n - 16; a = b; }
if ((b = (a >> 8))) { n = n - 8; a = b; }
if ((b = (a >> 4))) { n = n - 4; a = b; }
if ((b = (a >> 2))) { n = n - 2; a = b; }
if ((b = (a >> 1))) return n - 2;
return n - a;
#endif
}
uint32_t div_core_ref (uint32_t a, uint32_t b)
{
int64_t quot = ((int64_t)a << 16) / b;
/* round to nearest or even */
int64_t rem1 = ((int64_t)a << 16) - quot * b;
int64_t rem2 = rem1 - b;
if (llabs (rem2) < llabs (rem1)) quot++;
if ((llabs (rem2) == llabs (rem1)) && (quot & 1)) quot &= ~1;
return (uint32_t)quot;
}
// George Marsaglia's KISS PRNG, period 2**123. Newsgroup sci.math, 21 Jan 1999
// Bug fix: Greg Rose, "KISS: A Bit Too Simple" http://eprint.iacr.org/2011/007
static uint32_t kiss_z=362436069, kiss_w=521288629;
static uint32_t kiss_jsr=123456789, kiss_jcong=380116160;
#define znew (kiss_z=36969*(kiss_z&65535)+(kiss_z>>16))
#define wnew (kiss_w=18000*(kiss_w&65535)+(kiss_w>>16))
#define MWC ((znew<<16)+wnew )
#define SHR3 (kiss_jsr^=(kiss_jsr<<13),kiss_jsr^=(kiss_jsr>>17), \
kiss_jsr^=(kiss_jsr<<5))
#define CONG (kiss_jcong=69069*kiss_jcong+1234567)
#define KISS ((MWC^CONG)+SHR3)
int main (void)
{
uint64_t count = 0ULL, stats[3] = {0ULL, 0ULL, 0ULL};
uint32_t a, b, res, ref;
do {
/* random dividend and divisor, avoiding overflow and divison by zero */
do {
a = KISS % 0x80000001u;
b = KISS % 0x80000001u;
} while ((b == 0) || ((((uint64_t)a << 16) / b) > 0x80000000ULL));
/* compute function under test and reference result */
ref = div_core_ref (a, b);
res = div_core (a, b);
if (llabs ((int64_t)res - (int64_t)ref) > 1) {
printf ("\nerror: a=%08x b=%08x res=%08x ref=%08x\n", a, b, res, ref);
break;
} else {
stats[(int64_t)res - (int64_t)ref + 1]++;
}
count++;
if (!(count & 0xffffff)) {
printf ("\r[-1]=%llu [0]=%llu [+1]=%llu", stats[0], stats[1], stats[2]);
}
} while (count);
return EXIT_SUCCESS;
}

Related

Sha256 opencl kernel needed [Help needed]

i need a Sha256 kernel file , i am using Cloo as my opencl library , it will be included in WPF project
i am calculating a hash value several times
the program needs about an 30 mins or so to do that but my search result claimed opencl will reduce that time to under 3 mins or less
thanks in advance
[Edit]
ok now i managed to do it using this
https://searchcode.com/file/45893396/src/opencl/sha256_kernel.cl/
but it works fine with string
yet when sending my byteArray header to hash it returned a very different value than expected
[Edit2]
it can not handle large arrays any array more than 32 length returns missy results

Found this and i modified it to calculate double hash
if anyone needs it
#ifndef uint8_t
#define uint8_t unsigned char
#endif
#ifndef uint32_t
#define uint32_t unsigned int
#endif
#ifndef uint64_t
#define uint64_t unsigned long int
#endif
#define rotlFixed(x, n) (((x) << (n)) | ((x) >> (32 - (n))))
#define rotrFixed(x, n) (((x) >> (n)) | ((x) << (32 - (n))))
typedef struct
{
uint32_t state[8];
uint64_t count;
uint8_t buffer[64];
} CSha256;
inline void Sha256_Init(CSha256 *p)
{
p->state[0] = 0x6a09e667;
p->state[1] = 0xbb67ae85;
p->state[2] = 0x3c6ef372;
p->state[3] = 0xa54ff53a;
p->state[4] = 0x510e527f;
p->state[5] = 0x9b05688c;
p->state[6] = 0x1f83d9ab;
p->state[7] = 0x5be0cd19;
p->count = 0;
}
#define S0(x) (rotrFixed(x, 2) ^ rotrFixed(x,13) ^ rotrFixed(x, 22))
#define S1(x) (rotrFixed(x, 6) ^ rotrFixed(x,11) ^ rotrFixed(x, 25))
#define s0(x) (rotrFixed(x, 7) ^ rotrFixed(x,18) ^ (x >> 3))
#define s1(x) (rotrFixed(x,17) ^ rotrFixed(x,19) ^ (x >> 10))
#define blk0(i) (W[i] = data[i])
#define blk2(i) (W[i&15] += s1(W[(i-2)&15]) + W[(i-7)&15] + s0(W[(i-15)&15]))
#define Ch2(x,y,z) (z^(x&(y^z)))
#define Maj(x,y,z) ((x&y)|(z&(x|y)))
#define sha_a(i) T[(0-(i))&7]
#define sha_b(i) T[(1-(i))&7]
#define sha_c(i) T[(2-(i))&7]
#define sha_d(i) T[(3-(i))&7]
#define sha_e(i) T[(4-(i))&7]
#define sha_f(i) T[(5-(i))&7]
#define sha_g(i) T[(6-(i))&7]
#define sha_h(i) T[(7-(i))&7]
#ifdef _SHA256_UNROLL2
#define R(a,b,c,d,e,f,g,h, i) h += S1(e) + Ch2(e,f,g) + K[i+j] + (j?blk2(i):blk0(i));\
d += h; h += S0(a) + Maj(a, b, c)
#define RX_8(i) \
R(a,b,c,d,e,f,g,h, i); \
R(h,a,b,c,d,e,f,g, i+1); \
R(g,h,a,b,c,d,e,f, i+2); \
R(f,g,h,a,b,c,d,e, i+3); \
R(e,f,g,h,a,b,c,d, i+4); \
R(d,e,f,g,h,a,b,c, i+5); \
R(c,d,e,f,g,h,a,b, i+6); \
R(b,c,d,e,f,g,h,a, i+7)
#else
#define R(i) sha_h(i) += S1(sha_e(i)) + Ch2(sha_e(i),sha_f(i),sha_g(i)) + K[i+j] + (j?blk2(i):blk0(i));\
sha_d(i) += sha_h(i); sha_h(i) += S0(sha_a(i)) + Maj(sha_a(i), sha_b(i), sha_c(i))
#ifdef _SHA256_UNROLL
#define RX_8(i) R(i+0); R(i+1); R(i+2); R(i+3); R(i+4); R(i+5); R(i+6); R(i+7);
#endif
#endif
static const uint32_t K[64] = {
0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5,
0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3,
0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc,
0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7,
0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13,
0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3,
0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5,
0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208,
0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2
};
inline static void Sha256_Transform(uint32_t *state, const uint32_t *data)
{
uint32_t W[16];
unsigned j;
#ifdef _SHA256_UNROLL2
uint32_t a,b,c,d,e,f,g,h;
a = state[0];
b = state[1];
c = state[2];
d = state[3];
e = state[4];
f = state[5];
g = state[6];
h = state[7];
#else
uint32_t T[8];
for (j = 0; j < 8; j++)
T[j] = state[j];
#endif
for (j = 0; j < 64; j += 16)
{
#if defined(_SHA256_UNROLL) || defined(_SHA256_UNROLL2)
RX_8(0); RX_8(8);
#else
unsigned i;
for (i = 0; i < 16; i++) { R(i); }
#endif
}
#ifdef _SHA256_UNROLL2
state[0] += a;
state[1] += b;
state[2] += c;
state[3] += d;
state[4] += e;
state[5] += f;
state[6] += g;
state[7] += h;
#else
for (j = 0; j < 8; j++)
state[j] += T[j];
#endif
/* Wipe variables */
/* memset(W, 0, sizeof(W)); */
/* memset(T, 0, sizeof(T)); */
}
#undef S0
#undef S1
#undef s0
#undef s1
inline static void Sha256_WriteByteBlock(CSha256 *p)
{
uint32_t data32[16];
unsigned i;
for (i = 0; i < 16; i++)
data32[i] =
((uint32_t)(p->buffer[i * 4 ]) << 24) +
((uint32_t)(p->buffer[i * 4 + 1]) << 16) +
((uint32_t)(p->buffer[i * 4 + 2]) << 8) +
((uint32_t)(p->buffer[i * 4 + 3]));
Sha256_Transform(p->state, data32);
}
inline void Sha256_Update(CSha256 *p, __global const uint8_t *data, size_t size)
{
uint32_t curBufferPos = (uint32_t)p->count & 0x3F;
while (size > 0)
{
p->buffer[curBufferPos++] = *data++;
p->count++;
size--;
if (curBufferPos == 64)
{
curBufferPos = 0;
Sha256_WriteByteBlock(p);
}
}
}
inline void Sha256_Final(CSha256 *p, __global uint8_t *digest)
{
uint64_t lenInBits = (p->count << 3);
uint32_t curBufferPos = (uint32_t)p->count & 0x3F;
unsigned i;
p->buffer[curBufferPos++] = 0x80;
while (curBufferPos != (64 - 8))
{
curBufferPos &= 0x3F;
if (curBufferPos == 0)
Sha256_WriteByteBlock(p);
p->buffer[curBufferPos++] = 0;
}
for (i = 0; i < 8; i++)
{
p->buffer[curBufferPos++] = (uint8_t)(lenInBits >> 56);
lenInBits <<= 8;
}
Sha256_WriteByteBlock(p);
for (i = 0; i < 8; i++)
{
*digest++ = (uint8_t)(p->state[i] >> 24);
*digest++ = (uint8_t)(p->state[i] >> 16);
*digest++ = (uint8_t)(p->state[i] >> 8);
*digest++ = (uint8_t)(p->state[i]);
}
Sha256_Init(p);
}
inline void Sha256_Update1(CSha256 *p, const uint8_t *data, uint32_t size)
{
uint32_t curBufferPos = (uint32_t)p->count & 0x3F;
while (size > 0)
{
p->buffer[curBufferPos++] = *data++;
p->count++;
size--;
if (curBufferPos == 64)
{
curBufferPos = 0;
Sha256_WriteByteBlock(p);
}
}
}
inline void Sha256_Final1(CSha256 *p, uint8_t *digest)
{
uint64_t lenInBits = (p->count << 3);
uint32_t curBufferPos = (uint32_t)p->count & 0x3F;
unsigned i;
p->buffer[curBufferPos++] = 0x80;
while (curBufferPos != (64 - 8))
{
curBufferPos &= 0x3F;
if (curBufferPos == 0)
Sha256_WriteByteBlock(p);
p->buffer[curBufferPos++] = 0;
}
for (i = 0; i < 8; i++)
{
p->buffer[curBufferPos++] = (uint8_t)(lenInBits >> 56);
lenInBits <<= 8;
}
Sha256_WriteByteBlock(p);
for (i = 0; i < 8; i++)
{
*digest++ = (uint8_t)(p->state[i] >> 24);
*digest++ = (uint8_t)(p->state[i] >> 16);
*digest++ = (uint8_t)(p->state[i] >> 8);
*digest++ = (uint8_t)(p->state[i]);
}
Sha256_Init(p);
}
__kernel void Sha256_1(__global uint8_t *header,__global uint8_t *toRet)
{
uint8_t tempHdr[80];
uint8_t tempDigest[32]={0};
uint startNon=toRet[0] + (toRet[1] << 8) + (toRet[2] << 16) + (toRet[3] << 24);
uint maxNon=toRet[4] + (toRet[5] << 8) + (toRet[6] << 16) + (toRet[7] << 24);
uint nonce =startNon;
uint32_t finalNon=0;
uint8_t match=0;
for(int x=0;x<80;x++)
tempHdr[x]=header[x];
tempHdr[76] = (char)(nonce);
tempHdr[77] = (char)(nonce >> 8);
tempHdr[78] = (char)(nonce >> 16);
tempHdr[79] = (char)(nonce >> 24);
while(finalNon<1)
{
CSha256 p;
Sha256_Init(&p);
Sha256_Update1(&p, tempHdr, 80);
Sha256_Final1(&p, tempDigest);
CSha256 p1;
Sha256_Init(&p1);
Sha256_Update1(&p1, tempDigest, 32);
Sha256_Final1(&p1, tempDigest);
for(int x=31;x>21;x--)
{
if(tempDigest[x]<1) match++;
}
if(match>8)
{
finalNon=nonce;
toRet[8] = (char)(nonce);
toRet[9] = (char)(nonce >> 8);
toRet[10] = (char)(nonce >> 16);
toRet[11] = (char)(nonce >> 24);
}
else
{
nonce++;
tempHdr[76] = (char)(nonce);
tempHdr[77] = (char)(nonce >> 8);
tempHdr[78] = (char)(nonce >> 16);
tempHdr[79] = (char)(nonce >> 24);
}
match=0;
if(nonce>maxNon) break;
if(nonce<=startNon) break;
}
}

CUDA: Why are calculations are not hidden by memory latency?

I have created a program to test GPU RAM bandwidth and have got several strange effects on it.
The one is that straigtforward implementation is not purely memory bounded. If I unroll the
for (int *p = pStart; p < pEnd; p += shift)
sum += *p;
loop by adding
p += shift; sum += *p;
p += shift; sum += *p;
p += shift; sum += *p;
into the loop body, the speed raises by 20%, from 183 to 222 GB/s on GeForce 2060 RTX (theoretical bandwidth 264 GB/s).
Shouldn't such things should be hidden by memory latency?
Or in such program in an ideal case all warps only should await for data from memory and every additional calculation a warp done between such waits make memory bus underloaded?
I analyzed executables with NVidia Nsight Compute 2021.3.0, it reports less compute throughput and higher memory throughput. Cache usage is negligible. This all is logical.
The results in schedulers statistics are more interesting and I don't understand the meaning of these numbers. Could someone explain it please? I am not a newbie in CUDA, I understand that warps can move through pipeline or stall because of awaiting data from RAM, but I don't understand well what it means by eligible warps or warp cycles per issued instruction (in unrolled version each warp only did some instruction once per 114 clock cycles of GPU in average?)
Base is without unrolling, main values - with. So e.g. with unrolling we have 7.88 active warps per scheduler, without - 7.96 (isn't the more - the better?)
Full code (look at __global__ void gpuReadMemory):
#include <assert.h>
#include <conio.h>
#include <stdio.h>
#include <cuda.h>
#define VC_EXTRALEAN
#include <windows.h>
#define CUDA_CHECK(err) __cudaSafeCall(err, __FILE__, __LINE__)
inline void __cudaSafeCall(cudaError err, const char *file, const int line)
{
if (err != cudaSuccess)
{
fprintf(stderr, "%s(%i): CUDA error %d (%s)\n",
file, line, int(err), cudaGetErrorString(err));
throw "CUDA error";
}
}
int getMultiprocessorCount()
{
int num;
CUDA_CHECK(cudaDeviceGetAttribute(&num, cudaDevAttrMultiProcessorCount, 0));
return num;
}
__global__ void gpuWriteMemory(int *gpuArr, int dataSizeInMBs,
int packetShift, int passCount, int *gpuDebugArr)
{
int *pStart = gpuArr + ((long long)packetShift *
(blockDim.y * blockIdx.x + threadIdx.y)) / sizeof(int) + threadIdx.x;
int *pEnd = gpuArr + (((long long)dataSizeInMBs) << 20) / sizeof(int);
int shift = gridDim.x * blockDim.y * packetShift / sizeof(int);
for (int passInd = 0; passInd < passCount; passInd++)
for (int *p = pStart; p < pEnd; p += shift)
*p = blockIdx.x * 10000000 + threadIdx.y * 1000 + threadIdx.x;
}
__global__ void gpuReadMemory(int *gpuArr, int dataSizeInMBs,
int packetShift, int passCount, int *gpuDebugArr)
{
int *pStart = gpuArr + ((long long)packetShift *
(blockDim.y * blockIdx.x + threadIdx.y)) / sizeof(int) + threadIdx.x;
int *pEnd = gpuArr + (((long long)dataSizeInMBs) << 20) / sizeof(int);
int shift = gridDim.x * blockDim.y * packetShift / sizeof(int);
int sum = 0;
int accessCount = 0;
for (int passInd = 0; passInd < passCount; passInd++)
{
#pragma unroll // - doesn't have effect
for (int *p = pStart; p < pEnd; p += shift)
{
sum += *p;
p += shift; sum += *p;
p += shift; sum += *p;
p += shift; sum += *p;
}
*pStart = sum; // Without it bandwidth reported is 3 times bigger than theoretical
}
// Suspiciously fast code:
//for (int passInd = 0; passInd < passCount; passInd++)
// 0x0000000500999d10 IADD3 R8, R8, 0x1, RZ
// 0x0000000500999d20 BSSY B0, 0x500999de0
// for (int *p = pStart; p < pEnd; p += shift)
// 0x0000000500999d30 ISETP.GE.AND.EX P0, PT, R7, UR6, PT, P0
// for (int passInd = 0; passInd < passCount; passInd++)
// 0x0000000500999d40 ISETP.GE.AND P1, PT, R8, c[0x0][0x170], PT
// for (int *p = pStart; p < pEnd; p += shift)
// 0x0000000500999d50 #P0 BRA 0x500999dd0
// 0x0000000500999d60 IMAD.MOV.U32 R3, RZ, RZ, R2
// 0x0000000500999d70 IMAD.MOV.U32 R5, RZ, RZ, R4
// 0x0000000500999d80 LEA R3, P0, R0, R3, 0x2
// 0x0000000500999d90 LEA.HI.X R5, R0, R5, RZ, 0x2, P0
// 0x0000000500999da0 ISETP.GE.U32.AND P0, PT, R3, UR4, PT
// 0x0000000500999db0 ISETP.GE.U32.AND.EX P0, PT, R5, UR5, PT, P0
// 0x0000000500999dc0 #!P0 BRA 0x500999d80
// 0x0000000500999dd0 BSYNC B0
// Normal code:
//for (int *p = pStart; p < pEnd; p += shift)
// 0x0000000500999da0 IMAD.MOV.U32 R8, RZ, RZ, R6
// sum += *p;
//0x0000000500999db0 IMAD.MOV.U32 R4, RZ, RZ, R8
// 0x0000000500999dc0 LDG.E.SYS R4, [R4] // Reading from memory
// for (int *p = pStart; p < pEnd; p += shift)
// 0x0000000500999dd0 IADD3 R11, P0, R11, UR5, RZ
// 0x0000000500999de0 IADD3 R8, P1, R8, UR5, RZ
// 0x0000000500999df0 IADD3.X R13, R13, UR4, RZ, P0, !PT
// 0x0000000500999e00 ISETP.GE.U32.AND P0, PT, R11, R2, PT
// 0x0000000500999e10 IADD3.X R5, R5, UR4, RZ, P1, !PT
// 0x0000000500999e20 ISETP.GE.U32.AND.EX P0, PT, R13, R3, PT, P0
// sum += *p;
//0x0000000500999e30 IMAD.IADD R9, R4, 0x1, R9
// for (int *p = pStart; p < pEnd; p += shift)
// 0x0000000500999e40 #!P0 BRA 0x500999db0
// 0x0000000500999e50 BSYNC B0
// *pStart = sum; // Lowers bandwidth being reported on GT 520M from 33 to 9.7 GB/s
//0x0000000500999e60 STG.E.SYS [R6], R9
}
class CMemorySpeedTester
{
public:
CMemorySpeedTester()
{
m_multiprocessorCount = getMultiprocessorCount();
CUDA_CHECK(cudaMalloc((void**)&gpuArr, ((long long)dataSizeInMBs) << 20));
CUDA_CHECK(cudaMalloc((void**)&gpuDebugArr, debugArrLen * sizeof(int)));
CUDA_CHECK(cudaMemset(gpuArr, -1, ((long long)dataSizeInMBs) << 20));
debugArr = (int*)malloc(debugArrLen * sizeof(int));
debugArr2D = (int (*)[1000])debugArr;
QueryPerformanceFrequency(&timerFreq);
}
void testBandwidth()
{
int threadPerBlock = 256;
int blockCount = m_multiprocessorCount * 24;
dim3 blocks(blockCount);
dim3 threads1(32, threadPerBlock / 32);
gpuWriteMemory<<<blocks, threads1>>>(gpuArr, dataSizeInMBs,
32 * sizeof(int), 1, gpuDebugArr);
CUDA_CHECK(cudaDeviceSynchronize());
for (int passCount = 10; passCount <= 100; passCount *= 10)
{
int threadPerPacket = 32;
for (int packetShiftMult = 1; packetShiftMult <= 16; packetShiftMult *= 16)
{
int packetShift = threadPerPacket * sizeof(int) * packetShiftMult;
dim3 threads(threadPerPacket, threadPerBlock / threadPerPacket);
QueryPerformanceCounter(&t0);
gpuReadMemory<<<blocks, threads>>>(gpuArr, dataSizeInMBs,
packetShift, passCount, gpuDebugArr);
CUDA_CHECK(cudaDeviceSynchronize());
QueryPerformanceCounter(&t);
double dt = double(t.QuadPart - t0.QuadPart) / timerFreq.QuadPart;
printf(" %2d th./packet, packet shift %4d, %d pass(es): %.3f ms, %.2f GB/s\n",
threadPerPacket, packetShift, passCount,
dt * 1000, (double)(dataSizeInMBs) / packetShiftMult / (1 << 10) * passCount / dt);
}
}
}
void copyToHostDebugArr()
{
CUDA_CHECK(cudaMemcpy(debugArr, gpuDebugArr, debugArrLen * sizeof(int), cudaMemcpyDeviceToHost));
CUDA_CHECK(cudaDeviceSynchronize());
}
protected:
static const int dataSizeInMBs = 800;
static const int debugArrLen = 4000000;
int m_multiprocessorCount;
int *gpuArr, *gpuDebugArr;
int *debugArr;
int (*debugArr2D)[1000];
LARGE_INTEGER timerFreq, t, t0;
};
int main(int argc, char **argv)
{
printf("Started\n");
CMemorySpeedTester runner;
for (int runInd = 0; runInd < 5; runInd++)
{
printf("%d.\n", runInd);
runner.testBandwidth();
}
printf("Finished. Press Enter...");
getch();
}
Windows 10 x64, Visual Studio 2019, CUDA Toolkit 10.2, also can be compiled and reproduced on GeForce GT 520M, Windows 7 x64, Visual Studio 2010 and CUDA 7.5.

strange Kinect acceleration values with OpenNI and avin SensorKinect

according to [1], I should be able to access Kinect accelerometer data with request 0x32, providing a buffer with 10 bytes. The accelerometer vector xyz values should be short ints at bytes 3 thru 8. As stated in the text (and as expected), with a horizontally, stationary camera, I should get values near 0 for both x and z, and about 981 for y. This would be g and would make sense. Instead, while y values are as expected, I get x and z values near 0xffff. Here's the code (i skipped error-checking for better readability):
const unsigned short VENDOR_ID_MSFT = 0x045e;
const unsigned short PRODUCT_ID_KINECT360_MOTOR = 0x02b0;
XN_USB_DEV_HANDLE deviceHandle = NULL;
const XnUSBConnectionString *paths = NULL;
XnUInt32 count;
xnUSBInit();
xnUSBEnumerateDevices( VENDOR_ID_MSFT, PRODUCT_ID_KINECT360_MOTOR, &paths, &count );
xnUSBOpenDeviceByPath( paths[0], &this->deviceHandle );
//init motor
xnUSBSendControl( this->deviceHandle, (XnUSBControlType) 0xc0, 0x10, 0x00, 0x00, buf, sizeof( buf ), 0 );
XnStatus res;
XnUChar buf[10] = { 0 };
XnUInt32 size = 0;
//query motor data
xnUSBReceiveControl( this->deviceHandle, XN_USB_CONTROL_TYPE_VENDOR, 0x32, 0, 0, buf, sizeof( buf ), &size, 0 );
int accelCountX = (int) ( ( (short) buf[2] << 8 ) | buf[3] );
int accelCountY = (int) ( ( (short) buf[4] << 8 ) | buf[5] );
int accelCountZ = (int) ( ( (short) buf[6] << 8 ) | buf[7] );
std::cout << accelCountX << "/" << accelCountY << "/" << accelCountZ << std::endl;
The output shows values kind of like these:
65503/847/65516
Any idea, what the problem could be? Thanks!
[1] http://fivedots.coe.psu.ac.th/~ad/jg/nui16/motorControl.pdf

Setup the accelerator framework for fft on the iPhone

I have set a function to setup the accelerator, after i have read :
Using the Apple FFT and Accelerate Framework
iPhone FFT with Accelerate framework vDSP
and apple docs.
i did this :
void fftSetup()
{
COMPLEX_SPLIT A;
FFTSetup setupReal;
uint32_t log2n;
uint32_t n, nOver2;
int32_t stride;
uint32_t i;
float *originalReal, *obtainedReal;
float scale;
uint32_t L = 1024;
float *mag = new float[L/2];
log2n = 10 ;
n = 1 << log2n;
stride = 1;
nOver2 = n / 2;
printf("1D real FFT of length log2 ( %d ) = %d\n\n", n, log2n);
for (i = 0; i < n; i++)
originalReal[i] = (float) (i + 1);
vDSP_ctoz((COMPLEX *) originalReal,2,&A,1,nOver2);
A.realp = (float *) malloc(nOver2 * sizeof(float));
A.imagp = (float *) malloc(nOver2 * sizeof(float));
setupReal = vDSP_create_fftsetup(log2n, FFT_RADIX2);
vDSP_fft_zrip(setupReal, &A, stride, log2n, FFT_FORWARD);
vDSP_fft_zrip(setupReal, &A, stride, log2n, FFT_INVERSE);
//get magnitude;
for(i = 1; i < L/2; i++){
mag[i] = sqrtf(A.realp[i]*A.realp[i] + A.imagp[i] * A.imagp[i]);
}
scale = (float) 1.0 / (2 * n);
vDSP_vsmul(A.realp, 1, &scale, A.realp, 1, nOver2);
vDSP_vsmul(A.imagp, 1, &scale, A.imagp, 1, nOver2);
}
questions :
my app is always crash with no error(BAD ACCESS) on one of this 2 lines :
originalReal[i] = (float) (i + 1); // or
vDSP_ctoz((COMPLEX *) originalReal,2,&A,1,nOver2);
i guess i did not set a good value for log2n ? (10 to get 1024 window ? )
how do i get the real magnitude of the bins? my actual fft? the same i wrote here ?
where do i input MY data buffer array (exactly where in my code ? instead originalReal?)
thanks a lot.

I actually manage to make it work ,when i insert into it a sin wave of a certain f.
This is the code :
COMPLEX_SPLIT A;
FFTSetup setupReal;
uint32_t log2n;
uint32_t n, nOver2;
int32_t stride;
uint32_t i;
float *originalReal, *obtainedReal;
float scale;
uint32_t L = 1024;
float *mag = new float[L/2];
log2n = 10 ;
n = 1 << log2n;
stride = 1;
nOver2 = n / 2;
//printf("1D real FFT of length log2 ( %d ) = %d\n\n", n, log2n);
A.realp = (float *) malloc(nOver2 * sizeof(float));
A.imagp = (float *) malloc(nOver2 * sizeof(float));
originalReal = (float *) malloc(n * sizeof(float));
obtainedReal = (float *) malloc(n * sizeof(float));
for (i = 0; i < n; i++)
originalReal[i] = cos(2*3.141592*11000*i/44100);//(float) (i + 1);
vDSP_ctoz((COMPLEX *) originalReal,2,&A,1,nOver2);
setupReal = vDSP_create_fftsetup(log2n, FFT_RADIX2);
vDSP_fft_zrip(setupReal, &A, stride, log2n, FFT_FORWARD);
//vDSP_fft_zrip(setupReal, &A, stride, log2n, FFT_INVERSE);
scale = (float) 1.0 / (2 * n);
vDSP_vsmul(A.realp, 1, &scale, A.realp, 1, nOver2);
vDSP_vsmul(A.imagp, 1, &scale, A.imagp, 1, nOver2);
//get magnitude;
for(i = 1; i < L/2; i++)
{
mag[i] = sqrtf(A.realp[i]*A.realp[i] + A.imagp[i] * A.imagp[i]);
NSLog(#"%d:%f",i,mag[i]);
}
Actually its not 44hz between bins,as the guy wrote in the post above! but 43 ! 22050/512=43 . this thing is critical ! because in the higher bins- such as bin[300] you get a completely different resault for 44 and 43 ! (its 300hz drift). so take care of that .

Looking for an efficient integer square root algorithm for ARM Thumb2

I am looking for a fast, integer only algorithm to find the square root (integer part thereof) of an unsigned integer.
The code must have excellent performance on ARM Thumb 2 processors. It could be assembly language or C code.
Any hints welcome.

Integer Square Roots by Jack W. Crenshaw could be useful as another reference.
The C Snippets Archive also has an integer square root implementation. This one goes beyond just the integer result, and calculates extra fractional (fixed-point) bits of the answer. (Update: unfortunately, the C snippets archive is now defunct. The link points to the web archive of the page.) Here is the code from the C Snippets Archive:
#define BITSPERLONG 32
#define TOP2BITS(x) ((x & (3L << (BITSPERLONG-2))) >> (BITSPERLONG-2))
struct int_sqrt {
unsigned sqrt, frac;
};
/* usqrt:
ENTRY x: unsigned long
EXIT returns floor(sqrt(x) * pow(2, BITSPERLONG/2))
Since the square root never uses more than half the bits
of the input, we use the other half of the bits to contain
extra bits of precision after the binary point.
EXAMPLE
suppose BITSPERLONG = 32
then usqrt(144) = 786432 = 12 * 65536
usqrt(32) = 370727 = 5.66 * 65536
NOTES
(1) change BITSPERLONG to BITSPERLONG/2 if you do not want
the answer scaled. Indeed, if you want n bits of
precision after the binary point, use BITSPERLONG/2+n.
The code assumes that BITSPERLONG is even.
(2) This is really better off being written in assembly.
The line marked below is really a "arithmetic shift left"
on the double-long value with r in the upper half
and x in the lower half. This operation is typically
expressible in only one or two assembly instructions.
(3) Unrolling this loop is probably not a bad idea.
ALGORITHM
The calculations are the base-two analogue of the square
root algorithm we all learned in grammar school. Since we're
in base 2, there is only one nontrivial trial multiplier.
Notice that absolutely no multiplications or divisions are performed.
This means it'll be fast on a wide range of processors.
*/
void usqrt(unsigned long x, struct int_sqrt *q)
{
unsigned long a = 0L; /* accumulator */
unsigned long r = 0L; /* remainder */
unsigned long e = 0L; /* trial product */
int i;
for (i = 0; i < BITSPERLONG; i++) /* NOTE 1 */
{
r = (r << 2) + TOP2BITS(x); x <<= 2; /* NOTE 2 */
a <<= 1;
e = (a << 1) + 1;
if (r >= e)
{
r -= e;
a++;
}
}
memcpy(q, &a, sizeof(long));
}
I settled on the following code. It's essentially from the Wikipedia article on square-root computing methods. But it has been changed to use stdint.h types uint32_t etc. Strictly speaking, the return type could be changed to uint16_t.
/**
* \brief Fast Square root algorithm
*
* Fractional parts of the answer are discarded. That is:
* - SquareRoot(3) --> 1
* - SquareRoot(4) --> 2
* - SquareRoot(5) --> 2
* - SquareRoot(8) --> 2
* - SquareRoot(9) --> 3
*
* \param[in] a_nInput - unsigned integer for which to find the square root
*
* \return Integer square root of the input value.
*/
uint32_t SquareRoot(uint32_t a_nInput)
{
uint32_t op = a_nInput;
uint32_t res = 0;
uint32_t one = 1uL << 30; // The second-to-top bit is set: use 1u << 14 for uint16_t type; use 1uL<<30 for uint32_t type
// "one" starts at the highest power of four <= than the argument.
while (one > op)
{
one >>= 2;
}
while (one != 0)
{
if (op >= res + one)
{
op = op - (res + one);
res = res + 2 * one;
}
res >>= 1;
one >>= 2;
}
return res;
}
The nice thing, I discovered, is that a fairly simple modification can return the "rounded" answer. I found this useful in a certain application for greater accuracy. Note that in this case, the return type must be uint32_t because the rounded square root of 232 - 1 is 216.
/**
* \brief Fast Square root algorithm, with rounding
*
* This does arithmetic rounding of the result. That is, if the real answer
* would have a fractional part of 0.5 or greater, the result is rounded up to
* the next integer.
* - SquareRootRounded(2) --> 1
* - SquareRootRounded(3) --> 2
* - SquareRootRounded(4) --> 2
* - SquareRootRounded(6) --> 2
* - SquareRootRounded(7) --> 3
* - SquareRootRounded(8) --> 3
* - SquareRootRounded(9) --> 3
*
* \param[in] a_nInput - unsigned integer for which to find the square root
*
* \return Integer square root of the input value.
*/
uint32_t SquareRootRounded(uint32_t a_nInput)
{
uint32_t op = a_nInput;
uint32_t res = 0;
uint32_t one = 1uL << 30; // The second-to-top bit is set: use 1u << 14 for uint16_t type; use 1uL<<30 for uint32_t type
// "one" starts at the highest power of four <= than the argument.
while (one > op)
{
one >>= 2;
}
while (one != 0)
{
if (op >= res + one)
{
op = op - (res + one);
res = res + 2 * one;
}
res >>= 1;
one >>= 2;
}
/* Do arithmetic rounding to nearest integer */
if (op > res)
{
res++;
}
return res;
}

If exact accuracy isn't required, I have a fast approximation for you, that uses 260 bytes of RAM (you could halve that, but don't).
int ftbl[33]={0,1,1,2,2,4,5,8,11,16,22,32,45,64,90,128,181,256,362,512,724,1024,1448,2048,2896,4096,5792,8192,11585,16384,23170,32768,46340};
int ftbl2[32]={ 32768,33276,33776,34269,34755,35235,35708,36174,36635,37090,37540,37984,38423,38858,39287,39712,40132,40548,40960,41367,41771,42170,42566,42959,43347,43733,44115,44493,44869,45241,45611,45977};
int fisqrt(int val)
{
int cnt=0;
int t=val;
while (t) {cnt++;t>>=1;}
if (6>=cnt) t=(val<<(6-cnt));
else t=(val>>(cnt-6));
return (ftbl[cnt]*ftbl2[t&31])>>15;
}
Here's the code to generate the tables:
ftbl[0]=0;
for (int i=0;i<32;i++) ftbl[i+1]=sqrt(pow(2.0,i));
printf("int ftbl[33]={0");
for (int i=0;i<32;i++) printf(",%d",ftbl[i+1]);
printf("};\n");
for (int i=0;i<32;i++) ftbl2[i]=sqrt(1.0+i/32.0)*32768;
printf("int ftbl2[32]={");
for (int i=0;i<32;i++) printf("%c%d",(i)?',':' ',ftbl2[i]);
printf("};\n");
Over the range 1 → 220, the maximum error is 11, and over the range 1 → 230, it's about 256. You could use larger tables and minimise this. It's worth mentioning that the error will always be negative - i.e. when it's wrong, the value will be LESS than the correct value.
You might do well to follow this with a refining stage.
The idea is simple enough: (ab)0.5 = a0.b × b0.5.
So, we take the input X = A×B where A = 2N and 1 ≤ B < 2
Then we have a lookup table for sqrt(2N), and a lookup table for sqrt(1 ≤ B < 2). We store the lookup table for sqrt(2N) as integer, which might be a mistake (testing shows no ill effects), and we store the lookup table for sqrt(1 ≤ B < 2) as 15-bit fixed-point.
We know that 1 ≤ sqrt(2N) < 65536, so that's 16-bit, and we know that we can really only multiply 16-bit × 15-bit on an ARM, without fear of reprisal, so that's what we do.
In terms of implementation, the while(t) {cnt++;t>>=1;} is effectively a count-leading-bits instruction (CLB), so if your version of the chipset has that, you're winning! Also, the shift instruction would be easy to implement with a bidirectional shifter, if you have one?
There's a Lg[N] algorithm for counting the highest set bit here.
In terms of magic numbers, for changing table sizes, THE magic number for ftbl2 is 32, though note that 6 (Lg[32]+1) is used for the shifting.

One common approach is bisection.
hi = number
lo = 0
mid = ( hi + lo ) / 2
mid2 = mid*mid
while( lo < hi-1 and mid2 != number ) {
if( mid2 < number ) {
lo = mid
else
hi = mid
mid = ( hi + lo ) / 2
mid2 = mid*mid
Something like that should work reasonably well. It makes log2(number) tests, doing
log2(number) multiplies and divides. Since the divide is a divide by 2, you can replace it with a >>.
The terminating condition may not be spot on, so be sure to test a variety of integers to be sure that the division by 2 doesn't incorrectly oscillate between two even values; they would differ by more than 1.

I find that most algorithms are based on simple ideas, but are implemented in a way more complicated manner than necessary. I've taken the idea from here: http://ww1.microchip.com/downloads/en/AppNotes/91040a.pdf (by Ross M. Fosler) and made it into a very short C-function:
uint16_t int_sqrt32(uint32_t x)
{
uint16_t res= 0;
uint16_t add= 0x8000;
int i;
for(i=0;i<16;i++)
{
uint16_t temp=res | add;
uint32_t g2= (uint32_t)temp * temp;
if (x>=g2)
{
res=temp;
}
add>>=1;
}
return res;
}
This compiles to 5 cycles/bit on my blackfin. I believe your compiled code will in general be faster if you use for loops instead of while loops, and you get the added benefit of deterministic time (although that to some extent depends on how your compiler optimizes the if statement.)

It depends about the usage of the sqrt function. I often use some approx to make fast versions. For example, when I need to compute the module of vector :
Module = SQRT( x^2 + y^2)
I use :
Module = MAX( x,y) + Min(x,y)/2
Which can be coded in 3 or 4 instructions as:
If (x > y )
Module = x + y >> 1;
Else
Module = y + x >> 1;

It's not fast but it's small and simple:
int isqrt(int n)
{
int b = 0;
while(n >= 0)
{
n = n - b;
b = b + 1;
n = n - b;
}
return b - 1;
}

I have settled to something similar to the binary digit-by-digit algorithm described in this Wikipedia article.

I recently encountered the same task on ARM Cortex-M3 (STM32F103CBT6) and after searching the Internet came up with the following solution. It's not the fastest comparing with solutions offered here, but it has good accuracy (maximum error is 1, i.e. LSB on the entire UI32 input range) and relatively good speed (about 1.3M square roots per second on a 72-MHz ARM Cortex-M3 or about 55 cycles per single root including the function call).
// FastIntSqrt is based on Wikipedia article:
// https://en.wikipedia.org/wiki/Methods_of_computing_square_roots
// Which involves Newton's method which gives the following iterative formula:
//
// X(n+1) = (X(n) + S/X(n))/2
//
// Thanks to ARM CLZ instruction (which counts how many bits in a number are
// zeros starting from the most significant one) we can very successfully
// choose the starting value, so just three iterations are enough to achieve
// maximum possible error of 1. The algorithm uses division, but fortunately
// it is fast enough here, so square root computation takes only about 50-55
// cycles with maximum compiler optimization.
uint32_t FastIntSqrt (uint32_t value)
{
if (!value)
return 0;
uint32_t xn = 1 << ((32 - __CLZ (value))/2);
xn = (xn + value/xn)/2;
xn = (xn + value/xn)/2;
xn = (xn + value/xn)/2;
return xn;
}
I'm using IAR and it produces the following assembler code:
SECTION `.text`:CODE:NOROOT(1)
THUMB
_Z11FastIntSqrtj:
MOVS R1,R0
BNE.N ??FastIntSqrt_0
MOVS R0,#+0
BX LR
??FastIntSqrt_0:
CLZ R0,R1
RSB R0,R0,#+32
MOVS R2,#+1
LSRS R0,R0,#+1
LSL R0,R2,R0
UDIV R3,R1,R0
ADDS R0,R3,R0
LSRS R0,R0,#+1
UDIV R2,R1,R0
ADDS R0,R2,R0
LSRS R0,R0,#+1
UDIV R1,R1,R0
ADDS R0,R1,R0
LSRS R0,R0,#+1
BX LR ;; return

The most cleverly coded bit-wise integer square root implementations for ARM achieve 3 cycles per result bit, which comes out to a lower bound of 50 cycles for the square root of a 32-bit unsigned integer. An example is shown in Andrew N. Sloss, Dominic Symes, Chris Wright, "ARM System Developer's Guide", Morgan Kaufman 2004.
Since most ARM processors also have very fast integer multipliers, and most even provide a very fast implementation of the wide multiply instruction UMULL, an alternative approach that can achieve execution times on the order of 35 to 45 cycles is computation via the reciprocal square root 1/√x using fixed-point computation. For this it is necessary to normalize the input with the help of a count-leading-zeros instruction, which on most ARM processors is available as an instruction CLZ.
The computation starts with an initial low-accuracy reciprocal square root approximation from a lookup table indexed by some most significant bit of the normalized argument. The Newton-Raphson iteration to refine the reciprocal square root r of a number a with quadratic convergence is rn+1 = rn + rn* (1 - a * rn2) / 2. This can be re-arranged into algebraically equivalent forms as is convenient. In the exemplary C99 code below, an 8-bit approximation r0 is read from a 96-entry lookup table. This approximation is accurate to about 7 bits. The first Newton-Raphson iteration computes r1 = (3 * r0 - a * r03) / 2 to potentially take advantage of small operand multiplication instructions. The second Newton-Raphson iteration computes r2 = (r1 * (3 - r1 * (r1 * a))) / 2.
The normalized square root is then computed by a back multiplication s2 = a * r2 and the final approximation is achieved by denormalizing based on the count of leading zeros of the original argument a. It is important that the desired result ⌊√a⌋ is approximated with underestimation. This simplifies the checking whether the desired result has been achieved by guaranteeing that the remainder ⌊√a⌋ - s2 * s2 is positive. If the final approximation is found to be too small, the result is increased by one. Correct operation of this algorithm can easily be demonstrated by exhaustive test of all possible 232 inputs against a "golden" reference, which takes only a few minutes.
One can speed up this implementation at the expense of additional storage for the lookup table, by pre-computing 3 * r0 and r03 to simplify the first Newton-Raphson iteration. The former requires 10 bit of storage and the latter 24 bits. In order to combine each pair into a 32-bit data item, the cube is rounded to 22 bits, which introduces negligible error into the computation. This results in a lookup table of 96 * 4 = 384 bytes.
An alternative approach uses the observation that all starting approximations have the same most significant bit set, which therefore can be assumed implicitly and does not have to be stored. This allows to squeeze a 9-bit approximation r0 into an 8-bit data item, with the leading bit restored after table lookup. With a lookup table of 384 entries, all underestimates, one can achieve an accuracy of about 7.5 bits. Combining the back multiply with the Newton-Raphson iteration for the reciprocal square root, one computes s0 = a * r0, s1 = s0 + r0 * (a - s0 * s0) / 2. Because the accuracy of the starting approximation is not high enough for a very accurate final square root approximation, it can be off by up to three, and an appropriate adjustment must be made based on the magnitude of the remainder floor (sqrt (a)) - s1 * s1.
One advantage of the alternative approach is that is halves the number of multiplies required, and in particular requires only a single wide multiplication UMULL. Especially processors where wide multiplies are fairly slow, this is an alternative worth trying.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
#if defined(_MSC_VER) && defined(_WIN64)
#include <intrin.h>
#endif // defined(_MSC_VER) && defined(_WIN64)
#define CLZ_BUILTIN (1) // use compiler's built-in count-leading-zeros
#define CLZ_FPU (2) // emulate count-leading-zeros via FPU
#define CLZ_CPU (3) // emulate count-leading-zeros via CPU
#define ALT_IMPL (0) // alternative implementation with fewer multiplies
#define LARGE_TABLE (0) // ALT_IMPL=0: incorporate 1st NR-iter into table
#define CLZ_IMPL (CLZ_CPU)// choose count-leading-zeros implementation
#define GEN_TAB (0) // generate tables
uint32_t umul32_hi (uint32_t a, uint32_t b);
uint32_t float_as_uint32 (float a);
int clz32 (uint32_t a);
#if ALT_IMPL
uint8_t rsqrt_tab [384] =
{
0xfe, 0xfc, 0xfa, 0xf8, 0xf6, 0xf4, 0xf2, 0xf0, 0xee, 0xed, 0xeb, 0xe9,
0xe7, 0xe6, 0xe4, 0xe2, 0xe1, 0xdf, 0xdd, 0xdc, 0xda, 0xd8, 0xd7, 0xd5,
0xd4, 0xd2, 0xd1, 0xcf, 0xce, 0xcc, 0xcb, 0xc9, 0xc8, 0xc7, 0xc5, 0xc4,
0xc2, 0xc1, 0xc0, 0xbe, 0xbd, 0xbc, 0xba, 0xb9, 0xb8, 0xb7, 0xb5, 0xb4,
0xb3, 0xb2, 0xb0, 0xaf, 0xae, 0xad, 0xac, 0xab, 0xa9, 0xa8, 0xa7, 0xa6,
0xa5, 0xa4, 0xa3, 0xa2, 0xa0, 0x9f, 0x9e, 0x9d, 0x9c, 0x9b, 0x9a, 0x99,
0x98, 0x97, 0x96, 0x95, 0x94, 0x93, 0x92, 0x91, 0x90, 0x8f, 0x8e, 0x8d,
0x8c, 0x8b, 0x8b, 0x8a, 0x89, 0x88, 0x87, 0x86, 0x85, 0x84, 0x83, 0x83,
0x82, 0x81, 0x80, 0x7f, 0x7e, 0x7d, 0x7d, 0x7c, 0x7b, 0x7a, 0x79, 0x79,
0x78, 0x77, 0x76, 0x75, 0x75, 0x74, 0x73, 0x72, 0x72, 0x71, 0x70, 0x6f,
0x6f, 0x6e, 0x6d, 0x6c, 0x6c, 0x6b, 0x6a, 0x6a, 0x69, 0x68, 0x67, 0x67,
0x66, 0x65, 0x65, 0x64, 0x63, 0x63, 0x62, 0x61, 0x61, 0x60, 0x5f, 0x5f,
0x5e, 0x5d, 0x5d, 0x5c, 0x5c, 0x5b, 0x5a, 0x5a, 0x59, 0x58, 0x58, 0x57,
0x57, 0x56, 0x55, 0x55, 0x54, 0x54, 0x53, 0x52, 0x52, 0x51, 0x51, 0x50,
0x50, 0x4f, 0x4e, 0x4e, 0x4d, 0x4d, 0x4c, 0x4c, 0x4b, 0x4b, 0x4a, 0x4a,
0x49, 0x48, 0x48, 0x47, 0x47, 0x46, 0x46, 0x45, 0x45, 0x44, 0x44, 0x43,
0x43, 0x42, 0x42, 0x41, 0x41, 0x40, 0x40, 0x3f, 0x3f, 0x3e, 0x3e, 0x3d,
0x3d, 0x3c, 0x3c, 0x3c, 0x3b, 0x3b, 0x3a, 0x3a, 0x39, 0x39, 0x38, 0x38,
0x37, 0x37, 0x36, 0x36, 0x36, 0x35, 0x35, 0x34, 0x34, 0x33, 0x33, 0x33,
0x32, 0x32, 0x31, 0x31, 0x30, 0x30, 0x30, 0x2f, 0x2f, 0x2e, 0x2e, 0x2d,
0x2d, 0x2d, 0x2c, 0x2c, 0x2b, 0x2b, 0x2b, 0x2a, 0x2a, 0x29, 0x29, 0x29,
0x28, 0x28, 0x27, 0x27, 0x27, 0x26, 0x26, 0x26, 0x25, 0x25, 0x24, 0x24,
0x24, 0x23, 0x23, 0x23, 0x22, 0x22, 0x21, 0x21, 0x21, 0x20, 0x20, 0x20,
0x1f, 0x1f, 0x1f, 0x1e, 0x1e, 0x1e, 0x1d, 0x1d, 0x1d, 0x1c, 0x1c, 0x1c,
0x1b, 0x1b, 0x1a, 0x1a, 0x1a, 0x19, 0x19, 0x19, 0x18, 0x18, 0x18, 0x17,
0x17, 0x17, 0x17, 0x16, 0x16, 0x16, 0x15, 0x15, 0x15, 0x14, 0x14, 0x14,
0x13, 0x13, 0x13, 0x12, 0x12, 0x12, 0x11, 0x11, 0x11, 0x11, 0x10, 0x10,
0x10, 0x0f, 0x0f, 0x0f, 0x0e, 0x0e, 0x0e, 0x0e, 0x0d, 0x0d, 0x0d, 0x0c,
0x0c, 0x0c, 0x0c, 0x0b, 0x0b, 0x0b, 0x0a, 0x0a, 0x0a, 0x0a, 0x09, 0x09,
0x09, 0x08, 0x08, 0x08, 0x08, 0x07, 0x07, 0x07, 0x07, 0x06, 0x06, 0x06,
0x05, 0x05, 0x05, 0x05, 0x04, 0x04, 0x04, 0x04, 0x03, 0x03, 0x03, 0x03,
0x02, 0x02, 0x02, 0x02, 0x01, 0x01, 0x01, 0x01, 0x00, 0x00, 0x00, 0x00,
};
/* compute floor (sqrt (a)) */
uint32_t my_isqrt32 (uint32_t a)
{
uint32_t b, r, s, scal, rem;
if (a == 0) return a;
/* Normalize argument */
scal = clz32 (a) & ~1;
b = a << scal;
/* Compute initial approximation to 1/sqrt(a) */
r = rsqrt_tab [(b >> 23) - 128] | 0x100;
/* Compute initial approximation to sqrt(a) */
s = umul32_hi (b, r << 8);
/* Refine sqrt approximation */
b = b - s * s;
s = s + ((r * (b >> 2)) >> 23);
/* Denormalize result*/
s = s >> (scal >> 1);
/* Ensure we got the floor correct */
rem = a - s * s;
if (rem < (2 * s + 1)) s = s + 0;
else if (rem < (4 * s + 4)) s = s + 1;
else if (rem < (6 * s + 9)) s = s + 2;
else s = s + 3;
return s;
}
#else // ALT_IMPL
#if LARGE_TABLE
uint32_t rsqrt_tab [96] =
{
0xfa0bfafa, 0xee6b2aee, 0xe5f02ae5, 0xdaf26ed9, 0xd2f002d0, 0xc890c2c4,
0xc1037abb, 0xb9a75ab2, 0xb4da42ac, 0xadcea2a3, 0xa6f27a9a, 0xa279c294,
0x9beb4a8b, 0x97a5ca85, 0x9163427c, 0x8d4fca76, 0x89500270, 0x8563ba6a,
0x818ac264, 0x7dc4ea5e, 0x7a120258, 0x7671da52, 0x72e4424c, 0x6f690a46,
0x6db24243, 0x6a52423d, 0x67042637, 0x6563c234, 0x62302a2e, 0x609cea2b,
0x5d836a25, 0x5bfd1a22, 0x58fd421c, 0x5783ae19, 0x560e4a16, 0x53300210,
0x51c7120d, 0x50623a0a, 0x4da4c204, 0x4c4c1601, 0x4af769fe, 0x49a6b9fb,
0x485a01f8, 0x471139f5, 0x45cc59f2, 0x448b5def, 0x4214fde9, 0x40df89e6,
0x3fade1e3, 0x3e8001e0, 0x3d55e1dd, 0x3c2f79da, 0x3c2f79da, 0x3b0cc5d7,
0x39edc1d4, 0x38d265d1, 0x37baa9ce, 0x36a689cb, 0x359601c8, 0x348909c5,
0x348909c5, 0x337f99c2, 0x3279adbf, 0x317741bc, 0x30784db9, 0x30784db9,
0x2f7cc9b6, 0x2e84b1b3, 0x2d9001b0, 0x2d9001b0, 0x2c9eb1ad, 0x2bb0b9aa,
0x2bb0b9aa, 0x2ac615a7, 0x29dec1a4, 0x29dec1a4, 0x28fab5a1, 0x2819e99e,
0x2819e99e, 0x273c599b, 0x273c599b, 0x26620198, 0x258ad995, 0x258ad995,
0x24b6d992, 0x24b6d992, 0x23e5fd8f, 0x2318418c, 0x2318418c, 0x224d9d89,
0x224d9d89, 0x21860986, 0x21860986, 0x20c18183, 0x20c18183, 0x20000180,
};
#else // LARGE_TABLE
uint8_t rsqrt_tab [96] =
{
0xfe, 0xfa, 0xf7, 0xf3, 0xf0, 0xec, 0xe9, 0xe6, 0xe4, 0xe1, 0xde, 0xdc,
0xd9, 0xd7, 0xd4, 0xd2, 0xd0, 0xce, 0xcc, 0xca, 0xc8, 0xc6, 0xc4, 0xc2,
0xc1, 0xbf, 0xbd, 0xbc, 0xba, 0xb9, 0xb7, 0xb6, 0xb4, 0xb3, 0xb2, 0xb0,
0xaf, 0xae, 0xac, 0xab, 0xaa, 0xa9, 0xa8, 0xa7, 0xa6, 0xa5, 0xa3, 0xa2,
0xa1, 0xa0, 0x9f, 0x9e, 0x9e, 0x9d, 0x9c, 0x9b, 0x9a, 0x99, 0x98, 0x97,
0x97, 0x96, 0x95, 0x94, 0x93, 0x93, 0x92, 0x91, 0x90, 0x90, 0x8f, 0x8e,
0x8e, 0x8d, 0x8c, 0x8c, 0x8b, 0x8a, 0x8a, 0x89, 0x89, 0x88, 0x87, 0x87,
0x86, 0x86, 0x85, 0x84, 0x84, 0x83, 0x83, 0x82, 0x82, 0x81, 0x81, 0x80,
};
#endif //LARGE_TABLE
/* compute floor (sqrt (a)) */
uint32_t my_isqrt32 (uint32_t a)
{
uint32_t b, r, s, t, scal, rem;
if (a == 0) return a;
/* Normalize argument */
scal = clz32 (a) & ~1;
b = a << scal;
/* Initial approximation to 1/sqrt(a)*/
t = rsqrt_tab [(b >> 25) - 32];
/* First NR iteration */
#if LARGE_TABLE
r = (t << 22) - umul32_hi (b, t);
#else // LARGE_TABLE
r = ((3 * t) << 22) - umul32_hi (b, (t * t * t) << 8);
#endif // LARGE_TABLE
/* Second NR iteration */
s = umul32_hi (r, b);
s = 0x30000000 - umul32_hi (r, s);
r = umul32_hi (r, s);
/* Compute sqrt(a) = a * 1/sqrt(a). Adjust to ensure it's an underestimate*/
r = umul32_hi (r, b) - 1;
/* Denormalize result */
r = r >> ((scal >> 1) + 11);
/* Make sure we got the floor correct */
rem = a - r * r;
if (rem >= (2 * r + 1)) r++;
return r;
}
#endif // ALT_IMPL
uint32_t umul32_hi (uint32_t a, uint32_t b)
{
return (uint32_t)(((uint64_t)a * b) >> 32);
}
uint32_t float_as_uint32 (float a)
{
uint32_t r;
memcpy (&r, &a, sizeof r);
return r;
}
int clz32 (uint32_t a)
{
#if (CLZ_IMPL == CLZ_FPU)
// Henry S. Warren, Jr, " Hacker's Delight 2nd ed", p. 105
int n = 158 - (float_as_uint32 ((float)(int32_t)(a & ~(a >> 1))+.5f) >> 23);
return (n < 0) ? 0 : n;
#elif (CLZ_IMPL == CLZ_CPU)
static const uint8_t clz_tab[32] = {
31, 22, 30, 21, 18, 10, 29, 2, 20, 17, 15, 13, 9, 6, 28, 1,
23, 19, 11, 3, 16, 14, 7, 24, 12, 4, 8, 25, 5, 26, 27, 0
};
a |= a >> 16;
a |= a >> 8;
a |= a >> 4;
a |= a >> 2;
a |= a >> 1;
return clz_tab [0x07c4acddu * a >> 27] + (!a);
#else // CLZ_IMPL == CLZ_BUILTIN
#if defined(_MSC_VER) && defined(_WIN64)
return __lzcnt (a);
#else // defined(_MSC_VER) && defined(_WIN64)
return __builtin_clz (a);
#endif // defined(_MSC_VER) && defined(_WIN64)
#endif // CLZ_IMPL
}
/* Henry S. Warren, Jr., "Hacker's Delight, 2nd e.d", p. 286 */
uint32_t ref_isqrt32 (uint32_t x)
{
uint32_t m, y, b;
m = 0x40000000U;
y = 0U;
while (m != 0) {
b = y | m;
y = y >> 1;
if (x >= b) {
x = x - b;
y = y | m;
}
m = m >> 2;
}
return y;
}
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
LARGE_INTEGER t;
static double oofreq;
static int checkedForHighResTimer;
static BOOL hasHighResTimer;
if (!checkedForHighResTimer) {
hasHighResTimer = QueryPerformanceFrequency (&t);
oofreq = 1.0 / (double)t.QuadPart;
checkedForHighResTimer = 1;
}
if (hasHighResTimer) {
QueryPerformanceCounter (&t);
return (double)t.QuadPart * oofreq;
} else {
return (double)GetTickCount() * 1.0e-3;
}
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
struct timeval tv;
gettimeofday(&tv, NULL);
return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif
int main (void)
{
#if ALT_IMPL
printf ("Alternate integer square root implementation\n");
#else // ALT_IMPL
#if LARGE_TABLE
printf ("Integer square root implementation w/ large table\n");
#else // LARGE_TAB
printf ("Integer square root implementation w/ small table\n");
#endif
#endif // ALT_IMPL
#if GEN_TAB
printf ("Generating lookup table ...\n");
#if ALT_IMPL
for (int i = 0; i < 384; i++) {
double x = 1.0 + (i + 1) * 1.0 / 128;
double y = 1.0 / sqrt (x);
uint8_t val = (uint8_t)((y * 512) - 256);
rsqrt_tab[i] = val;
printf ("0x%02x, ", rsqrt_tab[i]);
if (i % 12 == 11) printf("\n");
}
#else // ALT_IMPL
for (int i = 0; i < 96; i++ ) {
double x1 = 1.0 + i * 1.0 / 32;
double x2 = 1.0 + (i + 1) * 1.0 / 32;
double y = (1.0 / sqrt(x1) + 1.0 / sqrt(x2)) * 0.5;
uint32_t val = (uint32_t)(y * 256 + 0.5);
#if LARGE_TABLE
uint32_t cube = val * val * val;
rsqrt_tab[i] = (((cube + 1) / 4) << 10) + (3 * val);
printf ("0x%08x, ", rsqrt_tab[i]);
if (i % 6 == 5) printf ("\n");
#else // LARGE_TABLE
rsqrt_tab[i] = val;
printf ("0x%02x, ", rsqrt_tab[i]);
if (i % 12 == 11) printf ("\n");
#endif // LARGE_TABLE
}
#endif // ALT_IMPL
#endif // GEN_TAB
printf ("Running exhaustive test ... ");
uint32_t i = 0;
do {
uint32_t ref = ref_isqrt32 (i);
uint32_t res = my_isqrt32 (i);
if (res != ref) {
printf ("error: arg=%08x res=%08x ref=%08x\n", i, res, ref);
return EXIT_FAILURE;
}
i++;
} while (i);
printf ("PASSED\n");
printf ("Running benchmark ...\n");
i = 0;
uint32_t sum[8] = {0, 0, 0, 0, 0, 0, 0, 0};
double start = second();
do {
sum [0] += my_isqrt32 (i + 0);
sum [1] += my_isqrt32 (i + 1);
sum [2] += my_isqrt32 (i + 2);
sum [3] += my_isqrt32 (i + 3);
sum [4] += my_isqrt32 (i + 4);
sum [5] += my_isqrt32 (i + 5);
sum [6] += my_isqrt32 (i + 6);
sum [7] += my_isqrt32 (i + 7);
i += 8;
} while (i);
double stop = second();
printf ("%08x \relapsed=%.5f sec\n",
sum[0]+sum[1]+sum[2]+sum[3]+sum[4]+sum[5]+sum[6]+sum[7],
stop - start);
return EXIT_SUCCESS;
}

Here is a solution in Java that combines integer log_2 and Newton's method to create a loop free algorithm. As a downside, it needs division. The commented lines are required to upconvert to a 64-bit algorithm.
private static final int debruijn= 0x07C4ACDD;
//private static final long debruijn= ( ~0x0218A392CD3D5DBFL)>>>6;
static
{
for(int x= 0; x<32; ++x)
{
final long v= ~( -2L<<(x));
DeBruijnArray[(int)((v*debruijn)>>>27)]= x; //>>>58
}
for(int x= 0; x<32; ++x)
SQRT[x]= (int) (Math.sqrt((1L<<DeBruijnArray[x])*Math.sqrt(2)));
}
public static int sqrt(final int num)
{
int y;
if(num==0)
return num;
{
int v= num;
v|= v>>>1; // first round up to one less than a power of 2
v|= v>>>2;
v|= v>>>4;
v|= v>>>8;
v|= v>>>16;
//v|= v>>>32;
y= SQRT[(v*debruijn)>>>27]; //>>>58
}
//y= (y+num/y)>>>1;
y= (y+num/y)>>>1;
y= (y+num/y)>>>1;
y= (y+num/y)>>>1;
return y*y>num?y-1:y;
}
How this works: The first part produces a square root accurate to about three bits. The line y= (y+num/y)>>1; doubles the accuracy in bits. The last line eliminates the roof roots that can be generated.

If you need it just for ARM Thumb 2 processors, CMSIS DSP library by ARM is the best shot for you. It's made by people who designed Thumb 2 processors. Who else can beat it?
Actually you don't even need an algorithm but specialized square root hardware instructions such as VSQRT. The ARM company maintains math and DSP algorithm implementations highly optimized for Thumb 2 supported processors by trying to use its hardware like VSQRT. You can get the source code:
arm_sqrt_f32()
arm_sqrt_q15.c / arm_sqrt_q31.c (q15 and q31 are the fixed-point data types specialized for ARM DSP core, which is often comes with Thum 2 compatible processors.)
Note that ARM also maintains compiled binaries of CMSIS DSP that guarantees the best possible performance for ARM Thumb architecture-specific instructions. So you should consider statically link them when you use the library. You can get the binaries here.

This method is similar to long division: you construct a guess for the next digit of the root, do a subtraction, and enter the digit if the difference meets certain criteria. With a the binary version, your only choice for the next digit is 0 or 1, so you always guess 1, do the subtraction, and enter a 1 unless the difference is negative.
http://www.realitypixels.com/turk/opensource/index.html#FractSqrt

I implemented Warren's suggestion and the Newton method in C# for 64-bit integers. Isqrt uses the Newton method, while Isqrt uses Warren's method. Here is the source code:
using System;
namespace Cluster
{
public static class IntegerMath
{
/// <summary>
/// Compute the integer square root, the largest whole number less than or equal to the true square root of N.
///
/// This uses the integer version of Newton's method.
/// </summary>
public static long Isqrt(this long n)
{
if (n < 0) throw new ArgumentOutOfRangeException("n", "Square root of negative numbers is not defined.");
if (n <= 1) return n;
var xPrev2 = -2L;
var xPrev1 = -1L;
var x = 2L;
// From Wikipedia: if N + 1 is a perfect square, then the algorithm enters a two-value cycle, so we have to compare
// to two previous values to test for convergence.
while (x != xPrev2 && x != xPrev1)
{
xPrev2 = xPrev1;
xPrev1 = x;
x = (x + n/x)/2;
}
// The two values x and xPrev1 will be above and below the true square root. Choose the lower one.
return x < xPrev1 ? x : xPrev1;
}
#region Sqrt using Bit-shifting and magic numbers.
// From http://stackoverflow.com/questions/1100090/looking-for-an-efficient-integer-square-root-algorithm-for-arm-thumb2
// Converted to C#.
private static readonly ulong debruijn= ( ~0x0218A392CD3D5DBFUL)>>6;
private static readonly ulong[] SQRT = new ulong[64];
private static readonly int[] DeBruijnArray = new int[64];
static IntegerMath()
{
for(int x= 0; x<64; ++x)
{
ulong v= (ulong) ~( -2L<<(x));
DeBruijnArray[(v*debruijn)>>58]= x;
}
for(int x= 0; x<64; ++x)
SQRT[x]= (ulong) (Math.Sqrt((1L<<DeBruijnArray[x])*Math.Sqrt(2)));
}
public static long Isqrt2(this long n)
{
ulong num = (ulong) n;
ulong y;
if(num==0)
return (long)num;
{
ulong v= num;
v|= v>>1; // first round up to one less than a power of 2
v|= v>>2;
v|= v>>4;
v|= v>>8;
v|= v>>16;
v|= v>>32;
y= SQRT[(v*debruijn)>>58];
}
y= (y+num/y)>>1;
y= (y+num/y)>>1;
y= (y+num/y)>>1;
y= (y+num/y)>>1;
// Make sure that our answer is rounded down, not up.
return (long) (y*y>num?y-1:y);
}
#endregion
}
}
I used the following to benchmark the code:
using System;
using System.Diagnostics;
using Cluster;
using Microsoft.VisualStudio.TestTools.UnitTesting;
namespace ClusterTests
{
[TestClass]
public class IntegerMathTests
{
[TestMethod]
public void Isqrt_Accuracy()
{
for (var n = 0L; n <= 100000L; n++)
{
var expectedRoot = (long) Math.Sqrt(n);
var actualRoot = n.Isqrt();
Assert.AreEqual(expectedRoot, actualRoot, String.Format("Square root is wrong for N = {0}.", n));
}
}
[TestMethod]
public void Isqrt2_Accuracy()
{
for (var n = 0L; n <= 100000L; n++)
{
var expectedRoot = (long)Math.Sqrt(n);
var actualRoot = n.Isqrt2();
Assert.AreEqual(expectedRoot, actualRoot, String.Format("Square root is wrong for N = {0}.", n));
}
}
[TestMethod]
public void Isqrt_Speed()
{
var integerTimer = new Stopwatch();
var libraryTimer = new Stopwatch();
integerTimer.Start();
var total = 0L;
for (var n = 0L; n <= 300000L; n++)
{
var root = n.Isqrt();
total += root;
}
integerTimer.Stop();
libraryTimer.Start();
total = 0L;
for (var n = 0L; n <= 300000L; n++)
{
var root = (long)Math.Sqrt(n);
total += root;
}
libraryTimer.Stop();
var isqrtMilliseconds = integerTimer.ElapsedMilliseconds;
var libraryMilliseconds = libraryTimer.ElapsedMilliseconds;
var msg = String.Format("Isqrt: {0} ms versus library: {1} ms", isqrtMilliseconds, libraryMilliseconds);
Debug.WriteLine(msg);
Assert.IsTrue(libraryMilliseconds > isqrtMilliseconds, "Isqrt2 should be faster than Math.Sqrt! " + msg);
}
[TestMethod]
public void Isqrt2_Speed()
{
var integerTimer = new Stopwatch();
var libraryTimer = new Stopwatch();
var warmup = (10L).Isqrt2();
integerTimer.Start();
var total = 0L;
for (var n = 0L; n <= 300000L; n++)
{
var root = n.Isqrt2();
total += root;
}
integerTimer.Stop();
libraryTimer.Start();
total = 0L;
for (var n = 0L; n <= 300000L; n++)
{
var root = (long)Math.Sqrt(n);
total += root;
}
libraryTimer.Stop();
var isqrtMilliseconds = integerTimer.ElapsedMilliseconds;
var libraryMilliseconds = libraryTimer.ElapsedMilliseconds;
var msg = String.Format("isqrt2: {0} ms versus library: {1} ms", isqrtMilliseconds, libraryMilliseconds);
Debug.WriteLine(msg);
Assert.IsTrue(libraryMilliseconds > isqrtMilliseconds, "Isqrt2 should be faster than Math.Sqrt! " + msg);
}
}
}
My results on a Dell Latitude E6540 in Release mode, Visual Studio 2012 were
that the Library call Math.Sqrt is faster.
59 ms - Newton (Isqrt)
12 ms - Bit shifting (Isqrt2)
5 ms - Math.Sqrt
I am not clever with compiler directives, so it may be possible to tune the compiler to get the integer math faster. Clearly, the bit-shifting approach is very close to the library. On a system with no math coprocessor, it would be very fast.

I've designed a 16-bit sqrt for RGB gamma compression. It dispatches into 3 different tables, based on the higher 8 bits. Disadvantages: it uses about a kilobyte for the tables, rounds unpredictable, if exact sqrt is impossible, and, in the worst case, uses single multiplication (but only for a very few input values).
static uint8_t sqrt_50_256[] = {
114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,
133,134,135,136,137,138,139,140,141,142,143,143,144,145,146,147,148,149,150,
150,151,152,153,154,155,155,156,157,158,159,159,160,161,162,163,163,164,165,
166,167,167,168,169,170,170,171,172,173,173,174,175,175,176,177,178,178,179,
180,181,181,182,183,183,184,185,185,186,187,187,188,189,189,190,191,191,192,
193,193,194,195,195,196,197,197,198,199,199,200,201,201,202,203,203,204,204,
205,206,206,207,207,208,209,209,210,211,211,212,212,213,214,214,215,215,216,
217,217,218,218,219,219,220,221,221,222,222,223,223,224,225,225,226,226,227,
227,228,229,229,230,230,231,231,232,232,233,234,234,235,235,236,236,237,237,
238,238,239,239,240,241,241,242,242,243,243,244,244,245,245,246,246,247,247,
248,248,249,249,250,250,251,251,252,252,253,253,254,254,255,255
};
static uint8_t sqrt_0_10[] = {
1,2,3,3,4,4,5,5,5,6,6,6,7,7,7,7,8,8,8,8,9,9,9,9,9,10,10,10,10,10,11,11,11,
11,11,11,12,12,12,12,12,12,13,13,13,13,13,13,13,14,14,14,14,14,14,14,15,15,
15,15,15,15,15,15,16,16,16,16,16,16,16,16,17,17,17,17,17,17,17,17,17,18,18,
18,18,18,18,18,18,18,19,19,19,19,19,19,19,19,19,19,20,20,20,20,20,20,20,20,
20,20,21,21,21,21,21,21,21,21,21,21,21,22,22,22,22,22,22,22,22,22,22,22,23,
23,23,23,23,23,23,23,23,23,23,23,24,24,24,24,24,24,24,24,24,24,24,24,25,25,
25,25,25,25,25,25,25,25,25,25,25,26,26,26,26,26,26,26,26,26,26,26,26,26,27,
27,27,27,27,27,27,27,27,27,27,27,27,27,28,28,28,28,28,28,28,28,28,28,28,28,
28,28,29,29,29,29,29,29,29,29,29,29,29,29,29,29,29,30,30,30,30,30,30,30,30,
30,30,30,30,30,30,30,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,31,32,32,
32,32,32,32,32,32,32,32,32,32,32,32,32,32,33,33,33,33,33,33,33,33,33,33,33,
33,33,33,33,33,33,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,34,35,35,
35,35,35,35,35,35,35,35,35,35,35,35,35,35,35,35,36,36,36,36,36,36,36,36,36,
36,36,36,36,36,36,36,36,36,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,
37,37,37,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,38,39,39,39,
39,39,39,39,39,39,39,39,39,39,39,39,39,39,39,39,39,40,40,40,40,40,40,40,40,
40,40,40,40,40,40,40,40,40,40,40,40,41,41,41,41,41,41,41,41,41,41,41,41,41,
41,41,41,41,41,41,41,41,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,42,
42,42,42,42,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,43,
43,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,44,45,45,
45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,45,46,46,46,46,
46,46,46,46,46,46,46,46,46,46,46,46,46,46,46,46,46,46,46,47,47,47,47,47,47,
47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,48,48,48,48,48,48,48,
48,48,48,48,48,48,48,48,48,48,48,48,48,48,48,48,48,49,49,49,49,49,49,49,49,
49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,49,50,50,50,50,50,50,50,50,
50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,50,51,51,51,51,51,51,51,51,
51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,51,52,52,52,52,52,52,52,
52,52,52,52,52,52,52,52,52,52,52,52,52,52,52,52,52,52,52,53,53
};
static uint8_t sqrt_11_49[] = {
54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,0,76,77,78,
0,79,80,81,82,83,0,84,85,86,0,87,88,89,0,90,0,91,92,93,0,94,0,95,96,97,0,98,0,
99,100,101,0,102,0,103,0,104,105,106,0,107,0,108,0,109,0,110,0,111,112,113
};
uint16_t isqrt16(uint16_t v) {
uint16_t a, b;
uint16_t h = v>>8;
if (h <= 10) return v ? sqrt_0_10[v>>2] : 0;
if (h >= 50) return sqrt_50_256[h-50];
h = (h-11)<<1;
a = sqrt_11_49[h];
b = sqrt_11_49[h+1];
if (!a) return b;
return b*b > v ? a : b;
}
I've compared it against the log2 based sqrt, using clang's __builtin_clz (which should expand to a single assembly opcode), and the library's sqrtf, called using (int)sqrtf((float)i). And got rather strange results:
$ gcc -O3 test.c -o test && ./test
isqrt16: 6.498955
sqrtf: 6.981861
log2_sqrt: 61.755873
Clang compiled the call to sqrtf to a sqrtss instruction, which is nearly as fast as that table sqrt. Lesson learned: on x86 the compiler can provide fast enough sqrt, which is less than 10% slower than what you yourself can come up with, wasting a lot of time, or can be 10 times faster, if you use some ugly bitwise hacks. And still sqrtss is a bit slower than custom function, so if you really need these 5%, you can get them, and ARM for example has no sqrtss, so log2_sqrt shouldn't lag that bad.
On x86, where FPU is available, the old Quake hack appears to be the fastest way to calculate integer sqrt. It is 2 times faster than this table or the FPU's sqrtss.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Initial value for Newton-Raphson Division algorithm when input is close to 0 - division

Related

Sha256 opencl kernel needed [Help needed]

CUDA: Why are calculations are not hidden by memory latency?

strange Kinect acceleration values with OpenNI and avin SensorKinect

Setup the accelerator framework for fft on the iPhone

Looking for an efficient integer square root algorithm for ARM Thumb2

Categories

Resources