Fast FFT Bit Reversal, Can I Count Down Backwards Bit Reversed?

Fast FFT Bit Reversal, Can I Count Down Backwards Bit Reversed? - optimization

I'm using FFT's for audio processing, and I've come up with some potentially very fast ways of doing the bit reversal needed which might be of use to others, but because of the size of my FFT's (8192), I'm trying to reduce memory usage / cache flushing do to size of lookup tables or code, and increase performance. I've seen lots of clever bit reversal routines; they all allow you can feed them with any arbitrary value and get a bit reversed output, but FFT's don't need that flexibility since they go in a predictable sequence. First let me state what I have tried and/or figured out since it may be the fastest to date and you can see the problem, then I'll ask the question.
1) I've written a program to generate straight through, unlooped x86 source code that can be pasted into my FFT code, which reads an audio sample, multiplies it by a window value (that's a lookup table itself) and then just places the resulting value in it's proper bit reversed sorted position by absolute values within the x86 addressing modes like: movlps [edi+1876],xmm0. This is the absolute fastest way to do this for smaller FFT sizes. The problem is when I write straight through code to handle 8192 values, the code grows beyond the L1 instruction cache size and performance drops way down. Of course in contrast, a 32K bit reversal lookup table mixed with a 32K window table, plus other stuff, is also too big to fit the L1 data cache, and performance drops way down, but that's the way I'm currently doing it.
2) I've found patterns in the bit reversal sequence that can be exploited to reduce lookup table size, for example using 4 bit numbers (0..15) as an example, the bit reversal sequence looks like: 0,8,4,12,2,10,6,14|1,5,9,13,3,11,7,15. First thing that can be seen is that the last 8 numbers are the same as the first 8 +1, so I can chop my LUT half. If I look at the difference between the numbers there is more redundancy, so if I start with a zero in a register and want to add values to it to get the next bit reversed number they would be: +0,+8,-4,+8,-10,+8,-4,+8 and the same for the second half. As can be seen, I could have a lookup table of just 0 and -10 because the +8's and -4's always show up in a predictable way. The code would be unrolled to handle 4 values per loop: one would be a lookup table read, and the other 3 would be straight code for +8, -4, +8, before looping around again. Then a second loop could handle the 1,5,9,13,3,11,7,15 sequence. This is great, because I can now chop down my lookup table by another factor of 4. This scales up the same way for an 8192 size FFT. I can now get by with a 4K size LUT instead of 32K. I can exploit the same pattern and double the size of my code and chop down the LUT by another half yet again, however far I want to go. But in order to eliminate the LUT altogether, I'm back to the prohibitive code size.
For large FFT sizes, I believe that this #2 solution is the absolute fastest to date, since a relatively small percentage of lookup table reads need to be done, and every algorithm I currently find on the web requires too many serial/dependency calculations which can't be vectorized.
The question is, is there an algorithm that can increment numbers so the MSB acts like the LSB, and so on? In other words (in binary): 0000, 1000, 0100, 1100, 0010, etc… I've tried to think up some way, and so far, short of a bunch of nested loops, I can't seem to find a way for a fast and simple algorithm that is a mirror image of simply adding 1 to the LSB of a number. Yet it seems like there should be a way.

One other approach to consider: take a well known bit reversal algorithm - typically a few masks, shifts, and ORs - then implement this with SSE, so you get e.g. 8 x 16 bit bit reversals for the price of one. For 16 bits you need 5*log2(N) = 20 instructions, so the aggregate throughput would be 2.5 instructions per bit reversal.

This is the most trivial and straightforward solution (in C):
void BitReversedIncrement(unsigned *var, int bit)
{
unsigned c, one = 1u << bit;
do {
c = *var & one;
(*var) ^= one;
one >>= 1;
} while (one && c);
}
The main problem with is the conditional branches, which are often costly on modern CPUs. You have one conditional branch per bit.
You can do reversed increments by working on several bits at a time, e.g. 3 if ints are 32-bit:
void BitReversedIncrement2(unsigned *var, int bit)
{
unsigned r = *var, t = 0;
while (bit >= 2 && !t)
{
unsigned tt = (r >> (bit - 2)) & 7;
t = (07351624 >> (tt * 3)) & 7;
r ^= ((tt ^ t) << (bit - 2));
bit -= 3;
}
if (bit >= 0 && !t)
{
t = r & ((1 << (bit + 1)) - 1);
r ^= t;
t <<= 2 - bit;
t = (07351624 >> (t * 3)) & 7;
t >>= 2 - bit;
r |= t;
}
*var = r;
}
This is better, you only have 1 conditional branch per 3 bits.
If your CPU supports 64-bit ints, you can work on 4 bits at a time:
void BitReversedIncrement3(unsigned *var, int bit)
{
unsigned r = *var, t = 0;
while (bit >= 3 && !t)
{
unsigned tt = (r >> (bit - 3)) & 0xF;
t = (0xF7B3D591E6A2C48ULL >> (tt * 4)) & 0xF;
r ^= ((tt ^ t) << (bit - 3));
bit -= 4;
}
if (bit >= 0 && !t)
{
t = r & ((1 << (bit + 1)) - 1);
r ^= t;
t <<= 3 - bit;
t = (0xF7B3D591E6A2C48ULL >> (t * 4)) & 0xF;
t >>= 3 - bit;
r |= t;
}
*var = r;
}
Which is even better. And the only look-up table (07351624 or 0xF7B3D591E6A2C48) is tiny and likely encoded as an immediate instruction operand.
You can further improve the code if the bit position for the reversed "1" is a known constant. Just unroll the while loop into nested ifs, substitute the reversed one bit position constant.

For larger FFTs, paying attention to cache blocking (minimizing total uncovered cache miss cycles) can have a far larger effect on performance than optimization of the cycle count taken by indexing bit reversal. Make sure not to de-optimize a bigger effect by a larger cycle count while optimizing the smaller effect. For small FFTs, where everything fits in cache, LUTs can be a good solution as long as you pay attention to any load-use hazards by making sure things are or can be pipelined appropriately.

Related

What is Pseudo-polynomial complexity?

Yes, I've seen this answer - What is pseudopolynomial time? How does it differ from polynomial time? - but I still don't understand.
Why does the representation in bits make a difference only sometimes?
For this program for example
function isPrime(n):
for i from 2 to n - 1:
if (n mod i) = 0, return false
return true
it says the complexity is not polynomial, because n requires log n bits to write out so the complexity is O(2^(4*log n)) but if i use that on every other problem then it could also be pseudopolynomial, right? (unless im getting it all wrong here). What makes this program so special to be measured in the amount of bits required to write out n?

You have linked to other questions where this is explained fairly well for someone who understands the concept, so here comes a very brief version.
for i from 2 to n - 1:
can be rewritten as
i = 2
while(i < n - 1):
if (n mod i) == 0:
return false
i = i + 1
Very often, we assume that the operations i < n - 1, i = i + 1 and n mod i are O(1). But this is not necessarily true. It is usually true for small values. And on a 32 bit machine, a "small value" is in the order of a billion.
Number that requires more than 32 bits to be represented will take more time to perform operations on than a number that fits in 32 bit. And it will take even more if it required more than 64 bit.
In practice, this rarely matters.
A very simple way to visualize this is to imagine that you get the task to implement the common mathematical operations where the operands are represented as strings. Here is a simple python function that takes two strings representing binary numbers and returns the sum as a string. It was quickly hacked together and assumes both strings has the same length. It may contain bugs and can most likely be refined. But it demonstrate the point. This function adds two numbers, but it will take longer time for longer numbers.
def binadd(a, b):
carry = '0'
result = list('0'*(len(a)+1))
for i in range(len(a)-1,-1, -1):
xor = '1' if (a[i] == '1') != (b[i] == '1') else '0'
val = '1' if (xor == '1') != (carry == '1') else '0'
carry = '1' if (carry == '1' and xor == '1') or (a[i] == '1' and b[i] == '1') else '0'
result[i] = val
result[0]=carry
return ''.join(result)
What makes this program so special to be measured in the amount of bits required to write out n?
There's nothing special about this particular program. At least not theoretical. In practice it is special in the sense that determining if a VERY big number is a prime is a common problem. Or to be more accurate, it would have been a much more common problem if there existed a very fast algorithm to do it. If it did, it would basically break encryption as we know it today.

How does this color blending trick that works on color components in parallel work?

I saw this Java code that does a perfect 50% blend between two RGB888 colors extremely efficiently:
public static int blendRGB(int a, int b) {
return (a + b - ((a ^ b) & 0x00010101)) >> 1;
}
That's apparently equivalent to extracting and averaging the channels individually. Something like this:
public static int blendRGB_(int a, int b) {
int aR = a >> 16;
int bR = b >> 16;
int aG = (a >> 8) & 0xFF;
int bG = (b >> 8) & 0xFF;
int aB = a & 0xFF;
int bB = b & 0xFF;
int cR = (aR + bR) >> 1;
int cG = (aG + bG) >> 1;
int cB = (aB + bB) >> 1;
return (cR << 16) | (cG << 8) | cB;
}
But the first way is much more efficient. My questions are: How does this magic work? What else can I do with it? And are there more tricks similar to this?

(a ^ b) & 0x00010101 is what the least significant bits of the channels would have been in a + b if no carry had come from the right.
Subtracting it from the sum guarantees that the bit that is shifted into the most significant bit of the next channel is just the carry from that channel, untainted by this channel. Of course that also means that this channel is no longer effected by the carry from the next channel.
An other way to look this, not the way it does it but a way that may help you understand it, is that effectively the inputs are changed so that their sum is even for all channels. The carries then go nicely into the least significant bits (which are zero, because even), without disturbing anything. Of course what it actually does is sort of the other way around, first it just sums them, and only then does it ensure that the sums are even for all channels. But the order doesn't matter.
More concretely, there are 4 cases (before the carry from the next channel is applied):
the lsb of a channel is 0 and there is no carry from the next channel.
the lsb of a channel is 0 and there is a carry from the next channel.
the lsb of a channel is 1 and there is no carry from the next channel.
the lsb of a channel is 1 and there is a carry from the next channel.
The first two cases are trivial. The shift puts the carried bit back in channel it belongs to, it doesn't even matter whether it was 0 or 1.
Case 3 is more interesting. If the lsb is 1, that means the shift would shift that bit into the most significant bit of the next channel. That's bad. That bit has to be unset somehow - but you can't just mask it away because maybe you're in case 4.
Case 4 is the most interesting. If the lsb is 1 and there is a carry into that bit, it rolls over to a 0 and the carry is propagated. That can't be undone by masking, but it can be done by reversing the process, ie subtracting 1 from the lsb (which puts it back to 1 and undoes any damage done by the propagated carry).
As you can see, in both case 3 and case 4, the cure is subtracting 1 from the lsb, and those are also the cases in which the lsb really wanted to be 1 (though maybe it isn't any more, due to a carry from the next channel), and in both case 1 and 2, you don't have to anything (in other words, subtract 0). That exactly corresponds to subtracting "what the lsb would have been in a + b if no carry had come from the right".
Also, the blue channel can only fall into cases 1 or 3 (there is no next channel which could carry), and the shift would just discard that bit instead of putting it in the next channel (because there is none). So alternatively, you may write (note the mask has lost the least significant 1)
public static int blendRGB(int a, int b) {
return (a + b - ((a ^ b) & 0x00010100)) >> 1;
}
Doesn't really make any difference, though.
To make it work for ARGB8888, you can switch to the good old "SWAR average":
// channel-by-channel average, no alpha blending
public static int blendARGB(int a, int b) {
return (a & b) + (((a ^ b) & 0xFEFEFEFE) >>> 1);
}
Which is a variation on a recursive way to define addition: x + y = (x ^ y) + ((x & y) << 1) which computes the sum without carries, then adds the carries in separately. The base case is when one of the operands is zero.
Both halves are effectively shifted right by 1, in such a way that the carry out of the most significant bit is never lost. The mask ensures that bits don't move to a channel to the right, and simultaneously ensures that a carry won't propagate out of its channel.

Fast way to sum Fourier series?

I have generated the coefficients using FFTW, now I want to reconstruct the original data, but using only the first numCoefs coefficients rather than all of them. At the moment I'm using the below code, which is very slow:
for ( unsigned int i = 0; i < length; ++i )
{
double sum = 0;
for ( unsigned int j = 0; j < numCoefs; ++j )
{
sum += ( coefs[j][0] * cos( j * omega * i ) ) + ( coefs[j][1] * sin( j * omega * i ) );
}
data[i] = sum;
}
Is there a faster way?

A much simpler solution would be to zero the unwanted coefficients and then do an IFFT with FFTW. This will be a lot more efficient than doing an IDFT as above.
Note that you may get some artefacts in the time domain when you do this kind of thing - you're effectively multiplying by a step function in the frequency domain, which is equivalent to convolution with a sinc function in the time domain. To reduce the resulting "ringing" in the time domain you should use a window function to smooth out the transition between non-zero and zero coeffs.

If your numCoefs is anywhere near or greater than log(length), then an IFFT, which is O(n*log(n)) in computational complexity, will most likely be faster, as well as pre-optimized for you. Just zero all the bins except for the coefficients you want to keep, and make sure to also keep their negative frequency complex conjugates as well if you want a real result.
If your numCoefs is small relative to log(length), then other optimizations you could try include using sinf() and cosf() if you don't really need more than 6 digits of precision, and pre-calculating omega*i outside the inner loop (although your compiler should do be doing that for you unless you have the optimization setting low or off).

Trailing/leading zero count for a byte

I'm using Java and I'm coding a chess engine.
I'm trying to find the index of the first 1 bit and the index of the last 1 bit in a byte.
I'm currently using Long.numberOfTrailingZeros() (or something like that) in Java, and would like to emulate that functionality, except with bytes.
Would it be something like:
byte b = 0b011000101;
int firstOneBit = bitCount ((b & -b) - 1);
If so, how would I implement bitCount relatively efficiently. I don't mind good explainations, please don't just give me code.

use a lookup tabel with 256 entries.
to create it:
unsigned int bitcount ( unsigned int i ) {
unsigned int r = 0;
while ( i ) { r+=i&1; i>>=1; } /* bit shift is >>> in java afair */
return r;
}
this of course does not need to be fast as you do it at most 256 times to init your tabel.

/* Count Leading Zeroes */
static uint8_t clzlut[256] = {
8,7,6,6,5,5,5,5,
4,4,4,4,4,4,4,4,
3,3,3,3,3,3,3,3,
3,3,3,3,3,3,3,3,
2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,
1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0
};
uint32_t clz(uint32_t val)
{
uint32_t accum = 0;
accum += clzlut[val >> 24];
accum += (accum == 8 ) ? clzlut[(val >> 16) & 0xFF] : 0;
accum += (accum == 16) ? clzlut[(val >> 8) & 0xFF] : 0;
accum += (accum == 24) ? clzlut[ val & 0xFF] : 0;
return accum;
}
Explanation:
This works by storing the number of leading zeroes for each permutation of a byte as a lookup table. You use the byte value to look up the count of leading zeroes for that value. Since the example does this for an unsigned int, you shift and mask the four individual bytes, and accumulate the lookups accordingly. The ternary statement is used to stop the accumulation as soon as we find a bit which is set. That the accumulated value is 8, 16 or 24 implies that no set bit is found so far.
Also, some architectures have hardware support for this (as an instruction). The assembly mnemonic is often called 'CLZ' or 'BSR'. They are abbreviations for "Count leading Zeroes" and "Bit Scan Reverse" respectively.

The correct answer is that most all processors have some special instructions to do this sort of thing (leading zeros, trailing zeros, number of ones, etc). x86 has bsf/bsr, powerpc has clz, and so on. Hopefully Integer.numberOfTrailingZeros is smart enough to use these, but that's probably the only way that has a chance of using this sort of platform-specific function in Java (if it even uses them).
The Aggregate Magic Algorithms is another place with some approaches to this sort of problem, ranging from the obvious (lookup tables), to some rather clever SWAR approaches. But I suspect they all lose to Integer(x).numberOfTrailingZeros() if the java runtime is smart about the latter; it ought to be possible to optimize out the boxing and use a platform-specific technique for numberOfTrailingZeros, and if it does both that'll win.
Just for completeness, the other classic archive of brilliant bit-whacking is the old MIT HAKMEM collection (there's also a semi-modernized C version if your PDP-6/10 assembler skills have gotten rusty).

If you assume that Long.numberOfTrailingZeros is fast (i.e. JIT compiled/optimized to use a single ASM instructions when available), then why can't you simply do something like this:
max(8,Long.numberOfTrailingZeros(val))
where val is your byte value converted to a Long. This is also assuming that max() is available and again optimizes to use asm select or max instructions.
Theoretically, on a machine that supports it, these operations could be JIT compiled to two assembler instructions.

What's the fastest way to divide an integer by 3?

int x = n / 3; // <-- make this faster
// for instance
int a = n * 3; // <-- normal integer multiplication
int b = (n << 1) + n; // <-- potentially faster multiplication

The guy who said "leave it to the compiler" was right, but I don't have the "reputation" to mod him up or comment. I asked gcc to compile int test(int a) { return a / 3; } for an ix86 and then disassembled the output. Just for academic interest, what it's doing is roughly multiplying by 0x55555556 and then taking the top 32 bits of the 64 bit result of that. You can demonstrate this to yourself with eg:
$ ruby -e 'puts(60000 * 0x55555556 >> 32)'
20000
$ ruby -e 'puts(72 * 0x55555556 >> 32)'
24
$
The wikipedia page on Montgomery division is hard to read but fortunately the compiler guys have done it so you don't have to.

This is the fastest as the compiler will optimize it if it can depending on the output processor.
int a;
int b;
a = some value;
b = a / 3;

There is a faster way to do it if you know the ranges of the values, for example, if you are dividing a signed integer by 3 and you know the range of the value to be divided is 0 to 768, then you can multiply it by a factor and shift it to the left by a power of 2 to that factor divided by 3.
eg.
Range 0 -> 768
you could use shifting of 10 bits, which multiplying by 1024, you want to divide by 3 so your multiplier should be 1024 / 3 = 341,
so you can now use (x * 341) >> 10
(Make sure the shift is a signed shift if using signed integers), also make sure the shift is an actually shift and not a bit ROLL
This will effectively divide the value 3, and will run at about 1.6 times the speed as a natural divide by 3 on a standard x86 / x64 CPU.
Of course the only reason you can make this optimization when the compiler cant is because the compiler does not know the maximum range of X and therefore cannot make this determination, but you as the programmer can.
Sometime it may even be more beneficial to move the value into a larger value and then do the same thing, ie. if you have an int of full range you could make it an 64-bit value and then do the multiply and shift instead of dividing by 3.
I had to do this recently to speed up image processing, i needed to find the average of 3 color channels, each color channel with a byte range (0 - 255). red green and blue.
At first i just simply used:
avg = (r + g + b) / 3;
(So r + g + b has a maximum of 768 and a minimum of 0, because each channel is a byte 0 - 255)
After millions of iterations the entire operation took 36 milliseconds.
I changed the line to:
avg = (r + g + b) * 341 >> 10;
And that took it down to 22 milliseconds, its amazing what can be done with a little ingenuity.
This speed up occurred in C# even though I had optimisations turned on and was running the program natively without debugging info and not through the IDE.

See How To Divide By 3 for an extended discussion of more efficiently dividing by 3, focused on doing FPGA arithmetic operations.
Also relevant:
Optimizing integer divisions with Multiply Shift in C#

Depending on your platform and depending on your C compiler, a native solution like just using
y = x / 3
Can be fast or it can be awfully slow (even if division is done entirely in hardware, if it is done using a DIV instruction, this instruction is about 3 to 4 times slower than a multiplication on modern CPUs). Very good C compilers with optimization flags turned on may optimize this operation, but if you want to be sure, you are better off optimizing it yourself.
For optimization it is important to have integer numbers of a known size. In C int has no known size (it can vary by platform and compiler!), so you are better using C99 fixed-size integers. The code below assumes that you want to divide an unsigned 32-bit integer by three and that you C compiler knows about 64 bit integer numbers (NOTE: Even on a 32 bit CPU architecture most C compilers can handle 64 bit integers just fine):
static inline uint32_t divby3 (
uint32_t divideMe
) {
return (uint32_t)(((uint64_t)0xAAAAAAABULL * divideMe) >> 33);
}
As crazy as this might sound, but the method above indeed does divide by 3. All it needs for doing so is a single 64 bit multiplication and a shift (like I said, multiplications might be 3 to 4 times faster than divisions on your CPU). In a 64 bit application this code will be a lot faster than in a 32 bit application (in a 32 bit application multiplying two 64 bit numbers take 3 multiplications and 3 additions on 32 bit values) - however, it might be still faster than a division on a 32 bit machine.
On the other hand, if your compiler is a very good one and knows the trick how to optimize integer division by a constant (latest GCC does, I just checked), it will generate the code above anyway (GCC will create exactly this code for "/3" if you enable at least optimization level 1). For other compilers... you cannot rely or expect that it will use tricks like that, even though this method is very well documented and mentioned everywhere on the Internet.
Problem is that it only works for constant numbers, not for variable ones. You always need to know the magic number (here 0xAAAAAAAB) and the correct operations after the multiplication (shifts and/or additions in most cases) and both is different depending on the number you want to divide by and both take too much CPU time to calculate them on the fly (that would be slower than hardware division). However, it's easy for a compiler to calculate these during compile time (where one second more or less compile time plays hardly a role).

For 64 bit numbers:
uint64_t divBy3(uint64_t x)
{
return x*12297829382473034411ULL;
}
However this isn't the truncating integer division you might expect.
It works correctly if the number is already divisible by 3, but it returns a huge number if it isn't.
For example if you run it on for example 11, it returns 6148914691236517209. This looks like a garbage but it's in fact the correct answer: multiply it by 3 and you get back the 11!
If you are looking for the truncating division, then just use the / operator. I highly doubt you can get much faster than that.
Theory:
64 bit unsigned arithmetic is a modulo 2^64 arithmetic.
This means for each integer which is coprime with the 2^64 modulus (essentially all odd numbers) there exists a multiplicative inverse which you can use to multiply with instead of division. This magic number can be obtained by solving the 3*x + 2^64*y = 1 equation using the Extended Euclidean Algorithm.

What if you really don't want to multiply or divide? Here is is an approximation I just invented. It works because (x/3) = (x/4) + (x/12). But since (x/12) = (x/4) / 3 we just have to repeat the process until its good enough.
#include <stdio.h>
void main()
{
int n = 1000;
int a,b;
a = n >> 2;
b = (a >> 2);
a += b;
b = (b >> 2);
a += b;
b = (b >> 2);
a += b;
b = (b >> 2);
a += b;
printf("a=%d\n", a);
}
The result is 330. It could be made more accurate using b = ((b+2)>>2); to account for rounding.
If you are allowed to multiply, just pick a suitable approximation for (1/3), with a power-of-2 divisor. For example, n * (1/3) ~= n * 43 / 128 = (n * 43) >> 7.
This technique is most useful in Indiana.

I don't know if it's faster but if you want to use a bitwise operator to perform binary division you can use the shift and subtract method described at this page:
Set quotient to 0
Align leftmost digits in dividend and divisor
Repeat:
If that portion of the dividend above the divisor is greater than or equal to the divisor:
Then subtract divisor from that portion of the dividend and
Concatentate 1 to the right hand end of the quotient
Else concatentate 0 to the right hand end of the quotient
Shift the divisor one place right
Until dividend is less than the divisor:
quotient is correct, dividend is remainder
STOP

For really large integer division (e.g. numbers bigger than 64bit) you can represent your number as an int[] and perform division quite fast by taking two digits at a time and divide them by 3. The remainder will be part of the next two digits and so forth.
eg. 11004 / 3 you say
11/3 = 3, remaineder = 2 (from 11-3*3)
20/3 = 6, remainder = 2 (from 20-6*3)
20/3 = 6, remainder = 2 (from 20-6*3)
24/3 = 8, remainder = 0
hence the result 3668
internal static List<int> Div3(int[] a)
{
int remainder = 0;
var res = new List<int>();
for (int i = 0; i < a.Length; i++)
{
var val = remainder + a[i];
var div = val/3;
remainder = 10*(val%3);
if (div > 9)
{
res.Add(div/10);
res.Add(div%10);
}
else
res.Add(div);
}
if (res[0] == 0) res.RemoveAt(0);
return res;
}

If you really want to see this article on integer division, but it only has academic merit ... it would be an interesting application that actually needed to perform that benefited from that kind of trick.

Easy computation ... at most n iterations where n is your number of bits:
uint8_t divideby3(uint8_t x)
{
uint8_t answer =0;
do
{
x>>=1;
answer+=x;
x=-x;
}while(x);
return answer;
}

A lookup table approach would also be faster in some architectures.
uint8_t DivBy3LU(uint8_t u8Operand)
{
uint8_t ai8Div3 = [0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, ....];
return ai8Div3[u8Operand];
}

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas