Trailing/leading zero count for a byte

Trailing/leading zero count for a byte - chess

I'm using Java and I'm coding a chess engine.
I'm trying to find the index of the first 1 bit and the index of the last 1 bit in a byte.
I'm currently using Long.numberOfTrailingZeros() (or something like that) in Java, and would like to emulate that functionality, except with bytes.
Would it be something like:
byte b = 0b011000101;
int firstOneBit = bitCount ((b & -b) - 1);
If so, how would I implement bitCount relatively efficiently. I don't mind good explainations, please don't just give me code.

use a lookup tabel with 256 entries.
to create it:
unsigned int bitcount ( unsigned int i ) {
unsigned int r = 0;
while ( i ) { r+=i&1; i>>=1; } /* bit shift is >>> in java afair */
return r;
}
this of course does not need to be fast as you do it at most 256 times to init your tabel.

/* Count Leading Zeroes */
static uint8_t clzlut[256] = {
8,7,6,6,5,5,5,5,
4,4,4,4,4,4,4,4,
3,3,3,3,3,3,3,3,
3,3,3,3,3,3,3,3,
2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,
2,2,2,2,2,2,2,2,
1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,
1,1,1,1,1,1,1,1,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0
};
uint32_t clz(uint32_t val)
{
uint32_t accum = 0;
accum += clzlut[val >> 24];
accum += (accum == 8 ) ? clzlut[(val >> 16) & 0xFF] : 0;
accum += (accum == 16) ? clzlut[(val >> 8) & 0xFF] : 0;
accum += (accum == 24) ? clzlut[ val & 0xFF] : 0;
return accum;
}
Explanation:
This works by storing the number of leading zeroes for each permutation of a byte as a lookup table. You use the byte value to look up the count of leading zeroes for that value. Since the example does this for an unsigned int, you shift and mask the four individual bytes, and accumulate the lookups accordingly. The ternary statement is used to stop the accumulation as soon as we find a bit which is set. That the accumulated value is 8, 16 or 24 implies that no set bit is found so far.
Also, some architectures have hardware support for this (as an instruction). The assembly mnemonic is often called 'CLZ' or 'BSR'. They are abbreviations for "Count leading Zeroes" and "Bit Scan Reverse" respectively.

The correct answer is that most all processors have some special instructions to do this sort of thing (leading zeros, trailing zeros, number of ones, etc). x86 has bsf/bsr, powerpc has clz, and so on. Hopefully Integer.numberOfTrailingZeros is smart enough to use these, but that's probably the only way that has a chance of using this sort of platform-specific function in Java (if it even uses them).
The Aggregate Magic Algorithms is another place with some approaches to this sort of problem, ranging from the obvious (lookup tables), to some rather clever SWAR approaches. But I suspect they all lose to Integer(x).numberOfTrailingZeros() if the java runtime is smart about the latter; it ought to be possible to optimize out the boxing and use a platform-specific technique for numberOfTrailingZeros, and if it does both that'll win.
Just for completeness, the other classic archive of brilliant bit-whacking is the old MIT HAKMEM collection (there's also a semi-modernized C version if your PDP-6/10 assembler skills have gotten rusty).

If you assume that Long.numberOfTrailingZeros is fast (i.e. JIT compiled/optimized to use a single ASM instructions when available), then why can't you simply do something like this:
max(8,Long.numberOfTrailingZeros(val))
where val is your byte value converted to a Long. This is also assuming that max() is available and again optimizes to use asm select or max instructions.
Theoretically, on a machine that supports it, these operations could be JIT compiled to two assembler instructions.

Related

BigQuery UDF using BYTES datatype

I am currently trying to calculate the Hamming distance between two binary strings in BigQuery using User defined functions in Javascript, my schema is quite simple:
row_id STRING
descriptors BYTES REPEATED
phash BYTES
What I am finding a bit confusing is the fact that you apparently deal with BYTES in BigQuery as a Base64 string, I imported both functions atob() and btoa() so I would be able to work with the binary form of the byte strings instead of the Base64 representation:
My Query currently looks like this:
CREATE TEMP FUNCTION f_PHASH_distance(ph1 BYTES, ph2 BYTES)
RETURNS INT64
LANGUAGE js AS
"""
return HammingDistance(ph1, ph2);
"""
OPTIONS (
library=["gs://test.appspot.com/HammingDistance.js",
"gs://test.appspot.com/btoa_atob.js"]
);
SELECT f_PHASH_distance(phash, CAST("9Slp3g9OgVI=" AS BYTES))
FROM ims.images WHERE row_id = "2333USX"
And the row with id = "2333USX" phash is equal to "9Slp3g9OgVI=" in base64, which means that the Hamming distance is 0. But instead of 0 I am currently getting is 35 on BigQuery.
HammingDistance.js has the following content:
function HammingDistance(a, b){
var count = 0;
for(var i = 0; i < a.length; i++){
// calculate XOR between the two chars
var xor = a.charCodeAt(i) ^ b.charCodeAt(i);
// count number of 1's on the result
for(var j = 0; j < 16; j++){
//add if LSB is 1
count += xor % 2;
//right shift the variable
xor = xor >> 1;
}
}
return count;
}
/**
* Calculates the distance between two Perceptual hashes of two images encoded
* in base 64
*/
function PHASHDistance(a, b){
return HammingDistance(atob(a), atob(b));
}
And testing it in the JS console of my browser I do get the expected result. So I assume that I am doing something wrong with the casts but the documentation is very scarce on UDFs with BYTE parameters.
Any help would be much appreciated.

It looks like the problem is that you are casting "9Slp3g9OgVI=" to bytes rather than converting it to bytes from base64. I think you want this instead:
SELECT f_PHASH_distance(phash, FROM_BASE64("9Slp3g9OgVI="))
FROM ims.images WHERE row_id = "2333USX"
You might be better off using SQL functions rather than JavaScript functions, though, since JavaScript normally isn't as fast. Here's a Hamming distance implementation in SQL, assuming that the bytes have equal lengths:
#standardSQL
CREATE TEMP FUNCTION HammingDistance(b1 BYTES, b2 BYTES) AS (
BIT_COUNT(b1 ^ b2)
);
WITH Input AS (
SELECT b'defdef' AS bytes UNION ALL
SELECT b'123de4' UNION ALL
SELECT b'abc123'
)
SELECT HammingDistance(b'abcdef', bytes)
FROM Input;
It takes the bitwise XOR of the two byte values, then checks how many bits are not the same.

In case someone is looking for a solution in the case of comparing regular strings (not binary ones as this question), look at my answer here

'while' Loop in Objective-C

The following program calculates and removes the remainder of a number, adds the total of the remainders calculated and displays them.
#import <Foundation/Foundation.h>
int main (int argc, char * argv[]) {
#autoreleasepool {
int number, remainder, total;
NSLog(#"Enter your number");
scanf("%i", &number);
while (number != 0)
{
remainder = number % 10;
total += remainder;
number /= 10;
}
NSLog(#"%i", total);
}
return 0;
}
My questions are:
Why is the program set to continue as long as the number is not equal to 0? Shouldn't it continue as the long as the remainder is not equal to 0?
At what point is the remainder discarded from the value of number? Why is there no number -= remainder statement before n /=10?
[Bonus question: Does Objective-C get any easier to understand?]

The reason we continue until number != 0 instead of using remainder is that if our input is divisible by 10 exactly, then we don't get the proper output (the sum of the base 10 digits).
The remainder is dropped off because of integer division. Remember, an integer cannot hold a decimal place, so when we divide 16 by 10, we don't get 1.6, we just get 1.
And yes, Objective-C does get easier over time (but, as a side-note, this uses absolutely 0 features of Objective-C, so it's basically C with a NSLog call).
Note that the output isn't quite what you would expect at all times, however, as in C / ObjC, a (unlike languages like D or JS) a variable is not always initialized to a set value (in this case, you assume 0). This could cause UB down the road.

It checks to see if number is not equal to zero because remainder very well may never become zero. If we were to input 5 as our input value, the first time through the loop remainder would be set to 5 (because 5 % 10 = 5), and number would go to zero because
5 / 10 = 0.5, and ints do not store floating point values, so the .5 will get truncated and the value of number will equal zero.
The remainder does not get removed from the value of number in this code. I think that you may be confused about what the modulo operator does (see this explanation).
Bonus answer: learning a programming language is difficult at first, but very rewarding in the long run (if you stick with it). Each new language that you learn after your first will most likely be easier to learn too, because you will understand general programming constructs and practices. The best of luck on your endeavor!

Fast FFT Bit Reversal, Can I Count Down Backwards Bit Reversed?

I'm using FFT's for audio processing, and I've come up with some potentially very fast ways of doing the bit reversal needed which might be of use to others, but because of the size of my FFT's (8192), I'm trying to reduce memory usage / cache flushing do to size of lookup tables or code, and increase performance. I've seen lots of clever bit reversal routines; they all allow you can feed them with any arbitrary value and get a bit reversed output, but FFT's don't need that flexibility since they go in a predictable sequence. First let me state what I have tried and/or figured out since it may be the fastest to date and you can see the problem, then I'll ask the question.
1) I've written a program to generate straight through, unlooped x86 source code that can be pasted into my FFT code, which reads an audio sample, multiplies it by a window value (that's a lookup table itself) and then just places the resulting value in it's proper bit reversed sorted position by absolute values within the x86 addressing modes like: movlps [edi+1876],xmm0. This is the absolute fastest way to do this for smaller FFT sizes. The problem is when I write straight through code to handle 8192 values, the code grows beyond the L1 instruction cache size and performance drops way down. Of course in contrast, a 32K bit reversal lookup table mixed with a 32K window table, plus other stuff, is also too big to fit the L1 data cache, and performance drops way down, but that's the way I'm currently doing it.
2) I've found patterns in the bit reversal sequence that can be exploited to reduce lookup table size, for example using 4 bit numbers (0..15) as an example, the bit reversal sequence looks like: 0,8,4,12,2,10,6,14|1,5,9,13,3,11,7,15. First thing that can be seen is that the last 8 numbers are the same as the first 8 +1, so I can chop my LUT half. If I look at the difference between the numbers there is more redundancy, so if I start with a zero in a register and want to add values to it to get the next bit reversed number they would be: +0,+8,-4,+8,-10,+8,-4,+8 and the same for the second half. As can be seen, I could have a lookup table of just 0 and -10 because the +8's and -4's always show up in a predictable way. The code would be unrolled to handle 4 values per loop: one would be a lookup table read, and the other 3 would be straight code for +8, -4, +8, before looping around again. Then a second loop could handle the 1,5,9,13,3,11,7,15 sequence. This is great, because I can now chop down my lookup table by another factor of 4. This scales up the same way for an 8192 size FFT. I can now get by with a 4K size LUT instead of 32K. I can exploit the same pattern and double the size of my code and chop down the LUT by another half yet again, however far I want to go. But in order to eliminate the LUT altogether, I'm back to the prohibitive code size.
For large FFT sizes, I believe that this #2 solution is the absolute fastest to date, since a relatively small percentage of lookup table reads need to be done, and every algorithm I currently find on the web requires too many serial/dependency calculations which can't be vectorized.
The question is, is there an algorithm that can increment numbers so the MSB acts like the LSB, and so on? In other words (in binary): 0000, 1000, 0100, 1100, 0010, etc… I've tried to think up some way, and so far, short of a bunch of nested loops, I can't seem to find a way for a fast and simple algorithm that is a mirror image of simply adding 1 to the LSB of a number. Yet it seems like there should be a way.

One other approach to consider: take a well known bit reversal algorithm - typically a few masks, shifts, and ORs - then implement this with SSE, so you get e.g. 8 x 16 bit bit reversals for the price of one. For 16 bits you need 5*log2(N) = 20 instructions, so the aggregate throughput would be 2.5 instructions per bit reversal.

This is the most trivial and straightforward solution (in C):
void BitReversedIncrement(unsigned *var, int bit)
{
unsigned c, one = 1u << bit;
do {
c = *var & one;
(*var) ^= one;
one >>= 1;
} while (one && c);
}
The main problem with is the conditional branches, which are often costly on modern CPUs. You have one conditional branch per bit.
You can do reversed increments by working on several bits at a time, e.g. 3 if ints are 32-bit:
void BitReversedIncrement2(unsigned *var, int bit)
{
unsigned r = *var, t = 0;
while (bit >= 2 && !t)
{
unsigned tt = (r >> (bit - 2)) & 7;
t = (07351624 >> (tt * 3)) & 7;
r ^= ((tt ^ t) << (bit - 2));
bit -= 3;
}
if (bit >= 0 && !t)
{
t = r & ((1 << (bit + 1)) - 1);
r ^= t;
t <<= 2 - bit;
t = (07351624 >> (t * 3)) & 7;
t >>= 2 - bit;
r |= t;
}
*var = r;
}
This is better, you only have 1 conditional branch per 3 bits.
If your CPU supports 64-bit ints, you can work on 4 bits at a time:
void BitReversedIncrement3(unsigned *var, int bit)
{
unsigned r = *var, t = 0;
while (bit >= 3 && !t)
{
unsigned tt = (r >> (bit - 3)) & 0xF;
t = (0xF7B3D591E6A2C48ULL >> (tt * 4)) & 0xF;
r ^= ((tt ^ t) << (bit - 3));
bit -= 4;
}
if (bit >= 0 && !t)
{
t = r & ((1 << (bit + 1)) - 1);
r ^= t;
t <<= 3 - bit;
t = (0xF7B3D591E6A2C48ULL >> (t * 4)) & 0xF;
t >>= 3 - bit;
r |= t;
}
*var = r;
}
Which is even better. And the only look-up table (07351624 or 0xF7B3D591E6A2C48) is tiny and likely encoded as an immediate instruction operand.
You can further improve the code if the bit position for the reversed "1" is a known constant. Just unroll the while loop into nested ifs, substitute the reversed one bit position constant.

For larger FFTs, paying attention to cache blocking (minimizing total uncovered cache miss cycles) can have a far larger effect on performance than optimization of the cycle count taken by indexing bit reversal. Make sure not to de-optimize a bigger effect by a larger cycle count while optimizing the smaller effect. For small FFTs, where everything fits in cache, LUTs can be a good solution as long as you pay attention to any load-use hazards by making sure things are or can be pipelined appropriately.

Getting the Leftmost Bit

I have a 5 bit integer that I'm working with. Is there a native function in Objective-C that will let me know which bit is the leftmost?
i.e. I have 01001, it would return 8 or the position.
Thanks

You can build a lookup table, with 32 elements: 0, 1, 2, 2, 3, etc.

This is effectively the same operation as counting he number of leading 0s. Some CPUs have an instruction for this, otherwise you can use tricks such as those found in Hacker's Delight.
It's also equivalent to rounding down to the nearest power of 2, and again you can find efficient methods for doing this in Hacker's Delight, e.g.
uint8_t flp2(uint8_t x)
{
x = x | (x >> 1);
x = x | (x >> 2);
x = x | (x >> 4);
return x - (x >> 1);
}
See also: Previous power of 2

NSInteger value = 9;
NSInteger shift = 1;
for(NSInteger bit = value; bit > 1; bit = value >> ++shift);
NSInteger leftmostbit = 1 << shift;
Works for every number of bits.

If you don't want to use a table lookup, I would use 31 - __builtin_clz(yourNumber).
__builtin_clz( ) is a compiler intrinsic supported by gcc, llvm-gcc, and clang (and possibly other compilers as well). It returns the number of leading zero bits in an integer argument. Subtracting that from 31 gives you the position of the highest-order set bit. It should generate reasonably fast code on any target architecture.

Stanford Bit Twiddling Hacks have lots of examples of how to accomplish this.

If you mean the value of whatever bit is in position five from the right (the "leftmost" of a five-bit value), then:
int value = 17;
int bit = (value >> 4) & 1; // bit is 1
If you mean the position of the leftmost bit that is 1:
int value = 2;
int position;
for (position = 0; position < 5; position++) {
int bit = (value >> position) & 1;
if (bit == 1)
break;
}
// position is 1
Position will be 0 for the bit furthest to the right, 4 for the leftmost bit of your five-bit value, or 5 if all bits where zero.
Note: this is not the most effective solution, in clock cycles. It is hopefully a reasonably clear and educational one. :)

To clear all bits below the most significant bit:
while ( x & (x-1) ) x &= x - 1;
// 01001 => 01000
To clear all bits above the least significant bit:
x &= -x;
// 01001 => 00001
To get the position of the only set bit in a byte:
position = ((0x56374210>>(((((x)&-(x))*0x17)>>3)&0x1C))&0x07);
// 01000 => 3
In libkern.h there is a clz function defined to count leading zeros in a 32 bit int. That is the closest thing to a native Objective-C function. To get the position of the most significant bit in an int:
position = 31 - clz( x );
// 01001 => 3

I don't know objective C but this is how I would do it in C.
pow(2, int(log2(Number))
This should give you the left most 1 bit value.
PLEASE SEE STEPHEN CANON'S COMMENT BELOW BEFORE USING THIS SOLUTION.

With VC++ have a look at
_BitScanReverse/(64) in

What's the fastest way to divide an integer by 3?

int x = n / 3; // <-- make this faster
// for instance
int a = n * 3; // <-- normal integer multiplication
int b = (n << 1) + n; // <-- potentially faster multiplication

The guy who said "leave it to the compiler" was right, but I don't have the "reputation" to mod him up or comment. I asked gcc to compile int test(int a) { return a / 3; } for an ix86 and then disassembled the output. Just for academic interest, what it's doing is roughly multiplying by 0x55555556 and then taking the top 32 bits of the 64 bit result of that. You can demonstrate this to yourself with eg:
$ ruby -e 'puts(60000 * 0x55555556 >> 32)'
20000
$ ruby -e 'puts(72 * 0x55555556 >> 32)'
24
$
The wikipedia page on Montgomery division is hard to read but fortunately the compiler guys have done it so you don't have to.

This is the fastest as the compiler will optimize it if it can depending on the output processor.
int a;
int b;
a = some value;
b = a / 3;

There is a faster way to do it if you know the ranges of the values, for example, if you are dividing a signed integer by 3 and you know the range of the value to be divided is 0 to 768, then you can multiply it by a factor and shift it to the left by a power of 2 to that factor divided by 3.
eg.
Range 0 -> 768
you could use shifting of 10 bits, which multiplying by 1024, you want to divide by 3 so your multiplier should be 1024 / 3 = 341,
so you can now use (x * 341) >> 10
(Make sure the shift is a signed shift if using signed integers), also make sure the shift is an actually shift and not a bit ROLL
This will effectively divide the value 3, and will run at about 1.6 times the speed as a natural divide by 3 on a standard x86 / x64 CPU.
Of course the only reason you can make this optimization when the compiler cant is because the compiler does not know the maximum range of X and therefore cannot make this determination, but you as the programmer can.
Sometime it may even be more beneficial to move the value into a larger value and then do the same thing, ie. if you have an int of full range you could make it an 64-bit value and then do the multiply and shift instead of dividing by 3.
I had to do this recently to speed up image processing, i needed to find the average of 3 color channels, each color channel with a byte range (0 - 255). red green and blue.
At first i just simply used:
avg = (r + g + b) / 3;
(So r + g + b has a maximum of 768 and a minimum of 0, because each channel is a byte 0 - 255)
After millions of iterations the entire operation took 36 milliseconds.
I changed the line to:
avg = (r + g + b) * 341 >> 10;
And that took it down to 22 milliseconds, its amazing what can be done with a little ingenuity.
This speed up occurred in C# even though I had optimisations turned on and was running the program natively without debugging info and not through the IDE.

See How To Divide By 3 for an extended discussion of more efficiently dividing by 3, focused on doing FPGA arithmetic operations.
Also relevant:
Optimizing integer divisions with Multiply Shift in C#

Depending on your platform and depending on your C compiler, a native solution like just using
y = x / 3
Can be fast or it can be awfully slow (even if division is done entirely in hardware, if it is done using a DIV instruction, this instruction is about 3 to 4 times slower than a multiplication on modern CPUs). Very good C compilers with optimization flags turned on may optimize this operation, but if you want to be sure, you are better off optimizing it yourself.
For optimization it is important to have integer numbers of a known size. In C int has no known size (it can vary by platform and compiler!), so you are better using C99 fixed-size integers. The code below assumes that you want to divide an unsigned 32-bit integer by three and that you C compiler knows about 64 bit integer numbers (NOTE: Even on a 32 bit CPU architecture most C compilers can handle 64 bit integers just fine):
static inline uint32_t divby3 (
uint32_t divideMe
) {
return (uint32_t)(((uint64_t)0xAAAAAAABULL * divideMe) >> 33);
}
As crazy as this might sound, but the method above indeed does divide by 3. All it needs for doing so is a single 64 bit multiplication and a shift (like I said, multiplications might be 3 to 4 times faster than divisions on your CPU). In a 64 bit application this code will be a lot faster than in a 32 bit application (in a 32 bit application multiplying two 64 bit numbers take 3 multiplications and 3 additions on 32 bit values) - however, it might be still faster than a division on a 32 bit machine.
On the other hand, if your compiler is a very good one and knows the trick how to optimize integer division by a constant (latest GCC does, I just checked), it will generate the code above anyway (GCC will create exactly this code for "/3" if you enable at least optimization level 1). For other compilers... you cannot rely or expect that it will use tricks like that, even though this method is very well documented and mentioned everywhere on the Internet.
Problem is that it only works for constant numbers, not for variable ones. You always need to know the magic number (here 0xAAAAAAAB) and the correct operations after the multiplication (shifts and/or additions in most cases) and both is different depending on the number you want to divide by and both take too much CPU time to calculate them on the fly (that would be slower than hardware division). However, it's easy for a compiler to calculate these during compile time (where one second more or less compile time plays hardly a role).

For 64 bit numbers:
uint64_t divBy3(uint64_t x)
{
return x*12297829382473034411ULL;
}
However this isn't the truncating integer division you might expect.
It works correctly if the number is already divisible by 3, but it returns a huge number if it isn't.
For example if you run it on for example 11, it returns 6148914691236517209. This looks like a garbage but it's in fact the correct answer: multiply it by 3 and you get back the 11!
If you are looking for the truncating division, then just use the / operator. I highly doubt you can get much faster than that.
Theory:
64 bit unsigned arithmetic is a modulo 2^64 arithmetic.
This means for each integer which is coprime with the 2^64 modulus (essentially all odd numbers) there exists a multiplicative inverse which you can use to multiply with instead of division. This magic number can be obtained by solving the 3*x + 2^64*y = 1 equation using the Extended Euclidean Algorithm.

What if you really don't want to multiply or divide? Here is is an approximation I just invented. It works because (x/3) = (x/4) + (x/12). But since (x/12) = (x/4) / 3 we just have to repeat the process until its good enough.
#include <stdio.h>
void main()
{
int n = 1000;
int a,b;
a = n >> 2;
b = (a >> 2);
a += b;
b = (b >> 2);
a += b;
b = (b >> 2);
a += b;
b = (b >> 2);
a += b;
printf("a=%d\n", a);
}
The result is 330. It could be made more accurate using b = ((b+2)>>2); to account for rounding.
If you are allowed to multiply, just pick a suitable approximation for (1/3), with a power-of-2 divisor. For example, n * (1/3) ~= n * 43 / 128 = (n * 43) >> 7.
This technique is most useful in Indiana.

I don't know if it's faster but if you want to use a bitwise operator to perform binary division you can use the shift and subtract method described at this page:
Set quotient to 0
Align leftmost digits in dividend and divisor
Repeat:
If that portion of the dividend above the divisor is greater than or equal to the divisor:
Then subtract divisor from that portion of the dividend and
Concatentate 1 to the right hand end of the quotient
Else concatentate 0 to the right hand end of the quotient
Shift the divisor one place right
Until dividend is less than the divisor:
quotient is correct, dividend is remainder
STOP

For really large integer division (e.g. numbers bigger than 64bit) you can represent your number as an int[] and perform division quite fast by taking two digits at a time and divide them by 3. The remainder will be part of the next two digits and so forth.
eg. 11004 / 3 you say
11/3 = 3, remaineder = 2 (from 11-3*3)
20/3 = 6, remainder = 2 (from 20-6*3)
20/3 = 6, remainder = 2 (from 20-6*3)
24/3 = 8, remainder = 0
hence the result 3668
internal static List<int> Div3(int[] a)
{
int remainder = 0;
var res = new List<int>();
for (int i = 0; i < a.Length; i++)
{
var val = remainder + a[i];
var div = val/3;
remainder = 10*(val%3);
if (div > 9)
{
res.Add(div/10);
res.Add(div%10);
}
else
res.Add(div);
}
if (res[0] == 0) res.RemoveAt(0);
return res;
}

If you really want to see this article on integer division, but it only has academic merit ... it would be an interesting application that actually needed to perform that benefited from that kind of trick.

Easy computation ... at most n iterations where n is your number of bits:
uint8_t divideby3(uint8_t x)
{
uint8_t answer =0;
do
{
x>>=1;
answer+=x;
x=-x;
}while(x);
return answer;
}

A lookup table approach would also be faster in some architectures.
uint8_t DivBy3LU(uint8_t u8Operand)
{
uint8_t ai8Div3 = [0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, ....];
return ai8Div3[u8Operand];
}

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas