Lightweight bandpass filter for embedded device (no FFT)

Lightweight bandpass filter for embedded device (no FFT) - embedded

I would like to remove certain frequencies from the discrete dataset (signal sampled by ADC). Sounds simple enough. However, there are certain constrains that make things harder:
The chip is a 32 bit NXP JN5168, that has hardware multiplication, but no hardware division, no floats support, or any tools whatsoever that make DSP easy. Therefore, FFT-based filters those are so easy to implement on ARM Cortex M chips are a no-go
Sampling is pretty much set at 1 kHz
real time and RAM usage is a concern, because there are other tasks that chip is taking care of at the same time.
I tried using well-known high/low pass filters of the form:
Filtered Value (L) = Previous Value - k*(Previous Value - Original Value) // LPF
Filtered Value (H) = Original Value - Filtered Value (L) // HPF
Unfortunately, even with multiple passes this type of filter does not work as I would like it to work. The raise of the HPF response in the frequency domain always starts at 0, and one only can control the slope there by adjustng k. Since the sampling rate can't be changed, that's the only control.
If I want to filter out anything below 120 Hz, and leave 200 Hz untouched, that's not possible with the above filter, which is very good at having sharp cutoffs below 80 Hz (at my sampling rate). If I reduce the cutoff aggressiveness, and make filter work for 120 Hz then 200 Hz are also affected (less, but significantly).
Below you can see a frequency response of the 3 passes of the above-described HPF with k=1/2.
This does not work for me. I am looking for something different: I am looking for a lightweight filter algorithm suitable for embedded applications that can ideally provide a full or significant cut at an arbitrary frequency with a steep cutoff line, so that neighboring frequencies are unaffected, or affected insignificantly.
Thanks!
Edit: I do not want to transform my signal into the frequency domain, but rather continue working with a cleaned-up signal in the time domain.
Edit 2: Changing the hardware is not an option. There is an already existing product that needs a new feature. That happens. That's life. My job is to find the solution, which I am sure exists.

If your computational budget can afford 5 multiplications (MACs) per sample (per millisecond at a 1 kHz sample rate), and you can save the past couple samples, you can use a biquad IIR filter. There is a cookbook for biquads coefficients.
If you can afford a small multiple of 5 MACs per sample, then you can cascade biquads and get a higher-order IIR filter with an even sharper roll-off or cutoff. But you may need to use a filter design software package (MatLab, et.al.) to optimize the pole-zero locations of a higher order IIR filter for your specific requirements.

Adapted from The Scientist and Engineer's Guide to
Digital Signal Processing - Chapter 19: Recursive Filters:
static const float pi = 3.141592f ;
static const float pi2 = 2.0f * pi ;
static const float s = 48000 ; // Sample rate
void bandpassFilter( float f_hz, // Filter centre frequency
float bw_hz, // Filter bandwidth
const float *x, // Pointer to input sample block
float *y, // Pointer to output buffer
int n // Number of samples in sample block
)
{
static float x_2 = 0.0f; // delayed x, y samples
static float x_1 = 0.0f;
static float y_1 = 0.0f;
static float y_2 = 0.0f;
static const float f = f_hz / s ;
static const float bw = bw_hz / s ;
static const float R = 1 - (3 * bw) ;
static const float Rsq = R * R ;
static const float cosf2 = 2 * cos(pi2 * f) ;
static const float K = (1 - R * cosf2 + Rsq ) / (2 - cosf2) ;
static const float a0 = 1.0 - K ;
static const float a1 = 2 * (K - R) * cosf2 ;
static const float a2 = Rsq - K ;
static const float b1 = 2 * R * cosf2 ;
static const float b2 = -Rsq ;
for( int i = 0; i < n; ++i)
{
// IIR difference equation
y[i] = a0 * x[i] + a1 * x_1 + a2 * x_2
+ b1 * y_1 + b2 * y_2;
// shift delayed x, y samples
x_2 = x_1;
x_1 = x[i];
y_2 = y_1 ;
y_1 = y[i];
}
}
Since at the end of the loop, the filter state is retained in the static variables x_1, y_1, x_2 and y_2, the filter may be called repeatedly with any number of samples - one sample at a time or in blocks (more efficient).
The static calculation of the coefficients and use of single precision floating point however makes it reasonably fast even for software floating point, requiring only multiply-add. The use of software floating point may increase code size somewhat - most significantly in the use of cos(), but if your frequency, and bandwidth are not variable, the coefficients may be pre-calculated and hard coded - I have included them in the code for illustration purposes, and because it was real code I had available rather then developed specifically for the question.
If the floating-point remains too resource or time hungry, then a fixed-point implementation is possible. I have used the same code adapted for fixed-point using Anthony Williams' fixed point math library which uses C++ and extensive operator overloading to allow in most cases a fixed-point implementation simply by replacing float with fixed.

Related

Quaternion from two vector pairs

I have two vector pairs (before and after rotation).
before rotation:
[x1,y1,z1]
[x2,y2,z2]
after rotation:
[x1',y1',z1']
[x2',y2',z2']
How to create a quaternion representing this rotation?

In most cases there is no rotation which transforms 2 vectors into 2 other vectors. Here is a simple way to visualize why: a rotation does not change the angle between vectors. If the angle between the 2 vectors before the rotation is different from the angle between the 2 vectors after the rotation, then there is no rotation which meets your criteria.
This said there may be an optimal quaternion with an acceptable error which "almost" rotates your 2 vector pairs. There are a number of algorithms which vary in speed and precision to find such a quaternion. I wrote a fast C++ algorithm for an Arduino application where the speed is critical but the precision is less important.
http://robokitchen.tumblr.com/post/67060392720/finding-a-rotation-quaternion-from-two-pairs-of-vectors
Before rotation: u0, v0. After rotation: u2, v2.
Quaternion q2 = Quaternion::fromTwoVectors(u0, u2);
Vector v1 = v2.rotate(q2.conjugate());
Vector v0_proj = v0.projectPlane(u0);
Vector v1_proj = v1.projectPlane(u0);
Quaternion q1 = Quaternion::fromTwoVectors(v0_proj, v1_proj);
return (q2 * q1).normalized();
If this does not meet the requirements of your own application try to google Wabha's problem.

Well, first you can find the rotation axis using vector-multiplication (cross-multiplication):
axis = v1 x v2;
Then you can compute the rotation angle:
sinA = |axis| / |v1|*|v2|
cosA = v1 . v2 / |v1|*|v2|
Here | | - is vector length operation, and . - is dot-multiplication
And finally, your quaternion is:
Q(w,x,y,z) = (cosA, axis.x * sinA, axis.y * sinA, axis.z * sinA)

I translated marcv81's very helpful blog post into Three.js:
const rotateVectorsSimultaneously = (u0, v0, u2, v2) => {
const q2 = new THREE.Quaternion().setFromUnitVectors(u0, u2);
const v1 = v2.clone().applyQuaternion(q2.clone().conjugate());
const v0_proj = v0.projectOnPlane(u0);
const v1_proj = v1.projectOnPlane(u0);
let angleInPlane = v0_proj.angleTo(v1_proj);
if (v1_proj.dot(new THREE.Vector3().crossVectors(u0, v0)) < 0) {
angleInPlane *= -1;
}
const q1 = new THREE.Quaternion().setFromAxisAngle(u0, angleInPlane);
const q = new THREE.Quaternion().multiplyQuaternions(q2, q1);
return q;
};
Because angleTo always returns a positive value, I manually flip the sign of the angle depending on which side of the u0-v0 plane v1 is on.

A mature solution to this problem is called Triad. Triad is one of the earliest and simplest solutions to the spacecraft attitude determination problem and is extremely efficient computationally.
With Triad, the idea is to replace your paired set of two vectors, with a paired set of three vectors, where the extra vector is generated with a cross-product. By normalizing the vectors, you can solve for a rotation matrix without a matrix inverse or an SVD (as is needed in more general instances of the problem -- see Wahba's Problem)
For full algorithm, see: https://en.wikipedia.org/wiki/Triad_method
You can then convert the solved rotation matrix from Triad to a rotation quaternion:
qw = √(1 + m00 + m11 + m22) /2
qx = (m21 - m12)/( 4 *qw)
qy = (m02 - m20)/( 4 *qw)
qz = (m10 - m01)/( 4 *qw)
In general to make the conversion to quaternion robust, you should consider looking at the matrix trace as discussed here: http://www.euclideanspace.com/maths/geometry/rotations/conversions/matrixToQuaternion/
Finally, consider an alternative to Triad that directly computes the optimal quaternion called QUEST.

It is fine to find the quaternion from v1 to v2.
The final q = (cos A/2, sin A/2 * axis), where the A is the angle between v1 and v2, axis is the normed axis.
Multiply both side by 2 * cos A/2,
Then we have
2 * cos A/2 *q = (1+cos A, sin A * axis)
(where cos A = dot(v1, v2)/|v1|/|v2| and
axis = cross(v1, v2).normalize() = cross(v1, v2)/|v1|/|v2|/sin A.)
Then 2 * cos A/2 *q = (1+dot(v1, v2)/|v1|/|v2|, cross(v1, v2)/|v1|/|v2|)
Finally q = (1+dot(v1, v2)/|v1|/|v2|, cross(v1, v2)/|v1|/|v2|).normalize()

Fast way to sum Fourier series?

I have generated the coefficients using FFTW, now I want to reconstruct the original data, but using only the first numCoefs coefficients rather than all of them. At the moment I'm using the below code, which is very slow:
for ( unsigned int i = 0; i < length; ++i )
{
double sum = 0;
for ( unsigned int j = 0; j < numCoefs; ++j )
{
sum += ( coefs[j][0] * cos( j * omega * i ) ) + ( coefs[j][1] * sin( j * omega * i ) );
}
data[i] = sum;
}
Is there a faster way?

A much simpler solution would be to zero the unwanted coefficients and then do an IFFT with FFTW. This will be a lot more efficient than doing an IDFT as above.
Note that you may get some artefacts in the time domain when you do this kind of thing - you're effectively multiplying by a step function in the frequency domain, which is equivalent to convolution with a sinc function in the time domain. To reduce the resulting "ringing" in the time domain you should use a window function to smooth out the transition between non-zero and zero coeffs.

If your numCoefs is anywhere near or greater than log(length), then an IFFT, which is O(n*log(n)) in computational complexity, will most likely be faster, as well as pre-optimized for you. Just zero all the bins except for the coefficients you want to keep, and make sure to also keep their negative frequency complex conjugates as well if you want a real result.
If your numCoefs is small relative to log(length), then other optimizations you could try include using sinf() and cosf() if you don't really need more than 6 digits of precision, and pre-calculating omega*i outside the inner loop (although your compiler should do be doing that for you unless you have the optimization setting low or off).

CUDAFunctionLoad in Mathematica - Indexing problem

I am trying to debug an index problem I am having on my CUDA machine
Cuda Machine Info:
{1->{Name->Tesla C2050,Clock Rate->1147000,Compute Capabilities->2.,GPU Overlap->1,Maximum Block Dimensions->{1024,1024,64},Maximum Grid Dimensions->{65535,65535,65535},Maximum Threads Per Block->1024,Maximum Shared Memory Per Block->49152,Total Constant Memory->65536,Warp Size->32,Maximum Pitch->2147483647,Maximum Registers Per Block->32768,Texture Alignment->512,Multiprocessor Count->14,Core Count->448,Execution Timeout->0,Integrated->False,Can Map Host Memory->True,Compute Mode->Default,Texture1D Width->65536,Texture2D Width->65536,Texture2D Height->65535,Texture3D Width->2048,Texture3D Height->2048,Texture3D Depth->2048,Texture2D Array Width->16384,Texture2D Array Height->16384,Texture2D Array Slices->2048,Surface Alignment->512,Concurrent Kernels->True,ECC Enabled->True,Total Memory->2817982462},
All this code does is set the values of a 3D array equal to the index that CUDA is using:
__global __ void cudaMatExp(
float *matrix1, float *matrixStore, int lengthx, int lengthy, int lengthz){
long UniqueBlockIndex = blockIdx.y * gridDim.x + blockIdx.x;
long index = UniqueBlockIndex * blockDim.z * blockDim.y * blockDim.x +
threadIdx.z * blockDim.y * blockDim.x + threadIdx.y * blockDim.x +
threadIdx.x;
if (index < lengthx*lengthy*lengthz) {
matrixStore[index] = index;
}
}
For some reason, once the dimension of my 3D array becomes too large, the indexing stops.
I have tried different block dimensions (blockDim.x by blockDim.y by blockDim.z):
8x8x8 only gives correct indexing up to array dimension 12x12x12
9x9x9 only gives correct indexing up to array dimension 14x14x14
10x10x10 only gives correct indexing up to array dimension 15x15x15
For dimensions larger than these all of the different block sizes eventually start to increase again, but they never reach a value of dim^3-1 (which is the maximum index that the cuda thread should reach)
Here are some plots that illustrate this behavior:
For example: This is plotting on the x axis the dimension of the 3D array (which is xxx), and on the y axis the maximum index number that is processed during the cuda execution. This particular plot is for block dimensions of 10x10x10.
Here is the (Mathematica) code to generate that plot, but when I ran this one, I used block dimensions of 1024x1x1:
CUDAExp = CUDAFunctionLoad[codeexp, "cudaMatExp",
{{"Float", _,"Input"}, {"Float", _,"Output"},
_Integer, _Integer, _Integer},
{1024, 1, 1}]; (*These last three numbers are the block dimensions*)
max = 100; (* the maximum dimension of the 3D array *)
hold = Table[1, {i, 1, max}];
compare = Table[i^3, {i, 1, max}];
Do[
dim = ii;
AA = CUDAMemoryLoad[ConstantArray[1.0, {dim, dim, dim}], Real,
"TargetPrecision" -> "Single"];
BB = CUDAMemoryLoad[ConstantArray[1.0, {dim, dim, dim}], Real,
"TargetPrecision" -> "Single"];
hold[[ii]] = Max[Flatten[
CUDAMemoryGet[CUDAExp[AA, BB, dim, dim, dim][[1]]]]];
, {ii, 1, max}]
ListLinePlot[{compare, Flatten[hold]}, PlotRange -> All]
This is the same plot, but now plotting x^3 to compare to where it should be. Notice that it diverges after the dimension of the array is >32
I test the dimensions of the 3D array and look at how far the indexing goes and compare it with dim^3-1. E.g. for dim=32, the cuda max index is 32767 (which is 32^3 -1), but for dim=33 the cuda output is 33791 when it should be 35936 (33^3 -1). Notice that 33791-32767 = 1024 = blockDim.x
Question:
Is there a way to correctly index an array with dimensions larger than the block dimensions in Mathematica?
Now, I know that some people use __mul24(threadIdx.y,blockDim.x) in their index equation to prevent errors in bit multiplication, but it doesn't seem to help in my case.
Also, I have seen someone mention that you should compile your code with -arch=sm_11 because by default it's compiled for compute capability 1.0. I don't know if this is the case in Mathematica though. I would assume that CUDAFunctionLoad[] knows to compile with 2.0 capability. Any one know?
Any suggestions would be extremely helpful!

So, Mathematica kind of has a hidden way of dealing with grid dimensions, to fix your grid dimension to something that will work, you have to add another number to the end of the function you are calling.
The argument denotes the number of threads to launch (or grid dimension times block dimension).
For example, in my code above:
CUDAExp =
CUDAFunctionLoad[codeexp,
"cudaMatExp", {
{"Float", _, "Input"}, {"Float", _,"Output"},
_Integer, _Integer, _Integer},
{8, 8, 8}, "ShellOutputFunction" -> Print];
(8,8,8) denotes the dimension of the block.
When you call CUDAExp[] in mathematica, you can add an argument that denotes the number of threads to launch:
In this example I finally got it to work with the following:
// AA and BB are 3D arrays of 0 with dimensions dim^3
dim = 64;
CUDAExp[AA, BB, dim, dim, dim, 4089];
Note that when you compile with CUDAFunctionLoad[], it only expects 5 inputs, the first is the array you pass it (of dimensions dim x dim x dim) and the second is where the memory of it is stored. The third, fourth, and fifth are the dimensions.
When you pass it a 6th, mathematica translates that as gridDim.x * blockDim.x, so, since I know I need gridDim.x = 512 in order for every element in the array to be dealt with, I set this number equal to 512 * 8 = 4089.
I hope this is clear and useful to someone in the future that comes across this issue.

Fast formula for a "high contrast" curve

My inner loop contains a calculation that profiling shows to be problematic.
The idea is to take a greyscale pixel x (0 <= x <= 1), and "increase its contrast". My requirements are fairly loose, just the following:
for x < .5, 0 <= f(x) < x
for x > .5, x < f(x) <= 1
f(0) = 0
f(x) = 1 - f(1 - x), i.e. it should be "symmetric"
Preferably, the function should be smooth.
So the graph must look something like this:
.
I have two implementations (their results differ but both are conformant):
float cosContrastize(float i) {
return .5 - cos(x * pi) / 2;
}
float mulContrastize(float i) {
if (i < .5) return i * i * 2;
i = 1 - i;
return 1 - i * i * 2;
}
So I request either a microoptimization for one of these implementations, or an original, faster formula of your own.
Maybe one of you can even twiddle the bits ;)

Consider the following sigmoid-shaped functions (properly translated to the desired range):
error function
normal CDF
tanh
logit
I generated the above figure using MATLAB. If interested here's the code:
x = -3:.01:3;
plot( x, 2*(x>=0)-1, ...
x, erf(x), ...
x, tanh(x), ...
x, 2*normcdf(x)-1, ...
x, 2*(1 ./ (1 + exp(-x)))-1, ...
x, 2*((x-min(x))./range(x))-1 )
legend({'hard' 'erf' 'tanh' 'normcdf' 'logit' 'linear'})

Trivially you could simply threshold, but I imagine this is too dumb:
return i < 0.5 ? 0.0 : 1.0;
Since you mention 'increasing contrast' I assume the input values are luminance values. If so, and they are discrete (perhaps it's an 8-bit value), you could use a lookup table to do this quite quickly.
Your 'mulContrastize' looks reasonably quick. One optimization would be to use integer math. Let's say, again, your input values could actually be passed as an 8-bit unsigned value in [0..255]. (Again, possibly a fine assumption?) You could do something roughly like...
int mulContrastize(int i) {
if (i < 128) return (i * i) >> 7;
// The shift is really: * 2 / 256
i = 255 - i;
return 255 - ((i * i) >> 7);

A piecewise interpolation can be fast and flexible. It requires only a few decisions followed by a multiplication and addition, and can approximate any curve. It also avoids the courseness that can be introduced by lookup tables (or the additional cost in two lookups followed by an interpolation to smooth this out), though the lut might work perfectly fine for your case.
With just a few segments, you can get a pretty good match. Here there will be courseness in the color gradients, which will be much harder to detect than courseness in the absolute colors.
As Eamon Nerbonne points out in the comments, segmentation can be optimized by "choos[ing] your segmentation points based on something like the second derivative to maximize detail", that is, where the slope is changing the most. Clearly, in my posted example, having three segments in the middle of the five segment case doesn't add much more detail.

What's the fastest way to divide an integer by 3?

int x = n / 3; // <-- make this faster
// for instance
int a = n * 3; // <-- normal integer multiplication
int b = (n << 1) + n; // <-- potentially faster multiplication

The guy who said "leave it to the compiler" was right, but I don't have the "reputation" to mod him up or comment. I asked gcc to compile int test(int a) { return a / 3; } for an ix86 and then disassembled the output. Just for academic interest, what it's doing is roughly multiplying by 0x55555556 and then taking the top 32 bits of the 64 bit result of that. You can demonstrate this to yourself with eg:
$ ruby -e 'puts(60000 * 0x55555556 >> 32)'
20000
$ ruby -e 'puts(72 * 0x55555556 >> 32)'
24
$
The wikipedia page on Montgomery division is hard to read but fortunately the compiler guys have done it so you don't have to.

This is the fastest as the compiler will optimize it if it can depending on the output processor.
int a;
int b;
a = some value;
b = a / 3;

There is a faster way to do it if you know the ranges of the values, for example, if you are dividing a signed integer by 3 and you know the range of the value to be divided is 0 to 768, then you can multiply it by a factor and shift it to the left by a power of 2 to that factor divided by 3.
eg.
Range 0 -> 768
you could use shifting of 10 bits, which multiplying by 1024, you want to divide by 3 so your multiplier should be 1024 / 3 = 341,
so you can now use (x * 341) >> 10
(Make sure the shift is a signed shift if using signed integers), also make sure the shift is an actually shift and not a bit ROLL
This will effectively divide the value 3, and will run at about 1.6 times the speed as a natural divide by 3 on a standard x86 / x64 CPU.
Of course the only reason you can make this optimization when the compiler cant is because the compiler does not know the maximum range of X and therefore cannot make this determination, but you as the programmer can.
Sometime it may even be more beneficial to move the value into a larger value and then do the same thing, ie. if you have an int of full range you could make it an 64-bit value and then do the multiply and shift instead of dividing by 3.
I had to do this recently to speed up image processing, i needed to find the average of 3 color channels, each color channel with a byte range (0 - 255). red green and blue.
At first i just simply used:
avg = (r + g + b) / 3;
(So r + g + b has a maximum of 768 and a minimum of 0, because each channel is a byte 0 - 255)
After millions of iterations the entire operation took 36 milliseconds.
I changed the line to:
avg = (r + g + b) * 341 >> 10;
And that took it down to 22 milliseconds, its amazing what can be done with a little ingenuity.
This speed up occurred in C# even though I had optimisations turned on and was running the program natively without debugging info and not through the IDE.

See How To Divide By 3 for an extended discussion of more efficiently dividing by 3, focused on doing FPGA arithmetic operations.
Also relevant:
Optimizing integer divisions with Multiply Shift in C#

Depending on your platform and depending on your C compiler, a native solution like just using
y = x / 3
Can be fast or it can be awfully slow (even if division is done entirely in hardware, if it is done using a DIV instruction, this instruction is about 3 to 4 times slower than a multiplication on modern CPUs). Very good C compilers with optimization flags turned on may optimize this operation, but if you want to be sure, you are better off optimizing it yourself.
For optimization it is important to have integer numbers of a known size. In C int has no known size (it can vary by platform and compiler!), so you are better using C99 fixed-size integers. The code below assumes that you want to divide an unsigned 32-bit integer by three and that you C compiler knows about 64 bit integer numbers (NOTE: Even on a 32 bit CPU architecture most C compilers can handle 64 bit integers just fine):
static inline uint32_t divby3 (
uint32_t divideMe
) {
return (uint32_t)(((uint64_t)0xAAAAAAABULL * divideMe) >> 33);
}
As crazy as this might sound, but the method above indeed does divide by 3. All it needs for doing so is a single 64 bit multiplication and a shift (like I said, multiplications might be 3 to 4 times faster than divisions on your CPU). In a 64 bit application this code will be a lot faster than in a 32 bit application (in a 32 bit application multiplying two 64 bit numbers take 3 multiplications and 3 additions on 32 bit values) - however, it might be still faster than a division on a 32 bit machine.
On the other hand, if your compiler is a very good one and knows the trick how to optimize integer division by a constant (latest GCC does, I just checked), it will generate the code above anyway (GCC will create exactly this code for "/3" if you enable at least optimization level 1). For other compilers... you cannot rely or expect that it will use tricks like that, even though this method is very well documented and mentioned everywhere on the Internet.
Problem is that it only works for constant numbers, not for variable ones. You always need to know the magic number (here 0xAAAAAAAB) and the correct operations after the multiplication (shifts and/or additions in most cases) and both is different depending on the number you want to divide by and both take too much CPU time to calculate them on the fly (that would be slower than hardware division). However, it's easy for a compiler to calculate these during compile time (where one second more or less compile time plays hardly a role).

For 64 bit numbers:
uint64_t divBy3(uint64_t x)
{
return x*12297829382473034411ULL;
}
However this isn't the truncating integer division you might expect.
It works correctly if the number is already divisible by 3, but it returns a huge number if it isn't.
For example if you run it on for example 11, it returns 6148914691236517209. This looks like a garbage but it's in fact the correct answer: multiply it by 3 and you get back the 11!
If you are looking for the truncating division, then just use the / operator. I highly doubt you can get much faster than that.
Theory:
64 bit unsigned arithmetic is a modulo 2^64 arithmetic.
This means for each integer which is coprime with the 2^64 modulus (essentially all odd numbers) there exists a multiplicative inverse which you can use to multiply with instead of division. This magic number can be obtained by solving the 3*x + 2^64*y = 1 equation using the Extended Euclidean Algorithm.

What if you really don't want to multiply or divide? Here is is an approximation I just invented. It works because (x/3) = (x/4) + (x/12). But since (x/12) = (x/4) / 3 we just have to repeat the process until its good enough.
#include <stdio.h>
void main()
{
int n = 1000;
int a,b;
a = n >> 2;
b = (a >> 2);
a += b;
b = (b >> 2);
a += b;
b = (b >> 2);
a += b;
b = (b >> 2);
a += b;
printf("a=%d\n", a);
}
The result is 330. It could be made more accurate using b = ((b+2)>>2); to account for rounding.
If you are allowed to multiply, just pick a suitable approximation for (1/3), with a power-of-2 divisor. For example, n * (1/3) ~= n * 43 / 128 = (n * 43) >> 7.
This technique is most useful in Indiana.

I don't know if it's faster but if you want to use a bitwise operator to perform binary division you can use the shift and subtract method described at this page:
Set quotient to 0
Align leftmost digits in dividend and divisor
Repeat:
If that portion of the dividend above the divisor is greater than or equal to the divisor:
Then subtract divisor from that portion of the dividend and
Concatentate 1 to the right hand end of the quotient
Else concatentate 0 to the right hand end of the quotient
Shift the divisor one place right
Until dividend is less than the divisor:
quotient is correct, dividend is remainder
STOP

For really large integer division (e.g. numbers bigger than 64bit) you can represent your number as an int[] and perform division quite fast by taking two digits at a time and divide them by 3. The remainder will be part of the next two digits and so forth.
eg. 11004 / 3 you say
11/3 = 3, remaineder = 2 (from 11-3*3)
20/3 = 6, remainder = 2 (from 20-6*3)
20/3 = 6, remainder = 2 (from 20-6*3)
24/3 = 8, remainder = 0
hence the result 3668
internal static List<int> Div3(int[] a)
{
int remainder = 0;
var res = new List<int>();
for (int i = 0; i < a.Length; i++)
{
var val = remainder + a[i];
var div = val/3;
remainder = 10*(val%3);
if (div > 9)
{
res.Add(div/10);
res.Add(div%10);
}
else
res.Add(div);
}
if (res[0] == 0) res.RemoveAt(0);
return res;
}

If you really want to see this article on integer division, but it only has academic merit ... it would be an interesting application that actually needed to perform that benefited from that kind of trick.

Easy computation ... at most n iterations where n is your number of bits:
uint8_t divideby3(uint8_t x)
{
uint8_t answer =0;
do
{
x>>=1;
answer+=x;
x=-x;
}while(x);
return answer;
}

A lookup table approach would also be faster in some architectures.
uint8_t DivBy3LU(uint8_t u8Operand)
{
uint8_t ai8Div3 = [0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, ....];
return ai8Div3[u8Operand];
}

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas