MathProg (AMPL) - Variable Array Sized by Another Variable - ampl

I am writing my first GNU MathProg (AMPL) program to find the minimum switch (vertex) count instances of a HyperX topology (graph) for a given radix, number of hosts, and bisection bandwidth. This is a simple first program because all of the equations have been described in the following paper: http://cal.snu.ac.kr/files/2009.sc.hyperx.pdf
I have read the specification and example programs, but I am stuck on a very simple syntax error. I need to have the following two variables: L, the number of dimensions in the network, and an array S of length L, where each element of S is the number of switches in each dimension. In my MathProg program, I express this as:
var L >= 1, integer;
var S{1 .. L} >= 2, integer;
However, when I run $ glpsol --check --math hyperx.mod, I get the following error:
hyperx.mod:28: operand following .. has invalid type
Context: ...isec ; param radix ; var L >= 1 , integer ; var S { 1 .. L }
If anybody can help explain how I should properly express this relationship, I will be grateful. Also, I am including the entire program I have written for reference and extra help. I expect there to be many syntax errors in my program, but until I fix the first one, I have no way of finding the rest.
/*
* A MathProg linear program to find an optimal HyperX topology of a
* given network size, switch radix, and bisection bandwidth. Optimal
* is simplistically defined as minimum switch count network.
*
* A HyperX topology is a multi-dimensional network (graph) where, in
* each dimension, the switches are fully connected. Every switch
* (vertex) is a point in an L-dimensional integer lattic. Each switch
* is identified by a multi-index I = (I_1, ..., I_L) where 0 <= I_k <
* S_k for each k = 1..L, where S_k is the number of switches in each
* dimension. A switch connects to all others whose multi-index is the
* same in all but one coordinate.
*/
/* Network size in number of hosts. */
param hosts;
/* Desired bisection bandwidth. */
param bisec;
/* Maximum switch radix. */
param radix;
/* The number of dimensions in the HyperX. */
var L >= 1, integer;
/* The number of switches in each dimension. */
var S{1 .. L} >= 2, integer;
/*
* Relative bandwidth of the dimension, i.e., the number of links in a
* given dimension.
*/
var K{1 .. L} >= 1, integer;
/* The number Terminals (hosts) per switch */
var T >= 1, integer;
/* Minimize the total number of switches. */
minimize cost: prod{i in 1..L} S[i];
/* The total number of links must be less than the switch radix. */
s.t. Radix: T + sum{i in 1..L} K[i] * (S[i] - 1) <= radix;
/* There must be enough hosts in the network. */
s.t. Hosts: T * prod{i in 1..L} S[i] >= hosts;
/* There must be enough bandwidth. */
s.t. Bandwidth: min{K[i]*S[i]} / (2 * T) >= bisec;
/* The order of the dimensions doesn't matter, so constrain them */
s.t. SwitchDimen: forall{i in 1..(L-1)} S[i] <= S[i+1];
/*
* Bisection bandwidth depends on the smallest S_i * K_i, so we know
* that the smallest switch count dimension needs the most links.
*/
s.t. LinkDimen: forall{i in 1..(L-1)} K[i] >= K[i+1];
# TODO: I would like to constrain the search such that the number of
# terminals, T, is bounded to T >= (hosts / O), where O is the switch
# count of the smallest switch count topology discovered so far, but I
# don't know how to do this.
/* Data section */
data;
param hosts := 32
param bisec := 0.5
param radix := 64
end;

Fixed number of variables in a problem is a common assumption in solvers and algebraic modelling languages including AMPL/MathProg. Therefore you can only use constant expressions, in particular parameters, not variables in indexing expressions. One possible solution is to make L a parameter, resolve your problem for different values of L and select the one that gives the best objective value. This can be done with a simple AMPL script.

Related

Lightweight bandpass filter for embedded device (no FFT)

I would like to remove certain frequencies from the discrete dataset (signal sampled by ADC). Sounds simple enough. However, there are certain constrains that make things harder:
The chip is a 32 bit NXP JN5168, that has hardware multiplication, but no hardware division, no floats support, or any tools whatsoever that make DSP easy. Therefore, FFT-based filters those are so easy to implement on ARM Cortex M chips are a no-go
Sampling is pretty much set at 1 kHz
real time and RAM usage is a concern, because there are other tasks that chip is taking care of at the same time.
I tried using well-known high/low pass filters of the form:
Filtered Value (L) = Previous Value - k*(Previous Value - Original Value) // LPF
Filtered Value (H) = Original Value - Filtered Value (L) // HPF
Unfortunately, even with multiple passes this type of filter does not work as I would like it to work. The raise of the HPF response in the frequency domain always starts at 0, and one only can control the slope there by adjustng k. Since the sampling rate can't be changed, that's the only control.
If I want to filter out anything below 120 Hz, and leave 200 Hz untouched, that's not possible with the above filter, which is very good at having sharp cutoffs below 80 Hz (at my sampling rate). If I reduce the cutoff aggressiveness, and make filter work for 120 Hz then 200 Hz are also affected (less, but significantly).
Below you can see a frequency response of the 3 passes of the above-described HPF with k=1/2.
This does not work for me. I am looking for something different: I am looking for a lightweight filter algorithm suitable for embedded applications that can ideally provide a full or significant cut at an arbitrary frequency with a steep cutoff line, so that neighboring frequencies are unaffected, or affected insignificantly.
Thanks!
Edit: I do not want to transform my signal into the frequency domain, but rather continue working with a cleaned-up signal in the time domain.
Edit 2: Changing the hardware is not an option. There is an already existing product that needs a new feature. That happens. That's life. My job is to find the solution, which I am sure exists.
If your computational budget can afford 5 multiplications (MACs) per sample (per millisecond at a 1 kHz sample rate), and you can save the past couple samples, you can use a biquad IIR filter. There is a cookbook for biquads coefficients.
If you can afford a small multiple of 5 MACs per sample, then you can cascade biquads and get a higher-order IIR filter with an even sharper roll-off or cutoff. But you may need to use a filter design software package (MatLab, et.al.) to optimize the pole-zero locations of a higher order IIR filter for your specific requirements.
Adapted from The Scientist and Engineer's Guide to
Digital Signal Processing - Chapter 19: Recursive Filters:
static const float pi = 3.141592f ;
static const float pi2 = 2.0f * pi ;
static const float s = 48000 ; // Sample rate
void bandpassFilter( float f_hz, // Filter centre frequency
float bw_hz, // Filter bandwidth
const float *x, // Pointer to input sample block
float *y, // Pointer to output buffer
int n // Number of samples in sample block
)
{
static float x_2 = 0.0f; // delayed x, y samples
static float x_1 = 0.0f;
static float y_1 = 0.0f;
static float y_2 = 0.0f;
static const float f = f_hz / s ;
static const float bw = bw_hz / s ;
static const float R = 1 - (3 * bw) ;
static const float Rsq = R * R ;
static const float cosf2 = 2 * cos(pi2 * f) ;
static const float K = (1 - R * cosf2 + Rsq ) / (2 - cosf2) ;
static const float a0 = 1.0 - K ;
static const float a1 = 2 * (K - R) * cosf2 ;
static const float a2 = Rsq - K ;
static const float b1 = 2 * R * cosf2 ;
static const float b2 = -Rsq ;
for( int i = 0; i < n; ++i)
{
// IIR difference equation
y[i] = a0 * x[i] + a1 * x_1 + a2 * x_2
+ b1 * y_1 + b2 * y_2;
// shift delayed x, y samples
x_2 = x_1;
x_1 = x[i];
y_2 = y_1 ;
y_1 = y[i];
}
}
Since at the end of the loop, the filter state is retained in the static variables x_1, y_1, x_2 and y_2, the filter may be called repeatedly with any number of samples - one sample at a time or in blocks (more efficient).
The static calculation of the coefficients and use of single precision floating point however makes it reasonably fast even for software floating point, requiring only multiply-add. The use of software floating point may increase code size somewhat - most significantly in the use of cos(), but if your frequency, and bandwidth are not variable, the coefficients may be pre-calculated and hard coded - I have included them in the code for illustration purposes, and because it was real code I had available rather then developed specifically for the question.
If the floating-point remains too resource or time hungry, then a fixed-point implementation is possible. I have used the same code adapted for fixed-point using Anthony Williams' fixed point math library which uses C++ and extensive operator overloading to allow in most cases a fixed-point implementation simply by replacing float with fixed.

Get the most occuring number amongst several integers without using arrays

DISCLAIMER: Rather theoretical question here, not looking for a correct answere, just asking for some inspiration!
Consider this:
A function is called repetitively and returns integers based on seeds (the same seed returns the same integer). Your task is to find out which integer is returned most often. Easy enough, right?
But: You are not allowed to use arrays or fields to store return values of said function!
Example:
int mostFrequentNumber = 0;
int occurencesOfMostFrequentNumber = 0;
int iterations = 10000000;
for(int i = 0; i < iterations; i++)
{
int result = getNumberFromSeed(i);
int occurencesOfResult = magic();
if(occurencesOfResult > occurencesOfMostFrequentNumber)
{
mostFrequentNumber = result;
occurencesOfMostFrequentNumber = occurencesOfResult;
}
}
If getNumberFromSeed() returns 2,1,5,18,5,6 and 5 then mostFrequentNumber should be 5 and occurencesOfMostFrequentNumber should be 3 because 5 is returned 3 times.
I know this could easily be solved using a two-dimentional list to store results and occurences. But imagine for a minute that you can not use any kind of arrays, lists, dictionaries etc. (Maybe because the system that is running the code has such a limited memory, that you cannot store enough integers at once or because your prehistoric programming language has no concept of collections).
How would you find mostFrequentNumber and occurencesOfMostFrequentNumber? What does magic() do?? (Of cause you do not have to stick to the example code. Any ideas are welcome!)
EDIT: I should add that the integers returned by getNumber() should be calculated using a seed, so the same seed returns the same integer (i.e. int result = getNumber(5); this would always assign the same value to result)
Make an hypothesis: Assume that the distribution of integers is, e.g., Normal.
Start simple. Have two variables
. N the number of elements read so far
. M1 the average of said elements.
Initialize both variables to 0.
Every time you read a new value x update N to be N + 1 and M1 to be M1 + (x - M1)/N.
At the end M1 will equal the average of all values. If the distribution was Normal this value will have a high frequency.
Now improve the above. Add a third variable:
M2 the average of all (x - M1)^2 for all values of xread so far.
Initialize M2 to 0. Now get a small memory of say 10 elements or so. For every new value x that you read update N and M1 as above and M2 as:
M2 := M2 + (x - M1)^2 * (N - 1) / N
At every step M2 is the variance of the distribution and sqrt(M2) its standard deviation.
As you proceed remember the frequencies of only the values read so far whose distances to M1 are less than sqrt(M2). This requires the use of some additional array, however, the array will be very short compared to the high number of iterations you will run. This modification will allow you to guess better the most frequent value instead of simply answering the mean (or average) as above.
UPDATE
Given that this is about insights for inspiration there is plenty of room for considering and adapting the approach I've proposed to any particular situation. Here are some thoughts
When I say assume that the distribution is Normal you should think of it as: Given that the problem has no solution, let's see if there is some qualitative information I can use to decide what kind of distribution would the data have. Given that the algorithm is intended to find the most frequent number, it should be fine to assume that the distribution is not uniform. Let's try with Normal, LogNormal, etc. to see what can be found out (more on this below.)
If the game completely disallows the use of any array, then fine, keep track of only, say 10 numbers. This would allow you to count the occurrences of the 10 best candidates, which will give more confidence to your answer. In doing this choose your candidates around the theoretical most likely value according to the distribution of your hypothesis.
You cannot use arrays but perhaps you can read the sequence of numbers two or three times, not just once. In that case you can read it once to check whether you hypothesis about its distribution is good nor bad. For instance, if you compute not just the variance but the skewness and the kurtosis you will have more elements to check your hypothesis. For instance, if the first reading indicates that there is some bias, you could use a LogNormal distribution instead, etc.
Finally, in addition to providing the approximate answer you would be able to use the information collected during the reading to estimate an interval of confidence around your answer.
Alright, I found a decent solution myself:
int mostFrequentNumber = 0;
int occurencesOfMostFrequentNumber = 0;
int iterations = 10000000;
int maxNumber = -2147483647;
int minNumber = 2147483647;
//Step 1: Find the largest and smallest number that _can_ occur
for(int i = 0; i < iterations; i++)
{
int result = getNumberFromSeed(i);
if(result > maxNumber)
{
maxNumber = result;
}
if(result < minNumber)
{
minNumber = result;
}
}
//Step 2: for each possible number between minNumber and maxNumber, count occurences
for(int thisNumber = minNumber; thisNumber <= maxNumber; thisNumber++)
{
int occurenceOfThisNumber = 0;
for(int i = 0; i < iterations; i++)
{
int result = getNumberFromSeed(i);
if(result == thisNumber)
{
occurenceOfThisNumber++;
}
}
if(occurenceOfThisNumber > occurencesOfMostFrequentNumber)
{
occurencesOfMostFrequentNumber = occurenceOfThisNumber;
mostFrequentNumber = thisNumber;
}
}
I must admit, this may take a long time, depending on the smallest and largest possible. But it will work without using arrays.

Fast way to sum Fourier series?

I have generated the coefficients using FFTW, now I want to reconstruct the original data, but using only the first numCoefs coefficients rather than all of them. At the moment I'm using the below code, which is very slow:
for ( unsigned int i = 0; i < length; ++i )
{
double sum = 0;
for ( unsigned int j = 0; j < numCoefs; ++j )
{
sum += ( coefs[j][0] * cos( j * omega * i ) ) + ( coefs[j][1] * sin( j * omega * i ) );
}
data[i] = sum;
}
Is there a faster way?
A much simpler solution would be to zero the unwanted coefficients and then do an IFFT with FFTW. This will be a lot more efficient than doing an IDFT as above.
Note that you may get some artefacts in the time domain when you do this kind of thing - you're effectively multiplying by a step function in the frequency domain, which is equivalent to convolution with a sinc function in the time domain. To reduce the resulting "ringing" in the time domain you should use a window function to smooth out the transition between non-zero and zero coeffs.
If your numCoefs is anywhere near or greater than log(length), then an IFFT, which is O(n*log(n)) in computational complexity, will most likely be faster, as well as pre-optimized for you. Just zero all the bins except for the coefficients you want to keep, and make sure to also keep their negative frequency complex conjugates as well if you want a real result.
If your numCoefs is small relative to log(length), then other optimizations you could try include using sinf() and cosf() if you don't really need more than 6 digits of precision, and pre-calculating omega*i outside the inner loop (although your compiler should do be doing that for you unless you have the optimization setting low or off).

CUDAFunctionLoad in Mathematica - Indexing problem

I am trying to debug an index problem I am having on my CUDA machine
Cuda Machine Info:
{1->{Name->Tesla C2050,Clock Rate->1147000,Compute Capabilities->2.,GPU Overlap->1,Maximum Block Dimensions->{1024,1024,64},Maximum Grid Dimensions->{65535,65535,65535},Maximum Threads Per Block->1024,Maximum Shared Memory Per Block->49152,Total Constant Memory->65536,Warp Size->32,Maximum Pitch->2147483647,Maximum Registers Per Block->32768,Texture Alignment->512,Multiprocessor Count->14,Core Count->448,Execution Timeout->0,Integrated->False,Can Map Host Memory->True,Compute Mode->Default,Texture1D Width->65536,Texture2D Width->65536,Texture2D Height->65535,Texture3D Width->2048,Texture3D Height->2048,Texture3D Depth->2048,Texture2D Array Width->16384,Texture2D Array Height->16384,Texture2D Array Slices->2048,Surface Alignment->512,Concurrent Kernels->True,ECC Enabled->True,Total Memory->2817982462},
All this code does is set the values of a 3D array equal to the index that CUDA is using:
__global __ void cudaMatExp(
float *matrix1, float *matrixStore, int lengthx, int lengthy, int lengthz){
long UniqueBlockIndex = blockIdx.y * gridDim.x + blockIdx.x;
long index = UniqueBlockIndex * blockDim.z * blockDim.y * blockDim.x +
threadIdx.z * blockDim.y * blockDim.x + threadIdx.y * blockDim.x +
threadIdx.x;
if (index < lengthx*lengthy*lengthz) {
matrixStore[index] = index;
}
}
For some reason, once the dimension of my 3D array becomes too large, the indexing stops.
I have tried different block dimensions (blockDim.x by blockDim.y by blockDim.z):
8x8x8 only gives correct indexing up to array dimension 12x12x12
9x9x9 only gives correct indexing up to array dimension 14x14x14
10x10x10 only gives correct indexing up to array dimension 15x15x15
For dimensions larger than these all of the different block sizes eventually start to increase again, but they never reach a value of dim^3-1 (which is the maximum index that the cuda thread should reach)
Here are some plots that illustrate this behavior:
For example: This is plotting on the x axis the dimension of the 3D array (which is xxx), and on the y axis the maximum index number that is processed during the cuda execution. This particular plot is for block dimensions of 10x10x10.
Here is the (Mathematica) code to generate that plot, but when I ran this one, I used block dimensions of 1024x1x1:
CUDAExp = CUDAFunctionLoad[codeexp, "cudaMatExp",
{{"Float", _,"Input"}, {"Float", _,"Output"},
_Integer, _Integer, _Integer},
{1024, 1, 1}]; (*These last three numbers are the block dimensions*)
max = 100; (* the maximum dimension of the 3D array *)
hold = Table[1, {i, 1, max}];
compare = Table[i^3, {i, 1, max}];
Do[
dim = ii;
AA = CUDAMemoryLoad[ConstantArray[1.0, {dim, dim, dim}], Real,
"TargetPrecision" -> "Single"];
BB = CUDAMemoryLoad[ConstantArray[1.0, {dim, dim, dim}], Real,
"TargetPrecision" -> "Single"];
hold[[ii]] = Max[Flatten[
CUDAMemoryGet[CUDAExp[AA, BB, dim, dim, dim][[1]]]]];
, {ii, 1, max}]
ListLinePlot[{compare, Flatten[hold]}, PlotRange -> All]
This is the same plot, but now plotting x^3 to compare to where it should be. Notice that it diverges after the dimension of the array is >32
I test the dimensions of the 3D array and look at how far the indexing goes and compare it with dim^3-1. E.g. for dim=32, the cuda max index is 32767 (which is 32^3 -1), but for dim=33 the cuda output is 33791 when it should be 35936 (33^3 -1). Notice that 33791-32767 = 1024 = blockDim.x
Question:
Is there a way to correctly index an array with dimensions larger than the block dimensions in Mathematica?
Now, I know that some people use __mul24(threadIdx.y,blockDim.x) in their index equation to prevent errors in bit multiplication, but it doesn't seem to help in my case.
Also, I have seen someone mention that you should compile your code with -arch=sm_11 because by default it's compiled for compute capability 1.0. I don't know if this is the case in Mathematica though. I would assume that CUDAFunctionLoad[] knows to compile with 2.0 capability. Any one know?
Any suggestions would be extremely helpful!
So, Mathematica kind of has a hidden way of dealing with grid dimensions, to fix your grid dimension to something that will work, you have to add another number to the end of the function you are calling.
The argument denotes the number of threads to launch (or grid dimension times block dimension).
For example, in my code above:
CUDAExp =
CUDAFunctionLoad[codeexp,
"cudaMatExp", {
{"Float", _, "Input"}, {"Float", _,"Output"},
_Integer, _Integer, _Integer},
{8, 8, 8}, "ShellOutputFunction" -> Print];
(8,8,8) denotes the dimension of the block.
When you call CUDAExp[] in mathematica, you can add an argument that denotes the number of threads to launch:
In this example I finally got it to work with the following:
// AA and BB are 3D arrays of 0 with dimensions dim^3
dim = 64;
CUDAExp[AA, BB, dim, dim, dim, 4089];
Note that when you compile with CUDAFunctionLoad[], it only expects 5 inputs, the first is the array you pass it (of dimensions dim x dim x dim) and the second is where the memory of it is stored. The third, fourth, and fifth are the dimensions.
When you pass it a 6th, mathematica translates that as gridDim.x * blockDim.x, so, since I know I need gridDim.x = 512 in order for every element in the array to be dealt with, I set this number equal to 512 * 8 = 4089.
I hope this is clear and useful to someone in the future that comes across this issue.

What's the fastest way to divide an integer by 3?

int x = n / 3; // <-- make this faster
// for instance
int a = n * 3; // <-- normal integer multiplication
int b = (n << 1) + n; // <-- potentially faster multiplication
The guy who said "leave it to the compiler" was right, but I don't have the "reputation" to mod him up or comment. I asked gcc to compile int test(int a) { return a / 3; } for an ix86 and then disassembled the output. Just for academic interest, what it's doing is roughly multiplying by 0x55555556 and then taking the top 32 bits of the 64 bit result of that. You can demonstrate this to yourself with eg:
$ ruby -e 'puts(60000 * 0x55555556 >> 32)'
20000
$ ruby -e 'puts(72 * 0x55555556 >> 32)'
24
$
The wikipedia page on Montgomery division is hard to read but fortunately the compiler guys have done it so you don't have to.
This is the fastest as the compiler will optimize it if it can depending on the output processor.
int a;
int b;
a = some value;
b = a / 3;
There is a faster way to do it if you know the ranges of the values, for example, if you are dividing a signed integer by 3 and you know the range of the value to be divided is 0 to 768, then you can multiply it by a factor and shift it to the left by a power of 2 to that factor divided by 3.
eg.
Range 0 -> 768
you could use shifting of 10 bits, which multiplying by 1024, you want to divide by 3 so your multiplier should be 1024 / 3 = 341,
so you can now use (x * 341) >> 10
(Make sure the shift is a signed shift if using signed integers), also make sure the shift is an actually shift and not a bit ROLL
This will effectively divide the value 3, and will run at about 1.6 times the speed as a natural divide by 3 on a standard x86 / x64 CPU.
Of course the only reason you can make this optimization when the compiler cant is because the compiler does not know the maximum range of X and therefore cannot make this determination, but you as the programmer can.
Sometime it may even be more beneficial to move the value into a larger value and then do the same thing, ie. if you have an int of full range you could make it an 64-bit value and then do the multiply and shift instead of dividing by 3.
I had to do this recently to speed up image processing, i needed to find the average of 3 color channels, each color channel with a byte range (0 - 255). red green and blue.
At first i just simply used:
avg = (r + g + b) / 3;
(So r + g + b has a maximum of 768 and a minimum of 0, because each channel is a byte 0 - 255)
After millions of iterations the entire operation took 36 milliseconds.
I changed the line to:
avg = (r + g + b) * 341 >> 10;
And that took it down to 22 milliseconds, its amazing what can be done with a little ingenuity.
This speed up occurred in C# even though I had optimisations turned on and was running the program natively without debugging info and not through the IDE.
See How To Divide By 3 for an extended discussion of more efficiently dividing by 3, focused on doing FPGA arithmetic operations.
Also relevant:
Optimizing integer divisions with Multiply Shift in C#
Depending on your platform and depending on your C compiler, a native solution like just using
y = x / 3
Can be fast or it can be awfully slow (even if division is done entirely in hardware, if it is done using a DIV instruction, this instruction is about 3 to 4 times slower than a multiplication on modern CPUs). Very good C compilers with optimization flags turned on may optimize this operation, but if you want to be sure, you are better off optimizing it yourself.
For optimization it is important to have integer numbers of a known size. In C int has no known size (it can vary by platform and compiler!), so you are better using C99 fixed-size integers. The code below assumes that you want to divide an unsigned 32-bit integer by three and that you C compiler knows about 64 bit integer numbers (NOTE: Even on a 32 bit CPU architecture most C compilers can handle 64 bit integers just fine):
static inline uint32_t divby3 (
uint32_t divideMe
) {
return (uint32_t)(((uint64_t)0xAAAAAAABULL * divideMe) >> 33);
}
As crazy as this might sound, but the method above indeed does divide by 3. All it needs for doing so is a single 64 bit multiplication and a shift (like I said, multiplications might be 3 to 4 times faster than divisions on your CPU). In a 64 bit application this code will be a lot faster than in a 32 bit application (in a 32 bit application multiplying two 64 bit numbers take 3 multiplications and 3 additions on 32 bit values) - however, it might be still faster than a division on a 32 bit machine.
On the other hand, if your compiler is a very good one and knows the trick how to optimize integer division by a constant (latest GCC does, I just checked), it will generate the code above anyway (GCC will create exactly this code for "/3" if you enable at least optimization level 1). For other compilers... you cannot rely or expect that it will use tricks like that, even though this method is very well documented and mentioned everywhere on the Internet.
Problem is that it only works for constant numbers, not for variable ones. You always need to know the magic number (here 0xAAAAAAAB) and the correct operations after the multiplication (shifts and/or additions in most cases) and both is different depending on the number you want to divide by and both take too much CPU time to calculate them on the fly (that would be slower than hardware division). However, it's easy for a compiler to calculate these during compile time (where one second more or less compile time plays hardly a role).
For 64 bit numbers:
uint64_t divBy3(uint64_t x)
{
return x*12297829382473034411ULL;
}
However this isn't the truncating integer division you might expect.
It works correctly if the number is already divisible by 3, but it returns a huge number if it isn't.
For example if you run it on for example 11, it returns 6148914691236517209. This looks like a garbage but it's in fact the correct answer: multiply it by 3 and you get back the 11!
If you are looking for the truncating division, then just use the / operator. I highly doubt you can get much faster than that.
Theory:
64 bit unsigned arithmetic is a modulo 2^64 arithmetic.
This means for each integer which is coprime with the 2^64 modulus (essentially all odd numbers) there exists a multiplicative inverse which you can use to multiply with instead of division. This magic number can be obtained by solving the 3*x + 2^64*y = 1 equation using the Extended Euclidean Algorithm.
What if you really don't want to multiply or divide? Here is is an approximation I just invented. It works because (x/3) = (x/4) + (x/12). But since (x/12) = (x/4) / 3 we just have to repeat the process until its good enough.
#include <stdio.h>
void main()
{
int n = 1000;
int a,b;
a = n >> 2;
b = (a >> 2);
a += b;
b = (b >> 2);
a += b;
b = (b >> 2);
a += b;
b = (b >> 2);
a += b;
printf("a=%d\n", a);
}
The result is 330. It could be made more accurate using b = ((b+2)>>2); to account for rounding.
If you are allowed to multiply, just pick a suitable approximation for (1/3), with a power-of-2 divisor. For example, n * (1/3) ~= n * 43 / 128 = (n * 43) >> 7.
This technique is most useful in Indiana.
I don't know if it's faster but if you want to use a bitwise operator to perform binary division you can use the shift and subtract method described at this page:
Set quotient to 0
Align leftmost digits in dividend and divisor
Repeat:
If that portion of the dividend above the divisor is greater than or equal to the divisor:
Then subtract divisor from that portion of the dividend and
Concatentate 1 to the right hand end of the quotient
Else concatentate 0 to the right hand end of the quotient
Shift the divisor one place right
Until dividend is less than the divisor:
quotient is correct, dividend is remainder
STOP
For really large integer division (e.g. numbers bigger than 64bit) you can represent your number as an int[] and perform division quite fast by taking two digits at a time and divide them by 3. The remainder will be part of the next two digits and so forth.
eg. 11004 / 3 you say
11/3 = 3, remaineder = 2 (from 11-3*3)
20/3 = 6, remainder = 2 (from 20-6*3)
20/3 = 6, remainder = 2 (from 20-6*3)
24/3 = 8, remainder = 0
hence the result 3668
internal static List<int> Div3(int[] a)
{
int remainder = 0;
var res = new List<int>();
for (int i = 0; i < a.Length; i++)
{
var val = remainder + a[i];
var div = val/3;
remainder = 10*(val%3);
if (div > 9)
{
res.Add(div/10);
res.Add(div%10);
}
else
res.Add(div);
}
if (res[0] == 0) res.RemoveAt(0);
return res;
}
If you really want to see this article on integer division, but it only has academic merit ... it would be an interesting application that actually needed to perform that benefited from that kind of trick.
Easy computation ... at most n iterations where n is your number of bits:
uint8_t divideby3(uint8_t x)
{
uint8_t answer =0;
do
{
x>>=1;
answer+=x;
x=-x;
}while(x);
return answer;
}
A lookup table approach would also be faster in some architectures.
uint8_t DivBy3LU(uint8_t u8Operand)
{
uint8_t ai8Div3 = [0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, ....];
return ai8Div3[u8Operand];
}