Related
In Verilog code
case ({Q[0], Q_1})
2'b0_1 :begin
A<=sum[7]; Q<=sum; Q_1<=Q;
end
2'b1_0 : begin
A<=difference[7]; Q<=difference; Q_1<=Q;
end
default: begin
A<=A[7]; Q<=A; Q_1<=Q;
end
endcase
is above code is same as below code
case ({Q[0], Q_1})
2'b0_1 : {A, Q, Q_1} <= {sum[7], sum, Q};
2'b1_0 : {A, Q, Q_1} <= {difference[7], difference, Q};
default: {A, Q, Q_1} <= {A[7], A, Q};
endcase
If yes then why i am getting different result?
Edit:-A, Q, sum and difference are all 8-bit values and Q_1 is a 1-bit value.
No, these are not the same. The concatenation operator ({ ... }) allows you to create vectors from several different signals, allowing you to both use these vectors and assign to these vectors, resulting in the assignment of the component signals will the appropriate bits from the result. From your previous question (Please Explain these verilog code?), I see that A, Q, sum and difference are all 8-bit values and Q_1 is a 1-bit value. Lets examine the first assignment (noting that the other three work the same way):
{A, Q, Q_1} <= {sum[7], sum, Q};
If we look at the right-hand side, we can see that the result of the concatenation is a 17-bit vector, as sum[7] is 1 bit (the MSb of sum), sum is 8 bits, and Q is 8 bits (1 + 8 + 8 = 17). Lets say sum = 8'b10100101 and Q = 8'b00110110, what would {sum[7], sum, Q} look like? Well, its the concatenation of the values from sum and Q so it would be 17'b1_10100101_00110110, the first bit coming from sum[7], the next 8 bits from sum and the final 8 bits from Q.
Now we have to assign this 17-bit value to the left hand side. On the left, we have {A, Q, Q_1}, which is also 17 bits (A is 8 bits, Q is 8 bits and Q_1 is 1 bit). However, we have to assign the bits from our 17-bit value we got above to the proper signals that make up this new 17-bit vector, that means the 8 most significant bits go into A, the next 8 bits go into Q and the least significant bit go into Q_1. So, if we take our value from above (17'b1_10100101_00110110), and split it up this way (17'b11010010_10011011_0), we see A = 8'b11010010, Q = 8'b10011011 and Q_1 = 1'b0. Thus, this is not the same as assigning A = sum[7], Q = sum and Q_1 = Q (this would result in A = 8'b00000001, Q = 8'b10100101, Q_1 = 1'b0, with many bits of Q being lost and A having 7 extra bits).
However, this doesnt mean we cant split up the left-hand side concatenation, it would just look like this:
A <= {sum[7], sum[7:1]};
Q <= {sum[0], Q[7:1]};
Q_1 <= Q[0];
Yes, they are same. For example try this small code and check, the output is same :
module test;
wire A,B,C;
reg p,q,r;
initial
begin
p=1; q=1; r=0;
end
assign {A,B,C} = {p,q,r};
initial #1 $display("%b %b %b",A,B,C);
endmodule
In general if you want to understand concatenation operator, you can refer here
Edit : I have assumed A and p , B and q, C and r of same length.
When we say 4K in hardware it is equal to the value 4096 which is 11 bits. What would be the value for 2G and how many bits represent this value?
Thanks
Often in CS we deal with number that are necessarily power of two (all addressable quantities for example).
In this context is it more useful to have prefixes that instead of being multiple of ten, like the decimal K = 10^3, M = 10^6, G = 10^9, are multiple of two.
Since the power of two closest to 1000, which is decimal K, is 1024 = 2^10, we can make the analogy that in CS K 1024 instead of 1000.
This is rather confusing as some quantities (like disk sizes or transmission channel parameters) are not bound to be power of two and can be given with either the decimal K or the CS K.
To avoid further confusion the CS now use appropriate binary prefixes, for example the CS K now is the Ki.
So as in decimal G is 10^9 = (10^3)^3 which you can think of as K^3 then G in binary (better called Gi) is Ki^3 = (2^10)^3 = 2^30.
To represent 4Ki quantities you need 12 bits as log2(4Ki) = log2(2^2 * 2^10) = 12.
To represent 2Gi quantities you need log2(2Gi) = log2(2 * 2^30) = 31 bits.
Note I used the phrase "To represent 4Ki quantities" rather then "To represent the 4Ki quantity", the latter is different and need one more bit. This is analogous to saying that to represent 1000 quantities we need 3 decimal digits (from 000 to 999) but to represent the number 1000 itself we need 4 digits (1, 0, 0 and 0).
I am currently trying to figure out how to multiply two numbers in fixed point representation.
Say my number representation is as follows:
[SIGN][2^0].[2^-1][2^-2]..[2^-14]
In my case, the number 10.01000000000000 = -0.25.
How would I for example do 0.25x0.25 or -0.25x0.25 etc?
Hope you can help!
You should use 2's complement representation instead of a seperate sign bit. It's much easier to do maths on that, no special handling is required. The range is also improved because there's no wasted bit pattern for negative 0. To multiply, just do as normal fixed-point multiplication. The normal Q2.14 format will store value x/214 for the bit pattern of x, therefore if we have A and B then
So you just need to multiply A and B directly then divide the product by 214 to get the result back into the form x/214 like this
AxB = ((int32_t)A*B) >> 14;
A rounding step is needed to get the nearest value. You can find the way to do it in Q number format#Math operations. The simplest way to round to nearest is just add back the bit that was last shifted out (i.e. the first fractional bit) like this
AxB = (int32_t)A*B;
AxB = (AxB >> 14) + ((AxB >> 13) & 1);
You might also want to read these
Fixed-point arithmetic.
Emulated Fixed Point Division/Multiplication
Fixed point math in c#?
With 2 bits you can represent the integer range of [-2, 1]. So using Q2.14 format, -0.25 would be stored as 11.11000000000000. Using 1 sign bit you can only represent -1, 0, 1, and it makes calculations more complex because you need to split the sign bit then combine it back at the end.
Multiply into a larger sized variable, and then right shift by the number of bits of fixed point precision.
Here's a simple example in C:
int a = 0.25 * (1 << 16);
int b = -0.25 * (1 << 16);
int c = (a * b) >> 16;
printf("%.2f * %.2f = %.2f\n", a / 65536.0, b / 65536.0 , c / 65536.0);
You basically multiply everything by a constant to bring the fractional parts up into the integer range, then multiply the two factors, then (optionally) divide by one of the constants to return the product to the standard range for use in future calculations. It's like multiplying prices expressed in fractional dollars by 100 and then working in cents (i.e. $1.95 * 100 cents/dollar = 195 cents).
Be careful not to overflow the range of the variable you are multiplying into. Your constant might need to be smaller to avoid overflow, like using 1 << 8 instead of 1 << 16 in the example above.
I am trying to debug an index problem I am having on my CUDA machine
Cuda Machine Info:
{1->{Name->Tesla C2050,Clock Rate->1147000,Compute Capabilities->2.,GPU Overlap->1,Maximum Block Dimensions->{1024,1024,64},Maximum Grid Dimensions->{65535,65535,65535},Maximum Threads Per Block->1024,Maximum Shared Memory Per Block->49152,Total Constant Memory->65536,Warp Size->32,Maximum Pitch->2147483647,Maximum Registers Per Block->32768,Texture Alignment->512,Multiprocessor Count->14,Core Count->448,Execution Timeout->0,Integrated->False,Can Map Host Memory->True,Compute Mode->Default,Texture1D Width->65536,Texture2D Width->65536,Texture2D Height->65535,Texture3D Width->2048,Texture3D Height->2048,Texture3D Depth->2048,Texture2D Array Width->16384,Texture2D Array Height->16384,Texture2D Array Slices->2048,Surface Alignment->512,Concurrent Kernels->True,ECC Enabled->True,Total Memory->2817982462},
All this code does is set the values of a 3D array equal to the index that CUDA is using:
__global __ void cudaMatExp(
float *matrix1, float *matrixStore, int lengthx, int lengthy, int lengthz){
long UniqueBlockIndex = blockIdx.y * gridDim.x + blockIdx.x;
long index = UniqueBlockIndex * blockDim.z * blockDim.y * blockDim.x +
threadIdx.z * blockDim.y * blockDim.x + threadIdx.y * blockDim.x +
threadIdx.x;
if (index < lengthx*lengthy*lengthz) {
matrixStore[index] = index;
}
}
For some reason, once the dimension of my 3D array becomes too large, the indexing stops.
I have tried different block dimensions (blockDim.x by blockDim.y by blockDim.z):
8x8x8 only gives correct indexing up to array dimension 12x12x12
9x9x9 only gives correct indexing up to array dimension 14x14x14
10x10x10 only gives correct indexing up to array dimension 15x15x15
For dimensions larger than these all of the different block sizes eventually start to increase again, but they never reach a value of dim^3-1 (which is the maximum index that the cuda thread should reach)
Here are some plots that illustrate this behavior:
For example: This is plotting on the x axis the dimension of the 3D array (which is xxx), and on the y axis the maximum index number that is processed during the cuda execution. This particular plot is for block dimensions of 10x10x10.
Here is the (Mathematica) code to generate that plot, but when I ran this one, I used block dimensions of 1024x1x1:
CUDAExp = CUDAFunctionLoad[codeexp, "cudaMatExp",
{{"Float", _,"Input"}, {"Float", _,"Output"},
_Integer, _Integer, _Integer},
{1024, 1, 1}]; (*These last three numbers are the block dimensions*)
max = 100; (* the maximum dimension of the 3D array *)
hold = Table[1, {i, 1, max}];
compare = Table[i^3, {i, 1, max}];
Do[
dim = ii;
AA = CUDAMemoryLoad[ConstantArray[1.0, {dim, dim, dim}], Real,
"TargetPrecision" -> "Single"];
BB = CUDAMemoryLoad[ConstantArray[1.0, {dim, dim, dim}], Real,
"TargetPrecision" -> "Single"];
hold[[ii]] = Max[Flatten[
CUDAMemoryGet[CUDAExp[AA, BB, dim, dim, dim][[1]]]]];
, {ii, 1, max}]
ListLinePlot[{compare, Flatten[hold]}, PlotRange -> All]
This is the same plot, but now plotting x^3 to compare to where it should be. Notice that it diverges after the dimension of the array is >32
I test the dimensions of the 3D array and look at how far the indexing goes and compare it with dim^3-1. E.g. for dim=32, the cuda max index is 32767 (which is 32^3 -1), but for dim=33 the cuda output is 33791 when it should be 35936 (33^3 -1). Notice that 33791-32767 = 1024 = blockDim.x
Question:
Is there a way to correctly index an array with dimensions larger than the block dimensions in Mathematica?
Now, I know that some people use __mul24(threadIdx.y,blockDim.x) in their index equation to prevent errors in bit multiplication, but it doesn't seem to help in my case.
Also, I have seen someone mention that you should compile your code with -arch=sm_11 because by default it's compiled for compute capability 1.0. I don't know if this is the case in Mathematica though. I would assume that CUDAFunctionLoad[] knows to compile with 2.0 capability. Any one know?
Any suggestions would be extremely helpful!
So, Mathematica kind of has a hidden way of dealing with grid dimensions, to fix your grid dimension to something that will work, you have to add another number to the end of the function you are calling.
The argument denotes the number of threads to launch (or grid dimension times block dimension).
For example, in my code above:
CUDAExp =
CUDAFunctionLoad[codeexp,
"cudaMatExp", {
{"Float", _, "Input"}, {"Float", _,"Output"},
_Integer, _Integer, _Integer},
{8, 8, 8}, "ShellOutputFunction" -> Print];
(8,8,8) denotes the dimension of the block.
When you call CUDAExp[] in mathematica, you can add an argument that denotes the number of threads to launch:
In this example I finally got it to work with the following:
// AA and BB are 3D arrays of 0 with dimensions dim^3
dim = 64;
CUDAExp[AA, BB, dim, dim, dim, 4089];
Note that when you compile with CUDAFunctionLoad[], it only expects 5 inputs, the first is the array you pass it (of dimensions dim x dim x dim) and the second is where the memory of it is stored. The third, fourth, and fifth are the dimensions.
When you pass it a 6th, mathematica translates that as gridDim.x * blockDim.x, so, since I know I need gridDim.x = 512 in order for every element in the array to be dealt with, I set this number equal to 512 * 8 = 4089.
I hope this is clear and useful to someone in the future that comes across this issue.
int x = n / 3; // <-- make this faster
// for instance
int a = n * 3; // <-- normal integer multiplication
int b = (n << 1) + n; // <-- potentially faster multiplication
The guy who said "leave it to the compiler" was right, but I don't have the "reputation" to mod him up or comment. I asked gcc to compile int test(int a) { return a / 3; } for an ix86 and then disassembled the output. Just for academic interest, what it's doing is roughly multiplying by 0x55555556 and then taking the top 32 bits of the 64 bit result of that. You can demonstrate this to yourself with eg:
$ ruby -e 'puts(60000 * 0x55555556 >> 32)'
20000
$ ruby -e 'puts(72 * 0x55555556 >> 32)'
24
$
The wikipedia page on Montgomery division is hard to read but fortunately the compiler guys have done it so you don't have to.
This is the fastest as the compiler will optimize it if it can depending on the output processor.
int a;
int b;
a = some value;
b = a / 3;
There is a faster way to do it if you know the ranges of the values, for example, if you are dividing a signed integer by 3 and you know the range of the value to be divided is 0 to 768, then you can multiply it by a factor and shift it to the left by a power of 2 to that factor divided by 3.
eg.
Range 0 -> 768
you could use shifting of 10 bits, which multiplying by 1024, you want to divide by 3 so your multiplier should be 1024 / 3 = 341,
so you can now use (x * 341) >> 10
(Make sure the shift is a signed shift if using signed integers), also make sure the shift is an actually shift and not a bit ROLL
This will effectively divide the value 3, and will run at about 1.6 times the speed as a natural divide by 3 on a standard x86 / x64 CPU.
Of course the only reason you can make this optimization when the compiler cant is because the compiler does not know the maximum range of X and therefore cannot make this determination, but you as the programmer can.
Sometime it may even be more beneficial to move the value into a larger value and then do the same thing, ie. if you have an int of full range you could make it an 64-bit value and then do the multiply and shift instead of dividing by 3.
I had to do this recently to speed up image processing, i needed to find the average of 3 color channels, each color channel with a byte range (0 - 255). red green and blue.
At first i just simply used:
avg = (r + g + b) / 3;
(So r + g + b has a maximum of 768 and a minimum of 0, because each channel is a byte 0 - 255)
After millions of iterations the entire operation took 36 milliseconds.
I changed the line to:
avg = (r + g + b) * 341 >> 10;
And that took it down to 22 milliseconds, its amazing what can be done with a little ingenuity.
This speed up occurred in C# even though I had optimisations turned on and was running the program natively without debugging info and not through the IDE.
See How To Divide By 3 for an extended discussion of more efficiently dividing by 3, focused on doing FPGA arithmetic operations.
Also relevant:
Optimizing integer divisions with Multiply Shift in C#
Depending on your platform and depending on your C compiler, a native solution like just using
y = x / 3
Can be fast or it can be awfully slow (even if division is done entirely in hardware, if it is done using a DIV instruction, this instruction is about 3 to 4 times slower than a multiplication on modern CPUs). Very good C compilers with optimization flags turned on may optimize this operation, but if you want to be sure, you are better off optimizing it yourself.
For optimization it is important to have integer numbers of a known size. In C int has no known size (it can vary by platform and compiler!), so you are better using C99 fixed-size integers. The code below assumes that you want to divide an unsigned 32-bit integer by three and that you C compiler knows about 64 bit integer numbers (NOTE: Even on a 32 bit CPU architecture most C compilers can handle 64 bit integers just fine):
static inline uint32_t divby3 (
uint32_t divideMe
) {
return (uint32_t)(((uint64_t)0xAAAAAAABULL * divideMe) >> 33);
}
As crazy as this might sound, but the method above indeed does divide by 3. All it needs for doing so is a single 64 bit multiplication and a shift (like I said, multiplications might be 3 to 4 times faster than divisions on your CPU). In a 64 bit application this code will be a lot faster than in a 32 bit application (in a 32 bit application multiplying two 64 bit numbers take 3 multiplications and 3 additions on 32 bit values) - however, it might be still faster than a division on a 32 bit machine.
On the other hand, if your compiler is a very good one and knows the trick how to optimize integer division by a constant (latest GCC does, I just checked), it will generate the code above anyway (GCC will create exactly this code for "/3" if you enable at least optimization level 1). For other compilers... you cannot rely or expect that it will use tricks like that, even though this method is very well documented and mentioned everywhere on the Internet.
Problem is that it only works for constant numbers, not for variable ones. You always need to know the magic number (here 0xAAAAAAAB) and the correct operations after the multiplication (shifts and/or additions in most cases) and both is different depending on the number you want to divide by and both take too much CPU time to calculate them on the fly (that would be slower than hardware division). However, it's easy for a compiler to calculate these during compile time (where one second more or less compile time plays hardly a role).
For 64 bit numbers:
uint64_t divBy3(uint64_t x)
{
return x*12297829382473034411ULL;
}
However this isn't the truncating integer division you might expect.
It works correctly if the number is already divisible by 3, but it returns a huge number if it isn't.
For example if you run it on for example 11, it returns 6148914691236517209. This looks like a garbage but it's in fact the correct answer: multiply it by 3 and you get back the 11!
If you are looking for the truncating division, then just use the / operator. I highly doubt you can get much faster than that.
Theory:
64 bit unsigned arithmetic is a modulo 2^64 arithmetic.
This means for each integer which is coprime with the 2^64 modulus (essentially all odd numbers) there exists a multiplicative inverse which you can use to multiply with instead of division. This magic number can be obtained by solving the 3*x + 2^64*y = 1 equation using the Extended Euclidean Algorithm.
What if you really don't want to multiply or divide? Here is is an approximation I just invented. It works because (x/3) = (x/4) + (x/12). But since (x/12) = (x/4) / 3 we just have to repeat the process until its good enough.
#include <stdio.h>
void main()
{
int n = 1000;
int a,b;
a = n >> 2;
b = (a >> 2);
a += b;
b = (b >> 2);
a += b;
b = (b >> 2);
a += b;
b = (b >> 2);
a += b;
printf("a=%d\n", a);
}
The result is 330. It could be made more accurate using b = ((b+2)>>2); to account for rounding.
If you are allowed to multiply, just pick a suitable approximation for (1/3), with a power-of-2 divisor. For example, n * (1/3) ~= n * 43 / 128 = (n * 43) >> 7.
This technique is most useful in Indiana.
I don't know if it's faster but if you want to use a bitwise operator to perform binary division you can use the shift and subtract method described at this page:
Set quotient to 0
Align leftmost digits in dividend and divisor
Repeat:
If that portion of the dividend above the divisor is greater than or equal to the divisor:
Then subtract divisor from that portion of the dividend and
Concatentate 1 to the right hand end of the quotient
Else concatentate 0 to the right hand end of the quotient
Shift the divisor one place right
Until dividend is less than the divisor:
quotient is correct, dividend is remainder
STOP
For really large integer division (e.g. numbers bigger than 64bit) you can represent your number as an int[] and perform division quite fast by taking two digits at a time and divide them by 3. The remainder will be part of the next two digits and so forth.
eg. 11004 / 3 you say
11/3 = 3, remaineder = 2 (from 11-3*3)
20/3 = 6, remainder = 2 (from 20-6*3)
20/3 = 6, remainder = 2 (from 20-6*3)
24/3 = 8, remainder = 0
hence the result 3668
internal static List<int> Div3(int[] a)
{
int remainder = 0;
var res = new List<int>();
for (int i = 0; i < a.Length; i++)
{
var val = remainder + a[i];
var div = val/3;
remainder = 10*(val%3);
if (div > 9)
{
res.Add(div/10);
res.Add(div%10);
}
else
res.Add(div);
}
if (res[0] == 0) res.RemoveAt(0);
return res;
}
If you really want to see this article on integer division, but it only has academic merit ... it would be an interesting application that actually needed to perform that benefited from that kind of trick.
Easy computation ... at most n iterations where n is your number of bits:
uint8_t divideby3(uint8_t x)
{
uint8_t answer =0;
do
{
x>>=1;
answer+=x;
x=-x;
}while(x);
return answer;
}
A lookup table approach would also be faster in some architectures.
uint8_t DivBy3LU(uint8_t u8Operand)
{
uint8_t ai8Div3 = [0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, ....];
return ai8Div3[u8Operand];
}