CUDAFunctionLoad in Mathematica - Indexing problem - indexing

I am trying to debug an index problem I am having on my CUDA machine
Cuda Machine Info:
{1->{Name->Tesla C2050,Clock Rate->1147000,Compute Capabilities->2.,GPU Overlap->1,Maximum Block Dimensions->{1024,1024,64},Maximum Grid Dimensions->{65535,65535,65535},Maximum Threads Per Block->1024,Maximum Shared Memory Per Block->49152,Total Constant Memory->65536,Warp Size->32,Maximum Pitch->2147483647,Maximum Registers Per Block->32768,Texture Alignment->512,Multiprocessor Count->14,Core Count->448,Execution Timeout->0,Integrated->False,Can Map Host Memory->True,Compute Mode->Default,Texture1D Width->65536,Texture2D Width->65536,Texture2D Height->65535,Texture3D Width->2048,Texture3D Height->2048,Texture3D Depth->2048,Texture2D Array Width->16384,Texture2D Array Height->16384,Texture2D Array Slices->2048,Surface Alignment->512,Concurrent Kernels->True,ECC Enabled->True,Total Memory->2817982462},
All this code does is set the values of a 3D array equal to the index that CUDA is using:
__global __ void cudaMatExp(
float *matrix1, float *matrixStore, int lengthx, int lengthy, int lengthz){
long UniqueBlockIndex = blockIdx.y * gridDim.x + blockIdx.x;
long index = UniqueBlockIndex * blockDim.z * blockDim.y * blockDim.x +
threadIdx.z * blockDim.y * blockDim.x + threadIdx.y * blockDim.x +
threadIdx.x;
if (index < lengthx*lengthy*lengthz) {
matrixStore[index] = index;
}
}
For some reason, once the dimension of my 3D array becomes too large, the indexing stops.
I have tried different block dimensions (blockDim.x by blockDim.y by blockDim.z):
8x8x8 only gives correct indexing up to array dimension 12x12x12
9x9x9 only gives correct indexing up to array dimension 14x14x14
10x10x10 only gives correct indexing up to array dimension 15x15x15
For dimensions larger than these all of the different block sizes eventually start to increase again, but they never reach a value of dim^3-1 (which is the maximum index that the cuda thread should reach)
Here are some plots that illustrate this behavior:
For example: This is plotting on the x axis the dimension of the 3D array (which is xxx), and on the y axis the maximum index number that is processed during the cuda execution. This particular plot is for block dimensions of 10x10x10.
Here is the (Mathematica) code to generate that plot, but when I ran this one, I used block dimensions of 1024x1x1:
CUDAExp = CUDAFunctionLoad[codeexp, "cudaMatExp",
{{"Float", _,"Input"}, {"Float", _,"Output"},
_Integer, _Integer, _Integer},
{1024, 1, 1}]; (*These last three numbers are the block dimensions*)
max = 100; (* the maximum dimension of the 3D array *)
hold = Table[1, {i, 1, max}];
compare = Table[i^3, {i, 1, max}];
Do[
dim = ii;
AA = CUDAMemoryLoad[ConstantArray[1.0, {dim, dim, dim}], Real,
"TargetPrecision" -> "Single"];
BB = CUDAMemoryLoad[ConstantArray[1.0, {dim, dim, dim}], Real,
"TargetPrecision" -> "Single"];
hold[[ii]] = Max[Flatten[
CUDAMemoryGet[CUDAExp[AA, BB, dim, dim, dim][[1]]]]];
, {ii, 1, max}]
ListLinePlot[{compare, Flatten[hold]}, PlotRange -> All]
This is the same plot, but now plotting x^3 to compare to where it should be. Notice that it diverges after the dimension of the array is >32
I test the dimensions of the 3D array and look at how far the indexing goes and compare it with dim^3-1. E.g. for dim=32, the cuda max index is 32767 (which is 32^3 -1), but for dim=33 the cuda output is 33791 when it should be 35936 (33^3 -1). Notice that 33791-32767 = 1024 = blockDim.x
Question:
Is there a way to correctly index an array with dimensions larger than the block dimensions in Mathematica?
Now, I know that some people use __mul24(threadIdx.y,blockDim.x) in their index equation to prevent errors in bit multiplication, but it doesn't seem to help in my case.
Also, I have seen someone mention that you should compile your code with -arch=sm_11 because by default it's compiled for compute capability 1.0. I don't know if this is the case in Mathematica though. I would assume that CUDAFunctionLoad[] knows to compile with 2.0 capability. Any one know?
Any suggestions would be extremely helpful!

So, Mathematica kind of has a hidden way of dealing with grid dimensions, to fix your grid dimension to something that will work, you have to add another number to the end of the function you are calling.
The argument denotes the number of threads to launch (or grid dimension times block dimension).
For example, in my code above:
CUDAExp =
CUDAFunctionLoad[codeexp,
"cudaMatExp", {
{"Float", _, "Input"}, {"Float", _,"Output"},
_Integer, _Integer, _Integer},
{8, 8, 8}, "ShellOutputFunction" -> Print];
(8,8,8) denotes the dimension of the block.
When you call CUDAExp[] in mathematica, you can add an argument that denotes the number of threads to launch:
In this example I finally got it to work with the following:
// AA and BB are 3D arrays of 0 with dimensions dim^3
dim = 64;
CUDAExp[AA, BB, dim, dim, dim, 4089];
Note that when you compile with CUDAFunctionLoad[], it only expects 5 inputs, the first is the array you pass it (of dimensions dim x dim x dim) and the second is where the memory of it is stored. The third, fourth, and fifth are the dimensions.
When you pass it a 6th, mathematica translates that as gridDim.x * blockDim.x, so, since I know I need gridDim.x = 512 in order for every element in the array to be dealt with, I set this number equal to 512 * 8 = 4089.
I hope this is clear and useful to someone in the future that comes across this issue.

Related

How can I use a CVX variable in a Numpy product that is to be Minimized?

I'm trying to optimize a configuration X (boolean), such that the total price : base_price + discount, on a configuration is minimized, but the problem formulation gives a Matmul error since x is a cvxpy Variable and thus doesn't conform to the Numpy shape even though it was defined with the correct length.
n = len(Configuration)
x = cp.Variable(n, boolean=True)
problem = cp.Problem(cp.Minimize(base_price + price#(price_rules_A#x <= price_rules_B)), [
config_rules_A#x <= config_rules_B,
config_rules_2A#x == config_rules_2B
])
# where price#(price_rules_A#x <= price_rules_B) is the total discount
# and price, price_rules_A and price_rules_B are numpy arrays
The error i get is
ValueError: matmul: Input operand 1 does not have enough dimensions (has 0, gufunc core with signature (n?,k),(k,m?)->(n?,m?) requires 1)
I expect it to find an optimal config for x ( 0010110...) such that the discount is minimized but it doesn't. Any idea what might be causing this?
Assuming the evaluation of the inequality in the objective function is suppose to work as index to price, you can rewrite the function as
cp.Minimize(base_price + price#(1-(price_rules_B - price_rules_A#x))
Then the elements in price where the inequality is true will be summed.

Lightweight bandpass filter for embedded device (no FFT)

I would like to remove certain frequencies from the discrete dataset (signal sampled by ADC). Sounds simple enough. However, there are certain constrains that make things harder:
The chip is a 32 bit NXP JN5168, that has hardware multiplication, but no hardware division, no floats support, or any tools whatsoever that make DSP easy. Therefore, FFT-based filters those are so easy to implement on ARM Cortex M chips are a no-go
Sampling is pretty much set at 1 kHz
real time and RAM usage is a concern, because there are other tasks that chip is taking care of at the same time.
I tried using well-known high/low pass filters of the form:
Filtered Value (L) = Previous Value - k*(Previous Value - Original Value) // LPF
Filtered Value (H) = Original Value - Filtered Value (L) // HPF
Unfortunately, even with multiple passes this type of filter does not work as I would like it to work. The raise of the HPF response in the frequency domain always starts at 0, and one only can control the slope there by adjustng k. Since the sampling rate can't be changed, that's the only control.
If I want to filter out anything below 120 Hz, and leave 200 Hz untouched, that's not possible with the above filter, which is very good at having sharp cutoffs below 80 Hz (at my sampling rate). If I reduce the cutoff aggressiveness, and make filter work for 120 Hz then 200 Hz are also affected (less, but significantly).
Below you can see a frequency response of the 3 passes of the above-described HPF with k=1/2.
This does not work for me. I am looking for something different: I am looking for a lightweight filter algorithm suitable for embedded applications that can ideally provide a full or significant cut at an arbitrary frequency with a steep cutoff line, so that neighboring frequencies are unaffected, or affected insignificantly.
Thanks!
Edit: I do not want to transform my signal into the frequency domain, but rather continue working with a cleaned-up signal in the time domain.
Edit 2: Changing the hardware is not an option. There is an already existing product that needs a new feature. That happens. That's life. My job is to find the solution, which I am sure exists.
If your computational budget can afford 5 multiplications (MACs) per sample (per millisecond at a 1 kHz sample rate), and you can save the past couple samples, you can use a biquad IIR filter. There is a cookbook for biquads coefficients.
If you can afford a small multiple of 5 MACs per sample, then you can cascade biquads and get a higher-order IIR filter with an even sharper roll-off or cutoff. But you may need to use a filter design software package (MatLab, et.al.) to optimize the pole-zero locations of a higher order IIR filter for your specific requirements.
Adapted from The Scientist and Engineer's Guide to
Digital Signal Processing - Chapter 19: Recursive Filters:
static const float pi = 3.141592f ;
static const float pi2 = 2.0f * pi ;
static const float s = 48000 ; // Sample rate
void bandpassFilter( float f_hz, // Filter centre frequency
float bw_hz, // Filter bandwidth
const float *x, // Pointer to input sample block
float *y, // Pointer to output buffer
int n // Number of samples in sample block
)
{
static float x_2 = 0.0f; // delayed x, y samples
static float x_1 = 0.0f;
static float y_1 = 0.0f;
static float y_2 = 0.0f;
static const float f = f_hz / s ;
static const float bw = bw_hz / s ;
static const float R = 1 - (3 * bw) ;
static const float Rsq = R * R ;
static const float cosf2 = 2 * cos(pi2 * f) ;
static const float K = (1 - R * cosf2 + Rsq ) / (2 - cosf2) ;
static const float a0 = 1.0 - K ;
static const float a1 = 2 * (K - R) * cosf2 ;
static const float a2 = Rsq - K ;
static const float b1 = 2 * R * cosf2 ;
static const float b2 = -Rsq ;
for( int i = 0; i < n; ++i)
{
// IIR difference equation
y[i] = a0 * x[i] + a1 * x_1 + a2 * x_2
+ b1 * y_1 + b2 * y_2;
// shift delayed x, y samples
x_2 = x_1;
x_1 = x[i];
y_2 = y_1 ;
y_1 = y[i];
}
}
Since at the end of the loop, the filter state is retained in the static variables x_1, y_1, x_2 and y_2, the filter may be called repeatedly with any number of samples - one sample at a time or in blocks (more efficient).
The static calculation of the coefficients and use of single precision floating point however makes it reasonably fast even for software floating point, requiring only multiply-add. The use of software floating point may increase code size somewhat - most significantly in the use of cos(), but if your frequency, and bandwidth are not variable, the coefficients may be pre-calculated and hard coded - I have included them in the code for illustration purposes, and because it was real code I had available rather then developed specifically for the question.
If the floating-point remains too resource or time hungry, then a fixed-point implementation is possible. I have used the same code adapted for fixed-point using Anthony Williams' fixed point math library which uses C++ and extensive operator overloading to allow in most cases a fixed-point implementation simply by replacing float with fixed.

Fast way to set diagonals of an (M x N x N) matrix? Einsum / n-dimensional fill_diagonal?

I'm trying to write fast, optimized code based on matrices, and have recently discovered einsum as a tool for achieving significant speed-up.
Is it possible to use this to set the diagonals of a multidimensional array efficiently, or can it only return data?
In my problem, I'm trying to set the diagonals for an array of square matrices (shape: M x N x N) by summing the columns in each square (N x N) matrix.
My current (slow, loop-based) solution is:
# Build dummy array
dimx = 2 # Dimension x (likely to be < 100)
dimy = 3 # Dimension y (likely to be between 2 and 10)
M = np.random.randint(low=1, high=9, size=[dimx, dimy, dimy])
# Blank the diagonals so we can see the intended effect
np.fill_diagonal(M[0], 0)
np.fill_diagonal(M[1], 0)
# Compute diagonals based on summing columns
diags = np.einsum('ijk->ik', M)
# Set the diagonal for each matrix
# THIS IS LOW. CAN IT BE IMPROVED?
for i in range(len(M)):
np.fill_diagonal(M[i], diags[i])
# Print result
M
Can this be improved at all please? It seems np.fill_diagonal doesn't accepted non-square matrices (hence forcing my loop based solution). Perhaps einsum can help here too?
One approach would be to reshape to 2D, set the columns at steps of ncols+1 with the diagonal values. Reshaping creates a view and as such allows us to directly access those diagonal positions. Thus, the implementation would be -
s0,s1,s2 = M.shape
M.reshape(s0,-1)[:,::s2+1] = diags
If you do np.source(np.fill_diagonal) you'll see that in the 2d case it uses a 'strided' approach
if a.ndim == 2:
step = a.shape[1] + 1
end = a.shape[1] * a.shape[1]
a.flat[:end:step] = val
#Divakar's solution applies this to your 3d case by 'flattening' on 2 dimensions.
You could sum the columns with M.sum(axis=1). Though I vaguely recall some timings that found that einsum was actually a bit faster. sum is a little more conventional.
Someone has has asked for an ability to expand dimensions in einsum, but I don't think that will happen.

Is it possible to optimize this Matlab code for doing vector quantization with centroids from k-means?

I've created a codebook using k-means of size 4000x300 (4000 centroids, each with 300 features). Using the codebook, I then want to label an input vector (for purposes of binning later on). The input vector is of size Nx300, where N is the total number of input instances I receive.
To compute the labels, I calculate the closest centroid for each of the input vectors. To do so, I compare each input vector against all centroids and pick the centroid with the minimum distance. The label is then just the index of that centroid.
My current Matlab code looks like:
function labels = assign_labels(centroids, X)
labels = zeros(size(X, 1), 1);
% for each X, calculate the distance from each centroid
for i = 1:size(X, 1)
% distance of X_i from all j centroids is: sum((X_i - centroid_j)^2)
% note: we leave off the sqrt as an optimization
distances = sum(bsxfun(#minus, centroids, X(i, :)) .^ 2, 2);
[value, label] = min(distances);
labels(i) = label;
end
However, this code is still fairly slow (for my purposes), and I was hoping there might be a way to optimize the code further.
One obvious issue is that there is a for-loop, which is the bane of good performance on Matlab. I've been trying to come up with a way to get rid of it, but with no luck (I looked into using arrayfun in conjunction with bsxfun, but haven't gotten that to work). Alternatively, if someone know of any other way to speed this up, I would be greatly appreciate it.
Update
After doing some searching, I couldn't find a great solution using Matlab, so I decided to look at what is used in Python's scikits.learn package for 'euclidean_distance' (shortened):
XX = sum(X * X, axis=1)[:, newaxis]
YY = Y.copy()
YY **= 2
YY = sum(YY, axis=1)[newaxis, :]
distances = XX + YY
distances -= 2 * dot(X, Y.T)
distances = maximum(distances, 0)
which uses the binomial form of the euclidean distance ((x-y)^2 -> x^2 + y^2 - 2xy), which from what I've read usually runs faster. My completely untested Matlab translation is:
XX = sum(data .* data, 2);
YY = sum(center .^ 2, 2);
[val, ~] = max(XX + YY - 2*data*center');
Use the following function to calculate your distances. You should see an order of magnitude speed up
The two matrices A and B have the columns as the dimenions and the rows as each point.
A is your matrix of centroids. B is your matrix of datapoints.
function D=getSim(A,B)
Qa=repmat(dot(A,A,2),1,size(B,1));
Qb=repmat(dot(B,B,2),1,size(A,1));
D=Qa+Qb'-2*A*B';
You can vectorize it by converting to cells and using cellfun:
[nRows,nCols]=size(X);
XCell=num2cell(X,2);
dist=reshape(cell2mat(cellfun(#(x)(sum(bsxfun(#minus,centroids,x).^2,2)),XCell,'UniformOutput',false)),nRows,nRows);
[~,labels]=min(dist);
Explanation:
We assign each row of X to its own cell in the second line
This piece #(x)(sum(bsxfun(#minus,centroids,x).^2,2)) is an anonymous function which is the same as your distances=... line, and using cell2mat, we apply it to each row of X.
The labels are then the indices of the minimum row along each column.
For a true matrix implementation, you may consider trying something along the lines of:
P2 = kron(centroids, ones(size(X,1),1));
Q2 = kron(ones(size(centroids,1),1), X);
distances = reshape(sum((Q2-P2).^2,2), size(X,1), size(centroids,1));
Note
This assumes the data is organized as [x1 y1 ...; x2 y2 ...;...]
You can use a more efficient algorithm for nearest neighbor search than brute force.
The most popular approach are Kd-Tree. O(log(n)) average query time instead of the O(n) brute force complexity.
Regarding a Maltab implementation of Kd-Trees, you can have a look here

Fast formula for a "high contrast" curve

My inner loop contains a calculation that profiling shows to be problematic.
The idea is to take a greyscale pixel x (0 <= x <= 1), and "increase its contrast". My requirements are fairly loose, just the following:
for x < .5, 0 <= f(x) < x
for x > .5, x < f(x) <= 1
f(0) = 0
f(x) = 1 - f(1 - x), i.e. it should be "symmetric"
Preferably, the function should be smooth.
So the graph must look something like this:
.
I have two implementations (their results differ but both are conformant):
float cosContrastize(float i) {
return .5 - cos(x * pi) / 2;
}
float mulContrastize(float i) {
if (i < .5) return i * i * 2;
i = 1 - i;
return 1 - i * i * 2;
}
So I request either a microoptimization for one of these implementations, or an original, faster formula of your own.
Maybe one of you can even twiddle the bits ;)
Consider the following sigmoid-shaped functions (properly translated to the desired range):
error function
normal CDF
tanh
logit
I generated the above figure using MATLAB. If interested here's the code:
x = -3:.01:3;
plot( x, 2*(x>=0)-1, ...
x, erf(x), ...
x, tanh(x), ...
x, 2*normcdf(x)-1, ...
x, 2*(1 ./ (1 + exp(-x)))-1, ...
x, 2*((x-min(x))./range(x))-1 )
legend({'hard' 'erf' 'tanh' 'normcdf' 'logit' 'linear'})
Trivially you could simply threshold, but I imagine this is too dumb:
return i < 0.5 ? 0.0 : 1.0;
Since you mention 'increasing contrast' I assume the input values are luminance values. If so, and they are discrete (perhaps it's an 8-bit value), you could use a lookup table to do this quite quickly.
Your 'mulContrastize' looks reasonably quick. One optimization would be to use integer math. Let's say, again, your input values could actually be passed as an 8-bit unsigned value in [0..255]. (Again, possibly a fine assumption?) You could do something roughly like...
int mulContrastize(int i) {
if (i < 128) return (i * i) >> 7;
// The shift is really: * 2 / 256
i = 255 - i;
return 255 - ((i * i) >> 7);
A piecewise interpolation can be fast and flexible. It requires only a few decisions followed by a multiplication and addition, and can approximate any curve. It also avoids the courseness that can be introduced by lookup tables (or the additional cost in two lookups followed by an interpolation to smooth this out), though the lut might work perfectly fine for your case.
With just a few segments, you can get a pretty good match. Here there will be courseness in the color gradients, which will be much harder to detect than courseness in the absolute colors.
As Eamon Nerbonne points out in the comments, segmentation can be optimized by "choos[ing] your segmentation points based on something like the second derivative to maximize detail", that is, where the slope is changing the most. Clearly, in my posted example, having three segments in the middle of the five segment case doesn't add much more detail.