How does magma_dgetri use multiple GPUs

How does magma_dgetri use multiple GPUs - gpu

I just installed magma and I noticed that some routines have _mgpu version while some not. For example for LU inverse, there are 4 functions:
magma_cgetri_gpu (magma_int_t n, magmaFloatComplex_ptr dA, magma_int_t ldda, magma_int_t *ipiv, magmaFloatComplex_ptr dwork, magma_int_t lwork, magma_int_t *info)
magma_dgetri_gpu (magma_int_t n, magmaDouble_ptr dA, magma_int_t ldda, magma_int_t *ipiv, magmaDouble_ptr dwork, magma_int_t lwork, magma_int_t *info)
magma_sgetri_gpu (magma_int_t n, magmaFloat_ptr dA, magma_int_t ldda, magma_int_t *ipiv, magmaFloat_ptr dwork, magma_int_t lwork, magma_int_t *info)
magma_zgetri_gpu (magma_int_t n, magmaDoubleComplex_ptr dA, magma_int_t ldda, magma_int_t *ipiv, magmaDoubleComplex_ptr dwork, magma_int_t lwork, magma_int_t *info)
There is no _mgpu function for LU inverse, and there is also no relevant input parameter in these gpu functions that indicating the number of GPUs to use. Dose this mean there functions without '_mpu' suffix cannot use multiple GPUs? If the answer is no, how to do it?
Here is the link to the documentation:
http://icl.cs.utk.edu/projectsfiles/magma/doxygen/group__magma__getri.html
Thank you very much!

Related

Eigenvalues do not match between Eigen, Numpy, LAPACKE (Ubuntu), Intel MKL

I have the following matrix M of doubles:
55.774375 61.0225 62.805625
-122.045 -125.61125 -122.045
62.805625 61.0225 55.774375
(the matrix is part of an algorithm to estimate the parameters of an ellipse from a 2D point cloud, see https://autotrace.sourceforge.net/WSCG98.pdf for reference).
And now comes the interesting part. Determining the eigenvalues and (right) eigenvectors of the matrix with different packages leads to different results for the (right) eigenvectors. Eigenvalues are for every package the same:
[-5.09420041e-13 -7.03125000e+00 -7.03125000e+00]
For numpy with python:
M = np.array([[55.774375, 61.0225, 62.805625],
[-122.045, -125.61125, -122.045, ],
[62.805625, 61.0225, 55.774375]])
eval, evec = np.linalg.eig(M)
I get:
[[ 0.41608575 0.37443021 -0.80942954]
[-0.80854518 -0.82367703 0.34119147]
[ 0.41608575 0.42586167 0.47792489]]
With Eigen C++, the code looks as follows
Eigen::Matrix3d M;
M << 55.774375, 61.0225, 62.805625,
-122.045, -125.61125, -122.045,
62.805625, 61.0225, 55.774375;
Eigen::EigenSolver<Eigen::MatrixXd> solver;
solver.compute(M);
I get for the eigenvectors
0.416086 0.376456 -0.462421
-0.808545 -0.823758 0.820878
0.416086 0.423914 -0.335151
With LAPACKE (apt install liblapack-dev lapacke lapacke-dev)
double Marr[]{55.774375, 61.0225, 62.805625,
-122.045, -125.61125, -122.045,
62.805625, 61.0225, 55.774375};
char jobvl = 'N';
char jobvr = 'V';
int n=3;
int lda = n;
int ldvl = n;
int ldvr = n;
int lwork = -1;
int info;
double wr[n], wi[n], vl[ldvl*n], vr[ldvr*n];
LAPACKE_dgeev( LAPACK_ROW_MAJOR, 'V', 'V', n, Marr, lda, wr, wi,
vl, ldvl, vr, ldvr );
if( info > 0 ) {
printf( "The algorithm failed to compute eigenvalues.\n" );
exit( 1 );
}
I get for the eigenvectors
0.416086 0.376456 -0.788993
-0.808545 -0.823758 0.565975
0.416086 0.423914 0.239087
Similar are the results for Intel MKL.
I checked the determinant of M and it is close to zero (-4.0031989907207254e-05).
What I would like to understand is
Why do the eigenvectors for same eigenvalues differ so much between the libraries? Is this because of the different numerical methods used to approx them?
I understand that an Eigenvalue d has many associated eigenvectors, i.e. if v is the eigenvector for d than q * v (q being a scalar) is also an eigenvector of d. Since the second and the third eigenvalue are the same, I would assume that there is some scalar that transforms one into the other, but this doesn't seem to be the case.
My algorithm fails in C++ (in python it is working) due to the different eigenvectors. Is there a way out?

LAPACKE or MAGMA GPU - inversion of matrix with Cholesky factorization - functions magma_dpotrf_gpu and magma_dpotri_gpu

I have a first version of a function that inverses a matrix of size m and using
magma_dgetrf_gpu and magma_dgetri_gpulike this :
// Inversion
magma_dgetrf_gpu( m, m, d_a, m, piv, &info);
magma_dgetri_gpu( m, d_a, m, piv, dwork, ldwork, &info);
Now, I would like to also inverse but using the Cholesky decomposition. The function looks like the first version one,except the functions used which are :
// Inversion
magma_dpotrf_gpu( MagmaLower, m, d_a, m, &info);
magma_dpotri_gpu( MagmaLower, m, d_a, m, &info);
Here is the entire function that inverses :
// ROUTINE MAGMA TO INVERSE WITH CHOLESKY
void matrix_inverse_magma(vector<vector<double>> const &F_matrix, vector<vector<double>> &F_output) {
// Index for loop and arrays
int i, j, ip, idx;
// Start magma part
magma_int_t m = F_matrix.size();
if (m) {
magma_init (); // initialize Magma
magma_queue_t queue=NULL;
magma_int_t dev=0;
magma_queue_create(dev ,&queue );
double gpu_time , *dwork; // dwork - workspace
magma_int_t ldwork; // size of dwork
magma_int_t *piv, info; // piv - array of indices of inter -
// changed rows; a - mxm matrix
magma_int_t mm=m*m; // size of a, r, c
double *a; // a- mxm matrix on the host
double *d_a; // d_a - mxm matrix a on the device
magma_int_t err;
ldwork = m * magma_get_dgetri_nb( m ); // optimal block size
// allocate matrices
err = magma_dmalloc_cpu( &a , mm ); // host memory for a
// Convert matrix to *a double pointer
for (i = 0; i<m; i++){
for (j = 0; j<m; j++){
idx = i*m + j;
a[idx] = F_matrix[i][j];
}
}
err = magma_dmalloc( &d_a , mm ); // device memory for a
err = magma_dmalloc( &dwork , ldwork );// dev. mem. for ldwork
piv=( magma_int_t *) malloc(m*sizeof(magma_int_t ));// host mem.
magma_dsetmatrix( m, m, a, m, d_a, m, queue); // copy a -> d_a
// Inversion
magma_dpotrf_gpu( MagmaLower, m, d_a, m, &info);
magma_dpotri_gpu( MagmaLower, m, d_a, m, &info);
magma_dgetmatrix( m, m, d_a , m, a, m, queue); // copy d_a ->a
// Save Final matrix
for (i = 0; i<m; i++){
for (j = 0; j<m; j++){
idx = i*m + j;
F_output[i][j] = a[idx];
}
}
free(a); // free host memory
free(piv); // free host memory
magma_free(d_a); // free device memory
magma_queue_destroy(queue); // destroy queue
magma_finalize ();
// End magma part
}
}
Unfortunately, after checking the output data, I have a wrong inversion with my implementation.
I have doubts about the using at this line :
ldwork = m * magma_get_dgetri_nb( m ); // optimal block size
Could anyone see at first sight where the error comes from in my using of dpotrf and dpotri functions (actually magma_dpotrf_gpu and magma_dpotri_gpu) ?
EDIT 1:
following the advice of Damir Tenishev, I put an example of a function that inverses a matrix using LAPACKE :
// LAPACK version
void matrix_inverse_lapack(vector<vector<double>> const &F_matrix, vector<vector<double>> &F_output) {
// Index for loop and arrays
int i, j, ip, idx;
// Size of F_matrix
int N = F_matrix.size();
int *IPIV = new int[N];
// Statement of main array to inverse
double *arr = new double[N*N];
// Output Diagonal block
double *diag = new double[N];
for (i = 0; i<N; i++){
for (j = 0; j<N; j++){
idx = i*N + j;
arr[idx] = F_matrix[i][j];
}
}
// LAPACKE routines
int info1 = LAPACKE_dgetrf(LAPACK_ROW_MAJOR, N, N, arr, N, IPIV);
int info2 = LAPACKE_dgetri(LAPACK_ROW_MAJOR, N, arr, N, IPIV);
for (i = 0; i<N; i++){
for (j = 0; j<N; j++){
idx = i*N + j;
F_output[i][j] = arr[idx];
}
}
delete[] IPIV;
delete[] arr;
}
As you can see, this is a classical version of matrix inversion, which uses LAPACKE_dgetrf and LAPACKE_dgetri
EDIT 2: The MAGMA version is :
// MAGMA version
void matrix_inverse_magma(vector<vector<double>> const &F_matrix, vector<vector<double>> &F_output) {
// Index for loop and arrays
int i, j, ip, idx;
// Start magma part
magma_int_t m = F_matrix.size();
if (m) {
magma_init (); // initialize Magma
magma_queue_t queue=NULL;
magma_int_t dev=0;
magma_queue_create(dev ,&queue );
double gpu_time , *dwork; // dwork - workspace
magma_int_t ldwork; // size of dwork
magma_int_t *piv, info; // piv - array of indices of inter -
// changed rows; a - mxm matrix
magma_int_t mm=m*m; // size of a, r, c
double *a; // a- mxm matrix on the host
double *d_a; // d_a - mxm matrix a on the device
magma_int_t ione = 1;
magma_int_t ISEED [4] = { 0,0,0,1 }; // seed
magma_int_t err;
const double alpha = 1.0; // alpha =1
const double beta = 0.0; // beta=0
ldwork = m * magma_get_dgetri_nb( m ); // optimal block size
// allocate matrices
err = magma_dmalloc_cpu( &a , mm ); // host memory for a
for (i = 0; i<m; i++){
for (j = 0; j<m; j++){
idx = i*m + j;
a[idx] = F_matrix[i][j]
}
}
err = magma_dmalloc( &d_a , mm ); // device memory for a
err = magma_dmalloc( &dwork , ldwork );// dev. mem. for ldwork
piv=( magma_int_t *) malloc(m*sizeof(magma_int_t ));// host mem.
magma_dsetmatrix( m, m, a, m, d_a, m, queue); // copy a -> d_a
// find the inverse matrix: d_a*X=I using the LU factorization
// with partial pivoting and row interchanges computed by
// magma_dgetrf_gpu; row i is interchanged with row piv(i);
// d_a -mxm matrix; d_a is overwritten by the inverse
magma_dgetrf_gpu( m, m, d_a, m, piv, &info);
magma_dgetri_gpu(m, d_a, m, piv, dwork, ldwork, &info);
magma_dgetmatrix( m, m, d_a , m, a, m, queue); // copy d_a ->a
for (i = 0; i<m; i++){
for (j = 0; j<m; j++){
idx = i*m + j;
F_output[i][j] = a[idx];
}
}
free(a); // free host memory
free(piv); // free host memory
magma_free(d_a); // free device memory
magma_queue_destroy(queue); // destroy queue
magma_finalize ();
// End magma part
}
}
As you can see, I have used magma_dgetrf_gpu and magma_dgetri_gpu functions.
Now, I would like to do the same, either with LAPACKE or MAGMA+LAPACK, using dpotrf and dpotri functions. I recall that the matrixes that I inverse are symmetric.
EDIT 3: my attempts come from this documentation link
Especially, see section 4.4.21 magma dpotri - invert a positive definite matrix in double precision, CPU interface on page 325.

I don't know how can I write the below conditions in CPLEX?

∀ d1, d2 ∈ D: ∃ ( p[i] ∈ P[d1]) ∩(p[j] ∈ P[d2] ) ∩ (l ∈ p[i] ∩ l ∈ p[j])
I want to write this conditions but I don't know how!
I write as you can see, is it correct?
{int} path[Demands][K_sp]=...; ( we have k_sp shortest pathes for each demand, that each shortest path includes index of links)
{int} Path[Demands]=...; (i insert all of index links for k_sp shortest paths a demand in this array)
forall (i in Demands, j in Demands : i!=j && card(Path[i] inter Path[j])!=0)
D[i][j]+D[j][i]==1;

You could use logical constraints:
{int} D=asSet(1..10);
{int} allP=asSet(1..20);
dvar boolean x[d in D][p in allP]; // for each D , is p in P[d] ?
dvar boolean DD[d1 in D][d2 in D];
subject to
{
forall(ordered d1,d2 in D) DD[d1][d2]== (1<=sum(p in allP) ((x[d1][p]==1) && (x[d2][p]==1)));
forall(ordered d2,d1 in D) DD[d1][d2]== 1-(1<=sum(p in allP) ((x[d1][p]==1) && (x[d2][p]==1)));
}

Could the following help you?
{int} D=asSet(1..10);
{int} allP=asSet(1..20);
dvar boolean x[d in D][p in allP]; // for each D , is p in P[d] ?
subject to
{
forall(d1,d2 in D) 1<=sum(p in allP) ((x[d1][p]==1) && (x[d2][p]==1));
}
assert forall(d1,d2 in D) or(p in allP) ((x[d1][p]==1) && (x[d2][p]==1));

Cost functional calculation for global optimization in CUDA

I am trying to optimize a function (say find the minimum) with n parameters (Xn). All Xi's are bound to a certain range (for example -200 to 200) and if any parameter leaves this range, the function goes to infinity very fast. However, n can be large (from 20 to about 60-70) and computing it's value takes long time.
I don't think the details about the function are of big relevance, but here are some: it consists of a weighted sum of 20-30 smaller functions (all different), which on their part consist of sums of dot products under the sign of an inverse sinusoidal function (arcsin, arccos, arctan, etc). Something like arcsin(X1 . X2) + arcsin(X4 . X7) + ....
The function has many local minima in general, so approaches such as (naive) conjugated gradients or quasi-Newton are useless. Searching the entire domain brute force is too slow.
My initial idea was to use some sort of massive parallelization in combination with a genetic algorithm, which performs many searches on different spots in the domain of the function, and at regular intervals checks whether some of the searches reached local minima. If yes, it compares them and discards all results but the smallest one, and continues the search until a reasonably small value is found.
My two questions are:
1) Is it possible to implement this problem in CUDA or a similar technology? Can CUDA compute the value of a function like this fast enough?
2) Would be better/faster to implement the problem on a multicore PC (with 12+ cores)?

It is certainly possible to implement global optimization algorithms on GPUs, but you may have to modify the algorithms to get the best performance. I have personally implemented about a half dozen population-based metaheuristics on GPUs, and it is possible to get excellent speedups relative to a multi-threaded CPU.
With population-based algorithms iterated over generations, you may find that the population size becomes the limiting factor in your performance. If your objective function is not easily parallelizable, then the maximum number of parallel threads is generally the number of candidate solutions in the population. GPUs work best with, at minimum, tens of thousands of simultaneous threads, which is not always practical for population-based algorithms due to population inertia and other effects.
You can get around the population limitations to some extent by running multiple instances of the optimization in parallel, each starting with a different random seed. Further, you can arrange for the parallel instances to communicate over some network topology (this would be an island model algorithm), so that the parallel instances can collaboratively evolve solutions to complex problems. I have actually implemented this in OpenCL for my MScE thesis.
So overall, here are succinct answers to your questions:
1) Yes, you can implement this in CUDA or a similar technology. Your speedup will depend on how parallelizable your objective function is and how many parallel instances you want to run at the same time.
2) It would almost certainly be faster to implement on a CPU, due to a wider range of existing libraries and the fact that CPU programming models are conceptually simpler than GPU models. Whether or not it is "better" depends on what you value.
This is still an area of active research, and it will probably take you longer to build a working GPU implementation than it would take for a multicore CPU implementation. If implementation time is your primary concern, I would like to recommend that you take a look at the PaGMO project (and its Python bindings PyGMO), which is an excellent implementation of an island model optimizer with a wide range of local and global optimization functions included. The islands can be assigned any arbitrary algorithm to run independently, and you can specify exactly how they communicate to collaboratively optimize your objective function.
http://sourceforge.net/apps/mediawiki/pagmo/index.php?title=Main_Page
Your question is well within my research area, and I would be happy to help you further if you need it.

Can CUDA compute the value of a function like this fast enough?
Many cost functionals are expressed in the form of summations of a certain number of terms. Examples are the:
Sphere function
Rosenbrock function
Styblinski-Tang function
In all those cases, the evaluation of the cost function can be performed by a reduction, or better, a transformation followed by a reduction.
CUDA Thrust has thrust::transform_reduce which can surely serve the scope, but of course you can set up your own transformation + reduction routines.
Below, I'm providing an example on how you can compute the Rosenbrock functional using either CUDA Thrust or a customized version of the reduction routine offered by the CUDA examples. In the latter case, a pointer to a __device__ transformation function is passed to the customized transform_reduce function, if the EXTERNAL keyword is defined, or the the transformation function is defined and compiled in the compilation unit of the customized transform_reduce routine.
Some performance results on a Kepler K20c card for the non-EXTERNAL case:
N = 90000 Thrust = 0.055ms Customized = 0.059ms
N = 900000 Thrust = 0.67ms Customized = 0.14ms
N = 9000000 Thrust = 0.85ms Customized = 0.87ms
Here is the code. For the timing functions, please see OrangeOwlSolutions/Timing and OrangeOwlSolutions/CUDA_Utilities github projects.
Please, note that separate compilation is required.
kernel.cu
// --- Requires separate compilation
#include <stdio.h>
#include <thrust/device_vector.h>
#include "transform_reduce.cuh"
#include "Utilities.cuh"
#include "TimingCPU.h"
#include "TimingGPU.cuh"
/***************************************/
/* COST FUNCTION - GPU/CUSTOMIZED CASE */
/***************************************/
// --- Transformation function
__device__ __forceinline__ float transformation(const float * __restrict__ x, const int i) { return (100.f * (x[i+1] - x[i] * x[i]) * (x[i+1] - x[i] * x[i]) + (x[i] - 1.f) * (x[i] - 1.f)) ; }
// --- Device-side function pointer
__device__ pointFunction_t dev_pfunc = transformation;
/***********************************/
/* COST FUNCTION - GPU/THRUST CASE */
/***********************************/
struct CostFunctionStructGPU{
template <typename Tuple>
__host__ __device__ float operator()(Tuple a) {
float temp1 = (thrust::get<1>(a) - thrust::get<0>(a) * thrust::get<0>(a));
float temp2 = (thrust::get<0>(a) - 1.f);
return 100.f * temp1 * temp1 + temp2 * temp2;
}
};
/********/
/* MAIN */
/********/
int main()
{
const int N = 90000000;
float *x = (float *)malloc(N * sizeof(float));
for (int i=0; i<N; i++) x[i] = 3.f;
float *d_x; gpuErrchk(cudaMalloc((void**)&d_x, N * sizeof(float)));
gpuErrchk(cudaMemcpy(d_x, x, N * sizeof(float), cudaMemcpyHostToDevice));
/************************************************/
/* "CUSTOMIZED" DEVICE-SIDE TRANSFORM REDUCTION */
/************************************************/
float customizedDeviceResult = transform_reduce(d_x, N - 1, &dev_pfunc);
TimingGPU timerGPU;
timerGPU.StartCounter();
customizedDeviceResult = transform_reduce(d_x, N - 1, &dev_pfunc);
printf("Timing for 'customized', device-side transform reduction = %f\n", timerGPU.GetCounter());
printf("Result for 'customized', device-side transform reduction = %f\n", customizedDeviceResult);
printf("\n\n");
/************************************************/
/* THRUST-BASED DEVICE-SIDE TRANSFORM REDUCTION */
/************************************************/
thrust::device_vector<float> d_vec(N,3.f);
timerGPU.StartCounter();
float ThrustResult = thrust::transform_reduce(thrust::make_zip_iterator(thrust::make_tuple(d_vec.begin(), d_vec.begin() + 1)), thrust::make_zip_iterator(thrust::make_tuple(d_vec.begin() + N - 1, d_vec.begin() + N)), CostFunctionStructGPU(), 0.f, thrust::plus<float>());
printf("Timing for Thrust-based, device-side transform reduction = %f\n", timerGPU.GetCounter());
printf("Result for Thrust-based, device-side transform reduction = %f\n", ThrustResult);
printf("\n\n");
/*********************************/
/* HOST-SIDE TRANSFORM REDUCTION */
/*********************************/
// thrust::host_vector<float> h_vec(d_vec);
//sum_host = sum_host + transformation(thrust::raw_pointer_cast(h_vec.data()), i);
TimingCPU timerCPU;
timerCPU.StartCounter();
float sum_host = 0.f;
for (int i=0; i<N-1; i++) {
float temp = (100.f * (x[i+1] - x[i] * x[i]) * (x[i+1] - x[i] * x[i]) + (x[i] - 1.f) * (x[i] - 1.f));
sum_host = sum_host + temp;
//printf("%i %f %f\n", i, temp, sum_host);
}
printf("Timing for host-side transform reduction = %f\n", timerCPU.GetCounter());
printf("Result for host-side transform reduction = %f\n", sum_host);
printf("\n\n");
sum_host = 0.f;
float c = 0.f;
for (int i=0; i<N-1; i++) {
float temp = (100.f * (x[i+1] - x[i] * x[i]) * (x[i+1] - x[i] * x[i]) + (x[i] - 1.f) * (x[i] - 1.f)) - c;
float t = sum_host + temp;
c = (t - sum_host) - temp;
sum_host = t;
}
printf("Result for host-side transform reduction = %f\n", sum_host);
// cudaDeviceReset();
}
transform_reduce.cuh
#ifndef TRANSFORM_REDUCE_CUH
#define TRANSFORM_REDUCE_CUH
// --- Function pointer type
// --- Complete with your own favourite instantiations
typedef float(*pointFunction_t)(const float * __restrict__, const int);
template <class T> T transform_reduce(T *, unsigned int, pointFunction_t *);
#endif
transform_reduce.cu
#include <stdio.h>
#include "Utilities.cuh"
#include "transform_reduce.cuh"
#define BLOCKSIZE 512
#define warpSize 32
// --- Host-side function pointer
pointFunction_t h_pfunc;
// --- Uncomment if you want to apply CUDA error checking to the kernel launches
//#define DEBUG
//#define EXTERNAL
/*******************************************************/
/* CALCULATING THE NEXT POWER OF 2 OF A CERTAIN NUMBER */
/*******************************************************/
unsigned int nextPow2(unsigned int x)
{
--x;
x |= x >> 1;
x |= x >> 2;
x |= x >> 4;
x |= x >> 8;
x |= x >> 16;
return ++x;
}
/*************************************/
/* CHECK IF A NUMBER IS A POWER OF 2 */
/*************************************/
// --- Note: although x = 1 is a power of 2 (1 = 2^0), this routine returns 0 for x == 1.
bool isPow2(unsigned int x) {
if (x == 1) return 0;
else return ((x&(x-1))==0);
}
/***************************/
/* TRANSFORMATION FUNCTION */
/***************************/
template <class T>
__host__ __device__ __forceinline__ T transformation(const T * __restrict__ x, const int i) { return ((T)100 * (x[i+1] - x[i] * x[i]) * (x[i+1] - x[i] * x[i]) + (x[i] - (T)1) * (x[i] - (T)1)) ; }
/********************/
/* REDUCTION KERNEL */
/********************/
/*
This version adds multiple elements per thread sequentially. This reduces the overall
cost of the algorithm while keeping the work complexity O(n) and the step complexity O(log n).
(Brent's Theorem optimization)
Note, this kernel needs a minimum of 64*sizeof(T) bytes of shared memory.
In other words if blockSize <= 32, allocate 64*sizeof(T) bytes.
If blockSize > 32, allocate blockSize*sizeof(T) bytes.
*/
template <class T, unsigned int blockSize, bool nIsPow2>
__global__ void reductionKernel(T *g_idata, T *g_odata, unsigned int N, pointFunction_t pPointTransformation)
{
extern __shared__ T sdata[];
unsigned int tid = threadIdx.x; // Local thread index
unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x; // Global thread index - Fictitiously double the block dimension
unsigned int gridSize = blockSize*2*gridDim.x;
// --- Performs the first level of reduction in registers when reading from global memory on multiple elements per thread.
// More blocks will result in a larger gridSize and therefore fewer elements per thread
T mySum = 0;
while (i < N) {
#ifdef EXTERNAL
mySum += (*pPointTransformation)(g_idata, i);
#else
mySum += transformation(g_idata, i);
#endif
// --- Ensure we don't read out of bounds -- this is optimized away for powerOf2 sized arrays
if (nIsPow2 || i + blockSize < N)
#ifdef EXTERNAL
mySum += (*pPointTransformation)(g_idata, i+blockSize);
#else
mySum += transformation(g_idata, i+blockSize);
#endif
i += gridSize; }
// --- Each thread puts its local sum into shared memory
sdata[tid] = mySum;
__syncthreads();
// --- Reduction in shared memory. Fully unrolled loop.
if ((blockSize >= 512) && (tid < 256)) sdata[tid] = mySum = mySum + sdata[tid + 256];
__syncthreads();
if ((blockSize >= 256) && (tid < 128)) sdata[tid] = mySum = mySum + sdata[tid + 128];
__syncthreads();
if ((blockSize >= 128) && (tid < 64)) sdata[tid] = mySum = mySum + sdata[tid + 64];
__syncthreads();
#if (__CUDA_ARCH__ >= 300 )
// --- Single warp reduction by shuffle operations
if ( tid < 32 )
{
// --- Last iteration removed from the for loop, but needed for shuffle reduction
mySum += sdata[tid + 32];
// --- Reduce final warp using shuffle
for (int offset = warpSize/2; offset > 0; offset /= 2) mySum += __shfl_down(mySum, offset);
//for (int offset=1; offset < warpSize; offset *= 2) mySum += __shfl_xor(mySum, i);
}
#else
// --- Reduction within a single warp. Fully unrolled loop.
if ((blockSize >= 64) && (tid < 32)) sdata[tid] = mySum = mySum + sdata[tid + 32];
__syncthreads();
if ((blockSize >= 32) && (tid < 16)) sdata[tid] = mySum = mySum + sdata[tid + 16];
__syncthreads();
if ((blockSize >= 16) && (tid < 8)) sdata[tid] = mySum = mySum + sdata[tid + 8];
__syncthreads();
if ((blockSize >= 8) && (tid < 4)) sdata[tid] = mySum = mySum + sdata[tid + 4];
__syncthreads();
if ((blockSize >= 4) && (tid < 2)) sdata[tid] = mySum = mySum + sdata[tid + 2];
__syncthreads();
if ((blockSize >= 2) && ( tid < 1)) sdata[tid] = mySum = mySum + sdata[tid + 1];
__syncthreads();
#endif
// --- Write result for this block to global memory. At the end of the kernel, global memory will contain the results for the summations of
// individual blocks
if (tid == 0) g_odata[blockIdx.x] = mySum;
}
/******************************/
/* REDUCTION WRAPPER FUNCTION */
/******************************/
template <class T>
T transform_reduce_inner(T *g_idata, unsigned int N, pointFunction_t h_pfunc) {
// --- Reduction parameters
const int NumThreads = (N < BLOCKSIZE) ? nextPow2(N) : BLOCKSIZE;
const int NumBlocks = (N + NumThreads - 1) / NumThreads;
const int smemSize = (NumThreads <= 32) ? 2 * NumThreads * sizeof(T) : NumThreads * sizeof(T);
// --- Device memory space where storing the partial reduction results
T *g_odata; gpuErrchk(cudaMalloc((void**)&g_odata, NumBlocks * sizeof(T)));
if (isPow2(N)) {
switch (NumThreads) {
case 512: reductionKernel<T, 512, true><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
case 256: reductionKernel<T, 256, true><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
case 128: reductionKernel<T, 128, true><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
case 64: reductionKernel<T, 64, true><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
case 32: reductionKernel<T, 32, true><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
case 16: reductionKernel<T, 16, true><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
case 8: reductionKernel<T, 8, true><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
case 4: reductionKernel<T, 4, true><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
case 2: reductionKernel<T, 2, true><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
case 1: reductionKernel<T, 1, true><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
}
#ifdef DEBUG
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
#endif
}
else {
switch (NumThreads) {
case 512: reductionKernel<T, 512, false><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
case 256: reductionKernel<T, 256, false><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
case 128: reductionKernel<T, 128, false><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
case 64: reductionKernel<T, 64, false><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
case 32: reductionKernel<T, 32, false><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
case 16: reductionKernel<T, 16, false><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
case 8: reductionKernel<T, 8, false><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
case 4: reductionKernel<T, 4, false><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
case 2: reductionKernel<T, 2, false><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
case 1: reductionKernel<T, 1, false><<< NumBlocks, NumThreads, smemSize>>>(g_idata, g_odata, N, h_pfunc); break;
}
#ifdef DEBUG
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
#endif
}
// --- The last part of the reduction, which would be expensive to perform on the device, is executed on the host
T *host_vector = (T *)malloc(NumBlocks * sizeof(T));
gpuErrchk(cudaMemcpy(host_vector, g_odata, NumBlocks * sizeof(T), cudaMemcpyDeviceToHost));
T sum_transformReduce = (T)0;
for (int i=0; i<NumBlocks; i++) sum_transformReduce = sum_transformReduce + host_vector[i];
return sum_transformReduce;
}
template <class T>
T transform_reduce(T *g_idata, unsigned int N, pointFunction_t *dev_pfunc) {
#ifdef EXTERNAL
gpuErrchk(cudaMemcpyFromSymbol(&h_pfunc, *dev_pfunc, sizeof(pointFunction_t)));
#endif
T customizedDeviceResult = transform_reduce_inner(g_idata, N, h_pfunc);
return customizedDeviceResult;
}
// --- Complete with your own favourite instantiations
template float transform_reduce(float *, unsigned int, pointFunction_t *);

My answer above is mostly suited for problems having a very arge number of unknowns, which are typically dealt with local optimization algorithms. I will leave it here for possible reference to other users.
As you mentioned, you are dealing with 60-70 unknowns, a scenario which can be more easily managed by global optimization algorithms.
As underlined above, cost functionals often consist of summations, so that their computation amounts to subsequent transforms and reductions operations. With such a number of unknowns, reduction is shared memory could be an interesting option. Fortunately, CUB offers primitives for reduction in shared memory.
Here is a worked example on how using CUB for the calculation of a large number of cost functional values for problems having a moderate number of unknowns. The cost functional in this case is chosen to be the Rastrigin function, but the example can be adapted to other cost functionals by just changing the corresponding __device__ function.
#include <cub/cub.cuh>
#include <cuda.h>
#include "Utilities.cuh"
#include <iostream>
#define BLOCKSIZE 256
const int N = 4096;
/************************/
/* RASTRIGIN FUNCTIONAL */
/************************/
__device__ float rastrigin(float x) {
return x * x - 10.0f * cosf(2.0f * x) + 10.0f;
}
/******************************/
/* TRANSFORM REDUCTION KERNEL */
/******************************/
__global__ void CostFunctionalCalculation(const float * __restrict__ indata, float * __restrict__ outdata) {
unsigned int tid = threadIdx.x + blockIdx.x * gridDim.x;
// --- Specialize BlockReduce for type float.
typedef cub::BlockReduce<float, BLOCKSIZE> BlockReduceT;
__shared__ typename BlockReduceT::TempStorage temp_storage;
float result;
if(tid < N) result = BlockReduceT(temp_storage).Sum(rastrigin(indata[tid]));
if(threadIdx.x == 0) outdata[blockIdx.x] = result;
return;
}
/********/
/* MAIN */
/********/
int main() {
// --- Allocate host side space for
float *h_data = (float *)malloc(N * sizeof(float));
float *h_result = (float *)malloc((N / BLOCKSIZE) * sizeof(float));
float *d_data; gpuErrchk(cudaMalloc(&d_data, N * sizeof(float)));
float *d_result; gpuErrchk(cudaMalloc(&d_result, (N / BLOCKSIZE) * sizeof(float)));
for (int i = 0; i < N; i++) {
h_data[i] = 1.f;
}
gpuErrchk(cudaMemcpy(d_data, h_data, N * sizeof(float), cudaMemcpyHostToDevice));
CostFunctionalCalculation<<<iDivUp(N, BLOCKSIZE), BLOCKSIZE>>>(d_data, d_result);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaMemcpy(h_result, d_result, (N / BLOCKSIZE) * sizeof(float), cudaMemcpyDeviceToHost));
std::cout << "output: \n";
for (int k = 0; k < N / BLOCKSIZE; k++) std::cout << k << " " << h_result[k] << "\n";
std::cout << std::endl;
gpuErrchk(cudaFree(d_data));
gpuErrchk(cudaFree(d_result));
return 0;
}

What is clojure.lang.Var.getRawRoot and why is it called?

I was writing a function that checks if two points can see each other on a 2D grid for a pathfinding algorithm. After profiling the code, I found that it spent 60% of its time in clojure.lang.Var.getRawRoot(). Why is this function consuming so much time and can I optimize it away?
(defn line-of-sight-helper [^Maze maze [x0 y0] [x1 y1]]
"Determines if there is a line of sight from [x0 y0] to [x1 y1] in maze."
(let [dy (int (- y1 y0))
dx (int (- x1 x0))
sy (int (if (neg? dy) -1 1))
sx (int (if (neg? dx) -1 1))
dy (int (* sy dy))
dx (int (* sx dx))
bias-x (int (if (pos? sx) 0 -1))
bias-y (int (if (pos? sy) 0 -1))
x-long (boolean (>= dx dy))
[u0 u1 du su bias-u] (if x-long
[(int x0) (int x1) dx sx bias-x]
[(int y0) (int y1) dy sy bias-y])
[v0 v1 dv sv bias-v] (if x-long
[(int y0) (int y1) dy sy bias-y]
[(int x0) (int x1) dx sx bias-x])
grid (if x-long
#(blocked? maze [%1 %2])
#(blocked? maze [%2 %1]))]
(loop [u0 u0
v0 v0
error (int 0)]
(if (not= u0 u1)
(let [error (+ error dv)
too-much-error? (> error du)
next-blocked? (grid (+ u0 bias-u) (+ v0 bias-v))
branch3 (and too-much-error? (not next-blocked?))
v0 (int (if branch3
(+ v0 sv)
v0))
error (if branch3
(int (- error du))
(int error))]
(if (and too-much-error? next-blocked?)
false
(if (and (not (zero? error)) next-blocked?)
false
(if (and (zero? dv)
(grid (+ u0 bias-u)
v0)
(grid (+ u0 bias-u)
(- v0 1)))
false
(recur (int (+ u0 su))
v0
error)))))
true))))

What's happening with getVarRoot?
I'm really surprised that any program spends much time in getRawRoot(). All this method does is return a single field from the Var, as per the source in clojure.lang.Var:
final public Object getRawRoot(){
return root;
}
In additional, it's a small final method so should be inlined by any modern JIT compiler..... basically any calls to getRawRoot should be insanely fast.
I suspect that something strange is going on with your profiler: perhaps it is adding debug code in getRawRoot() that is taking a lot of time. Hence I'd suggest benchmarking your code without the profiler and with java -server to see how the function really performs.
Other performance hints
Make sure you use Clojure 1.3+, since there are some optimisations for var access that you will almost certainly want to take advantage of in this kind of low-level code.
If I was to take a guess as to what is actually the biggest bottleneck in this code, then I think it would be the fact that the grid function #(blocked? maze [%1 %2]) constructs a new vector every time it is called to check a grid square. It would be much better if you could refactor this so that it didn't need a vector and you could then just use #(blocked? maze %1 %2) directly. Constructing new collections is expensive compared to simple maths operations so you want to do it sparingly in you inner loops.
You also want to make sure you are using primitive operations wherever possible, and with (set! *unchecked-math* true). Make sure you declare your locals as primitives, so you will want e.g. (let [u0 (int (if x-long x0 y0)) .....] .....) etc. The main reason to do this is avoid the overhead of boxed primitives, which again implies memory allocations that you want to avoid in inner loops.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas