Clang, link time optimization fails for AVX horizontal add - optimization

I have a small piece of testing code which calculates the dot products of two vectors with a third vector using AVX instructions (A dot C and B dot C below). It also adds the two products, but that is just to make the function return something for this example.
#include <iostream>
#include <immintrin.h>
double compute(const double *x)
{
__m256d A = _mm256_loadu_pd(x);
__m256d B = _mm256_loadu_pd(x + 4);
__m256d C = _mm256_loadu_pd(x + 8);
__m256d c1 = _mm256_mul_pd(A, C);
__m256d c2 = _mm256_mul_pd(B, C);
__m256d tmp = _mm256_hadd_pd(c1, c2);
__m128d lo = _mm256_extractf128_pd(tmp, 0);
__m128d hi = _mm256_extractf128_pd(tmp, 1);
__m128d dotp = _mm_add_pd(lo, hi);
double y[2];
_mm_store_pd(y, dotp);
return y[0] + y[1];
}
int main(int argc, char *argv[])
{
const double v[12] = {0.3, 2.9, 1.3, 4.0, -1.0, -2.1, -3.0, -4.0, 0.0, 2.0, 1.3, 1.2};
double x = 0;
std::cout << "AVX" << std::endl;
x = compute(v);
std::cout << "x = " << x << std::endl;
return 0;
}
When I compile as
clang++ -O3 -mavx main.cc -o main
everything works fine. If I enable link time optimization:
clang++ -flto -O3 -mavx main.cc -o main
I get the following error "LLVM ERROR: Do not know how to split the result of this operator!". I have narrowed the culprit to the _mm256_hadd_pd statement. If this is exchanged with e.g. _m256_add_pd link time optimization works again. I realize that this is a silly example to use link-time optimization for, but the error ocurred in a different context where it link-time optimization is extremely helpful.
Can anyone explain what is going on here?

Related

Eigen on STM32 works only until a certain size

I am trying to use Eigen C++ library on STM32F4 Discovery embedded board to perform some matrix operations in the future, specifically to do some kalman filtering on sensor data.
I tried linking against the standard c++ library and even tried to compile the program using g++ arm compiler.
typedef Eigen::Matrix<float, 10, 10> Matrix10d;
Matrix10d mat1 = Matrix10d::Constant(10, 10, 1);
Matrix10d mat2 = Matrix10d::Constant(10, 10, 2);
Matrix10d result;
result = mat1 * mat2;
I can compile the same code if the matrix size as been set to 7. If I cross that then the code wont compile and the eigen gives me a warning that
warning: argument 1 value '4294967295' exceeds maximum object size 2147483647
These are the partial error messages I am getting
n function 'throw_std_bad_alloc,
inlined from 'check_size_for_overflow at bla/bla/Eigen/src/Core/util/Memory.h:289:24
Here is the memory allocation in Linker script I am using
/*
* STM32F407xG memory setup.
* Note: Use of ram1 and ram2 is mutually exclusive with use of ram0.
*/
MEMORY
{
flash0 : org = 0x08000000, len = 1M
flash1 : org = 0x00000000, len = 0
flash2 : org = 0x00000000, len = 0
flash3 : org = 0x00000000, len = 0
flash4 : org = 0x00000000, len = 0
flash5 : org = 0x00000000, len = 0
flash6 : org = 0x00000000, len = 0
flash7 : org = 0x00000000, len = 0
ram0 : org = 0x20000000, len = 128k /* SRAM1 + SRAM2 */
ram1 : org = 0x20000000, len = 112k /* SRAM1 */
ram2 : org = 0x2001C000, len = 16k /* SRAM2 */
ram3 : org = 0x00000000, len = 0
ram4 : org = 0x10000000, len = 64k /* CCM SRAM */
ram5 : org = 0x40024000, len = 4k /* BCKP SRAM */
ram6 : org = 0x00000000, len = 0
ram7 : org = 0x00000000, len = 0
}
I am just running STM32F4 discovery board with unchanged Chibios configuration
# Stack size to be allocated to the Cortex-M process stack. This stack is
# the stack used by the main() thread.
ifeq ($(USE_PROCESS_STACKSIZE),)
USE_PROCESS_STACKSIZE = 0x400
endif
Update
I was not able to reproduce this error anymore. The sad thing is that I didn't do anything to solve the issue.
arm-none-eabi-gcc -c -mcpu=cortex-m4 -O3 -Os -ggdb -fomit-frame-pointer -falign-functions=16 -ffunction-sections -fdata-sections -fno-common -flto -mfloat-abi=hard -mfpu=fpv4-sp-d16 -fsingle-precision-constant -Wall -Wextra -Wundef -Wstrict-prototypes -Wa,-alms=build/lst/ -DCORTEX_USE_FPU=TRUE -DCHPRINTF_USE_FLOAT=TRUE -DTHUMB_PRESENT -mno-thumb-interwork -DTHUMB_NO_INTERWORKING -MD -MP -MF .dep/build.d -I.
The above are the compiler options that I am using if anyone is interested.
Now I can multiply even 20x20 matrices with out any problem.
Matrix20d mat1 = Matrix20d::Constant(20, 20, 2);
// Multiply the matrix with a vector.
Vector20d vec = Vector20d::Constant(20, 1, 2);
Vector20d result;
systime_t startTime = chVTGetSystemTimeX();
result = mat1 * vec;
// Calculate the timedifference
systime_t endTime = chVTGetSystemTimeX();
systime_t timeDifference = chTimeDiffX(startTime, endTime);
chprintf(chp,"Time taken for the multiplication in milliseconds : %d\n", (int)timeDifference);
chprintf(chp, "System time : %d \n", startTime);
chprintf(chp, "Systime end : %d \n", endTime);
chprintf(chp, "Values in the vector : \n [");
for(Eigen::Index i=0; i < result.size();i++)
{
chprintf(chp, "%0.3f, ", result(i));
}
chprintf(chp, "] \n");
chThdSleepMilliseconds(1000);
It took about ~1ms to do the above computation.
I thought that there might be some problem with my compiler. So I tried with two versions of compilers
Version - 1
arm-none-eabi-gcc (GNU Tools for Arm Embedded Processors 7-2017-q4-major) 7.2.1 20170904 (release) [ARM/embedded-7-branch revision 255204]
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Version-2
arm-none-eabi-gcc (GNU Tools for ARM Embedded Processors 6-2017-q2-update) 6.3.1 20170620 (release) [ARM/embedded-6-branch revision 249437]
Copyright (C) 2016 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Inefficient kernel function

Is there any possibility to accelerate this simple kernel function? I have thought about using shared-memory but N is equal to 507904, so it is much more than shared memory array could be.
My program creates blocks of 256 threads each.
__global__ void compute(COMPLEX_TYPE *a, COMPLEX_TYPE *b,
FLOAT_TYPE *F, FLOAT_TYPE f, int N)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N)
{
F[i] = ( a[i].x*a[i].x + a[i].y*a[i].y + b[i].x*b[i].x + b[i].y*b[i].y) / (f);
}
}
The simplest general optimisation would be something like this:
__global__ void compute(const COMPLEX_TYPE * __restrict__ a,
const COMPLEX_TYPE * __restrict__ b,
FLOAT_TYPE *F, FLOAT_TYPE f, int N)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
#pragma unroll 8
for(; i < N; i += blockDim.x * gridDim.x;)
{
COMPLEX_TYPE aval = a[i], bval = b[i]
FLOAT_TYPE Fval;
Fval = ( aval.x*aval.x + aval.y*aval.y + bval.x*bval.x + bval.y*bval.y) / (f);
F[i] = Fval;
}
}
[disclaimer: written in browser, not tested, use at own risk]
The idea here is to launch only as many threads as will execute concurrently on your target GPU, and then have every thread perform multiple operations rather than one. This helps amortise a lot of the fixed overhead at the block scheduler and setup code level and improve the overall efficiency. On most architectures, this will probably be memory bandwidth limited anyway, so memory coalescing and transaction optimisation is about the most important performance optimisation you will be able to make.
EDIT: Since this answer was marked CW, I elected to add my tests here, rather than create my own answer. If anyone objects to this, please just roll back the edit to a previous acceptable version. I'm not adding any new ideas, just testing those provided by #talonmies and #JanLucas
In my test case, the suggestions (excepting the unroll pragma) offered by #talonmies seem to give rise to a ~10% perf improvement. The suggestion by #JanLucas, to replace the floating-point divide with a floating point multiply, if acceptable, seem to give about a doubling of performance. This will obviously vary depending on GPU and other specifics. Here's my test:
$ cat t891.cu
#include <cuComplex.h>
#include <stdio.h>
#include <stdlib.h>
#define DSIZE 507904
#define nTPB 256
#define nBLK 256
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
typedef cuFloatComplex COMPLEX_TYPE;
typedef float FLOAT_TYPE;
__global__ void compute(COMPLEX_TYPE *a, COMPLEX_TYPE *b,
FLOAT_TYPE *F, FLOAT_TYPE f, int N)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N)
{
F[i] = ( a[i].x*a[i].x + a[i].y*a[i].y + b[i].x*b[i].x + b[i].y*b[i].y) / (f);
}
}
__global__ void compute_imp(const COMPLEX_TYPE * __restrict__ a,
const COMPLEX_TYPE * __restrict__ b,
FLOAT_TYPE *F, FLOAT_TYPE f, int N)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
// #pragma unroll 8
for(; i < N; i += blockDim.x * gridDim.x)
{
COMPLEX_TYPE aval = a[i];
COMPLEX_TYPE bval = b[i];
FLOAT_TYPE Fval = ( aval.x*aval.x + aval.y*aval.y + bval.x*bval.x + bval.y*bval.y) / (f);
F[i] = Fval;
}
}
__global__ void compute_imp2(const COMPLEX_TYPE * __restrict__ a,
const COMPLEX_TYPE * __restrict__ b,
FLOAT_TYPE *F, FLOAT_TYPE f, int N)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
// #pragma unroll 8
for(; i < N; i += blockDim.x * gridDim.x)
{
COMPLEX_TYPE aval = a[i];
COMPLEX_TYPE bval = b[i];
FLOAT_TYPE Fval = ( aval.x*aval.x + aval.y*aval.y + bval.x*bval.x + bval.y*bval.y) * (f);
F[i] = Fval;
}
}
int main(){
COMPLEX_TYPE *d_A, *d_B;
FLOAT_TYPE *d_F, f = 4.0f;
cudaMalloc(&d_A, DSIZE*sizeof(COMPLEX_TYPE));
cudaMalloc(&d_B, DSIZE*sizeof(COMPLEX_TYPE));
cudaMalloc(&d_F, DSIZE*sizeof(FLOAT_TYPE));
//warm-up
compute<<<(DSIZE+nTPB-1)/nTPB,nTPB>>>(d_A, d_B, d_F, f, DSIZE);
cudaDeviceSynchronize();
unsigned long long t1 = dtime_usec(0);
compute<<<(DSIZE+nTPB-1)/nTPB,nTPB>>>(d_A, d_B, d_F, f, DSIZE);
cudaDeviceSynchronize();
t1 = dtime_usec(t1);
//warm-up
compute_imp<<<DSIZE/(8*nTPB),nTPB>>>(d_A, d_B, d_F, f, DSIZE);
cudaDeviceSynchronize();
unsigned long long t2 = dtime_usec(0);
compute_imp<<<nBLK,nTPB>>>(d_A, d_B, d_F, f, DSIZE);
cudaDeviceSynchronize();
t2 = dtime_usec(t2);
//warm-up
compute_imp2<<<(DSIZE+nTPB-1)/nTPB,nTPB>>>(d_A, d_B, d_F, 1/f, DSIZE);
cudaDeviceSynchronize();
unsigned long long t3 = dtime_usec(0);
compute_imp2<<<nBLK,nTPB>>>(d_A, d_B, d_F, 1/f, DSIZE);
cudaDeviceSynchronize();
t3 = dtime_usec(t3);
cudaCheckErrors("some error");
printf("t1: %fs, t2: %fs, t3: %fs\n", t1/(float)USECPSEC, t2/(float)(USECPSEC), t3/(float)USECPSEC);
}
$ nvcc -O3 -o t891 t891.cu
$ ./t891
t1: 0.000226s, t2: 0.000209s, t3: 0.000110s
$
Notes:
The unroll pragma doesn't seem to help (it makes it run slower, for a few test cases I tried). The compiler already will, in some cases, unroll loops without a specific hint, and loop unrolling is generally an optimization that requires tuning, perhaps careful tuning.
The modification to the kernel proposed by #talonmies to create a grid-striding loop is one of the factors that would need to be taken into account to make a specific loop-unroll trip count useful. The overall grid dimension should be reduced by a factor equal to the unroll trip count, at least. However I wasn't able to find a "sweet spot".
I mostly tested on a Quadro5000 (Fermi cc2.0 GPU), CUDA 7.5RC, Fedora20. Certainly the behavior will be different on different GPUs, especially newer ones.
The nBLK parameter in this code is another "tunable" parameter, however I saw little variation with this when above about 64 or so. The best case might be to have a grid equal in size to the data.

CGAL Solves a Quadratic Programming

I have a qp problem:
Minimize: -5x0 - x1 - 4x2 - 5x5 + 1000x0x2 + 1000x1x2 + 1000x0x3
+ 1000x1x3 + 1000x0x4 +1000x1x4
Subject to: x0>=0 x1>=0 x2>=0 x3>=0 x4>=0 x5>=0
x0+x1+x5<=5
x2+x3+x4<=5
The answer should be X0=0 X1=0 X2=5 X3=0 X4=0 X5=5 and obj=-45.
But CGAL gives me X0=5 X1=0 X2=0 X3=0 X4=0 X5=0 and obj=-25.
The code is pasted as follows:
Any suggestion would be appreciated.
Kelly
#include <iostream>
#include <climits>
#include <cassert>
#include <CGAL/basic.h>
#include <CGAL/QP_models.h>
#include <CGAL/QP_functions.h>
// choose exact integral type
#ifdef CGAL_USE_GMP
#include <CGAL/Gmpz.h>
typedef CGAL::Gmpz ET;
#else
#include <CGAL/MP_Float.h>
typedef CGAL::MP_Float ET;
#endif
using namespace std;
// program and solution types
typedef CGAL::Quadratic_program<int> Program;
typedef CGAL::Quadratic_program_solution<ET> Solution;
int
main(){
Program qp (CGAL::SMALLER, true, 0.0, false, 0.0);
qp.set_c(0, -5);
qp.set_c(1, -1);
qp.set_c(2, -4);
qp.set_c(5, -5);
int g = 1000;
qp.set_d(2, 0, g);
qp.set_d(2, 1, g);
qp.set_d(3, 0, g);
qp.set_d(3, 1, g);
qp.set_d(4, 0, g);
qp.set_d(4, 1, g);
int nRow = 0;
qp.set_a(0, nRow, 1.0);
qp.set_a(1, nRow, 1.0);
qp.set_a(5, nRow, 1.0);
qp.set_b(nRow, 5);
nRow++;
qp.set_a(2, nRow, 1.0);
qp.set_a(3, nRow, 1.0);
qp.set_a(4, nRow, 1.0);
qp.set_b(nRow, 5);
Solution s = CGAL::solve_quadratic_program(qp, ET());
assert (s.solves_quadratic_program(qp));
CGAL::print_nonnegative_quadratic_program(std::cout, qp, "first_qp");
std::cout << s;
return 0;
}
Since you matrix D (quadratic objective function) is not positive semi-definite, your result isn't so surprising. CGAL does not guarantee convergence towards a global minimum but towards a local one. What you obtain is a local minimum respecting the constraints you imposed.
If you set minimum bounds for x2 and x5 at 1 by writing qp.set_l(2,true,1); qp.set_l(5,true,1);, you will see that you converge towards the solution that you computed.

vecLib cblas_sgemm documentation wrong?

I'm trying to multiply two matrices using vecLibs' cblas:
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <vecLib/cblas.h>
int main (void) {
float *A = malloc(sizeof(float) * 2 * 3);
float *B = malloc(sizeof(float) * 3 * 1);
float *C = malloc(sizeof(float) * 2 * 1);
cblas_sgemm(CblasRowMajor,
CblasNoTrans,
CblasNoTrans,
2,
1,
3,
1.0,
A, 2,
B, 3,
0.0,
C, 2);
printf ("[ %f, %f]\n", C[0], C[1]);
return 0;
}
According to the docs every argument seems to match yet I get this error:
lda must be >= MAX(K,1): lda=2 K=3BLAS error: Parameter number 9 passed to cblas_sgemm had an invalid value
The error you are seeing seems perfectly correct to my eyes.
LDA is always the pitch of the array A in linear memory. If you are using row major storage order, the pitch will be the number of columns, not the number of rows. So LDA should be 3 in this case.

dot product using cblas is slow

I want to calculate the product A^T*A ( A is 2000x1000 Matrix). Also i only want to solve the upper triangular Matrix. In the inner loop i have to solve the dot product of two vectors.
Now, here is the problem. Using cblas ddot() is not faster than calculating the dot product with a loop. How is this possible? (using Intel Core (TM)i7 CPU M620 #2,67GHz, 1,92GB RAM)
The problem is caused essentially by matrix size, not by ddot. Your matrices are so large that they do not fit in the cache memory. The solution is to rearrange the three nested loops such that as much as possible can be done with a line in cache, so reducing cache refreshes. A model implementation follows for both the ddot and an daxpy approach. On my computer the time consumption was about 15:1.
In other words: never, never, never program a matrix multiplication along the "row times column" scheme that we learned in school.
/*
Matrix product of A^T * A by two methods.
1) "Row times column" as we learned in school.
2) With rearranged loops such that need for cash refreshes is reduced
(this can be improved even more).
Compile: gcc -o aT_a aT_a.c -lgslcblas -lblas -lm
*/
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <cblas.h>
#define ROWS 2000
#define COLS 1000
static double a[ROWS][COLS];
static double c[COLS][COLS];
static void dot() {
int i, j;
double *ai, *bj;
ai = a[0];
for (i=0; i<COLS; i++) {
bj = a[0];
for (j=0; j<COLS; j++) {
c[i][j] = cblas_ddot(ROWS,ai,COLS,bj,COLS);
bj += 1;
}
ai += 1;
}
}
static void axpy() {
int i, j;
double *ci, *bj, aij;
for (i=0; i<COLS; i++) {
ci = c[i];
for (j=0; j<COLS; j++) ci[j] = 0.;
for (j=0; j<ROWS; j++) {
aij = a[j][i];
bj = a[j];
cblas_daxpy(COLS,aij,bj,1,ci,1);
}
}
}
int main(int argc, char** argv) {
clock_t t0, t1;
int i, j;
for (i=0; i<ROWS; ++i)
for (j=0; j<COLS; ++j)
a[i][j] = i+j;
t0 = clock();
dot();
t0 = clock();
printf("Time for DOT : %f sec.\n",(double)t0/CLOCKS_PER_SEC);
axpy();
t1 = clock();
printf("Time for AXPY: %f sec.\n",(double)(t1-t0)/CLOCKS_PER_SEC);
return 0;
}
The CBLAS dot product is effectively just a computation in slightly unrolled loop. The netlib Fortran is just this:
DO I = MP1,N,5
DTEMP = DTEMP + DX(I)*DY(I) + DX(I+1)*DY(I+1) +
$ DX(I+2)*DY(I+2) + DX(I+3)*DY(I+3) + DX(I+4)*DY(I+4)
END DO
ie. just a loop unrolled to a stride of 5.
If you must use a ddot style dot product for your operation, you might get a performance boost by re-writing your loop to use SSE2 intrinsics:
#include <emmintrin.h>
double ddotsse2(const double *x, const double *y, const int n)
{
double result[2];
int n2 = 2 * (n/2);
__m128d dtemp;
if ( (n % 2) == 0) {
dtemp = _mm_setzero_pd();
} else {
dtemp = _mm_set_sd(x[n] * y[n]);
}
for(int i=0; i<n2; i+=2) {
__m128d x1 = _mm_loadr_pd(x+i);
__m128d y1 = _mm_loadr_pd(y+i);
__m128d xy = _mm_mul_pd(x1, y1);
dtemp = _mm_add_pd(dtemp, xy);
}
_mm_store_pd(&result[0],dtemp);
return result[0] + result[1];
}
(not tested, never been compiled, buyer beware).
This may or may be faster than the standard BLAS implementation. You may also want to investigate whether further loop unrolling could improve performance.
If you're not using SSE2 intrinsics or using a data type that may not boost performance with them, you can try to transpose the matrix for an easy improvement in performance for larger matrix multiplications with cblas_?dot. Performing the matrix multiplication in blocks also helps.
void matMulDotProduct(int n, float *A, float* B, int a_size, int b_size, int a_row, int a_col, int b_row, int b_col, float *C) {
int i, j, k;
MKL_INT incx, incy;
incx = 1;
incy = b_size;
//copy out multiplying matrix from larger matrix
float *temp = (float*) malloc(n * n * sizeof(float));
for (i = 0; i < n; ++i) {
cblas_scopy(n, &B[(b_row * b_size) + b_col + i], incy, &temp[i * n], 1);
}
//transpose
mkl_simatcopy('R', 'T', n, n, 1.0, temp, 1, 1);
for (i = 0; i < n; i+= BLOCK_SIZE) {
for (j = 0; j < n; j++) {
for (k = 0; k < BLOCK_SIZE; ++k) {
C[((i + k) * n) + j] = cblas_sdot(n, &A[(a_row + i + k) * a_size + a_col], incx, &temp[n * j], 1);
}
}
}
free(temp);
}
On my machine, this code is about 1 order of magnitude faster than the the 3 loop code (but also 1 order of magnitude slower than cblas_?gemm call) for single precision floats and 2K by 2K matrices. (I'm using Intel MKL).