Eigen on STM32 works only until a certain size

Eigen on STM32 works only until a certain size - embedded

I am trying to use Eigen C++ library on STM32F4 Discovery embedded board to perform some matrix operations in the future, specifically to do some kalman filtering on sensor data.
I tried linking against the standard c++ library and even tried to compile the program using g++ arm compiler.
typedef Eigen::Matrix<float, 10, 10> Matrix10d;
Matrix10d mat1 = Matrix10d::Constant(10, 10, 1);
Matrix10d mat2 = Matrix10d::Constant(10, 10, 2);
Matrix10d result;
result = mat1 * mat2;
I can compile the same code if the matrix size as been set to 7. If I cross that then the code wont compile and the eigen gives me a warning that
warning: argument 1 value '4294967295' exceeds maximum object size 2147483647
These are the partial error messages I am getting
n function 'throw_std_bad_alloc,
inlined from 'check_size_for_overflow at bla/bla/Eigen/src/Core/util/Memory.h:289:24
Here is the memory allocation in Linker script I am using
/*
* STM32F407xG memory setup.
* Note: Use of ram1 and ram2 is mutually exclusive with use of ram0.
*/
MEMORY
{
flash0 : org = 0x08000000, len = 1M
flash1 : org = 0x00000000, len = 0
flash2 : org = 0x00000000, len = 0
flash3 : org = 0x00000000, len = 0
flash4 : org = 0x00000000, len = 0
flash5 : org = 0x00000000, len = 0
flash6 : org = 0x00000000, len = 0
flash7 : org = 0x00000000, len = 0
ram0 : org = 0x20000000, len = 128k /* SRAM1 + SRAM2 */
ram1 : org = 0x20000000, len = 112k /* SRAM1 */
ram2 : org = 0x2001C000, len = 16k /* SRAM2 */
ram3 : org = 0x00000000, len = 0
ram4 : org = 0x10000000, len = 64k /* CCM SRAM */
ram5 : org = 0x40024000, len = 4k /* BCKP SRAM */
ram6 : org = 0x00000000, len = 0
ram7 : org = 0x00000000, len = 0
}
I am just running STM32F4 discovery board with unchanged Chibios configuration
# Stack size to be allocated to the Cortex-M process stack. This stack is
# the stack used by the main() thread.
ifeq ($(USE_PROCESS_STACKSIZE),)
USE_PROCESS_STACKSIZE = 0x400
endif
Update
I was not able to reproduce this error anymore. The sad thing is that I didn't do anything to solve the issue.
arm-none-eabi-gcc -c -mcpu=cortex-m4 -O3 -Os -ggdb -fomit-frame-pointer -falign-functions=16 -ffunction-sections -fdata-sections -fno-common -flto -mfloat-abi=hard -mfpu=fpv4-sp-d16 -fsingle-precision-constant -Wall -Wextra -Wundef -Wstrict-prototypes -Wa,-alms=build/lst/ -DCORTEX_USE_FPU=TRUE -DCHPRINTF_USE_FLOAT=TRUE -DTHUMB_PRESENT -mno-thumb-interwork -DTHUMB_NO_INTERWORKING -MD -MP -MF .dep/build.d -I.
The above are the compiler options that I am using if anyone is interested.
Now I can multiply even 20x20 matrices with out any problem.
Matrix20d mat1 = Matrix20d::Constant(20, 20, 2);
// Multiply the matrix with a vector.
Vector20d vec = Vector20d::Constant(20, 1, 2);
Vector20d result;
systime_t startTime = chVTGetSystemTimeX();
result = mat1 * vec;
// Calculate the timedifference
systime_t endTime = chVTGetSystemTimeX();
systime_t timeDifference = chTimeDiffX(startTime, endTime);
chprintf(chp,"Time taken for the multiplication in milliseconds : %d\n", (int)timeDifference);
chprintf(chp, "System time : %d \n", startTime);
chprintf(chp, "Systime end : %d \n", endTime);
chprintf(chp, "Values in the vector : \n [");
for(Eigen::Index i=0; i < result.size();i++)
{
chprintf(chp, "%0.3f, ", result(i));
}
chprintf(chp, "] \n");
chThdSleepMilliseconds(1000);
It took about ~1ms to do the above computation.
I thought that there might be some problem with my compiler. So I tried with two versions of compilers
Version - 1
arm-none-eabi-gcc (GNU Tools for Arm Embedded Processors 7-2017-q4-major) 7.2.1 20170904 (release) [ARM/embedded-7-branch revision 255204]
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Version-2
arm-none-eabi-gcc (GNU Tools for ARM Embedded Processors 6-2017-q2-update) 6.3.1 20170620 (release) [ARM/embedded-6-branch revision 249437]
Copyright (C) 2016 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Related

Why didn't Valgrind detect Uninitialised value was used

#include <stdlib.h>
int* matvec(int A[][3], int* x, int n) {
int i, j;
int* y = (int*)malloc(n * sizeof(int));
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
y[i] += A[i][j] * x[j];
}
}
free(y);
}
void main() {
int a[3][3] = {
{0, 1, 2},
{2, 3, 4},
{4, 5, 6}
};
int x[3] = {1, 2, 3};
matvec(a, x, 3);
}
I detect memory problem by below command:
valgrind --tool=memcheck --leak-check=full --track-origins=yes ./a.out
it gives output:
==37060== Memcheck, a memory error detector
==37060== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==37060== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==37060== Command: ./a.out
==37060==
==37060==
==37060== HEAP SUMMARY:
==37060== in use at exit: 0 bytes in 0 blocks
==37060== total heap usage: 1 allocs, 1 frees, 12 bytes allocated
==37060==
==37060== All heap blocks were freed -- no leaks are possible
==37060==
==37060== For lists of detected and suppressed errors, rerun with: -s
==37060== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Question is: Why the valgind didn't detect the problem that: y array is used as uninitialized

From the Valgrind User Manual §4.2.2 Use of uninitialised values:
A complaint is issued only when your program attempts to make use of uninitialised data in a way that might affect your program's externally-visible behaviour.
You are starting with some uninitialized memory, keeping it uninitialized, and then freeing it. You are not using it in an externally-visible way, such as by using it in an if condition or passing it to a system call.

lapack library for scip optimization

I have a quadratic optimization problem with linear constraints that I want to solve using SCIP. The optimization matrix that I want to be minimized is positive semi-definite (it is the variance of certain variables, to be precise). I have the problem in a file in CPLEX LP format and when I optimize in SCIP, I get the message
Quadratic constraint handler does not have LAPACK for eigenvalue computation. Will assume
that matrices (with size > 2x2) are indefinite.
So SCIP starts optimization assuming that the matrix is indefinite and takes a large amount of time. I have installed LAPACK and even copied liblapack.a file in the lib folder where the SCIP source and binaries are and reinstalled SCIP. But, I keep getting the above message.
Is there a way to make SCIP use the LAPACK library? I believe the optimization will be really fast, if SCIP can figure out that the matrix is positive semi-definite.

If you feel like patching up SCIP a bit to use your Lapack lib without providing a full Ipopt (though it's relatively easy to build on *nix and could help performance, as mattmilten pointed out), here is a patch that you could try out:
diff --git a/src/scip/cons_quadratic.c b/src/scip/cons_quadratic.c
index 93ba359..795bade 100644
--- a/src/scip/cons_quadratic.c
+++ b/src/scip/cons_quadratic.c
## -46,7 +46,7 ##
#include "scip/heur_trysol.h"
#include "scip/debug.h"
#include "nlpi/nlpi.h"
-#include "nlpi/nlpi_ipopt.h"
+/*#include "nlpi/nlpi_ipopt.h" */
/* constraint handler properties */
#define CONSHDLR_NAME "quadratic"
## -4257,6 +4257,71 ## void checkCurvatureEasy(
*determined = FALSE;
}
+#define F77_FUNC(a,A) a##_
+
+ /** LAPACK Fortran subroutine DSYEV */
+ void F77_FUNC(dsyev,DSYEV)(
+ char* jobz, /**< 'N' to compute eigenvalues only, 'V' to compute eigenvalues and eigenvectors */
+ char* uplo, /**< 'U' if upper triangle of A is stored, 'L' if lower triangle of A is stored */
+ int* n, /**< dimension */
+ double* A, /**< matrix A on entry; orthonormal eigenvectors on exit, if jobz == 'V' and info == 0; if jobz == 'N', then the matrix data is destroyed */
+ int* ldA, /**< leading dimension, probably equal to n */
+ double* W, /**< buffer for the eigenvalues in ascending order */
+ double* WORK, /**< workspace array */
+ int* LWORK, /**< length of WORK; if LWORK = -1, then the optimal workspace size is calculated and returned in WORK(1) */
+ int* info /**< == 0: successful exit; < 0: illegal argument at given position; > 0: failed to converge */
+ );
+
+/** Calls Lapacks Dsyev routine to compute eigenvalues and eigenvectors of a dense matrix.
+ */
+static
+SCIP_RETCODE LapackDsyev(
+ SCIP_Bool computeeigenvectors,/**< should also eigenvectors should be computed ? */
+ int N, /**< dimension */
+ SCIP_Real* a, /**< matrix data on input (size N*N); eigenvectors on output if computeeigenvectors == TRUE */
+ SCIP_Real* w /**< buffer to store eigenvalues (size N) */
+ )
+{
+ int INFO;
+ char JOBZ = computeeigenvectors ? 'V' : 'N';
+ char UPLO = 'L';
+ int LDA = N;
+ double* WORK = NULL;
+ int LWORK;
+ double WORK_PROBE;
+ int i;
+
+ /* First we find out how large LWORK should be */
+ LWORK = -1;
+ F77_FUNC(dsyev,DSYEV)(&JOBZ, &UPLO, &N, a, &LDA, w, &WORK_PROBE, &LWORK, &INFO);
+ if( INFO != 0 )
+ {
+ SCIPerrorMessage("There was an error when calling DSYEV. INFO = %d\n", INFO);
+ return SCIP_ERROR;
+ }
+
+ LWORK = (int) WORK_PROBE;
+ assert(LWORK > 0);
+
+ SCIP_ALLOC( BMSallocMemoryArray(&WORK, LWORK) );
+
+ for( i = 0; i < LWORK; ++i )
+ WORK[i] = i;
+
+ F77_FUNC(dsyev,DSYEV)(&JOBZ, &UPLO, &N, a, &LDA, w, WORK, &LWORK, &INFO);
+
+ BMSfreeMemoryArray(&WORK);
+
+ if( INFO != 0 )
+ {
+ SCIPerrorMessage("There was an error when calling DSYEV. INFO = %d\n", INFO);
+ return SCIP_ERROR;
+ }
+
+ return SCIP_OKAY;
+}
+
+
/** checks a quadratic constraint for convexity and/or concavity */
static
SCIP_RETCODE checkCurvature(
## -4343,7 +4408,7 ## SCIP_RETCODE checkCurvature(
return SCIP_OKAY;
}
- if( SCIPisIpoptAvailableIpopt() )
+ if( TRUE )
{
for( i = 0; i < consdata->nbilinterms; ++i )
{
## -4479,7 +4544,7 ## SCIP_RETCODE checkFactorable(
return SCIP_OKAY;
/* need routine to compute eigenvalues/eigenvectors */
- if( !SCIPisIpoptAvailableIpopt() )
+ if( !TRUE )
return SCIP_OKAY;
SCIP_CALL( consdataSortQuadVarTerms(scip, consdata) );
## -9395,7 +9460,7 ## SCIP_DECL_CONSINITSOL(consInitsolQuadratic)
SCIP_CALL( SCIPcatchEvent(scip, SCIP_EVENTTYPE_SOLFOUND, eventhdlr, (SCIP_EVENTDATA*)conshdlr, &conshdlrdata->newsoleventfilterpos) );
}
- if( nconss != 0 && !SCIPisIpoptAvailableIpopt() && !SCIPisInRestart(scip) )
+ if( nconss != 0 && !TRUE && !SCIPisInRestart(scip) )
{
SCIPverbMessage(scip, SCIP_VERBLEVEL_HIGH, NULL, "Quadratic constraint handler does not have LAPACK for eigenvalue computation. Will assume that matrices (with size > 2x2) are indefinite.\n");
}
Use USRLDFLAGS="-llapack -lblas" with make.

Currently, SCIP is only able to use LAPACK through Ipopt. There is usually a better performance on nonlinear problems when SCIP is compiled with Ipopt support, so it is definitely recommended. Run
make IPOPT=true
and make sure you have Ipopt installed beforehand.

PETSc - MatMultScale? Matrix X vector X scalar

I'm using PETSc and I wanted to do something like,
I know I can do:
Mat A
Vec x,y
MatMult(A,x,y)
VecScale(y,0.5)
I was just curious if there is a function that would do all of these in one shot. It seems like it would save a loop.
MatMultScale(A,x,0.5,y)
Does such a function exist?

This function (or anything close) does not seems to be in the list of functions operating on Mat. So a brief answer to your question would be...no.
If you often use $y=\frac12 Ax$, a solution would be to scale the matrix once for all, using MatScale(A,0.5);.
Would such a function be useful ? One way to check this is to use the -log_summary option of petsc, to get some profiling information. If your matrix is dense, you will see that the time spent in MatMult() is much larger than the time spent in VecScale(). This question is meaningful only if a sparce matrix is handled, with a few non-null terms per line.
Here is a code to test it, using 2xIdentity as the matrix :
static char help[] = "Tests solving linear system on 0 by 0 matrix.\n\n";
#include <petscksp.h>
#undef __FUNCT__
#define __FUNCT__ "main"
int main(int argc,char **args)
{
Vec x, y;
Mat A;
PetscReal alpha=0.5;
PetscErrorCode ierr;
PetscInt n=42;
PetscInitialize(&argc,&args,(char*)0,help);
ierr = PetscOptionsGetInt(NULL,"-n",&n,NULL);CHKERRQ(ierr);
/* Create the vector*/
ierr = VecCreate(PETSC_COMM_WORLD,&x);CHKERRQ(ierr);
ierr = VecSetSizes(x,PETSC_DECIDE,n);CHKERRQ(ierr);
ierr = VecSetFromOptions(x);CHKERRQ(ierr);
ierr = VecDuplicate(x,&y);CHKERRQ(ierr);
/*
Create matrix. When using MatCreate(), the matrix format can
be specified at runtime.
Performance tuning note: For problems of substantial size,
preallocation of matrix memory is crucial for attaining good
performance. See the matrix chapter of the users manual for details.
*/
ierr = MatCreate(PETSC_COMM_WORLD,&A);CHKERRQ(ierr);
ierr = MatSetSizes(A,PETSC_DECIDE,PETSC_DECIDE,n,n);CHKERRQ(ierr);
ierr = MatSetFromOptions(A);CHKERRQ(ierr);
ierr = MatSetUp(A);CHKERRQ(ierr);
/*
This matrix is diagonal, two times identity
should have preallocated, shame
*/
PetscInt i,col;
PetscScalar value=2.0;
for (i=0; i<n; i++) {
col=i;
ierr = MatSetValues(A,1,&i,1,&col,&value,INSERT_VALUES);CHKERRQ(ierr);
}
ierr = MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);
ierr = MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);
/*
let's do this 42 times for nothing :
*/
for(i=0;i<42;i++){
ierr = MatMult(A,x,y);CHKERRQ(ierr);
ierr = VecScale(y,alpha);CHKERRQ(ierr);
}
ierr = VecDestroy(&x);CHKERRQ(ierr);
ierr = VecDestroy(&y);CHKERRQ(ierr);
ierr = MatDestroy(&A);CHKERRQ(ierr);
ierr = PetscFinalize();
return 0;
}
The makefile :
include ${PETSC_DIR}/conf/variables
include ${PETSC_DIR}/conf/rules
include ${PETSC_DIR}/conf/test
CLINKER=g++
all : ex1
ex1 : main.o chkopts
${CLINKER} -w -o main main.o ${PETSC_LIB}
${RM} main.o
run :
mpirun -np 2 main -n 10000000 -log_summary -help -mat_type mpiaij
And here the resulting two lines of -log_summary that could answer your question :
Event Count Time (sec) Flops --- Global --- --- Stage --- Total
Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------
--- Event Stage 0: Main Stage
VecScale 42 1.0 1.0709e+00 1.0 2.10e+08 1.0 0.0e+00 0.0e+00 0.0e+00 4 50 0 0 0 4 50 0 0 0 392
MatMult 42 1.0 5.7360e+00 1.1 2.10e+08 1.0 0.0e+00 0.0e+00 0.0e+00 20 50 0 0 0 20 50 0 0 0 73
So the 42 VecScale() operations took 1 second while the 42 MatMult() operations took 5.7 seconds. Suppressing the VecScale() operation would speed up the code by 20%, in the best case. The overhead due to the for loop is even lower than that. I guess that's the reason why this function does not exist.
I apologize for the poor performance of my computer (392Mflops for VecScale()...). I am curious to know what happens on yours !

Clang, link time optimization fails for AVX horizontal add

I have a small piece of testing code which calculates the dot products of two vectors with a third vector using AVX instructions (A dot C and B dot C below). It also adds the two products, but that is just to make the function return something for this example.
#include <iostream>
#include <immintrin.h>
double compute(const double *x)
{
__m256d A = _mm256_loadu_pd(x);
__m256d B = _mm256_loadu_pd(x + 4);
__m256d C = _mm256_loadu_pd(x + 8);
__m256d c1 = _mm256_mul_pd(A, C);
__m256d c2 = _mm256_mul_pd(B, C);
__m256d tmp = _mm256_hadd_pd(c1, c2);
__m128d lo = _mm256_extractf128_pd(tmp, 0);
__m128d hi = _mm256_extractf128_pd(tmp, 1);
__m128d dotp = _mm_add_pd(lo, hi);
double y[2];
_mm_store_pd(y, dotp);
return y[0] + y[1];
}
int main(int argc, char *argv[])
{
const double v[12] = {0.3, 2.9, 1.3, 4.0, -1.0, -2.1, -3.0, -4.0, 0.0, 2.0, 1.3, 1.2};
double x = 0;
std::cout << "AVX" << std::endl;
x = compute(v);
std::cout << "x = " << x << std::endl;
return 0;
}
When I compile as
clang++ -O3 -mavx main.cc -o main
everything works fine. If I enable link time optimization:
clang++ -flto -O3 -mavx main.cc -o main
I get the following error "LLVM ERROR: Do not know how to split the result of this operator!". I have narrowed the culprit to the _mm256_hadd_pd statement. If this is exchanged with e.g. _m256_add_pd link time optimization works again. I realize that this is a silly example to use link-time optimization for, but the error ocurred in a different context where it link-time optimization is extremely helpful.
Can anyone explain what is going on here?

OpenCL AMD GPU compiler crash

I am working on a kernel that find intersections between ray and a triangle list, but (there is always a "but" ) i got some trouble using my opencl compiler indeed it crashes when I try to compile it.
I try to compile it on my CPU compiler and it compile well, but with my GPU compiler it crashes...
//-----------------------------------------------------------------------------
//---------------------------------DEFINES-------------------------------------
//-----------------------------------------------------------------------------
#define RAYON_SORTANT -1000
#define RAYON_ENTRANT 1000
#define MIN_LONGUEUR_RT 1.E-6f
//-----------------------------------------------------------------------------
//---------------------------------CONTENT-------------------------------------
//-----------------------------------------------------------------------------
typedef struct s_CDPoint
{
float x;
float y;
float z;
} CDPoint;
typedef struct s_TTriangle
{
CDPoint triangle_[3];
CDPoint normal_;
} TTriangle;
typedef struct s_GridIntersection
{
CDPoint pos_;
float distance_;
int sensNormale_;
unsigned int idTriangle_;
} TGridIntersection;
//-----------------------------------------------------------------------------
//---------------------------------MUTEX---------------------------------------
//-----------------------------------------------------------------------------
void GetSemaphor(__global int * semaphor)
{
int occupied = atomic_xchg(semaphor, 1);
while(occupied > 0)
{
occupied = atomic_xchg(semaphor, 1);
}
}
void ReleaseSemaphor(__global int * semaphor)
{
int prevVal = atomic_xchg(semaphor, 0);
}
//-----------------------------------------------------------------------------
//---------------------------------GEOMETRIE-----------------------------------
//-----------------------------------------------------------------------------
float dotProduct(const CDPoint* pA, const CDPoint* pB)
{
return (pA->x * pB->x + pA->y * pB->y + pA->z * pB->z);
}
CDPoint crossProduct(const CDPoint* pA, const CDPoint* pB)
{
CDPoint res;
res.x = pA->y * pB->z - pB->y * pA->z;
res.y = pA->z * pB->x - pB->z * pA->x;
res.z = pA->x * pB->y - pB->x * pA->y;
return res;
}
CDPoint soustraction(const CDPoint* pA, const CDPoint* pB)
{
CDPoint res;
res.x = pA->x - pB->x;
res.y = pA->y - pB->y;
res.z = pA->z - pB->z;
return res;
}
CDPoint addition(const CDPoint* pA, const CDPoint* pB)
{
CDPoint res;
res.x = pA->x + pB->x;
res.y = pA->y + pB->y;
res.z = pA->z + pB->z;
return res;
}
CDPoint homothetie(const CDPoint* pA, float val)
{
CDPoint pnt;
pnt.x = pA->x * val;
pnt.y = pA->y * val;
pnt.z = pA->z * val;
return pnt;
}
//-----------------------------------------------------------------------------
//---------------------------------KERNEL--------------------------------------
//-----------------------------------------------------------------------------
__kernel void IntersectionTriangle( __global const TTriangle* pTriangleListe,
const unsigned int idxDebutTriangle,
const unsigned int idxFin,
__constant const CDPoint* pPointOrigine,
__constant const CDPoint* pDir,
__global int *nbInter,
__global TGridIntersection* pResults )
{
__private unsigned int index = get_global_id(0) + idxDebutTriangle;
if (index > idxFin) return;
__global const TTriangle *pTriangle = &pTriangleListe[index];
__private float distance = 0.f;
// Côté du triangle et normale au plan
__private CDPoint edge1 = soustraction(&pTriangle->triangle_[1], &pTriangle->triangle_[0]);
__private CDPoint edge2 = soustraction(&pTriangle->triangle_[2], &pTriangle->triangle_[0]);
__private CDPoint pvec = crossProduct(pDir, &edge2); // produit vectoriel
// Le rayon et le triangle sont il parallèle ?
__private float det = dotProduct(&edge1, &pvec);
if (det == 0.f)
{
return ;
}
__private float inv_det = 1.f / det;
// Distance origin t0
__private CDPoint tvec = soustraction(pPointOrigine, &pTriangle->triangle_[0]);
//Calculate u parameter and test bound
__private float u = (dotProduct(&tvec, &pvec)) * inv_det;
//The intersection lies outside of the triangle
if (u < -MIN_LONGUEUR_RT
|| u > 1.f + MIN_LONGUEUR_RT)
{
return ;
}
u = max(u, 0.f);
//Prepare to test v parameter
__private CDPoint qvec = crossProduct(&tvec, &edge1);
//Calculate V parameter and test bound
__private float v = dotProduct(pDir, &qvec) * inv_det;
//The intersection lies outside of the triangle
if (v < -MIN_LONGUEUR_RT
|| u + v > 1.f + MIN_LONGUEUR_RT)
{
return ;
}
// Get distance
distance = dotProduct(&edge2, &qvec) * inv_det;
if (distance > -MIN_LONGUEUR_RT)
{
// We are using nbInter as semaphor index
GetSemaphor(nbInter);
__private int idxInter = *nbInter;
pResults[idxInter].distance_ = max(distance, 0.f);
// Intersection point
__private CDPoint vDir = homothetie(pDir, distance);
pResults[idxInter].pos_ = addition(pPointOrigine, &vDir);
// Get ray way
pResults[idxInter].sensNormale_ = dotProduct(&pTriangle->normal_, pDir) > 0.f ? RAYON_SORTANT : RAYON_ENTRANT;
// Triangle id
pResults[idxInter].idTriangle_ = index - idxDebutTriangle;
// inc nb inter
*nbInter = *nbInter + 1;
ReleaseSemaphor(nbInter);
}
}
I notice that if I change "__global const TTriangle* pTriangleListe" by "const TTriangle* pTriangleListe" it compiles but it is not the code i want !
Exactly what I want to do, is to fill all triangles in "pTriangleListe", and with a uniform grid get indexes of triangles to check (idxDebutTriangle/idxFin). "pPointOrigine" is the ray origin and "pDir" the direction. "nbInter" and "pResults" are shared and will contains intersections (they are protected by the semaphor)
Here it is my openCL computer configuration :
Platform [0]
id = 5339E7D8
profile = FULL_PROFILE
version = OpenCL 1.2 AMD-APP (1445.5)
name = AMD Accelerated Parallel Processing
vendor = Advanced Micro Devices, Inc.
extensions = cl_khr_icd
cl_khr_d3d10_sharing
cl_khr_d3d11_sharing
cl_khr_dx9_media_sharing
cl_amd_event_callback
cl_amd_offline_devices
cl_amd_hsa
2 Devices detected
Device [0]
id = 010DFA00
type = CL_DEVICE_TYPE_GPU
name = Cedar
vendor = Advanced Micro Devices, Inc.
driver version = 1445.5 (VM)
device version = OpenCL 1.2 AMD-APP (1445.5)
profile = FULL_PROFILE
max compute units = 2
max work items dimensions = 3
max work item sizes = 128 / 128 / 128
max work group size = 128
max clock frequency = 650 MHz
address_bits = 32
max mem alloc size = 512 MB
global mem size = 1024 MB
image support = CL_TRUE
max read image args = 128
max write image args = 8
2D image max size = 16384 x 16384
3D image max size = 2048 x 2048 x 2048
max samplers = 16
max parameter size = 1024
mem base addr align = 2048
min data type align size = 128
single fp config = CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_FP_ROUND_TO_ZERO CL_FP_ROUND_TO_INF CL_FP_FMA
global mem cache type = CL_NONE
max constant buffer size = 64 KB
max constant args = 8
local mem type = CL_LOCAL
local mem size = 32 KB
error correction support = CL_FALSE
profiling timer resolution = 1 ns
endian little = CL_TRUE
available = CL_TRUE
compiler available = CL_TRUE
execution capabilities = CL_EXEC_KERNEL
queue properties = CL_QUEUE_PROFILING_ENABLE
extensions = cl_khr_global_int32_base_atomics
cl_khr_global_int32_extended_atomics
cl_khr_local_int32_base_atomics
cl_khr_local_int32_extended_atomics
cl_khr_3d_image_writes
cl_khr_byte_addressable_store
cl_khr_gl_sharing
cl_ext_atomic_counters_32
cl_amd_device_attribute_query
cl_amd_vec3
cl_amd_printf
cl_amd_media_ops
cl_amd_media_ops2
cl_amd_popcnt
cl_khr_d3d10_sharing
cl_khr_d3d11_sharing
cl_khr_dx9_media_sharing
cl_amd_image2d_from_buffer_read_only
cl_khr_spir
cl_khr_gl_event
Device [1]
id = 03501CD0
type = CL_DEVICE_TYPE_CPU
name = Intel(R) Core(TM) i3-2130 CPU # 3.40GHz
vendor = GenuineIntel
driver version = 1445.5 (sse2,avx)
device version = OpenCL 1.2 AMD-APP (1445.5)
profile = FULL_PROFILE
max compute units = 4
max work items dimensions = 3
max work item sizes = 1024 / 1024 / 1024
max work group size = 1024
max clock frequency = 3392 MHz
address_bits = 32
max mem alloc size = 1024 MB
global mem size = 2048 MB
image support = CL_TRUE
max read image args = 128
max write image args = 8
2D image max size = 8192 x 8192
3D image max size = 2048 x 2048 x 2048
max samplers = 16
max parameter size = 4096
mem base addr align = 1024
min data type align size = 128
single fp config = CL_FP_DENORM CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_FP_ROUND_TO_ZERO CL_FP_ROUND_TO_INF CL_FP_FMA
global mem cache type = CL_READ_WRITE_CACHE
global mem cacheline size = 64
global mem cache size = 32768
max constant buffer size = 64 KB
max constant args = 8
local mem type = CL_GLOBAL
local mem size = 32 KB
error correction support = CL_FALSE
profiling timer resolution = 301 ns
endian little = CL_TRUE
available = CL_TRUE
compiler available = CL_TRUE
execution capabilities = CL_EXEC_KERNEL CL_EXEC_NATIVE_KERNEL
queue properties = CL_QUEUE_PROFILING_ENABLE
extensions = cl_khr_fp64
cl_amd_fp64
cl_khr_global_int32_base_atomics
cl_khr_global_int32_extended_atomics
cl_khr_local_int32_base_atomics
cl_khr_local_int32_extended_atomics
cl_khr_3d_image_writes
cl_khr_byte_addressable_store
cl_khr_gl_sharing
cl_ext_device_fission
cl_amd_device_attribute_query
cl_amd_vec3
cl_amd_printf
cl_amd_media_ops
cl_amd_media_ops2
cl_amd_popcnt
cl_khr_d3d10_sharing
cl_khr_spir
cl_amd_svm
cl_khr_gl_event
Thank you for reading !

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Eigen on STM32 works only until a certain size - embedded

Related

Why didn't Valgrind detect Uninitialised value was used

lapack library for scip optimization

PETSc - MatMultScale? Matrix X vector X scalar

Clang, link time optimization fails for AVX horizontal add

OpenCL AMD GPU compiler crash

Categories

Resources