PGI 18.1 vs PGI 18.4 - gpu

Is there any change from the PGi version 18.1 to 18.4 regarding the
#pragma routine seq, the code I have works fine with version 18.1 but gives an error when I use the newer version. I generate kernels using the math library.
using namespace std;
#pragma acc routine
double myfunc(double x)
{
return(fabs(x));
}
The default parallelism for routine directive is (or was) sequential.
i.e. #pragma acc routine is equivalent to #pragma acc routine seq
This works fine in version 18.1.
But I think there might be some change in the new version since when I compile with 18.4 version, it gives an error complaining about the math library function.
Oddly enough also causes error
#include cmath
#include "openacc.h"
using namespace std;
#pragma acc routine seq
double sine( double x )
{
return ( sin( x ) );
}
Gives compilation error but when I change the math library to math.h, it is perfectly fine, Can anyone explain why is not working with pgc++ ?

What's the actual error you get? I get the same error with both PGI 18.1 and 18.4:
% pgc++ -c test1.cpp -ta=tesla -Minfo=accel -w -V18.1
PGCC-S-1000-Call in OpenACC region to procedure 'sin' which has no acc routine information (test1.cpp: 10)
PGCC-S-0155-Compiler failed to translate accelerator region (see -Minfo messages) (test1.cpp: 10)
sine(double):
10, Generating acc routine seq
Generating Tesla code
11, Accelerator restriction: call to 'sin' with no acc routine information
The solution here is to include the PGI header "accelmath.h" to get the device version for the C99 math intrinsics.
% diff test1.cpp test2.cpp
4a5
> #include "accelmath.h"
% pgc++ -c test2.cpp -ta=tesla -Minfo=accel -w -V18.1
sine(double):
12, Generating acc routine seq
Generating Tesla code
% pgc++ -c test2.cpp -ta=tesla -Minfo=accel -w -V18.4
sine(double):
12, Generating acc routine seq
Generating Tesla code

Related

Corrupted stack detected inside Tensorflow Lite Micro interpreter->Invoke() call with Mobilenet_V1_0.25_224_quant model

I am trying to use the quantized model with Tensorflow Lite Micro, and got a segmentation error inside interpreter->Invoke() call.
Debugger showed that segmentation error occurred on returning from Eval() in conv.cc on Node 28 of CONV_2D, and stack was corrupted. Error message is *** stack smashing detected ***: <unknown> terminated with compiler flags "-fstack-protector-all -Wstack-protector".
My test was simply from the person detection example with model replaced with Mobilenet_V1_0.25_224_quant at the Tensorflow lite pre-trained models site with increased enough kTensorArenaSize and model input/output size changed to 224x224x3 and 1x1001, amd pulled additional required operators.
Also tried a few different models, at another quantified mode Mobilenet_V1_0.25_192_quant is showing the same segfault problem,
But the regular floating point modes Mobilenet_V1_0.25_192, and Mobilenet_V1_0.25_224 run OK with many loops.
Have anyone seen similar problem ? Or is some limitations on Tensorflow Lite Micro that I should be aware of ?
This problem can be reproduced at this commit of forked tensorflow repo.
Build command:
$ bazel build //tensorflow/lite/micro/examples/person_detection:person_detection -c dbg --copt=-fstack-protector-all --copt=-Wstack-protector --copt=-fno-omit-frame-pointer
And run:
$ ./bazel-bin/tensorflow/lite/micro/examples/person_detection/person_detection
Files changed:
tensorflow/lite/micro/examples/person_detection/main_functions.cc
tensorflow/lite/micro/examples/person_detection/model_settings.h
tensorflow/lite/micro/examples/person_detection/person_detect_model_data.cc
Changes in main_functions.cc:
constexpr int kTensorArenaSize = 1400 * 1024;
static tflite::MicroOpResolver<5> micro_op_resolver;
micro_op_resolver.AddBuiltin(tflite::BuiltinOperator_RESHAPE,
tflite::ops::micro::Register_RESHAPE());
micro_op_resolver.AddBuiltin(tflite::BuiltinOperator_SOFTMAX,
tflite::ops::micro::Register_SOFTMAX(), 1, 2);
Changes in model_settings.h
constexpr int kNumCols = 224;
constexpr int kNumRows = 224;
constexpr int kNumChannels = 3;
constexpr int kCategoryCount = 1001;
The last model data file person_detect_model_data.cc is pretty big, please see full file at github.
March 28, 2020:
Also tested on Raspberry Pi 3, results are same as on the x86 Ubuntu 18.04.
pi#raspberrypi:~/tests $ ./person_detection
*** stack smashing detected ***: <unknown> terminated
Aborted
Thanks for your help.
Problem root cause found - Updated on April 2, 2020:
I found that the problem is caused by an array overrun of the layer operation data. Tensorflow microlite has a hidden limit (or I missed document, at least TF microlite runtime does not check) on output channels to maximum 256 in the OpData structure of conv.cc for TF micro lite.
constexpr int kMaxChannels = 256;
....
struct OpData {
...
// Per channel output multiplier and shift.
// TODO(b/141139247): Allocate these dynamically when possible.
int32_t per_channel_output_multiplier[kMaxChannels];
int32_t per_channel_output_shift[kMaxChannels];
...
}
The mobilenet model Mobilenet_V1_0.25_224_quant.tflite is with 1000 output classes, and total of 1001 channels internally. And it caused stack corruption in tflite::PopulateConvolutionQuantizationParams() of tensorflow/lite/kernels/kernel_util.cc:90 for the last Conv2D with output size of 1001.
No problem for TF, and TF lite as they are believed not using this structure definition.
Confirmed with increasing the channels to 1024 on loops of model evaluation calls.
Although most of TF microlite cases are likely with small models, and probably won't run into this problem.
This limit may be better documented and/or to perform check at run-time ?

Can I use `omp_get_thread_num()` on the GPU?

I have OpenMP code which works on the CPU by having each thread manage memory addressed by the thread's id number, accessible via omp_get_thread_num(). This works well on the CPU, but can it work on the GPU?
A MWE is:
#include <iostream>
#include <omp.h>
int main(){
const int SIZE = 400000;
int *m;
m = new int[SIZE];
#pragma omp target
{
#pragma omp parallel for
for(int i=0;i<SIZE;i++)
m[i] = omp_get_thread_num();
}
for(int i=0;i<SIZE;i++)
std::cout<<m[i]<<"\n";
}
It works fine on the GPU for me with GCC. You need to map m thoough e.g. like this
#pragma omp target map(tofrom:m[0:SIZE])
I compiled like this
g++ -O3 -Wall -fopenmp -fno-stack-protector so.cpp
You can see an example for system without offloading here
http://coliru.stacked-crooked.com/a/1e756410d6e2db61
A method I use to find out the number of teams and threads before doing work is this:
#pragma omp target teams defaultmap(tofrom:scalar)
{
nteams = omp_get_num_teams();
#pragma omp parallel
#pragma omp single
nthreads = omp_get_num_threads();
}
On my system with GCC 7.2, Ubuntu 17.10, and gcc-offload-nvptx with a GTX 1060 I get nteams = 30 and nthreads = 8. See this answer where I do a custom reduction for a target region using threads and teams. With -offload=disable nteams = 1 and nthreads = 8 (4 core/8 hardware thread CPU).
I added -fopt-info to the compile options and I get only the message
note: basic block vectorized
The answer seems to be no.
Compiling with PGI using:
pgc++ -fast -mp -ta=tesla,pinned,cc60 -Minfo=all test2.cpp
gives:
13, Parallel region activated
Parallel loop activated with static block schedule
Loop not vectorized/parallelized: contains call
14, Parallel region terminated
whereas compiling with GCC using
g++ -O3 test2.cpp -fopenmp -fopt-info
gives
test2.cpp:17: note: not vectorized: loop contains function calls or data references that cannot be analyzed
test2.cpp:17: note: bad data references.

MKL Sparse BLAS segfault when transposing CSR with 100M rows

I am trying to use MKL Sparse BLAS for CSR matrices with number of rows/columns on the order of 100M. My source code that seems to work fine for 10M rows/columns fails with segfault when I increase it to 100M.
I isolated the problem to the following code snippet:
void TestSegfault1() {
float values[1] = { 1.0f };
int col_indx[1] = { 0 };
int rows_start[1] = { 0 };
int rows_end[1] = { 1 };
// Step 1. Create 1 x 100M matrix
// with single non-zero value at (0,0)
sparse_matrix_t A;
mkl_sparse_s_create_csr(
&A, SPARSE_INDEX_BASE_ZERO, 1, 100000000,
rows_start, rows_end, col_indx, values);
// Step 2. Transpose it to get 100M x 1 matrix
sparse_matrix_t B;
mkl_sparse_convert_csr(A, SPARSE_OPERATION_TRANSPOSE, &B);
}
This function segfaults in mkl_sparse_convert_csr with backtrace
#0 0x00000000004c0d03 in mkl_sparse_s_convert_csr_i4_avx ()
#1 0x0000000000434061 in TestSegfault1 ()
For slightly different code (but essentially the same) it has a little more detail:
#0 0x00000000008fc09b in mkl_serv_free ()
#1 0x000000000099949e in mkl_sparse_s_export_csr_data_i4_avx ()
#2 0x0000000000999ee4 in mkl_sparse_s_convert_csr_i4_avx ()
Apparently something goes bad in memory allocation. And it sure looks like some kind of integer overflow from the outside. The build of MKL I have uses MKL_INT = int = int32.
Is it indeed the case and the limit on number of rows I can have in Sparse BLAS CSR matrix is < 100M (looks more like ~65M)? Or am I doing it wrong?
EDIT 1: MKL version string is "Intel(R) Math Kernel Library Version 11.3.1 Product Build 20151021 for Intel(R) 64 architecture applications".
EDIT 2: Figured it out. There is indeed a subtle kind of integer overflow when allocating memory for internal per-thread buffers. At some point inside mkl_sparse_s_export_csr_data_i4_avx it attempts to allocate (omp_get_max_threads() + 1) * num_rows * 4 bytes; the number doesn't fit in 32-bit signed integer. Subsequent call to mkl_serv_malloc causes memory corruption and eventually segfault. One possible solution is to alter the number of OpenMP threads via omp_set_num_threads call.
Could you check your example on last version of MKL? I run it on MKL 11.3.2 and it passed correctly for 100M matrix. However it could depend on number of threads on your machine (size of matrix mult number of threads have to be less than max int). To prevent such issue I am strongly recommend to use ilp64 version of MKL libraries
Thanks,
Alex
check how this example works with the latest mkl 2019 u4.
compiling the example with ILP64 mode like as follows:
icc -I/opt/intel/compilers_and_libraries_2019/linux/mkl/include test_csr.cpp \
-L/opt/intel/compilers_and_libraries_2019/linux/mkl/lib/intel64 -lmkl_core -lmkl_intel_ilp64 -lmkl_intel_thread -liomp5 -lpthread -lm -ldl
./a.out
mkl_sparse_convert_csr passed

SCIP cannot read my input

I'm trying to use the SCIP solver (http://scip.zib.de/). My input (1.lp) is in lpsolve format. It looks like this:
max: +2 x_0_0;
+x_0_0 <= 1;
+x_0_0 <= 1;
-3x_0_0 <= 0;
0 <= x_0_0 <= 1;
int x_0_0;
I run SCIP like this:
"c:\Program Files\SCIP\scip.exe" -f 1.lp -l 1.lp.out
However, SCIP generates this output:
SCIP version 3.0.0 [precision: 8 byte] [memory: block] [mode: optimized] [LP solver: SoPlex 1.7.0] [GitHash: c95600b]
Copyright (c) 2002-2012 Konrad-Zuse-Zentrum fuer Informationstechnik Berlin (ZIB)
External codes:
SoPlex 1.7.0 Linear Programming Solver developed at Zuse Institute Berlin (soplex.zib.de) [GitHash: 657dfe5]
cppad-20120101.3 Algorithmic Differentiation of C++ algorithms developed by B. Bell (www.coin-or.org/CppAD)
Ipopt 3.10.2 Interior Point Optimizer developed by A. Waechter et.al. (www.coin-or.org/Ipopt)
user parameter file <scip.set> not found - using default parameters
read problem <[...]1.lp>
============
input:
^
error reading file <[...]1.lp>
I guess that means it is choking on a whitespace... What am I doing wrong?
EDIT:
See my answer for details. After giving the input in CPLEX format, everything works fine.
The answer is that SCIP apparently cannot read lpsolve input files. The LP format they are referring to in this overview of readable file formats is in fact the CPLEX LP format.

gcc memory alignment pragma

Does gcc have memory alignment pragma, akin #pragma vector aligned in Intel compiler?
I would like to tell compiler to optimize particular loop using aligned loads/store instructions. to avoid possible confusion, this is not about struct packing.
e.g:
#if defined (__INTEL_COMPILER)
#pragma vector aligned
#endif
for (int a = 0; a < int(N); ++a) {
q10 += Ix(a,0,0)*Iy(a,1,1)*Iz(a,0,0);
q11 += Ix(a,0,0)*Iy(a,0,1)*Iz(a,1,0);
q12 += Ix(a,0,0)*Iy(a,0,0)*Iz(a,0,1);
q13 += Ix(a,1,0)*Iy(a,0,0)*Iz(a,0,1);
q14 += Ix(a,0,0)*Iy(a,1,0)*Iz(a,0,1);
q15 += Ix(a,0,0)*Iy(a,0,0)*Iz(a,1,1);
}
Thanks
You can tell GCC that a pointer points to aligned memory by using a typedef to create an over-aligned type that you can declare pointers to.
This helps gcc but not clang7.0 or ICC19, see the x86-64 non-AVX asm they emit on Godbolt. (Only GCC folds a load into a memory operand for mulps, instead of using a separate movups). You have have to use __builtin_assume_aligned if you want to portably convey an alignment promise to GNU C compilers other than GCC itself.
From http://gcc.gnu.org/onlinedocs/gcc/Type-Attributes.html
typedef double aligned_double __attribute__((aligned (16)));
// Note: sizeof(aligned_double) is 8, not 16
void some_function(aligned_double *x, aligned_double *y, int n)
{
for (int i = 0; i < n; ++i) {
// math!
}
}
This won't make aligned_double 16 bytes wide. This will just make it aligned to a 16-byte boundary, or rather the first one in an array will be. Looking at the disassembly on my computer, as soon as I use the alignment directive, I start to see a LOT of vector ops. I am using a Power architecture computer at the moment so it's altivec code, but I think this does what you want.
(Note: I wasn't using double when I tested this, because there altivec doesn't support double floats.)
You can see some other examples of autovectorization using the type attributes here: http://gcc.gnu.org/projects/tree-ssa/vectorization.html
I tried your solution with g++ version 4.5.2 (both Ubuntu and Windows) and it did not vectorize the loop.
If the alignment attribute is removed then it vectorizes the loop, using unaligned loads.
If the function is inlined so that the array can be accessed directly with the pointer eliminated, then it is vectorized with aligned loads.
In both cases, the alignment attribute prevents vectorization. This is ironic: The "aligned_double *x" was supposed to enable vectorization but it does the opposite.
Which compiler was it that reported vectorized loops for you? I suspect it was not a gcc compiler?
Does gcc have memory alignment pragma, akin #pragma vector aligned
It looks like newer versions of GCC have __builtin_assume_aligned:
Built-in Function: void * __builtin_assume_aligned (const void *exp, size_t align, ...)
This function returns its first argument, and allows the compiler to assume that the returned pointer is at least align bytes aligned.
This built-in can have either two or three arguments, if it has three,
the third argument should have integer type, and if it is nonzero
means misalignment offset. For example:
void *x = __builtin_assume_aligned (arg, 16);
means that the compiler can assume x, set to arg, is at least 16-byte aligned, while:
void *x = __builtin_assume_aligned (arg, 32, 8);
means that the compiler can assume for x, set to arg, that (char *) x - 8 is 32-byte aligned.
Based on some other questions and answers on Stack Overflow circa 2010, it appears the built-in was not available in GCC 3 and early GCC 4. But I do not know where the cut-off point is.