Can I use `omp_get_thread_num()` on the GPU? - g++

I have OpenMP code which works on the CPU by having each thread manage memory addressed by the thread's id number, accessible via omp_get_thread_num(). This works well on the CPU, but can it work on the GPU?
A MWE is:
#include <iostream>
#include <omp.h>
int main(){
const int SIZE = 400000;
int *m;
m = new int[SIZE];
#pragma omp target
{
#pragma omp parallel for
for(int i=0;i<SIZE;i++)
m[i] = omp_get_thread_num();
}
for(int i=0;i<SIZE;i++)
std::cout<<m[i]<<"\n";
}

It works fine on the GPU for me with GCC. You need to map m thoough e.g. like this
#pragma omp target map(tofrom:m[0:SIZE])
I compiled like this
g++ -O3 -Wall -fopenmp -fno-stack-protector so.cpp
You can see an example for system without offloading here
http://coliru.stacked-crooked.com/a/1e756410d6e2db61
A method I use to find out the number of teams and threads before doing work is this:
#pragma omp target teams defaultmap(tofrom:scalar)
{
nteams = omp_get_num_teams();
#pragma omp parallel
#pragma omp single
nthreads = omp_get_num_threads();
}
On my system with GCC 7.2, Ubuntu 17.10, and gcc-offload-nvptx with a GTX 1060 I get nteams = 30 and nthreads = 8. See this answer where I do a custom reduction for a target region using threads and teams. With -offload=disable nteams = 1 and nthreads = 8 (4 core/8 hardware thread CPU).
I added -fopt-info to the compile options and I get only the message
note: basic block vectorized

The answer seems to be no.
Compiling with PGI using:
pgc++ -fast -mp -ta=tesla,pinned,cc60 -Minfo=all test2.cpp
gives:
13, Parallel region activated
Parallel loop activated with static block schedule
Loop not vectorized/parallelized: contains call
14, Parallel region terminated
whereas compiling with GCC using
g++ -O3 test2.cpp -fopenmp -fopt-info
gives
test2.cpp:17: note: not vectorized: loop contains function calls or data references that cannot be analyzed
test2.cpp:17: note: bad data references.

Related

Is TensorRT "floating-point 16" precision mode non-deterministic on Jetson TX2?

I'm using TensorRT FP16 precision mode to optimize my deep learning model. And I use this optimised model on Jetson TX2. While testing the model, I have observed that TensorRT inference engine is not deterministic. In other words, my optimized model gives different FPS values between 40 and 120 FPS for same input images.
I started to think that the source of the non-determinism is floating point operations when I see this comment about CUDA:
"If your code uses floating-point atomics, results may differ from run
to run because floating-point operations are generally not
associative, and the order in which data enters a computation (e.g. a
sum) is non-deterministic when atomics are used."
Is type of precision such as FP16, FP32 and INT8 affects determinism of TensorRT? Or anything?
Do you have any thoughs?
Best regards.
I solved the problem by changing the function clock() that I used for measuring latencies. The clock() function was measuring the CPU time latency, but what I want to do is to measure real time latency. Now I am using std::chrono to measure the latencies. Now inference results are latency-deterministic.
That was wrong one, (clock())
int main ()
{
clock_t t;
int f;
t = clock();
inferenceEngine(); // Tahmin yapılıyor
t = clock() - t;
printf ("It took me %d clicks (%f seconds).\n",t,((float)t)/CLOCKS_PER_SEC);
return 0;
}
Use Cuda Events like this, (CudaEvent)
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);
inferenceEngine(); // Do the inference
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);
Use chrono like this: (std::chrono)
#include <iostream>
#include <chrono>
#include <ctime>
int main()
{
auto start = std::chrono::system_clock::now();
inferenceEngine(); // Do the inference
auto end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end-start;
std::time_t end_time = std::chrono::system_clock::to_time_t(end);
std::cout << "finished computation at " << std::ctime(&end_time)
<< "elapsed time: " << elapsed_seconds.count() << "s\n";
}

PGI 18.1 vs PGI 18.4

Is there any change from the PGi version 18.1 to 18.4 regarding the
#pragma routine seq, the code I have works fine with version 18.1 but gives an error when I use the newer version. I generate kernels using the math library.
using namespace std;
#pragma acc routine
double myfunc(double x)
{
return(fabs(x));
}
The default parallelism for routine directive is (or was) sequential.
i.e. #pragma acc routine is equivalent to #pragma acc routine seq
This works fine in version 18.1.
But I think there might be some change in the new version since when I compile with 18.4 version, it gives an error complaining about the math library function.
Oddly enough also causes error
#include cmath
#include "openacc.h"
using namespace std;
#pragma acc routine seq
double sine( double x )
{
return ( sin( x ) );
}
Gives compilation error but when I change the math library to math.h, it is perfectly fine, Can anyone explain why is not working with pgc++ ?
What's the actual error you get? I get the same error with both PGI 18.1 and 18.4:
% pgc++ -c test1.cpp -ta=tesla -Minfo=accel -w -V18.1
PGCC-S-1000-Call in OpenACC region to procedure 'sin' which has no acc routine information (test1.cpp: 10)
PGCC-S-0155-Compiler failed to translate accelerator region (see -Minfo messages) (test1.cpp: 10)
sine(double):
10, Generating acc routine seq
Generating Tesla code
11, Accelerator restriction: call to 'sin' with no acc routine information
The solution here is to include the PGI header "accelmath.h" to get the device version for the C99 math intrinsics.
% diff test1.cpp test2.cpp
4a5
> #include "accelmath.h"
% pgc++ -c test2.cpp -ta=tesla -Minfo=accel -w -V18.1
sine(double):
12, Generating acc routine seq
Generating Tesla code
% pgc++ -c test2.cpp -ta=tesla -Minfo=accel -w -V18.4
sine(double):
12, Generating acc routine seq
Generating Tesla code

MKL Sparse BLAS segfault when transposing CSR with 100M rows

I am trying to use MKL Sparse BLAS for CSR matrices with number of rows/columns on the order of 100M. My source code that seems to work fine for 10M rows/columns fails with segfault when I increase it to 100M.
I isolated the problem to the following code snippet:
void TestSegfault1() {
float values[1] = { 1.0f };
int col_indx[1] = { 0 };
int rows_start[1] = { 0 };
int rows_end[1] = { 1 };
// Step 1. Create 1 x 100M matrix
// with single non-zero value at (0,0)
sparse_matrix_t A;
mkl_sparse_s_create_csr(
&A, SPARSE_INDEX_BASE_ZERO, 1, 100000000,
rows_start, rows_end, col_indx, values);
// Step 2. Transpose it to get 100M x 1 matrix
sparse_matrix_t B;
mkl_sparse_convert_csr(A, SPARSE_OPERATION_TRANSPOSE, &B);
}
This function segfaults in mkl_sparse_convert_csr with backtrace
#0 0x00000000004c0d03 in mkl_sparse_s_convert_csr_i4_avx ()
#1 0x0000000000434061 in TestSegfault1 ()
For slightly different code (but essentially the same) it has a little more detail:
#0 0x00000000008fc09b in mkl_serv_free ()
#1 0x000000000099949e in mkl_sparse_s_export_csr_data_i4_avx ()
#2 0x0000000000999ee4 in mkl_sparse_s_convert_csr_i4_avx ()
Apparently something goes bad in memory allocation. And it sure looks like some kind of integer overflow from the outside. The build of MKL I have uses MKL_INT = int = int32.
Is it indeed the case and the limit on number of rows I can have in Sparse BLAS CSR matrix is < 100M (looks more like ~65M)? Or am I doing it wrong?
EDIT 1: MKL version string is "Intel(R) Math Kernel Library Version 11.3.1 Product Build 20151021 for Intel(R) 64 architecture applications".
EDIT 2: Figured it out. There is indeed a subtle kind of integer overflow when allocating memory for internal per-thread buffers. At some point inside mkl_sparse_s_export_csr_data_i4_avx it attempts to allocate (omp_get_max_threads() + 1) * num_rows * 4 bytes; the number doesn't fit in 32-bit signed integer. Subsequent call to mkl_serv_malloc causes memory corruption and eventually segfault. One possible solution is to alter the number of OpenMP threads via omp_set_num_threads call.
Could you check your example on last version of MKL? I run it on MKL 11.3.2 and it passed correctly for 100M matrix. However it could depend on number of threads on your machine (size of matrix mult number of threads have to be less than max int). To prevent such issue I am strongly recommend to use ilp64 version of MKL libraries
Thanks,
Alex
check how this example works with the latest mkl 2019 u4.
compiling the example with ILP64 mode like as follows:
icc -I/opt/intel/compilers_and_libraries_2019/linux/mkl/include test_csr.cpp \
-L/opt/intel/compilers_and_libraries_2019/linux/mkl/lib/intel64 -lmkl_core -lmkl_intel_ilp64 -lmkl_intel_thread -liomp5 -lpthread -lm -ldl
./a.out
mkl_sparse_convert_csr passed

What is consuming power with PIC18?

I have a very simple code that print out something to the terminal then goes directly to sleep.
For some reason the device is consuming more current during sleep mode. It is drawing 0.24 mA but I know it should be less than that. Without sleep it is consuming 4.32 mA. I've ran the most basic software I can and must be missing something.
Please what are some of the factors that effect power consumption? I really need to lower power consumption but I don't know what's causing it be that high. Here is the Datasheet for your own convenience.
/*
File: main.c
Date: 2011-SEP-4
Target: PIC18F87J11
IDE: MPLAB 8.76
Compiler: C18 3.40
*/
#include <p18cxxx.h>
#include <usart.h>
#pragma config FOSC = HSPLL, WDTEN = OFF, WDTPS = 4096, XINST = OFF
#define FOSC (4000000UL)
#define FCYC (FOSC/4UL)
#define BAUD 9600UL
#define SPBRG_INIT (FOSC/(16UL*BAUD) - 1)
void main(void)
{
/* set FOSC clock to 4MHZ */
OSCCON = 0x70;
/* turn off 4x PLL */
OSCTUNE = 0x00;
/* make all ADC inputs digital I/O */
ANCON0 = 0xFF;
ANCON1 = 0xFF;
/* test the simulator UART interface */
Open1USART(USART_TX_INT_OFF & USART_RX_INT_OFF & USART_ASYNCH_MODE & USART_EIGHT_BIT & USART_CONT_RX & USART_BRGH_HIGH, SPBRG_INIT);
putrs1USART("PIC MICROCONTROLLERS\r\n");
Close1USART();
/* sleep forever */
Sleep();
}
Thanks in advance!
Update 1: I noticed adding the following code decreased it to 0.04 mA
TRISE = 0;
PORTE = 0x0C;
And If I was to change PORTE to the following, it increased to 0.16 mA.
PORTE = 0x00;
But I don't really understand what all of that means ... or how the power consumption went down. I have to be missing something in the code but I don't know what it is.
Update 2: This code gives me unstable current consumption. Sometimes 2.7 mA other times 0.01 mA. I suspect the problem with WDTCONbits.REGSLP = 1;
Download Code
Current consumption nicely went down from 0.24 mA to 0.04 mA when OP change settings on port outputs.
This is expected in typical designs, the outputs control various circuitry. Example: An output, by driving high, may turn on an LED(1), taking an additional 0.20 mA. In another design, an output, by driving low, may turn on an LED. In a 3rd design, not driving may turn on an LED.
OP needs to consult the schematic or designer to determine what settings result in low power. Further, certain combinations may/may not be allowed during low power mode.
Lastly, the sequence of lowering power, disabling, etc. on the various design elements may be important. The sequence to to shut things down is usually reversed in bringing them back on-line.
#Chris Stratton has good ideas in the posted comment.
(1) A low powered LED.

gcc memory alignment pragma

Does gcc have memory alignment pragma, akin #pragma vector aligned in Intel compiler?
I would like to tell compiler to optimize particular loop using aligned loads/store instructions. to avoid possible confusion, this is not about struct packing.
e.g:
#if defined (__INTEL_COMPILER)
#pragma vector aligned
#endif
for (int a = 0; a < int(N); ++a) {
q10 += Ix(a,0,0)*Iy(a,1,1)*Iz(a,0,0);
q11 += Ix(a,0,0)*Iy(a,0,1)*Iz(a,1,0);
q12 += Ix(a,0,0)*Iy(a,0,0)*Iz(a,0,1);
q13 += Ix(a,1,0)*Iy(a,0,0)*Iz(a,0,1);
q14 += Ix(a,0,0)*Iy(a,1,0)*Iz(a,0,1);
q15 += Ix(a,0,0)*Iy(a,0,0)*Iz(a,1,1);
}
Thanks
You can tell GCC that a pointer points to aligned memory by using a typedef to create an over-aligned type that you can declare pointers to.
This helps gcc but not clang7.0 or ICC19, see the x86-64 non-AVX asm they emit on Godbolt. (Only GCC folds a load into a memory operand for mulps, instead of using a separate movups). You have have to use __builtin_assume_aligned if you want to portably convey an alignment promise to GNU C compilers other than GCC itself.
From http://gcc.gnu.org/onlinedocs/gcc/Type-Attributes.html
typedef double aligned_double __attribute__((aligned (16)));
// Note: sizeof(aligned_double) is 8, not 16
void some_function(aligned_double *x, aligned_double *y, int n)
{
for (int i = 0; i < n; ++i) {
// math!
}
}
This won't make aligned_double 16 bytes wide. This will just make it aligned to a 16-byte boundary, or rather the first one in an array will be. Looking at the disassembly on my computer, as soon as I use the alignment directive, I start to see a LOT of vector ops. I am using a Power architecture computer at the moment so it's altivec code, but I think this does what you want.
(Note: I wasn't using double when I tested this, because there altivec doesn't support double floats.)
You can see some other examples of autovectorization using the type attributes here: http://gcc.gnu.org/projects/tree-ssa/vectorization.html
I tried your solution with g++ version 4.5.2 (both Ubuntu and Windows) and it did not vectorize the loop.
If the alignment attribute is removed then it vectorizes the loop, using unaligned loads.
If the function is inlined so that the array can be accessed directly with the pointer eliminated, then it is vectorized with aligned loads.
In both cases, the alignment attribute prevents vectorization. This is ironic: The "aligned_double *x" was supposed to enable vectorization but it does the opposite.
Which compiler was it that reported vectorized loops for you? I suspect it was not a gcc compiler?
Does gcc have memory alignment pragma, akin #pragma vector aligned
It looks like newer versions of GCC have __builtin_assume_aligned:
Built-in Function: void * __builtin_assume_aligned (const void *exp, size_t align, ...)
This function returns its first argument, and allows the compiler to assume that the returned pointer is at least align bytes aligned.
This built-in can have either two or three arguments, if it has three,
the third argument should have integer type, and if it is nonzero
means misalignment offset. For example:
void *x = __builtin_assume_aligned (arg, 16);
means that the compiler can assume x, set to arg, is at least 16-byte aligned, while:
void *x = __builtin_assume_aligned (arg, 32, 8);
means that the compiler can assume for x, set to arg, that (char *) x - 8 is 32-byte aligned.
Based on some other questions and answers on Stack Overflow circa 2010, it appears the built-in was not available in GCC 3 and early GCC 4. But I do not know where the cut-off point is.