nested classes with pointers woth openACC - gpu

I have a rather big code in C++, I had to integrate some new class to the base class as shown below.
class A
{
int N;
B b;
double *__restrict__ w;
construct();
}
A::construct()
{
w=new double[N];
#pragma acc data enter create(this)
#pragma acc update device(this)
#pragma acc data enter create(w)
// allocate class A
b.construct()
}
class B
{
double *__restrict__ u;
double *__restrict__ v;
B(){};
construct();
}
B::construct()
{
u=new double[N];
v=new double[N];
#pragma acc data enter create(this)
#pragma acc update device(this)
#pragma acc data enter create(u)
#pragma acc data enter create(v)
}
I think I am running into the deep copy issue as the pointers of class B are invalidated and hence the behavior of the code on GPU i undefined.
I would appreciate the feedback on how to perform the class inclusion in another class without getting into the deep copy issue. I suspect the update device (this) somehow causes this.

Do you have a full example which recreates the error you're seeing? I wrote a little test example using your code snip-it and it worked fine. (See below)
If you were updating the "this" pointer after creating the arrays, then it would be a problem since you'd be overwriting the device pointers with the host pointers. But as you show above, it shouldn't be an issue.
% cat test.cpp
#include <iostream>
class B
{
public:
int N;
double *__restrict__ u;
double *__restrict__ v;
void construct(int);
};
void B::construct(int _N)
{
N=_N;
u=new double[N];
v=new double[N];
#pragma acc enter data create(this)
#pragma acc update device(this)
#pragma acc enter data create(u[:N])
#pragma acc enter data create(v[:N])
}
class A
{
public:
int N;
B b;
double *__restrict__ w;
void construct(int);
};
void A::construct(int _N)
{
N=_N;
w=new double[N];
#pragma acc enter data create(this)
#pragma acc update device(this)
#pragma acc enter data create(w[:N])
// allocate class A
b.construct(N);
}
int main() {
A myA;
int N=32;
myA.construct(N);
#pragma acc parallel loop present(myA)
for (int i=0; i<N; ++i) {
myA.w[i] = i;
myA.b.u[i] = i;
myA.b.v[i] = i;
}
#pragma acc update host( myA.w[:N], myA.b.u[:N], myA.b.v[:N])
for (int i=0; i<N; ++i) {
std::cout << myA.w[i] << ":" << myA.b.u[i] << ":" << myA.b.v[i] << std::endl;
}
return 0;
}
% pgc++ test.cpp -Minfo=accel -V18.10 -ta=tesla; a.out
main:
49, Generating present(myA)
Accelerator kernel generated
Generating Tesla code
52, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
56, Generating update self(myA.b.u[:N],myA.w[:N],myA.b.v[:N])
B::construct(int):
21, Generating update device(this[:1])
Generating enter data create(this[:1],v[:N],u[:N])
A::construct(int):
41, Generating update device(this[:1])
Generating enter data create(w[:N],this[:1])
0:0:0
1:1:1
2:2:2
3:3:3
4:4:4
5:5:5
6:6:6
7:7:7
8:8:8
9:9:9
10:10:10
11:11:11
12:12:12
13:13:13
14:14:14
15:15:15
16:16:16
17:17:17
18:18:18
19:19:19
20:20:20
21:21:21
22:22:22
23:23:23
24:24:24
25:25:25
26:26:26
27:27:27
28:28:28
29:29:29
30:30:30
31:31:31

Related

How to measure the execution time of GPU with using Profiling+openCL+Sycl+DPCPP

I read this link
https://github.com/intel/pti-gpu
and I tried to use Device Activity Tracing for OpenCL(TM), but I am confused and I do not know how should I measure the time on the accelerators with using Device Activity documentation.
for measuring the performance of CPU I used chrono, but I am interested in to using profiling for measuring the performance of CPU and GPU in different devices.
my program:
#include <CL/sycl.hpp>
#include <iostream>
#include <tbb/tbb.h>
#include <tbb/parallel_for.h>
#include <vector>
#include <string>
#include <queue>
#include<tbb/blocked_range.h>
#include <tbb/global_control.h>
#include <chrono>
using namespace tbb;
template<class Tin, class Tout, class Function>
class Map {
private:
Function fun;
public:
Map() {}
Map(Function f):fun(f) {}
std::vector<Tout> operator()(bool use_tbb, std::vector<Tin>& v) {
std::vector<Tout> r(v.size());
if(use_tbb){
// Start measuring time
auto begin = std::chrono::high_resolution_clock::now();
tbb::parallel_for(tbb::blocked_range<Tin>(0, v.size()),
[&](tbb::blocked_range<Tin> t) {
for (int index = t.begin(); index < t.end(); ++index){
r[index] = fun(v[index]);
}
});
// Stop measuring time and calculate the elapsed time
auto end = std::chrono::high_resolution_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin);
printf("Time measured: %.3f seconds.\n", elapsed.count() * 1e-9);
return r;
} else {
sycl::queue gpuQueue{sycl::gpu_selector()};
sycl::range<1> n_item{v.size()};
sycl::buffer<Tin, 1> in_buffer(&v[0], n_item);
sycl::buffer<Tout, 1> out_buffer(&r[0], n_item);
gpuQueue.submit([&](sycl::handler& h){
//local copy of fun
auto f = fun;
sycl::accessor in_accessor(in_buffer, h, sycl::read_only);
sycl::accessor out_accessor(out_buffer, h, sycl::write_only);
h.parallel_for(n_item, [=](sycl::id<1> index) {
out_accessor[index] = f(in_accessor[index]);
});
}).wait();
}
return r;
}
};
template<class Tin, class Tout, class Function>
Map<Tin, Tout, Function> make_map(Function f) { return Map<Tin, Tout, Function>(f);}
typedef int(*func)(int x);
//define different functions
auto function = [](int x){ return x; };
auto functionTimesTwo = [](int x){ return (x*2); };
auto functionDivideByTwo = [](int x){ return (x/2); };
auto lambdaFunction = [](int x){return (++x);};
int main(int argc, char *argv[]) {
std::vector<int> v = {1,2,3,4,5,6,7,8,9};
//auto f = [](int x){return (++x);};
//Array of functions
func functions[] =
{
function,
functionTimesTwo,
functionDivideByTwo,
lambdaFunction
};
for(int i = 0; i< sizeof(functions); i++){
auto m1 = make_map<int, int>(functions[i]);
//auto m1 = make_map<int, int>(f);
std::vector<int> r = m1(true, v);
//print the result
for(auto &e:r) {
std::cout << e << " ";
}
}
return 0;
}
First of all, SYCL Kernel wont support function pointers. So, you can change the code accordingly.
One way to achieve profiling in GPU is by following the steps below:
1.Enable profiling mode for the command queue of the target device
2.Introduce the event for the target device activity
3.Set the callback to be notified when the activity is completed
4.Read the profiling data inside the callback
Basically, you need to use CL_PROFILING_COMMAND_START and CL_PROFILING_COMMAND_END (command identified by event start and end execution on the device) inside the call back.
You can find the detailed steps here
https://github.com/intel/pti-gpu/blob/master/chapters/device_activity_tracing/OpenCL.md
I would also advice you to check the samples for pti-gpu using Device Activity Tracing. Check the URL for the same
https://github.com/intel/pti-gpu/tree/master/samples

I am creating a Dynamic array in class in c++ using malloc and realloc and getting assertion failure(Heap corruption detected)

I am creating a Dynamic array in class in c++ using malloc and realloc but i am getting assertion
Failure(Heap corruption detected).
I am getting error when calling destructor
I am using free but did not understand where the code went wrong
But after calling setIndexElement I am getting error
class DynamicArray1
{
public:
int *ptr;
int len;
DynamicArray1()
{
ptr = (int*)malloc(2 * sizeof(int));
}
void setIndexElement(int index,int val)
{
int newsize = size();
if (index > newsize)
{
ptr = (int *)realloc(ptr, sizeof(int)*index);
}
ptr[index] = val;
len = index;
}
~DynamicArray1()
{
free(ptr);
}
};
int main()
{
DynamicArray1 d;
d.setIndexElement(0, 10);
d.setIndexElement(1, 11);
d.setIndexElement(2, 4);
int x = d.newSize();
for(int i=0;i<=x;i++)
{
cout << d.getIndexval(i) << endl;
}
}

Is MOCK_METHOD* not enough?

Inspired by that Google doc "ForDummies" :), I am trying a simple Google Mock with Google Test example as follows:
#include <gtest/gtest.h>
#include <gmock/gmock.h>
using ::testing::_;
class A
{
public:
A()
{
std::cout << "A()" << std::endl;
}
virtual ~A()
{
std::cout << "~A()" << std::endl;
}
virtual int incVirtual(int i)
{
return i + 1;
}
};
class MockA: public A
{
public:
MOCK_METHOD1(incVirtual, int(int));
};
TEST(Test, IncTest) {
MockA a;
EXPECT_CALL(a, incVirtual(_));
printf("n == %d\n", a.incVirtual(0));
}
int main(int argc, char **argv)
{
::testing::InitGoogleMock(&argc, argv);
return RUN_ALL_TESTS();
}
When I run it, I get n == 0:
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from Test
[ RUN ] Test.IncTest
A()
n == 0
~A()
whereas I expect it to be n == 1. So I wonder if just defining MOCK_METHODx in the mock class is not enough for mocking the base class method and something additional needs to be done to make MockA::incVirtual call A::incVirtual?
This is behaving the way it should. By mocking your function, the MockA class has essentially overridden the A class implementation with a "do nothing" implementation.
If you want to call A::incVirtual() when calling MockA::incVirtual(), try the following
EXPECT_CALL(a, incVirtual(_)).WillOnce(Invoke(&a, &A::incVirtual));

OpenKinect acquire raw depth image

I am trying to use the example code from here.
I have made some changes in order to save the images to the computer. When I read the data in MATLAB it seems like values that should be 0 are set to 2047, and overall it does not seem to be correct when I reconstruct the 3D points using the default intrinsic camera parameters.
What I want to achieve is to save the images so that I can use
img = single(imread(depth.png'))/ 1000
and have the depth values in meters, and pixels with no measurements should be zero.
It is the Kinect V1 by the way.
Here is the code with comments where I have tried to change.
#include "libfreenect.hpp"
#include <iostream>
#include <vector>
#include <cmath>
#include <pthread.h>
#include <cv.h>
#include <cxcore.h>
#include <highgui.h>
using namespace cv;
using namespace std;
class myMutex {
public:
myMutex() {
pthread_mutex_init( &m_mutex, NULL );
}
void lock() {
pthread_mutex_lock( &m_mutex );
}
void unlock() {
pthread_mutex_unlock( &m_mutex );
}
private:
pthread_mutex_t m_mutex;
};
// Should one use FREENECT_DEPTH_REGISTERED instead of FREENECT_DEPTH_11BIT?
class MyFreenectDevice : public Freenect::FreenectDevice {
public:
MyFreenectDevice(freenect_context *_ctx, int _index)
: Freenect::FreenectDevice(_ctx, _index), m_buffer_depth(FREENECT_DEPTH_11BIT),
m_buffer_rgb(FREENECT_VIDEO_RGB), m_gamma(2048), m_new_rgb_frame(false),
m_new_depth_frame(false), depthMat(Size(640,480),CV_16UC1),
rgbMat(Size(640,480), CV_8UC3, Scalar(0)),
ownMat(Size(640,480),CV_8UC3,Scalar(0)) {
for( unsigned int i = 0 ; i < 2048 ; i++) {
float v = i/2048.0;
v = std::pow(v, 3)* 6;
m_gamma[i] = v*6*256;
}
}
// Do not call directly even in child
void VideoCallback(void* _rgb, uint32_t timestamp) {
std::cout << "RGB callback" << std::endl;
m_rgb_mutex.lock();
uint8_t* rgb = static_cast<uint8_t*>(_rgb);
rgbMat.data = rgb;
m_new_rgb_frame = true;
m_rgb_mutex.unlock();
};
// Do not call directly even in child
void DepthCallback(void* _depth, uint32_t timestamp) {
std::cout << "Depth callback" << std::endl;
m_depth_mutex.lock();
uint16_t* depth = static_cast<uint16_t*>(_depth);
// Here I use memcpy instead so I can use uint16
// memcpy(depthMat.data,depth,depthMat.rows*depthMat.cols*sizeof(uint16_t));
depthMat.data = (uchar*) depth;
m_new_depth_frame = true;
m_depth_mutex.unlock();
}
bool getVideo(Mat& output) {
m_rgb_mutex.lock();
if(m_new_rgb_frame) {
cv::cvtColor(rgbMat, output, CV_RGB2BGR);
m_new_rgb_frame = false;
m_rgb_mutex.unlock();
return true;
} else {
m_rgb_mutex.unlock();
return false;
}
}
bool getDepth(Mat& output) {
m_depth_mutex.lock();
if(m_new_depth_frame) {
depthMat.copyTo(output);
m_new_depth_frame = false;
m_depth_mutex.unlock();
return true;
} else {
m_depth_mutex.unlock();
return false;
}
}
private:
// Should it be uint16_t instead or even higher?
std::vector<uint8_t> m_buffer_depth;
std::vector<uint8_t> m_buffer_rgb;
std::vector<uint16_t> m_gamma;
Mat depthMat;
Mat rgbMat;
Mat ownMat;
myMutex m_rgb_mutex;
myMutex m_depth_mutex;
bool m_new_rgb_frame;
bool m_new_depth_frame;
};
int main(int argc, char **argv) {
bool die(false);
string filename("snapshot");
string suffix(".png");
int i_snap(0),iter(0);
Mat depthMat(Size(640,480),CV_16UC1);
Mat depthf (Size(640,480),CV_8UC1);
Mat rgbMat(Size(640,480),CV_8UC3,Scalar(0));
Mat ownMat(Size(640,480),CV_8UC3,Scalar(0));
// The next two lines must be changed as Freenect::Freenect
// isn't a template but the method createDevice:
// Freenect::Freenect<MyFreenectDevice> freenect;
// MyFreenectDevice& device = freenect.createDevice(0);
// by these two lines:
Freenect::Freenect freenect;
MyFreenectDevice& device = freenect.createDevice<MyFreenectDevice>(0);
namedWindow("rgb",CV_WINDOW_AUTOSIZE);
namedWindow("depth",CV_WINDOW_AUTOSIZE);
device.startVideo();
device.startDepth();
while (!die) {
device.getVideo(rgbMat);
device.getDepth(depthMat);
// Here I save the depth images
std::ostringstream file;
file << filename << i_snap << suffix;
cv::imwrite(file.str(),depthMat);
cv::imshow("rgb", rgbMat);
depthMat.convertTo(depthf, CV_8UC1, 255.0/2048.0);
cv::imshow("depth",depthf);
if(iter >= 1000) break;
iter++;
}
device.stopVideo();
device.stopDepth();
return 0;
}
Thanks in advance!
Erik
I dont have any experience with OpenKinect in particular; but should your depth buffer be uint16?
std::vector<uint8_t> m_buffer_depth;
Also; for Matlab, do check if the image that you are reading is a uint16 or uint8. If its the latter then convert it to uint16
uint16(imread('depth.png'));
Sorry couldn't help more. Hope this helps.
The values you have are the raw depth values. You need to remap those into MM for the numbers to make sense. Kinect 1 can see up to 10 meters. So I would go with raw_values/2407*10000.
If the values are saturated at 2047, you are probably using the FREENECT_DEPTH_11BIT_PACKED depth format.
For work in Matlab, it is always easier to use FREENECT_DEPTH_MM or FREENECT_DEPTH_REGISTERED.
Enjoy.

How to print using g++ with printf correctly?

I am compiling this code with g++:
#include <pthread.h>
#include <iostream>
#include <cstdlib>
#include <string>
#include <stdio.h>
using namespace std;
#define num_threads 3
#define car_limit 4
pthread_mutex_t mutex; // mutex lock
pthread_t cid; // thread id
pthread_attr_t attr; // thread attrubutes
void *OneCar(void *dir);
void ArriveBridge(int *direction);
void CrossBridge();
void ExitBridge(int *direction);
int main()
{
int dir[3] = {0,1,1};
pthread_mutex_init(&mutex, NULL);
pthread_attr_init(&attr);
//cout<< "Pthread Create" << endl;
printf("Pthread Create\n");
for(int i = 0; i < num_threads; i++)
{
pthread_create(&cid, &attr, OneCar, (void *)&dir[i]);
}
return 0;
}
void ArriveBridge(int *direction)
{
//cout<<"Arrive"<<*direction << endl;
int dr;
if(*direction == 0)
dr=0;
else
dr=1;
printf("Arrive%d", dr);
}
void CrossBridge(int *dir)
{
char d;
if(*dir == 0)
d = 'N';
else
d = 'S';
//cout<<"Crossing Bridge going:"<<d<<endl;
printf("Crossing Bridge going %c", d);
}
void ExitBridge(int *direction)
{
//cout<<"Exit" <<*direction<<endl;
int dr;
if(*direction == 0)
dr=0;
else
dr=1;
printf("Exit%d\n", dr);
}
void *OneCar(void *dir)
{
int *cardir;
cardir = (int *) dir;
//cout<<*cardir;
ArriveBridge(cardir);
CrossBridge(cardir);
ExitBridge(cardir);
return 0;
}
and I am expecting this result printed to the screen:
> Pthread Create
> Arrive0Crossing Bridge going NExit0
> Arrive1Crossing Bridge going SExit1
> Arrive1Crossing Bridge going NExit1
But i get this instead:
Pthread Create
Arrive0Crossing Bridge going NExit0
Why doesnt it print the rest out?
You need to use "pthread_join" in main to wait for all threads to exit before your program terminates. You should also use an array to hold the id of each thread that you create:
pthread_t cid[num_threads]; // thread id`
You'll then want to call join on every thread you create:
for(int i = 0; i < num_threads; i++)
{
pthread_create(&cid[i], &attr, OneCar, (void *)&dir[i]);
}
for(int i = 0; i < num_threads; ++i)
{
pthread_join(cid[i], NULL);
};
Running the modified code now gives:
Pthread Create
Arrive0Crossing Bridge going NExit0
Arrive1Crossing Bridge going SExit1
Arrive1Crossing Bridge going SExit1
Have you tried joining your threads at the end of main? It could be that the program is terminating before the other threads are completely finished.
You missed the newlines ("\n"):
printf("Arrive%d\n", dr);
printf("Crossing Bridge going %c\n", d);
Because of that, the streams are probably not flushed. Additionally, if you don't wait for your threads (pthread_join) your program will exit before the threads could do their work.