How to measure the execution time of GPU with using Profiling+openCL+Sycl+DPCPP - gpu

I read this link
https://github.com/intel/pti-gpu
and I tried to use Device Activity Tracing for OpenCL(TM), but I am confused and I do not know how should I measure the time on the accelerators with using Device Activity documentation.
for measuring the performance of CPU I used chrono, but I am interested in to using profiling for measuring the performance of CPU and GPU in different devices.
my program:
#include <CL/sycl.hpp>
#include <iostream>
#include <tbb/tbb.h>
#include <tbb/parallel_for.h>
#include <vector>
#include <string>
#include <queue>
#include<tbb/blocked_range.h>
#include <tbb/global_control.h>
#include <chrono>
using namespace tbb;
template<class Tin, class Tout, class Function>
class Map {
private:
Function fun;
public:
Map() {}
Map(Function f):fun(f) {}
std::vector<Tout> operator()(bool use_tbb, std::vector<Tin>& v) {
std::vector<Tout> r(v.size());
if(use_tbb){
// Start measuring time
auto begin = std::chrono::high_resolution_clock::now();
tbb::parallel_for(tbb::blocked_range<Tin>(0, v.size()),
[&](tbb::blocked_range<Tin> t) {
for (int index = t.begin(); index < t.end(); ++index){
r[index] = fun(v[index]);
}
});
// Stop measuring time and calculate the elapsed time
auto end = std::chrono::high_resolution_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin);
printf("Time measured: %.3f seconds.\n", elapsed.count() * 1e-9);
return r;
} else {
sycl::queue gpuQueue{sycl::gpu_selector()};
sycl::range<1> n_item{v.size()};
sycl::buffer<Tin, 1> in_buffer(&v[0], n_item);
sycl::buffer<Tout, 1> out_buffer(&r[0], n_item);
gpuQueue.submit([&](sycl::handler& h){
//local copy of fun
auto f = fun;
sycl::accessor in_accessor(in_buffer, h, sycl::read_only);
sycl::accessor out_accessor(out_buffer, h, sycl::write_only);
h.parallel_for(n_item, [=](sycl::id<1> index) {
out_accessor[index] = f(in_accessor[index]);
});
}).wait();
}
return r;
}
};
template<class Tin, class Tout, class Function>
Map<Tin, Tout, Function> make_map(Function f) { return Map<Tin, Tout, Function>(f);}
typedef int(*func)(int x);
//define different functions
auto function = [](int x){ return x; };
auto functionTimesTwo = [](int x){ return (x*2); };
auto functionDivideByTwo = [](int x){ return (x/2); };
auto lambdaFunction = [](int x){return (++x);};
int main(int argc, char *argv[]) {
std::vector<int> v = {1,2,3,4,5,6,7,8,9};
//auto f = [](int x){return (++x);};
//Array of functions
func functions[] =
{
function,
functionTimesTwo,
functionDivideByTwo,
lambdaFunction
};
for(int i = 0; i< sizeof(functions); i++){
auto m1 = make_map<int, int>(functions[i]);
//auto m1 = make_map<int, int>(f);
std::vector<int> r = m1(true, v);
//print the result
for(auto &e:r) {
std::cout << e << " ";
}
}
return 0;
}

First of all, SYCL Kernel wont support function pointers. So, you can change the code accordingly.
One way to achieve profiling in GPU is by following the steps below:
1.Enable profiling mode for the command queue of the target device
2.Introduce the event for the target device activity
3.Set the callback to be notified when the activity is completed
4.Read the profiling data inside the callback
Basically, you need to use CL_PROFILING_COMMAND_START and CL_PROFILING_COMMAND_END (command identified by event start and end execution on the device) inside the call back.
You can find the detailed steps here
https://github.com/intel/pti-gpu/blob/master/chapters/device_activity_tracing/OpenCL.md
I would also advice you to check the samples for pti-gpu using Device Activity Tracing. Check the URL for the same
https://github.com/intel/pti-gpu/tree/master/samples

Related

Simple parsing to a tuple fails

Why does this not compile? The marked line gives me: "static assertion failed: The parser expects tuple-like attribute type". I would think that an std::tuple were the essence of "tuple-like"?
#include <string>
#include <tuple>
#include <boost/spirit/home/x3.hpp>
void parseInteger(std::string input) {
namespace x3 = boost::spirit::x3;
auto iter = input.begin();
auto end_iter = input.end();
int result;
x3::parse(iter, end_iter, x3::int_, result);
}
void parseIntegerAndDouble(std::string input) {
namespace x3 = boost::spirit::x3;
auto iter = input.begin();
auto end_iter = input.end();
std::tuple<int, double> result;
x3::parse(iter, end_iter, x3::int_ >> ' ' >> x3::double_, result); //Compile error!
}
int main(int, char **)
{
parseInteger("567");
parseIntegerAndDouble("321 3.1412");
return 0;
}
The trick is to include two additional header files:
#include <boost/fusion/adapted/std_tuple.hpp>
#include <boost/fusion/include/std_tuple.hpp>
This is a bit more magical than I like, but I've got a feeling I really don't want to know how this works!

OpenKinect acquire raw depth image

I am trying to use the example code from here.
I have made some changes in order to save the images to the computer. When I read the data in MATLAB it seems like values that should be 0 are set to 2047, and overall it does not seem to be correct when I reconstruct the 3D points using the default intrinsic camera parameters.
What I want to achieve is to save the images so that I can use
img = single(imread(depth.png'))/ 1000
and have the depth values in meters, and pixels with no measurements should be zero.
It is the Kinect V1 by the way.
Here is the code with comments where I have tried to change.
#include "libfreenect.hpp"
#include <iostream>
#include <vector>
#include <cmath>
#include <pthread.h>
#include <cv.h>
#include <cxcore.h>
#include <highgui.h>
using namespace cv;
using namespace std;
class myMutex {
public:
myMutex() {
pthread_mutex_init( &m_mutex, NULL );
}
void lock() {
pthread_mutex_lock( &m_mutex );
}
void unlock() {
pthread_mutex_unlock( &m_mutex );
}
private:
pthread_mutex_t m_mutex;
};
// Should one use FREENECT_DEPTH_REGISTERED instead of FREENECT_DEPTH_11BIT?
class MyFreenectDevice : public Freenect::FreenectDevice {
public:
MyFreenectDevice(freenect_context *_ctx, int _index)
: Freenect::FreenectDevice(_ctx, _index), m_buffer_depth(FREENECT_DEPTH_11BIT),
m_buffer_rgb(FREENECT_VIDEO_RGB), m_gamma(2048), m_new_rgb_frame(false),
m_new_depth_frame(false), depthMat(Size(640,480),CV_16UC1),
rgbMat(Size(640,480), CV_8UC3, Scalar(0)),
ownMat(Size(640,480),CV_8UC3,Scalar(0)) {
for( unsigned int i = 0 ; i < 2048 ; i++) {
float v = i/2048.0;
v = std::pow(v, 3)* 6;
m_gamma[i] = v*6*256;
}
}
// Do not call directly even in child
void VideoCallback(void* _rgb, uint32_t timestamp) {
std::cout << "RGB callback" << std::endl;
m_rgb_mutex.lock();
uint8_t* rgb = static_cast<uint8_t*>(_rgb);
rgbMat.data = rgb;
m_new_rgb_frame = true;
m_rgb_mutex.unlock();
};
// Do not call directly even in child
void DepthCallback(void* _depth, uint32_t timestamp) {
std::cout << "Depth callback" << std::endl;
m_depth_mutex.lock();
uint16_t* depth = static_cast<uint16_t*>(_depth);
// Here I use memcpy instead so I can use uint16
// memcpy(depthMat.data,depth,depthMat.rows*depthMat.cols*sizeof(uint16_t));
depthMat.data = (uchar*) depth;
m_new_depth_frame = true;
m_depth_mutex.unlock();
}
bool getVideo(Mat& output) {
m_rgb_mutex.lock();
if(m_new_rgb_frame) {
cv::cvtColor(rgbMat, output, CV_RGB2BGR);
m_new_rgb_frame = false;
m_rgb_mutex.unlock();
return true;
} else {
m_rgb_mutex.unlock();
return false;
}
}
bool getDepth(Mat& output) {
m_depth_mutex.lock();
if(m_new_depth_frame) {
depthMat.copyTo(output);
m_new_depth_frame = false;
m_depth_mutex.unlock();
return true;
} else {
m_depth_mutex.unlock();
return false;
}
}
private:
// Should it be uint16_t instead or even higher?
std::vector<uint8_t> m_buffer_depth;
std::vector<uint8_t> m_buffer_rgb;
std::vector<uint16_t> m_gamma;
Mat depthMat;
Mat rgbMat;
Mat ownMat;
myMutex m_rgb_mutex;
myMutex m_depth_mutex;
bool m_new_rgb_frame;
bool m_new_depth_frame;
};
int main(int argc, char **argv) {
bool die(false);
string filename("snapshot");
string suffix(".png");
int i_snap(0),iter(0);
Mat depthMat(Size(640,480),CV_16UC1);
Mat depthf (Size(640,480),CV_8UC1);
Mat rgbMat(Size(640,480),CV_8UC3,Scalar(0));
Mat ownMat(Size(640,480),CV_8UC3,Scalar(0));
// The next two lines must be changed as Freenect::Freenect
// isn't a template but the method createDevice:
// Freenect::Freenect<MyFreenectDevice> freenect;
// MyFreenectDevice& device = freenect.createDevice(0);
// by these two lines:
Freenect::Freenect freenect;
MyFreenectDevice& device = freenect.createDevice<MyFreenectDevice>(0);
namedWindow("rgb",CV_WINDOW_AUTOSIZE);
namedWindow("depth",CV_WINDOW_AUTOSIZE);
device.startVideo();
device.startDepth();
while (!die) {
device.getVideo(rgbMat);
device.getDepth(depthMat);
// Here I save the depth images
std::ostringstream file;
file << filename << i_snap << suffix;
cv::imwrite(file.str(),depthMat);
cv::imshow("rgb", rgbMat);
depthMat.convertTo(depthf, CV_8UC1, 255.0/2048.0);
cv::imshow("depth",depthf);
if(iter >= 1000) break;
iter++;
}
device.stopVideo();
device.stopDepth();
return 0;
}
Thanks in advance!
Erik
I dont have any experience with OpenKinect in particular; but should your depth buffer be uint16?
std::vector<uint8_t> m_buffer_depth;
Also; for Matlab, do check if the image that you are reading is a uint16 or uint8. If its the latter then convert it to uint16
uint16(imread('depth.png'));
Sorry couldn't help more. Hope this helps.
The values you have are the raw depth values. You need to remap those into MM for the numbers to make sense. Kinect 1 can see up to 10 meters. So I would go with raw_values/2407*10000.
If the values are saturated at 2047, you are probably using the FREENECT_DEPTH_11BIT_PACKED depth format.
For work in Matlab, it is always easier to use FREENECT_DEPTH_MM or FREENECT_DEPTH_REGISTERED.
Enjoy.

Tagging (or coloring) CGAL objects

I am trying to use CGAL to perform some simple 2D CSG operations. Here is an example of an intersection of two polygons.
The actual problem is tracking down the origin (marked with color) of each segment in resulting polygon.
I would like to know if that is possible, maybe with some hacking on the CGAL itself. Any suggestion will be highly appreciated.
Unfortunately, there is no out-of-the-box way doing it. However, it doesn't require too much (famous last words...). You need to do two things described below. The first is supported by the API. The second is not, so you will need to patch a source file. A simple example is provided further bellow. Notice that the data you need, that is, the specification of the origin of each edge, ends up in a 2D arrangement data structure. If you want to obtain the polygons with this data, you need to extract them from the arrangement. You can obtain the header pgn_print.h, used in the example, from the 2D-Arrangement book.
Use an instance of CGAL::Polygon_set_2<Kernel, Container, Dcel>, where the Dcel is substituted with an extended Dcel, the halfedge of which is extended with a label that indicates the origin of the halfedge (i.e., first polygon, second polygon, or both in case of an overlap).
Patch the header file Boolean_set_operations_2/Gps_base_functor.h. In particular, add to the body of the three functions called create_edge() statements that set the label of the resulting halfedges according to their origin:
void create_edge(Halfedge_const_handle h1, Halfedge_const_handle h2,
Halfedge_handle h)
{
h->set_label(3);
h->twin()->set_label(3);
}
void create_edge(Halfedge_const_handle h1, Face_const_handle f2,
Halfedge_handle h)
{
h->set_label(1);
h->twin()->set_label(1);
}
void create_edge(Face_const_handle f1, Halfedge_const_handle h2,
Halfedge_handle h)
{
h->set_label(2);
h->twin()->set_label(2);
}
#include <list>
#include <vector>
#include <CGAL/Exact_predicates_exact_constructions_kernel.h>
#include <CGAL/Boolean_set_operations_2.h>
#include <CGAL/Polygon_set_2.h>
#include "pgn_print.h"
/*! Extend the arrangement halfedge */
template <typename X_monotone_curve_2>
class Arr_labeled_halfedge :
public CGAL::Arr_halfedge_base<X_monotone_curve_2>
{
private:
unsigned m_label;
public:
Arr_labeled_halfedge() : m_label(0) {}
unsigned label() const { return m_label; }
void set_label(unsigned label) { m_label = label; }
virtual void assign(const Arr_labeled_halfedge& he)
{
CGAL::Arr_halfedge_base<X_monotone_curve_2>::assign(he);
m_label = he.m_label;
}
};
template <typename Traits>
class Arr_labeled_dcel :
public CGAL::Arr_dcel_base<CGAL::Arr_vertex_base<typename Traits::Point_2>,
Arr_labeled_halfedge<typename Traits::
X_monotone_curve_2>,
CGAL::Gps_face_base>
{
public:
Arr_labeled_dcel() {}
};
typedef CGAL::Exact_predicates_exact_constructions_kernel Kernel;
typedef Kernel::Point_2 Point_2;
typedef CGAL::Polygon_2<Kernel> Polygon_2;
typedef CGAL::Polygon_with_holes_2<Kernel> Polygon_with_holes_2;
typedef std::vector<Point_2> Container;
typedef CGAL::Gps_segment_traits_2<Kernel, Container> Traits_2;
typedef Arr_labeled_dcel<Traits_2> Dcel;
typedef CGAL::Polygon_set_2<Kernel, Container, Dcel> Polygon_set_2;
typedef std::list<Polygon_with_holes_2> Pwh_list_2;
typedef Polygon_set_2::Arrangement_2 Arrangement_2;
typedef Arrangement_2::Edge_const_iterator Edge_const_iterator;
void print_result(const Polygon_set_2& S)
{
std::cout << "The result contains " << S.number_of_polygons_with_holes()
<< " components:" << std::endl;
Pwh_list_2 res;
S.polygons_with_holes(std::back_inserter(res));
for (Pwh_list_2::const_iterator hit = res.begin(); hit != res.end(); ++hit) {
std::cout << "--> ";
print_polygon_with_holes(*hit);
}
const Arrangement_2& arr = S.arrangement();
for (Edge_const_iterator it = arr.edges_begin(); it != arr.edges_end(); ++it) {
std::cout << it->curve()
<< ", " << it->label()
<< std::endl;
}
}
int main()
{
// Construct the two input polygons.
Polygon_2 P;
P.push_back(Point_2(0, 0));
P.push_back(Point_2(5, 0));
P.push_back(Point_2(3.5, 1.5));
P.push_back(Point_2(2.5, 0.5));
P.push_back(Point_2(1.5, 1.5));
std::cout << "P = "; print_polygon(P);
Polygon_2 Q;
Q.push_back(Point_2(0, 2));
Q.push_back(Point_2(1.5, 0.5));
Q.push_back(Point_2(2.5, 1.5));
Q.push_back(Point_2(3.5, 0.5));
Q.push_back(Point_2(5, 2));
std::cout << "Q = "; print_polygon(Q);
// Compute the union of P and Q.
Polygon_set_2 intersection_set;
intersection_set.insert(P);
intersection_set.intersection(Q);
print_result(intersection_set);
// Compute the intersection of P and Q.
Polygon_set_2 union_set;
union_set.insert(P);
union_set.join(Q);
print_result(union_set);
return 0;
}

How do you measure actual on-CPU time for an iOS thread?

I am looking for an iOS analog for Android's SystemClock.currentThreadTimeMillis() or Microsoft's GetThreadTimes() or Posix clock_gettime(CLOCK_THREAD_CPUTIME_ID, ) and pthread_getcpuclockid() functions to measure the actual "clean" time used by a function in a multithreaded application. That is, I don't want to measure the actual wall clock time spent in a function, but the on-CPU time.
I found interesting discussions about this here on stackoverflow and elsewhere. Unfortunately, neither applies to iOS.
Is there a comparable function for this on iOS?
In case anyone is looking for a good answer:
A while ago I found some great code in this answer (for finding CPU time/memory usage in OSX), and adapted it slightly. I used this for benchmarking some NEON optimizations on the ARM. You would probably only need the section which gets time for the current thread.
#include <sys/types.h>
#include <sys/sysctl.h>
#include <mach/mach_init.h>
#include <mach/mach_host.h>
#include <mach/mach_port.h>
#include <mach/mach_traps.h>
#include <mach/task_info.h>
#include <mach/thread_info.h>
#include <mach/thread_act.h>
#include <mach/vm_region.h>
#include <mach/vm_map.h>
#include <mach/task.h>
typedef struct {
double utime, stime;
} CPUTime;
int get_cpu_time(CPUTime *rpd, bool_t thread_only)
{
task_t task;
kern_return_t error;
mach_msg_type_number_t count;
thread_array_t thread_table;
thread_basic_info_t thi;
thread_basic_info_data_t thi_data;
unsigned table_size;
struct task_basic_info ti;
if (thread_only) {
// just get time of this thread
count = THREAD_BASIC_INFO_COUNT;
thi = &thi_data;
error = thread_info(mach_thread_self(), THREAD_BASIC_INFO, (thread_info_t)thi, &count);
rpd->utime = thi->user_time.seconds + thi->user_time.microseconds * 1e-6;
rpd->stime = thi->system_time.seconds + thi->system_time.microseconds * 1e-6;
return 0;
}
// get total time of the current process
task = mach_task_self();
count = TASK_BASIC_INFO_COUNT;
error = task_info(task, TASK_BASIC_INFO, (task_info_t)&ti, &count);
assert(error == KERN_SUCCESS);
{ /* calculate CPU times, adapted from top/libtop.c */
unsigned i;
// the following times are for threads which have already terminated and gone away
rpd->utime = ti.user_time.seconds + ti.user_time.microseconds * 1e-6;
rpd->stime = ti.system_time.seconds + ti.system_time.microseconds * 1e-6;
error = task_threads(task, &thread_table, &table_size);
assert(error == KERN_SUCCESS);
thi = &thi_data;
// for each active thread, add up thread time
for (i = 0; i != table_size; ++i) {
count = THREAD_BASIC_INFO_COUNT;
error = thread_info(thread_table[i], THREAD_BASIC_INFO, (thread_info_t)thi, &count);
assert(error == KERN_SUCCESS);
if ((thi->flags & TH_FLAGS_IDLE) == 0) {
rpd->utime += thi->user_time.seconds + thi->user_time.microseconds * 1e-6;
rpd->stime += thi->system_time.seconds + thi->system_time.microseconds * 1e-6;
}
error = mach_port_deallocate(mach_task_self(), thread_table[i]);
assert(error == KERN_SUCCESS);
}
error = vm_deallocate(mach_task_self(), (vm_offset_t)thread_table, table_size * sizeof(thread_array_t));
assert(error == KERN_SUCCESS);
}
if (task != mach_task_self()) {
mach_port_deallocate(mach_task_self(), task);
assert(error == KERN_SUCCESS);
}
return 0;
}
I use this code to measure thread CPU time. It's based on tcovo's answer, but is more concise and focused.
thread_time.h
#ifndef thread_time_h
#define thread_time_h
#include <stdint.h>
typedef struct thread_time {
int64_t user_time_us;
int64_t system_time_us;
} thread_time_t;
thread_time_t thread_time();
thread_time_t thread_time_sub(thread_time_t const a, thread_time_t const b);
#endif /* thread_time_h */
thread_time.c
#include "thread_time.h"
#include <mach/mach_init.h>
#include <mach/thread_act.h>
#include <mach/thread_info.h>
int64_t time_value_to_us(time_value_t const t) {
return (int64_t)t.seconds * 1000000 + t.microseconds;
}
thread_time_t thread_time() {
thread_basic_info_data_t basic_info;
mach_msg_type_number_t count = THREAD_BASIC_INFO_COUNT;
kern_return_t const result = thread_info(mach_thread_self(), THREAD_BASIC_INFO, (thread_info_t)&basic_info, &count);
if (result == KERN_SUCCESS) {
return (thread_time_t){
.user_time_us = time_value_to_us(basic_info.user_time),
.system_time_us = time_value_to_us(basic_info.system_time)
};
} else {
return (thread_time_t){-1, -1};
}
}
thread_time_t thread_time_sub(thread_time_t const a, thread_time_t const b) {
return (thread_time_t){
.user_time_us = a.user_time_us - b.user_time_us,
.system_time_us = a.system_time_us - b.system_time_us
};
}
From its description it seems that the clock() function does what you need?
http://developer.apple.com/library/ios/#documentation/System/Conceptual/ManPages_iPhoneOS/man3/clock.3.html

How to print using g++ with printf correctly?

I am compiling this code with g++:
#include <pthread.h>
#include <iostream>
#include <cstdlib>
#include <string>
#include <stdio.h>
using namespace std;
#define num_threads 3
#define car_limit 4
pthread_mutex_t mutex; // mutex lock
pthread_t cid; // thread id
pthread_attr_t attr; // thread attrubutes
void *OneCar(void *dir);
void ArriveBridge(int *direction);
void CrossBridge();
void ExitBridge(int *direction);
int main()
{
int dir[3] = {0,1,1};
pthread_mutex_init(&mutex, NULL);
pthread_attr_init(&attr);
//cout<< "Pthread Create" << endl;
printf("Pthread Create\n");
for(int i = 0; i < num_threads; i++)
{
pthread_create(&cid, &attr, OneCar, (void *)&dir[i]);
}
return 0;
}
void ArriveBridge(int *direction)
{
//cout<<"Arrive"<<*direction << endl;
int dr;
if(*direction == 0)
dr=0;
else
dr=1;
printf("Arrive%d", dr);
}
void CrossBridge(int *dir)
{
char d;
if(*dir == 0)
d = 'N';
else
d = 'S';
//cout<<"Crossing Bridge going:"<<d<<endl;
printf("Crossing Bridge going %c", d);
}
void ExitBridge(int *direction)
{
//cout<<"Exit" <<*direction<<endl;
int dr;
if(*direction == 0)
dr=0;
else
dr=1;
printf("Exit%d\n", dr);
}
void *OneCar(void *dir)
{
int *cardir;
cardir = (int *) dir;
//cout<<*cardir;
ArriveBridge(cardir);
CrossBridge(cardir);
ExitBridge(cardir);
return 0;
}
and I am expecting this result printed to the screen:
> Pthread Create
> Arrive0Crossing Bridge going NExit0
> Arrive1Crossing Bridge going SExit1
> Arrive1Crossing Bridge going NExit1
But i get this instead:
Pthread Create
Arrive0Crossing Bridge going NExit0
Why doesnt it print the rest out?
You need to use "pthread_join" in main to wait for all threads to exit before your program terminates. You should also use an array to hold the id of each thread that you create:
pthread_t cid[num_threads]; // thread id`
You'll then want to call join on every thread you create:
for(int i = 0; i < num_threads; i++)
{
pthread_create(&cid[i], &attr, OneCar, (void *)&dir[i]);
}
for(int i = 0; i < num_threads; ++i)
{
pthread_join(cid[i], NULL);
};
Running the modified code now gives:
Pthread Create
Arrive0Crossing Bridge going NExit0
Arrive1Crossing Bridge going SExit1
Arrive1Crossing Bridge going SExit1
Have you tried joining your threads at the end of main? It could be that the program is terminating before the other threads are completely finished.
You missed the newlines ("\n"):
printf("Arrive%d\n", dr);
printf("Crossing Bridge going %c\n", d);
Because of that, the streams are probably not flushed. Additionally, if you don't wait for your threads (pthread_join) your program will exit before the threads could do their work.