Using OpenCL to get the energy consumption of my OpenCL Kernel - gpu

I am trying to estimate the power consumption of my OpenCL Kernel running on AMD Radeon RX Vega GPU. is there a way to access the power consumption through OpenCL directly?
I tried using profilers but could not find one that supports amd GPUs or opencl. so I want to do it through programming if that's possible

I now managed to access the average power consumption of the GPU (provided in mW) through the rocm_smi library.
this is a little tricky and more of an estimate because it is hard to call while the kernel is launched. However, if the kernel's runtime is long enough, I can run rocm_smi from command line and get the average power consumption during the kernel's launch.

Related

"portable" way to determine GPU core occupancy with Vulkan

For writing GPU computation kernels (aka "compute shaders" in GL/Vulkan), it is useful to query various shader parameters such as register usage and shared memory usage that determine how much individual threads may get scheduled to a single streaming multiprocessor (SM with nVidia, CU with AMD, etc.).
For AMD GPUs, we have an appropriate extension that contains vkGetShaderInfoAMD, using which one can reach some information about occupied VGPR/SGPRs and shared memory (aka LDS) and thus calculate a good estimate of the core occupancy.
Is there any such possibility/extension for nVidia and Intel (and possibly other) GPUs, or a workaround to measure the properties of a GLSL shader on a particular hardware in some other way? At least for nVidia cards, the functionality is implemented in CUDA, but that doesn't help much with debugging GLSL shaders stuff.

Atomic operation between integrated GPU and CPU

Hi I'm working on developing an application, which involves working on shared data between GPU and CPU.
I know I can do atomic operation GPU and CPU separately. And also I don't want to use event synchronized between CPU and GPU.
Is there any way/command so that I can do atomic operation on shared data between CPU and integrated GPU in OpenCL?
It's possible but there are preconditions. You'll need a device supporting OpenCL 2.0 or higher (Intel, AMD and ARM all have such devices, i dunno about Nvidia).
To get started, look here,here and here.

Can GPU be used to run programs that run on CPU?

Can Gpu be used to run programs that run on Cpu like getting input from keyboard and mouse or playing music or reading the contents of a text file using Direct3D and OpenGL Api?
The GPU has no direct access on any memory that is mapped by the OS to be accessed within client code (i.e. code, which is executed in user-mode while the instructions are executed on the CPU).
In addition the GPU is not supposed to perform stuff like this, it aims to perform floating point arithmetic at a high speed. And finally you would never use Direct3D or OpenGL to perform anything that is not related to graphics, except you are only going to use the compute shader.
General purpose computations are performed with OpenCL or CUDA on the GPU, such as image manipulation or physics simulations.
You can, however, gather any data on the CPU, send it to the GPU for further processing and finally write it back again into memory accessible from the CPU.

What PC hardware is needed to try out GPU programming?

I am interested to try out GPU programming. One thing not clear to me is, what hardware do I need? Is it right any PC with graphics card is good? I have very little knowledge of GPU programming, so the starting learning curve is best not steep. If I have to make a lot of hacks just in order to run some tutorial because my hardware is not good enough, I'd rather to buy a new hardware.
I have a retired PC (~10 year old) installed with Ubuntu Linux, I am not sure what graphics card it has, must be some old one.
I am also planning to buy a new sub-$500 desktop which to my casual research normally has AMD Radeon 7x or Nvidia GT 6x graphics card. I assume any new PC is good enough for the programming learning.
Anyway any suggestion is appreciated.
If you want to use CUDA, you'll need a GPU from NVidia, and their site explains the compute capabilities of their different products.
If you want to learn OpenCL, you can start right now with an OpenCL implementation that has a CPU back-end. The basics of writing OpenCL code targeting CPUs or GPUs is the same, and they differ mainly in performance tuning.
For GPU programming, any AMD or NVidia GPU made in the past several years will have some degree of OpenCL support, though there have been some new features introduced with newer generations that can't be easily emulated for older generations.
Intel's integrated GPUs in Ivy Bridge and later support OpenCL, but Intel only provides a GPU-capable OpenCL implementation for Windows, not Linux.
Also be aware that there is a huge difference between a mid-range and high-end GPU in terms of compute capabilities, especially where double-precision arithmetic is supported. Some low-end GPUs don't support double-precision at all, and mid-range GPUs often perform double-precision arithmetic 24 times slower than single-precision. When you want to do a lot of double-precision calculations, it's absolutely worth it to get a compute-oriented GPU (Radeon 7900 series or GeForce Titan and up).
If you want a low-end system with a non-trivial GPU power, you best bet at the moment is probably to get a system built around an AMD APU.

non-graphics benchmarks for gpu

Most of the benchmarks for gpu performance and load testing are graphics related. Is there any benchmark that is computationally intensive but not graphics related ? I am using
DELL XPS 15 laptop,
nvidia GT 525M graphics card,
Ubuntu 11.04 with bumblebee installed.
I want to load test my system to come up with a max load the graphics cards can handle. Are there any non-graphics benchmarks for gpu ?
What exactly do you want to measure?
To measure GFLOPS on the card just write a simple Kernel in Cuda (or OpenCL).
If you have never written anything in CUDA let me know and i can post something for you.
If your application is not computing intensive (take a look at a roofline paper) then I/O will be the bottleneck. Getting data from global (card) memory to the processor takes 100's of cycles.
On the other hand if your application IS compute intensive then just time it and calculate how many bytes you process per second. In order to hit the maximum GFLOPS (your card can do 230) you need many FLOPs per memory access, so that the processors are busy and not stalling for memory and switching threads.