Atomic operation between integrated GPU and CPU - gpu

Hi I'm working on developing an application, which involves working on shared data between GPU and CPU.
I know I can do atomic operation GPU and CPU separately. And also I don't want to use event synchronized between CPU and GPU.
Is there any way/command so that I can do atomic operation on shared data between CPU and integrated GPU in OpenCL?

It's possible but there are preconditions. You'll need a device supporting OpenCL 2.0 or higher (Intel, AMD and ARM all have such devices, i dunno about Nvidia).
To get started, look here,here and here.

Related

Using OpenCL to get the energy consumption of my OpenCL Kernel

I am trying to estimate the power consumption of my OpenCL Kernel running on AMD Radeon RX Vega GPU. is there a way to access the power consumption through OpenCL directly?
I tried using profilers but could not find one that supports amd GPUs or opencl. so I want to do it through programming if that's possible
I now managed to access the average power consumption of the GPU (provided in mW) through the rocm_smi library.
this is a little tricky and more of an estimate because it is hard to call while the kernel is launched. However, if the kernel's runtime is long enough, I can run rocm_smi from command line and get the average power consumption during the kernel's launch.

"portable" way to determine GPU core occupancy with Vulkan

For writing GPU computation kernels (aka "compute shaders" in GL/Vulkan), it is useful to query various shader parameters such as register usage and shared memory usage that determine how much individual threads may get scheduled to a single streaming multiprocessor (SM with nVidia, CU with AMD, etc.).
For AMD GPUs, we have an appropriate extension that contains vkGetShaderInfoAMD, using which one can reach some information about occupied VGPR/SGPRs and shared memory (aka LDS) and thus calculate a good estimate of the core occupancy.
Is there any such possibility/extension for nVidia and Intel (and possibly other) GPUs, or a workaround to measure the properties of a GLSL shader on a particular hardware in some other way? At least for nVidia cards, the functionality is implemented in CUDA, but that doesn't help much with debugging GLSL shaders stuff.

How does TensorFlow use both shared and dedicated GPU memory on the GPU on Windows 10?

When running a TensorFlow job I sometimes get a non-fatal error that says GPU memory exceeded, and then I see the "Shared memory GPU usage" go up on the Performance Monitor on Windows 10.
How does TensorFlow achieve this? I have looked at CUDA documentation and not found a reference to the Dedicated and Shared concepts used in the Performance Monitor. There is a Shared Memory concept in CUDA but I think it is something on the device, not the RAM I see in the Performance Monitor, which is allocated by the BIOS from CPU RAM.
Note: A similar question was asked but not answered by another poster.
Shared memory in windows 10 does not refer to the same concept as cuda shared memory (or local memory in opencl), it refers to host accessible/allocated memory from the GPU. For integrated graphics processing host and device memory is usually the same as shared thanks to both the cpu and gpu being located on the same die and being able to access the same ram. For dedicated graphics with their own memory, this is separate memory allocated on the host side for use by the GPU.
Shared memory for compute APIs such as through GLSL compute shaders, or Nvidia CUDA kernels refer to a programmer managed cache layer (some times refereed to as "scratch pad memory") which on Nvidia devices, exists per SM, and can only be accessed by a single SM and is usually between 32kB to 96kB per SM. Its purpose is to speed up memory access to data which is used often.
If you see and increase shared memory used in Tensorflow, you have a dedicated graphics card, and you are experiencing "GPU memory exceeded" it most likely means you are using too much memory on the GPU itself, so it is trying to allocate memory from elsewhere (IE from system RAM). This potentially can make your program much slower as the bandwidth and latency will be much worse on non device memory for a dedicated graphics card.
I think I figured this out by accident. The "Shared GPU Memory" reported by Windows 10 Task Manager Performance tab does get used, if there are multiple processes hitting the GPU simultaneously. I discovered this by writing a Python programming that used multiprocessing to queue up multiple GPU tasks, and I saw the "Shared GPU memory" start filling up. This is the only way I've seen it happen.
So it is only for queueing tasks. Each individual task is still limited to the onboard DRAM minus whatever is permanently allocated to actual graphics processing, which seems to be around 1GB.

Can GPU be used to run programs that run on CPU?

Can Gpu be used to run programs that run on Cpu like getting input from keyboard and mouse or playing music or reading the contents of a text file using Direct3D and OpenGL Api?
The GPU has no direct access on any memory that is mapped by the OS to be accessed within client code (i.e. code, which is executed in user-mode while the instructions are executed on the CPU).
In addition the GPU is not supposed to perform stuff like this, it aims to perform floating point arithmetic at a high speed. And finally you would never use Direct3D or OpenGL to perform anything that is not related to graphics, except you are only going to use the compute shader.
General purpose computations are performed with OpenCL or CUDA on the GPU, such as image manipulation or physics simulations.
You can, however, gather any data on the CPU, send it to the GPU for further processing and finally write it back again into memory accessible from the CPU.

Getting the most of the GPU in an Embedded Platform

My platform is Ubuntu running ob Exynos4412CPU which has the Mali400GPU. I would like to do some computer vision using OpenCV and OpenGL, I'm also going to do some fragment shaders. My question is what is the fastest way to copy the contents from the GPU to the CPU, which is really slow on my platform using glreadpixels. Is it beneficial to utilize glreadpixels in its own thread or use OpenMP ? Suggestions are welcome please :).
The Exynos 4412 doesn't have separate CPU and GPU memory at the hardware level; it's all the same RAM and physically accessible by both. Thus, there is likely to be some way to access the GPU's portion of the memory directly from the CPU.