Getting the most of the GPU in an Embedded Platform

Getting the most of the GPU in an Embedded Platform - optimization

My platform is Ubuntu running ob Exynos4412CPU which has the Mali400GPU. I would like to do some computer vision using OpenCV and OpenGL, I'm also going to do some fragment shaders. My question is what is the fastest way to copy the contents from the GPU to the CPU, which is really slow on my platform using glreadpixels. Is it beneficial to utilize glreadpixels in its own thread or use OpenMP ? Suggestions are welcome please :).

The Exynos 4412 doesn't have separate CPU and GPU memory at the hardware level; it's all the same RAM and physically accessible by both. Thus, there is likely to be some way to access the GPU's portion of the memory directly from the CPU.

Related

Is it possible to use system memory instead of GPU memory for processing Dask tasks

We have been running DASK clusters on Kubernetes for some time. Up to now, we have been using CPUs for processing and, of course, system memory for storing our Dataframe of around 1,5 TB (per DASK cluster, split onto 960 workers). Now we want to update our algorithm to take advantage of GPUs. But it seems like the available memory on GPUs is not going to be enough for our needs, it will be a limiting factor(with our current setup, we are using more than 1GB of memory per virtual core).
I was wondering if it is possible to use GPUs (thinking about NVDIA, AMD cards with PCIe connections and their own VRAMS, not integrated GPUs that use system memory) for processing and system memory (not GPU memory/VRAM) for storing DASK Dataframes. I mean, is it technically possible? Have you ever tried something like this? Can I schedule a kubernetes pod such that it uses GPU cores and system memory together?
Another thing is, even if it was possible to allocate the system RAM as VRAM of GPU, is there a limitation to the size of this allocatable system RAM?
Note 1. I know that using system RAM with GPU (if it was possible) will create an unnecessary traffic through PCIe bus, and will result in a degraded performance, but I would still need to test this configuration with real data.
Note 2. GPUs are fast because they have many simple cores to perform simple tasks at the same time/in parallel. If an individual GPU core is not superior to an individual CPU core then may be I am chasing the wrong dream? I am already running dask workers on kubernetes which already have access to hundreds of CPU cores. In the end, having a huge number of workers with a part of my data won't mean better performance (increased shuffling). No use infinitely increasing the number of cores.
Note 3. We are mostly manipulating python objects and doing math calculations using calls to .so libraries implemented in C++.
Edit1: DASK-CUDA library seems to support spilling from GPU memory to host memory but spilling is not what I am after.
Edit2: It is very frustrating that most of the components needed to utilize GPUs on Kubernetes are still experimental/beta.
Dask-CUDA: This library is experimental...
NVIDIA device plugin: The NVIDIA device plugin is still considered beta and...
Kubernetes: Kubernetes includes experimental support for managing AMD and NVIDIA GPUs...

I don't think this is possible directly as of today, but it's useful to mention why and reply to some of the points you've raised:
Yes, dask-cuda is what comes to mind first when I think of your use-case. The docs do say it's experimental, but from what I gather, the team has plans to continue to support and improve it. :)
Next, dask-cuda's spilling mechanism was designed that way for a reason -- while doing GPU compute, your biggest bottleneck is data-transfer (as you have also noted), so we want to keep as much data on GPU-memory as possible by design.
I'd encourage you to open a topic on Dask's Discourse forum, where we can reach out to some NVIDIA developers who can help confirm. :)
A sidenote, there are some ongoing discussion around improving how Dask manages GPU resources. That's in its early stages, but we may see cool new features in the coming months!

Does High-Ram in Google Colab refer to GPU RAM or CPU RAM?

I see that you get better GPUs with the Pro account. But does switching to High-Ram affect the GPU's memory or the CPU's memory? I don't see it spelled out exactly anywhere, and maybe this is just common knowledge among ML experts.

How does TensorFlow use both shared and dedicated GPU memory on the GPU on Windows 10?

When running a TensorFlow job I sometimes get a non-fatal error that says GPU memory exceeded, and then I see the "Shared memory GPU usage" go up on the Performance Monitor on Windows 10.
How does TensorFlow achieve this? I have looked at CUDA documentation and not found a reference to the Dedicated and Shared concepts used in the Performance Monitor. There is a Shared Memory concept in CUDA but I think it is something on the device, not the RAM I see in the Performance Monitor, which is allocated by the BIOS from CPU RAM.
Note: A similar question was asked but not answered by another poster.

Shared memory in windows 10 does not refer to the same concept as cuda shared memory (or local memory in opencl), it refers to host accessible/allocated memory from the GPU. For integrated graphics processing host and device memory is usually the same as shared thanks to both the cpu and gpu being located on the same die and being able to access the same ram. For dedicated graphics with their own memory, this is separate memory allocated on the host side for use by the GPU.
Shared memory for compute APIs such as through GLSL compute shaders, or Nvidia CUDA kernels refer to a programmer managed cache layer (some times refereed to as "scratch pad memory") which on Nvidia devices, exists per SM, and can only be accessed by a single SM and is usually between 32kB to 96kB per SM. Its purpose is to speed up memory access to data which is used often.
If you see and increase shared memory used in Tensorflow, you have a dedicated graphics card, and you are experiencing "GPU memory exceeded" it most likely means you are using too much memory on the GPU itself, so it is trying to allocate memory from elsewhere (IE from system RAM). This potentially can make your program much slower as the bandwidth and latency will be much worse on non device memory for a dedicated graphics card.

I think I figured this out by accident. The "Shared GPU Memory" reported by Windows 10 Task Manager Performance tab does get used, if there are multiple processes hitting the GPU simultaneously. I discovered this by writing a Python programming that used multiprocessing to queue up multiple GPU tasks, and I saw the "Shared GPU memory" start filling up. This is the only way I've seen it happen.
So it is only for queueing tasks. Each individual task is still limited to the onboard DRAM minus whatever is permanently allocated to actual graphics processing, which seems to be around 1GB.

Can GPU be used to run programs that run on CPU?

Can Gpu be used to run programs that run on Cpu like getting input from keyboard and mouse or playing music or reading the contents of a text file using Direct3D and OpenGL Api?

The GPU has no direct access on any memory that is mapped by the OS to be accessed within client code (i.e. code, which is executed in user-mode while the instructions are executed on the CPU).
In addition the GPU is not supposed to perform stuff like this, it aims to perform floating point arithmetic at a high speed. And finally you would never use Direct3D or OpenGL to perform anything that is not related to graphics, except you are only going to use the compute shader.
General purpose computations are performed with OpenCL or CUDA on the GPU, such as image manipulation or physics simulations.
You can, however, gather any data on the CPU, send it to the GPU for further processing and finally write it back again into memory accessible from the CPU.

AMD Fusion how can CPU and GPU share the same memory?

AMD announced it's Fusion platform some time ago. Having read a bit about it I'm both excited and sceptic. For example it should make it possible that GPUs and CPUs share the same memory. (and the GPU and CPU are both in the same package) Now since GPUs have a much higher memory bandwidth (around 10x the bandwidth a CPU has) and that the way CPUs and GPUs use cache is fundamentally different, the question arises how the heck can they do this? I wonder if any details are known.

I have also searched for some detailed info on how this APU technically works, but haven't found anything better than AMD's whitepaper on the subject, which, in a slightly marketing-wise tone, does present a lot of good info.

By using high-bandwidth dual ported RAM. AMD is happy to explain.

Here is a good paper (from AMD) explaining the different memory access patterns in an APU
http://amddevcentral.com/afds/assets/presentations/1004_final.pdf

CPUs and GPUs have been sharing memory for years. Fusion is no different, in fact it's far from the first instance of a GPU being integrated into a general-purpose CPU core.
Like all other such solutions, it'll be fine for casual use, but it'll be far from cutting-edge 3D acceleration.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas