Large cupy array running out of GPU Ram - gpu

This is a total newbie question but I've been searching for a couple days and cannot find the answer.
I am using cupy to allocate a large array of doubles (circa 655k rows x 4k columns ) which is about 16Gb in ram. I'm running on p2.8xlarge (the aws instance that claims to have 96GB of GPU ram and 8 GPUs), but when I allocate the array it gives me out of memory error.
Is this happening becaues the 96GB of ram is split into 8x12 GB lots that are only accessible to each GPU? Is there no concept of pooling the GPU ram across the GPUs (like regular ram in multiple CPU situation) ?

From playing around with it a fair bit, I think the answer is no, you cannot pool memory across GPUs. You can move data back and forth between GPUs and CPU but there's no concept of unified GPU ram accessible to all GPUs

Related

Memory allocation strategies CPU vs GPU on deeplearning (cuda, tensorflow, pytorch,…)

I'm trying to start multiple processes (10 for example) of learning with tensorflow 2. I'm still using Session and so on multiple tf.compat.v1 in all my codebase.
When I'm running with CPU, processes take each around 500mo of CPU memory. htop output :
When I'm running with GPU, processes take each much more CPU memory (like 3Go each) and almost the same (more in reality) GPU memory. nvtop output (GPU mem left, CPU (HOST) mem right) :
I can reduce GPU memory process fingerprint by using environment variable TF_CUDNN_USE_AUTOTUNE=0 (1.5Go GPU, not more than 3Go CPU). But it's still much more memory consumption than running process on CPU only. I tried a lot of thing like TF_GPU_ALLOCATOR=cuda_malloc_async with a tf nightly release, but it's still the same. This cause OOM errors if I would like to keep 10 processes on GPU like on CPU.
I found memory fragmentation may be a hint, by profiling a single process. You can find screenshots here.
TL;DR
When running tf process on CPU only, it uses some memory (comparable to data size). When running the same tf process on GPU only, it uses much more memory (~x16 without any tensorflow optimization).
I would like to know what can cause a huge difference of memory usage like this, and how to prevent it. Even how to fix it.
FYI -> Current setup : tf 2.6, cuda 11.4 (or 11.2 or 11.1 or 11.0), ubuntu 20.04, nvidia driver 370
EDIT : I tried to convert my tensorflow / tflearn code to pytorch. I have the same behaviour (low memory on CPU, and everything explode when running on GPU)
EDIT2 : Some of memory allocated on GPU should be for CUDA runtime. On pytorch. I have 300mo memory allocated on CPU run. I have 2go of GPU memory and almost 5go of CPU memory used when running on GPU. May the main problem is the CPU/system memory allocated for this process when I'm running on GPU, since it seems that CUDA runtime can take almost 2go of GPU mem (this is huge...). It looks like related to CUDA initialization.
EDIT3 : This is definitely an issue with CUDA. Even if I try to create a 1,1 tensor with pytorch, it takes 2go of GPU and almost 5go of CPU memory. It can be explain because pytorch is loading a huge number of kernels to memory; even if the main program isn't using them.

Google Colab Pro not allocating more than 1 GB of GPU memory

I recently upgraded to colab pro. I am trying to use GPU resources from colab pro to train my Mask RCNN model. I was allocated around 15 GB of memory when I tried to run the model right after I signed up for Pro. However, for some reason, I was allocated just 1 GB of memory from the next morning. Since then, I haven't been allocated more than 1 GB. I was wondering if I am missing something or I perturbed the VM inherent packages. I understand that the allocation varies from day to day, but it's been like this for almost 3 days now. Following attempts have already made to improve, but none seems to work.
I have made sure that GPU and "High-RAM" option is selected.
I have tried restarting runtimes several times
I have tried running other scripts (just to make sure that problem was not with mask rcnn script)
I would appreciate any suggestions on this issue.
GPU info
The high memory setting in the screen controls the system RAM rather than GPU memory.
The command !nvidia-smi will show GPU memory. For example:
The highlighted output shows the GPU memory utilization: 0 of 16 GB.

CPU and GPU memory sharing

If the (discrete) GPU has its own video RAM, I have to copy my data from RAM to VRAM to be able to use them. But if the GPU is integrated with the CPU (e.g. AMD Ryzen) and shares the memory, do I still have to make copies, or can they both alternatively access the same memory block?
It is possible to avoid copying in case of integrated graphics, but this feature is platform specific, and it may work differently for different vendors.
How to Increase Performance by Minimizing Buffer Copies on Intel® Processor Graphics article describes how to achieve this for Intel hardware:
To create zero copy buffers, do one of the following:
Use CL_MEM_ALLOC_HOST_PTR and let the runtime handle creating a zero copy allocation buffer for you
If you already have the data and want to load the data into an OpenCL buffer object, then use CL_MEM_USE_HOST_PTR with a buffer allocated at a 4096 byte boundary (aligned to a page and cache line boundary) and a total size that is a multiple of 64 bytes (cache line size).
When reading or writing data to these buffers from the host, use clEnqueueMapBuffer(), operate on the buffer, then call clEnqueueUnmapMemObject().
GPU and CPU memory sharing ?
GPU have multiple cores without control unit but the CPU controls the GPU through control unit. dedicated GPU have its own DRAM=VRAM=GRAM faster then integrated RAM. when we say integrated GPU its mean that GPU placed on same chip with CPU, and CPU & GPU used same RAM memory (shared memory ).
References to other similar Q&As:
GPU - System memory mapping
Data sharing between CPU and GPU on modern x86 hardware with OpenCL or other GPGPU framework

Does Swap memory help if my RAM is insufficient?

New to StackOverflow and don't have enough credits to post a comment. So opening a new question.
I am running into the same issue as this:
why tensorflow just outputs killed
In this scenario, does SWAP memory help?
Little more info on the platform:
Ras Pi 3 on Ubuntu Mate 16.04
RAM - 1 GB
Storage - 32 GB SD card
Framework: Tensorflow
Network architecture- similar to complexity of AlexNet.
Appreciate any help!
Thanks
SK
While swap may stave off the hard failure as seen in the linked question, swapping will generally doom your inference. The difference in throughput and latency b/w RAM and any other form of storage is simply far too large.
As a rough estimate, RAM throughput is about 100x that of even high quality SD cards. The factor for I/O operations per second is even larger, at somewhere between 100,000 and 5,000,000.

How does TensorFlow use both shared and dedicated GPU memory on the GPU on Windows 10?

When running a TensorFlow job I sometimes get a non-fatal error that says GPU memory exceeded, and then I see the "Shared memory GPU usage" go up on the Performance Monitor on Windows 10.
How does TensorFlow achieve this? I have looked at CUDA documentation and not found a reference to the Dedicated and Shared concepts used in the Performance Monitor. There is a Shared Memory concept in CUDA but I think it is something on the device, not the RAM I see in the Performance Monitor, which is allocated by the BIOS from CPU RAM.
Note: A similar question was asked but not answered by another poster.
Shared memory in windows 10 does not refer to the same concept as cuda shared memory (or local memory in opencl), it refers to host accessible/allocated memory from the GPU. For integrated graphics processing host and device memory is usually the same as shared thanks to both the cpu and gpu being located on the same die and being able to access the same ram. For dedicated graphics with their own memory, this is separate memory allocated on the host side for use by the GPU.
Shared memory for compute APIs such as through GLSL compute shaders, or Nvidia CUDA kernels refer to a programmer managed cache layer (some times refereed to as "scratch pad memory") which on Nvidia devices, exists per SM, and can only be accessed by a single SM and is usually between 32kB to 96kB per SM. Its purpose is to speed up memory access to data which is used often.
If you see and increase shared memory used in Tensorflow, you have a dedicated graphics card, and you are experiencing "GPU memory exceeded" it most likely means you are using too much memory on the GPU itself, so it is trying to allocate memory from elsewhere (IE from system RAM). This potentially can make your program much slower as the bandwidth and latency will be much worse on non device memory for a dedicated graphics card.
I think I figured this out by accident. The "Shared GPU Memory" reported by Windows 10 Task Manager Performance tab does get used, if there are multiple processes hitting the GPU simultaneously. I discovered this by writing a Python programming that used multiprocessing to queue up multiple GPU tasks, and I saw the "Shared GPU memory" start filling up. This is the only way I've seen it happen.
So it is only for queueing tasks. Each individual task is still limited to the onboard DRAM minus whatever is permanently allocated to actual graphics processing, which seems to be around 1GB.