How to use GPU in Paperspace when running Transformers Pipeline? - gpu

We are trying to run HuggingFace Transformers Pipeline model in Paperspace (using its GPU).
The problem is that when we set 'device=0' we get this error:
RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 15.90 GiB total capacity; 476.40 MiB already allocated; 7.44 MiB free; 492.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
And if we don't set 'device=0' then GPU doesn't work (which is OK because default option is not to use it).
How to make GPU work for these models in Paperspace?
The code is pretty simple so far:
from transformers import pipeline
classifier = pipeline("zero-shot-classification",
model="Sahajtomar/German_Zeroshot"
#,device = 0
)
On Google Colab with the same model and same code we didn't have this problem. We just set 'device=0' and we got GPU running perfectly.

Related

No additional memory from Colab Pro+

I've been running this notebook with the Runtime Type as "high-RAM" "GPU." I was getting the following error:
CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 15.90 GiB total capacity; 14.81 GiB already allocated; 31.75 MiB free; 14.94 GiB reserved in total by PyTorch)
So I upgraded from Pro to Pro+, because that's supposed to give me more memory, but I'm still getting the same error.
I don't think a better GPU was promised with Colab Pro+.

Dask-Rapids data movment and out of memory issue

I am using dask (2021.3.0) and rapids(0.18) in my project. In this, I am performing preprocessing task on the CPU, and later the preprocessed data is transferred to GPU for K-means clustering. But in this process, I am getting the following problem:
1 of 1 worker jobs failed: std::bad_alloc: CUDA error: ~/envs/include/rmm/mr/device/cuda_memory_resource.hpp:69: cudaErrorMemoryAllocation out of memory
(before using GPU memory completely it gave the error i.e. it is not using GPU memory completely)
I have a single GPU of size 40 GB.
Ram size 512 GB.
I am using following snippet of code:
cluster=LocalCluster(n_workers=1, threads_per_worker=1)
cluster.scale(100)
##perform my preprocessing on data and get output on variable A
# convert A varible to cupy
x = A.map_blocks(cp.asarray)
km =KMeans(n_clusters=4)
predict=km.fit_predict(x).compute()
I am also looking for a solution so that the data larger than GPU memory can be preprocessed, and whenever there is a spill in GPU memory the spilled data is transferred into temp directory or CPU (as we do with dask where we define temp directory when there is a spill in RAM).
Any help will be appriciated.
There are several ways to run larger than GPU datasets.
Check out Nick Becker's blog, which has a few methods well documented
Check out BlazingSQL, which is built on top of RAPIDS and can perform out of core processings. You can try it at beta.blazingsql.com.

aws gpu oom issue onnx cuda

Doing predictions on AWS GPU instance g4dn.4xlarge(16gb gpu memory,64 gb cpu mem) and deployed with k8s & dockers.
Tested with (cuda10.1 + onnxruntime-gpu==1.4.0 ) and (cuda10.2 + onnxruntime-gpu==1.6.0) same error.Models are customised for our purpose,cant point to weights.
Problem is :
Getting cuda oom(out of memory) error:
Error: onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Conv node. Name:'Conv_16' Status Message: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:298 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool) Failed to allocate memory for requested buffer of size 33554432
On some backtracking:
Using nvidia-smi commands and GPU memory profiling, found for the 1st prediction and for next all predictions a constant GPU memory of ~1.8GB minimum for some models ~ 3 GB is blocked for some (I think it's blocked for multiprocess ). Releasing mem doesnt make sense , coz for next prediction same amount of mem will be blocked.
My understanding:
So at the peak, we are scaling up to 22 pods & in every pod, the model load is initialized, and hence every pod is blocking 1.8 ~ 3gb of memory & pointing to 1 GPU instance of 16 GB GPU memory.So, with 22 pods, oom is expected.
What is confusing:
Above cuda message throws oom, but gpu profiling shows memory utilisation is never more than 50% , though SM(Streaming multiprocessing) is 100% at peak(when pods scaled to 22).Attached image for refernce.
On research I understood that SM has nothing to do with oom and cuda would handle sm efficiently. Then why getting cuda oom error if only 50% mem is utilised?
Ruled out.
I ruled out memory leak from model , as it runs w/o oom error when load is low.
Why GPU and not CPU for prediction.
Want faster predictions. Ran on CPU w/o any error ,even on high load.
What I am looking for:
A solution to scale AWS GPU instances on the basis of GPU memory.If oom is reason ,scaling on GPU mem should solve problem.I can't find.
Understanding cuda msg , when mem is available why oom ?
Being very hypothetical. If there is a way to create singleton object by design or using k8s for particular model load and saled up pods can utilise that model load object for prediction rather than creating new server. BUt that would kill sense or using k8s for availabilty & scalabilty.

How to free allocated GiB memory in EC2 CUDA GPU instance

I have a p3.2xlarge instance, I ran a couple of experiments on the instance, and now that I want to run a new experiment (deep learning) I get the following error:
RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 15.78 GiB total capacity; 14.70 GiB already allocated; 34.44 MiB free; 14.76 GiB reserved in total by PyTorch)
I wonder if there's any way that I can free the allocated memories so that I can run my experiment? Is freeing the memories even the solution for such an error?

Allocating GPU memory for cupy arrays

I have a tensorflow session running in parallel to this cupy code. I have allocated 8 Gb out of 16 Gb of my total gpu memory to the tensorflow session. What I want now is to allocate 2 Gb from the rest of 7 Gb for executing this cupy code. The actual code is more involved than the example code I provided. In my actual code, cp_arr is a result of a series of array operations. But I want cp_array to be allocated in the specified 2 Gb space of my gpu memory. Remember, freeing the gpu resources by closing out tensorflow session is not an option.
This is the code I am using.
memory = cp.cuda.Memory(2048000000)
ptr = cp.cuda.MemoryPointer(memory,0)
cp_arr = cp.ndarray(shape=(30,1080,1920,3),memptr=ptr)
cp_arr = ** Array operations **
In this case, an additional memory of 1.7 GB was allocated when 'cp_arr = ** Array operations **' is being executed. What I want is to make use of the allocated space of 2 GB to hold my cupy array, cp_arr. Thanks in advance.
Memory allocation behavior of CuPy is similar to that of NumPy.
Like as in NumPy, several functions support out argument which can be used to store the computation results into specified array. e.g., https://docs-cupy.chainer.org/en/stable/reference/generated/cupy.dot.html