I am trying to do a matrix decomposition (or tucker decomposition on a tensor) in Tensorflow with GPU. I have tensorflow-gpu, my NVidia GPU has 4GB RAM. My problem is that my input matrix is huge, millions of rows and millions of columns and the size of the matrix is more than 5GB in memory. So each time Tensorflow gives me an out of memory (OOM) error. (If I turn off GPU, the whole process can run successfully in CPU using system RAM. Of course, the speed is slow.)
I did some research on Tensorflow and on NVidia CUDA lib. CUDA seems has a "unified memory" mechanism so the system RAM and GPU RAM share one address book. Yet no further details found.
I wonder if Tensorflow supports some memory sharing mechanism such that I can generate input in system RAM? (Since I want to use GPU to accelerate the calculations) And GPU can do the calculation piece by piece.
Related
Assuming TensorFlow GPU library being used in computation, which operations are offloaded to GPU (and how often)? What is the performance impact of:
CPU Core count (because it is now not actively involved in computation)
RAM size.
GPU VRAM (What benefit of owning a higher memory GPU)
Say I'd like to decide upon particular(s) of these hardware choices. Can someone explain with an example, which aspect of a Machine Learning model will impact the particular hardware constraint?
(I need a little elaboration on what exact ops are offloaded to GPU and CPU, based on TensorFlow GPU lib for example.)
One way of using tensorflow to efficiently spread work between CPUs and GPUs is to use estimators.
For example :
model = tf.estimator.Estimator(model_fn=model_fn,
params=params,
model_dir="./models/model-v0-0")
model.train(lambda:input_fn(train_data_path), steps=1000)
In the function 'input_fn' the data batch loading and batch preparation will be offloaded to the CPU while the GPU is working on the model as declared in the function 'model_fn'.
If you are concerned about RAM constraints then you should look at using the tfrecord format as this avoids loading up the whole dataset in RAM
see tensorflow.org/tutorials/load_data/tf_records
I have an GTX 1050 ti (4GB) and i5 CPU, 8GB memory.
I successfully installed tensorflow-gpu with cuda driver on win10 and the test shows that tensorflow is actually using the gpu (snapshot):
However, when carrying out the training with CNN, while the GPU memory is always 100%, the GPU load is qualsi 0 with some spikes # 30%~70%:
Is it normal ?
EDIT: While the GPU occupation is qualsi 0 with spikes, the CPU load is fixed at 100% during the training.
EDIT2: I did read somewhere that the CPU could be high while GPU be low if there are a lot of operations of data copy between CPU and GPU. But I am using the official tensorflow object detection api for the training so I am totally unaware of the possible place in code.
What you see is normal behavior in most cases.
TensorFlow books the entire GPU memory initially.
The load on the GPU is dependent upon the data it is getting for processing.
If the data loading operation is slow, then most of the time GPU is waiting for data to get copied from disk to the GPU, and during that time it is not performing any work. That is what you see in your screen.
I have a very large deep neural network. when I try to run it on GPU I get "OOM when allocating". but when I mask the GPU and run on CPU it works (about 100x slower when comparing small model).
My question is if there is any mechanism in tenosrflow that would enable me to run the model on GPU. I assume the CPU uses virtual memory so it can allocates as much as he likes and move between cache/RAM/Disk (thrashing).
is there something similiar on Tensorflow with GPU? that would help me even if it will be 10x slower than regular GPU run
Thanks
GPU memory is currently not extensible (Till something like PASCAL is available)
Simply in tensorflow we can run our project on CPU or GPU. I have a GTX-1080 with 2500 cores. when we use a GPU in tensorflow all of the cores and memory of the graphic card is involved.
How can I use a part or certain number of my graphic cores in tensorflow ?
I am using Nvidia Digits Box with GPU (Nvidia GeForce GTX Titan X) and Tensorflow 0.6 to train the Neural Network, and everything works. However, when I check the Volatile GPU Util using nvidia-smi -l 1, I notice that it's only 6%, and I think most of the computation is on CPU, since I notice that the process which runs Tensorflow has about 90% CPU usage. The result is the training process is very slow. I wonder if there are ways to make full usage of GPU instead of CPU to speed up the training process. Thanks!
I suspect you have a bottleneck somewhere (like in this github issue) -- you have some operation which doesn't have GPU implementation, so it's placed on CPU, and the GPU is idling because of data transfers. For instance, until recently reduce_mean was not implemented on GPU, and before that Rank was not implemented on GPU, and it was implicitly being used by many ops.
At one point, I saw a network from fully_connected_preloaded.py being slow because there was a Rank op that got placed on CPU, and hence triggering the transfer of entire dataset from GPU to CPU at each step.
To solve this I would first recommend upgrading to 0.8 since it had a few more ops implemented for GPU (reduce_prod for integer inputs, reduce_mean and others).
Then you can create your session with log_device_placement=True and see if there are any ops placed on CPU or GPU that would cause excessive transfers per step.
There are often ops in the input pipeline (such as parse_example) which don't have GPU implementations, I find it helpful sometimes to pin the whole input pipeline to CPU using with tf.device("/cpu:0"): block