How to solve not enough GPU memory when traning big model in tensorflow? - tensorflow

I am running LSTM demo in tensorflow.
Cell output size 461*461*4*120=100MB (120 hidden nodes)
Softmax output size 461*461*4*256=200MB
But use Nvidia 960 (4G memory) to running this demo will exhaust all GPU-memory,Why?
If hidden nodes up to 1000, use doble GPU (Nvidia 1080) is hard to make this work ,How to figure it out?


TF 2.11 CNN training with 20k Image and NVIDIA GeForce RTX 4090 GPU running too slow

I have Linux-x86_64 operating system and I am running TF 2.11 on conda environment. I just got a workstation which includes NVIDIA GeForce RTX 4090 24GB GPU. I'd like to perform CNN image classification, and my dataset contains 20k images, 14k of which are for training, 3k for validation, and 3k for testing. The code also does hyperparameter tuning using the tensorboard API. So basically, I am expecting to finish around 10k experiments. My epoch numbers in the algorithm 300. Batch size varies within the range of 16, 32, 64.
Before, I was running a CNN with 2k image data using the same logic and number of experiments and honestly it was taking like 2 weeks to finish everything. Now, I was expecting for it to run super fast since I upgraded it from GeForce 2060 to 4090, however it's not the case.
As you see in the following pictures, there is no issue with running it on GPU, the problem is that why it runs very slow. it's like finishing the first Epoch 1/300 while it includes 450 substeps takes up to 2 - 2.5 hour. Afterward, it goes to 2/300. This is incredible. It means the whole process can take months.
I just got confused over GPU utilization but I am assuming using 0.9 percent makes sense. I checked all updates and CUDA things, they seem correct.
What do you think the issue could be? 20k image data is not huge for this GPU. I tried to run it through terminal or Jupyter notebook, those are the same. I feel like this tf.session command can create some issues? Should there be a specific open and close sessions?
Those are my parameters that needs to be optimized:
EDIT: if I run it on RTX 2060, it's definitely going too fast compared to Linux RTX 4090, I have not figured it out what the problem is. It's like finishing the first run 1/300 takes just 1.5 minutes, it's like 1.5 hr on linux 4090 workstation!
enter image description here
enter image description here
how I generate the data:
train_data = train_datagen.flow_from_directory(directory=train_path,
target_size=(1000, 9), color_mode="grayscale", class_mode="categorical", shuffle=True)
valid_data = valid_datagen.flow_from_directory(directory=valid_path,
target_size=(1000, 9), color_mode="grayscale", class_mode="categorical", shuffle=False)
test_data = test_datagen.flow_from_directory(directory=test_path,
target_size=(1000, 9), color_mode="grayscale", class_mode="categorical", shuffle=False)

Google TensorFlow based seq2seq model crashes while training

I have been trying to use Google's RNN based seq2seq model.
I have been training a model for text summarization and am feeding in a textual data approximately of size 1GB. The model quickly fills up my entire RAM(8GB), starts filling up even the swap memory(further 8GB) and crashes post which I have to do a hard shutdown.
The configuration of my LSTM network is as follows:
model: AttentionSeq2Seq
attention.class: seq2seq.decoders.attention.AttentionLayerDot
num_units: 128
bridge.class: seq2seq.models.bridges.ZeroBridge
embedding.dim: 128
encoder.class: seq2seq.encoders.BidirectionalRNNEncoder
cell_class: GRUCell
num_units: 128
dropout_input_keep_prob: 0.8
dropout_output_keep_prob: 1.0
num_layers: 1
decoder.class: seq2seq.decoders.AttentionDecoder
cell_class: GRUCell
num_units: 128
dropout_input_keep_prob: 0.8
dropout_output_keep_prob: 1.0
num_layers: 1 Adam
epsilon: 0.0000008
optimizer.learning_rate: 0.0001
source.max_seq_len: 50
source.reverse: false
target.max_seq_len: 50
I tried decreasing the batch size from 32 to 16, but it still did not help. What specific changes should I make in order to prevent my model from taking up the entirety of RAM and crashing? (Like decreasing data size, decreasing number of stacked LSTM cells, further decreasing batch size etc)
My system runs Python 2.7x, TensorFlow version 1.1.0, and CUDA 8.0. The system has an Nvidia Geforce GTX-1050Ti(768 CUDA cores) with 4GB of memory, and the system has 8GB of RAM and a further 8GB of swap memory.
You model looks pretty small. The only thing kind of big is the train data. Please check to make sure your get_batch() function has no bugs. It is possible that each batch you are actually loading the whole data set for training, in case there is a bug there.
In order to quickly prove this, just cut down your training data size to something very small (such as 1/10 of current size) and see if that helps. Note that it should not help because you are using mini batch. But if that resolve the problem, fix your get_batch() function.

Optimizer is too large for GPU

I'm building a nn consisting of 2 hidden FC layers, where the weights are tensors [19000,19000] (I cannot make them smaller).
When I run everything of the network on CPU and only the optimizer (adagrad) on GPU, I'm fine (the allocated memory is 8990B). Since I need the weights to be clipped as well, I apply the function tf.clip_by_value() on the weigths (not the biases) in both layers. As of now, the allocated memory exceeds the 12GB of dedicated memory of the GPU.
Is there a way that I can split the optimizer over multiple GPU's or the cpu and GPU?

Tensorflow batching is very slow

I tried to setup a very simple Mnist example with an Estimator.
First I used the estimator's deprecated fit() parameters x, y and batch_size. This executed very fast and utilized about 100% of my GPU while not effecting the CPU much (about 10% utilization). So it worked as expected.
Because the x, y and batch_size parameters are deprecated, I wanted to use the input_fn parameter for the fit() function. To build the input_fn, I used a tf.slice_input_producer and batched it with tf.train.batch. This is my code You should be able to just run it with tensorflow 1.0.
My problem is that the training now runs very slow and only utilizes about 30 % of my GPU (shown in nvidia-smi).
I also tried to increase the queue capacity of the slice_input_producer and to increase the number of threads used for batching. However, this only helped to get to about 45% of GPU utilization and resulted in a 100 % GPU utilization.
What am I doing wrong? Is there a better way for feeding the inputs and batching them? I do not want to create the batches manually (creating subarrays of the numpy input array) because I want to use this example for a more complex input queue where I'll be reading and preprocessing the images in the graph.
I don't think my hardware should be the problem:
List item
Windows 10
NVidia GTX 960M

GPU + CPU Tensorflow Training

I have a network, one whose parameter is a large-embedding matrix (3Million X 300 sized), say embed_mat.
During training, for each mini-batch, I only update a small subset of the vectors from embed_mat (max 15000 vectors) which are chosen using the embedding_lookup op. I am using the Adam optimizer to train my model.
As I cannot store this embed_mat in the GPU, due to its size, I define it under CPU (say /cpu:0) device, but the rest of the parameters of the model, the optimizer etc. are defined under a GPU (say, gpu:/0) device.
I see that my GPU usage is very minimal (200 MB), which suggests all my training is happening on the CPU. What I expected was that the result of the embedding_lookup is copied to the GPU and all my training happens there. Am I doing something wrong.
The training time is very largely affected by the size (num_vectors) of the embedding matrix which doesn't seem correct to me. In any mini-batch, I only update my network parameters and the vectors I looked up (~15000), so the training time should, if at all, grow sub-linearly with the size of the embedding matrix.
Is there a way to automatically and seamlessly split up my embed_mat to multiple GPUs for faster training?
I suspect the Adam Optimizer for this. Looks like because the embed_mat is on the CPU, all training is happening on the CPU. Is this correct?
Try visualizing on tensorboard where each of your ops is placed. In the "graph" tab you can color by "device". Ideally the embedding variable, the embedding lookup, and the embedding gradient update should be in the CPU, while most other things should be in the GPU.