TF 2.11 CNN training with 20k Image and NVIDIA GeForce RTX 4090 GPU running too slow - tensorflow

I have Linux-x86_64 operating system and I am running TF 2.11 on conda environment. I just got a workstation which includes NVIDIA GeForce RTX 4090 24GB GPU. I'd like to perform CNN image classification, and my dataset contains 20k images, 14k of which are for training, 3k for validation, and 3k for testing. The code also does hyperparameter tuning using the tensorboard API. So basically, I am expecting to finish around 10k experiments. My epoch numbers in the algorithm 300. Batch size varies within the range of 16, 32, 64.
Before, I was running a CNN with 2k image data using the same logic and number of experiments and honestly it was taking like 2 weeks to finish everything. Now, I was expecting for it to run super fast since I upgraded it from GeForce 2060 to 4090, however it's not the case.
As you see in the following pictures, there is no issue with running it on GPU, the problem is that why it runs very slow. it's like finishing the first Epoch 1/300 while it includes 450 substeps takes up to 2 - 2.5 hour. Afterward, it goes to 2/300. This is incredible. It means the whole process can take months.
I just got confused over GPU utilization but I am assuming using 0.9 percent makes sense. I checked all updates and CUDA things, they seem correct.
What do you think the issue could be? 20k image data is not huge for this GPU. I tried to run it through terminal or Jupyter notebook, those are the same. I feel like this tf.session command can create some issues? Should there be a specific open and close sessions?
Those are my parameters that needs to be optimized:
EDIT: if I run it on RTX 2060, it's definitely going too fast compared to Linux RTX 4090, I have not figured it out what the problem is. It's like finishing the first run 1/300 takes just 1.5 minutes, it's like 1.5 hr on linux 4090 workstation!
GPU UTILIZATION BEFORE TRAINING:
enter image description here
GPU UTILIZATION AFTER STARTING TRAINING:
enter image description here
how I generate the data:
train_data = train_datagen.flow_from_directory(directory=train_path,
target_size=(1000, 9), color_mode="grayscale", class_mode="categorical", shuffle=True)
valid_data = valid_datagen.flow_from_directory(directory=valid_path,
target_size=(1000, 9), color_mode="grayscale", class_mode="categorical", shuffle=False)
test_data = test_datagen.flow_from_directory(directory=test_path,
target_size=(1000, 9), color_mode="grayscale", class_mode="categorical", shuffle=False)

Related

xgboost treemethod gpu-hist outperformed by hist using rtx3060ti and amd ryzen 9 5950x

I'm doing some hyper-parameter tuning, so speed is key. I've got a nice workstation with both an AMD Ryzen 9 5950x and an NVIDIA RTX3060ti 8GB.
Setup:
xgboost 1.5.1 using PyPi in an anaconda environment.
NVIDIA graphics driver 471.68
CUDA 11.0
When training a xgboost model using the scikit-learn API I pass the tree_method = gpu_hist parameter. And i notice that it is consistently outperformed by using the default tree_method = hist.
Somewhat surprisingly, even when I open multiple consoles (I work in spyder) and start an Optuna study in each of them, each using a different scikit-learn model until my CPU usage is at 100%. When I then compare the tree_method = gpu_hist with tree_method = hist, the tree_method = hist is still faster!
How is this possible? Do I have my drivers configured incorrectly?, is my dataset too small to enjoy a benefit from the tree_method = gpu_hist? (7000 samples, 50 features on a 3 class classification problem). Or is the RTX3060ti simply outclassed by the AMD Ryzen 9 5950x? Or none of the above?
Any help is highly appreciated :)
Edit #Ferdy:
I carried out this little experiment:
def fit_10_times(tree_method, X_train, y_train):
times = []
for i in range(10):
model = XGBClassifier(tree_method = tree_method)
start = time.time()
model.fit(X_train, y_train)
times.append(time.time()-start)
return times
cpu_times = fit_10_times('hist', X_train, y_train)
gpu_times = fit_10_times('gpu_hist', X_train, y_train)
print(X_train.describe())
print('mean cpu training times: ', np.mean(cpu_times), 'standard deviation :',np.std(cpu_times))
print('all training times :', cpu_times)
print('----------------------------------')
print('mean gpu training times: ', np.mean(gpu_times), 'standard deviation :',np.std(gpu_times))
print('all training times :', gpu_times)
Which yielded this output:
mean cpu training times: 0.5646213531494141 standard deviation : 0.010005875058323703
all training times : [0.5690040588378906, 0.5500047206878662, 0.5700047016143799, 0.563004732131958, 0.5570034980773926, 0.5486617088317871, 0.5630037784576416, 0.5680046081542969, 0.57651686668396, 0.5810048580169678]
----------------------------------
mean gpu training times: 2.0273998022079467 standard deviation : 0.05105794761358874
all training times : [2.0265607833862305, 2.0070691108703613, 1.9900789260864258, 1.9856727123260498, 1.9925382137298584, 2.0021069049835205, 2.1197071075439453, 2.1220884323120117, 2.0516715049743652, 1.9765043258666992]
The peak in CPU usage refers to the CPU training runs, and the peak in GPU usage the GPU training runs.
7000 samples is too small to fill the GPU pipeline, your GPU is likely to be starving. We usually work with millions of samples when using GPU acceleration.

How to merge two or more trained weights?

I implemented a 5x5 Gomoku by CNN + DQN.
Here is the github link:
https://github.com/bokidigital/CNN_DQN_5x5_Gomoku
My problem is this code has no parallelization design.
This means when running this code on an Intel Skylake server ( 2 CPU, 80 cores ), the CPU usage is just about 90%.
I think the idea CPU usage should be 8000% ( 80 cores ).
Since I have some customized rule in gaming ( not only neural network part which consumes GPU about 75% ), it consumes CPU and no parallelization.
My environment is:
Skylake CPU X 2
NVIDIA P100 X 2 ( only use 1 )
40GB RAM
Tensorflow 1.14.0
Keras
Python 3.7
Ubuntu 16.04
My idea is to run this program separately ( run many copies of this process in the different folder which then generates different weights ) then CPU usage could reach ideally to 8000% ( as long as many processes run at the same time )
Since it is the training process, it doesn't matter how each process trained their weights.
Q1. The problem is how to merge their results(the trained weights)? (A+B)/2?
Q2. It seems 1 GPU can only be used by 1 process, I tried to run 3 process at the same time, the GPU seems to hang.
Q3. If I disabled GPU, 80 core Skylake will faster than NVIDIA P100?
I expect to use more CPU usage to speed up this training process.
Since 5x5 agent trained with 5 days, I tested the same code but change grid size to 9x9, I estimated the training time needs 3 months.

tensorflow wide linear model inference on gpu slow

I am training a sparse logistic regression model on tensorflow. This problem is specifically about the inference part. I am trying to benchmark inference on on cpu and gpu. I am using the Nvidia P100 gpu (4 dies) on my current GCE box. I am new to gpu so sorry for naive questions.
The model is pretty big ~54k operation (is it considered big compared to dnn or imagenet models ? ) . When i log device placement , i only see gpu:0 being used , and rest of them unused ? I don't do any device placement during training , but during inference i want it to optimally place and use gpu.
Few things i observed : my input node placehoolder (feed_dict) is placed on cpu, so i assume my data is being copied from cpu to gpu ? how does feed_dict exactly work behind the scene ?
1) How can i place my data on which i want to run prediction directly on gpu ? Note : my training runs on distributed cpu with multiple terabytes so i cannot have constant or variable directly in my graph during training , but my inference i can definitely have small batches of data that i would directly like to place on gpu. Are there ways i can achieve this ?
2) Since i am using P100 gpu , i think it has unified memory with host , is it possible to have zerocopy and directly have my data loaded into gpu ? How can i do this from python , java and c++ code. Currently i use feed_dict which from various google sources i think is not at all optimal .
3) Is there some tool or profiler i can use to see when i profile code like :
for epoch_step in epochs:
start_time = time.time()
for i in range(epoch_step):
result = session.run(output, feed_dict={input_example: records_batch})
end_time = time.time()
print("Batch {} epochs {} :time {}".format(batch_size, epoch_step, str(end_time - start_time)))
how much time is being spent on 1) cpu to gpu data transfer 2) session run overhead 3) gpu utilization (currently i use nvidia-smi periodically to monitor
4) kernel call overhead on cpu vs gpu (I assume each invokation of sess.run invokes 1 kernel call right ?
my current bench marking results :
CPU :
Batch size : 10
NumberEpochs TimeGPU TimeCPU
10 5.473 0.484
20 11.673 0.963
40 22.716 1.922
100 56.998 4.822
200 113.483 9.773
Batch size : 100
NumberEpochs TimeGPU TimeCPU
10 5.904 0.507
20 11.708 1.004
40 23.046 1.952
100 58.493 4.989
200 118.272 9.912
Batch size : 1000
NumberEpochs TimeGPU TimeCPU
10 5.986 0.653
20 12.020 1.261
40 23.887 2.530
100 59.598 6.312
200 118.561 12.518
Batch size : 10k
NumberEpochs TimeGPU TimeCPU
10 7.542 0.969
20 14.764 1.923
40 29.308 3.838
100 72.588 9.822
200 146.156 19.542
Batch size : 100k
NumberEpochs TimeGPU TimeCPU
10 11.285 9.613
20 22.680 18.652
40 44.065 35.727
100 112.604 86.960
200 225.377 174.652
Batch size : 200k
NumberEpochs TimeGPU TimeCPU
10 19.306 21.587
20 38.918 41.346
40 78.730 81.456
100 191.367 202.523
200 387.704 419.223
Some notable observations:
As batch size increase i see my gpu utilization increase (reaches to 100% for the only gpu it uses , is there a way i can tell tf to use other gpu too)
at batch size 200k is the only time i see my naive benchmarking shows gpu has minor gain as compared to cpu.
Increasing batch size for a given epoch has minimal effect on time both cpu and gpu until batch size <= 10k. But after that increasing batch size from 10k -> 100k -> 200k the time also increase quite fast i.e for a given epoch let us say 10 batch size 10, 100, 1k, 10k the cpu time and gpu time remain pretty stable ~5-7 sec for gpu and 0.48-0.96 sec for cpu (meaning that sess.run has much higher overhead than computing of graph themselves ?), but increasing batch size further the compute time increase at much faster rate i.e for epoch 10 100k->200k gputime increase from 11 -> 19 sec and cpu time also doubles , why so ? It seems for larger batch size even though i have just one sess.run , but internally it splits that into smaller batch and calls sess.run twice because epoch 20 batch size 100k matches more closely with epoch 10 batch 200k ..
How can i improve my inference further , i believe i am not usding all gpus optimally.
Are there any ideas around how can i benchmark better to get better breakdowns of time for cpu-> gpu transfer and actual speedup for graph computation from moving from cpu to gpu ?
Loading data better directly if possible zero copy into gpu ?
Can i place some nodes to gpu only during inference to get better performance ?
Ideas around quantization or optimizing inference graph ?
Any more ideas to improve gpu based inference . May be xla based optimization or tensrort ? i want to have high performance inference code to run these computations on gpu while the application server crunches on cpu.
One source of information are the TensorFlow docs on performance, including Optimizing for GPU and High Performance Models.
That said, those guides tend to target training more than batch inference, though certainly some of the principles still apply.
I will note that, unless you are using DistributionStrategy, TensorFlow will not automatically put ops on more than a single GPU (source).
In your particularly case, I don't believe GPUs are yet well-tuned to do the type of sparse operation required for your model, so I don't actually expect it to do that well on a GPU (if you log the device placement there's a chance the lookup is done on the CPU). A logistic regression model has only an (sparse) input layer and an output layer, so generally there are very few math ops. GPUs excel the most when they are doing lots of matrix multiplies, convolutions, etc.
Finally, I would encourage you to use TensorRT to optimize your graph, though for your particular model there's no guarantee it does much better.

Google TensorFlow based seq2seq model crashes while training

I have been trying to use Google's RNN based seq2seq model.
I have been training a model for text summarization and am feeding in a textual data approximately of size 1GB. The model quickly fills up my entire RAM(8GB), starts filling up even the swap memory(further 8GB) and crashes post which I have to do a hard shutdown.
The configuration of my LSTM network is as follows:
model: AttentionSeq2Seq
model_params:
attention.class: seq2seq.decoders.attention.AttentionLayerDot
attention.params:
num_units: 128
bridge.class: seq2seq.models.bridges.ZeroBridge
embedding.dim: 128
encoder.class: seq2seq.encoders.BidirectionalRNNEncoder
encoder.params:
rnn_cell:
cell_class: GRUCell
cell_params:
num_units: 128
dropout_input_keep_prob: 0.8
dropout_output_keep_prob: 1.0
num_layers: 1
decoder.class: seq2seq.decoders.AttentionDecoder
decoder.params:
rnn_cell:
cell_class: GRUCell
cell_params:
num_units: 128
dropout_input_keep_prob: 0.8
dropout_output_keep_prob: 1.0
num_layers: 1
optimizer.name: Adam
optimizer.params:
epsilon: 0.0000008
optimizer.learning_rate: 0.0001
source.max_seq_len: 50
source.reverse: false
target.max_seq_len: 50
I tried decreasing the batch size from 32 to 16, but it still did not help. What specific changes should I make in order to prevent my model from taking up the entirety of RAM and crashing? (Like decreasing data size, decreasing number of stacked LSTM cells, further decreasing batch size etc)
My system runs Python 2.7x, TensorFlow version 1.1.0, and CUDA 8.0. The system has an Nvidia Geforce GTX-1050Ti(768 CUDA cores) with 4GB of memory, and the system has 8GB of RAM and a further 8GB of swap memory.
You model looks pretty small. The only thing kind of big is the train data. Please check to make sure your get_batch() function has no bugs. It is possible that each batch you are actually loading the whole data set for training, in case there is a bug there.
In order to quickly prove this, just cut down your training data size to something very small (such as 1/10 of current size) and see if that helps. Note that it should not help because you are using mini batch. But if that resolve the problem, fix your get_batch() function.

Tensorflow batching is very slow

I tried to setup a very simple Mnist example with an Estimator.
First I used the estimator's deprecated fit() parameters x, y and batch_size. This executed very fast and utilized about 100% of my GPU while not effecting the CPU much (about 10% utilization). So it worked as expected.
Because the x, y and batch_size parameters are deprecated, I wanted to use the input_fn parameter for the fit() function. To build the input_fn, I used a tf.slice_input_producer and batched it with tf.train.batch. This is my code https://gist.github.com/andreas-eberle/11f650fca0dce4c9d3d6c0955145e80d. You should be able to just run it with tensorflow 1.0.
My problem is that the training now runs very slow and only utilizes about 30 % of my GPU (shown in nvidia-smi).
I also tried to increase the queue capacity of the slice_input_producer and to increase the number of threads used for batching. However, this only helped to get to about 45% of GPU utilization and resulted in a 100 % GPU utilization.
What am I doing wrong? Is there a better way for feeding the inputs and batching them? I do not want to create the batches manually (creating subarrays of the numpy input array) because I want to use this example for a more complex input queue where I'll be reading and preprocessing the images in the graph.
I don't think my hardware should be the problem:
List item
Windows 10
NVidia GTX 960M
i7-6700HQ
32 GB RAM