Optimising GPU use for Keras model training - tensorflow

I'm training a Keras model. During the training, I'm only utilising between 5 and 20% of my CUDA cores and an equally small proportion of my NVIDIA RTX 2070 memory. Model training is pretty slow currently and I would really like to take advantage of as many of my available CUDA cores as possible to speed this up!
nvidia dmon # (during model training)
# gpu pwr gtemp mtemp sm mem enc dec mclk pclk
# Idx W C C % % % % MHz MHz
0 45 49 - 9 6 0 0 6801 1605
What parameters should I look to tune in order to increase CUDA core utilisation with the aim of training the same model faster?
Here's a simplified example of my current image generation and training steps (I can elaborate / edit, if required, but I currently believe these are the key steps for the purpose of the question):
train_datagen = ImageDataGenerator(rescale=1./255)
test_datagen = ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_directory(
r'./input_training_examples',
target_size=(150, 150),
batch_size=32,
class_mode='binary'
)
validation_generator = test_datagen.flow_from_directory(
r'./input_validation_examples',
target_size=(150, 150),
batch_size=32,
class_mode='binary'
)
history = model.fit(
train_generator,
steps_per_epoch=128, epochs=30,
validation_data=validation_generator, validation_steps=50,
)
Hardware: NVIDIA 2070 GPU
Platform: Linux 5.4.0-29-generic #33-Ubuntu x86_64, NVIDIA driver 440.64, CUDA 10.2, Tensorflow 2.2.0-rc3

GPU utilization is a tricky business, there are too many factors involved.
The first thing to try obviously: increase batch size.
But that solely doesn't ensure the max utilization, maybe your I/O is slow so there is a bottleneck in the data_generator.
You can try loading the full data as a NumPy array if you have enough ram memory.
You can try increasing number of workers in multiprocessing scheme.
model.fit(..., use_multiprocessing=True, workers=8)
Finally, depends on your model, if your model is too light and not deep, your utilization will be low and there's no standard way to improve it any further.

Related

xgboost treemethod gpu-hist outperformed by hist using rtx3060ti and amd ryzen 9 5950x

I'm doing some hyper-parameter tuning, so speed is key. I've got a nice workstation with both an AMD Ryzen 9 5950x and an NVIDIA RTX3060ti 8GB.
Setup:
xgboost 1.5.1 using PyPi in an anaconda environment.
NVIDIA graphics driver 471.68
CUDA 11.0
When training a xgboost model using the scikit-learn API I pass the tree_method = gpu_hist parameter. And i notice that it is consistently outperformed by using the default tree_method = hist.
Somewhat surprisingly, even when I open multiple consoles (I work in spyder) and start an Optuna study in each of them, each using a different scikit-learn model until my CPU usage is at 100%. When I then compare the tree_method = gpu_hist with tree_method = hist, the tree_method = hist is still faster!
How is this possible? Do I have my drivers configured incorrectly?, is my dataset too small to enjoy a benefit from the tree_method = gpu_hist? (7000 samples, 50 features on a 3 class classification problem). Or is the RTX3060ti simply outclassed by the AMD Ryzen 9 5950x? Or none of the above?
Any help is highly appreciated :)
Edit #Ferdy:
I carried out this little experiment:
def fit_10_times(tree_method, X_train, y_train):
times = []
for i in range(10):
model = XGBClassifier(tree_method = tree_method)
start = time.time()
model.fit(X_train, y_train)
times.append(time.time()-start)
return times
cpu_times = fit_10_times('hist', X_train, y_train)
gpu_times = fit_10_times('gpu_hist', X_train, y_train)
print(X_train.describe())
print('mean cpu training times: ', np.mean(cpu_times), 'standard deviation :',np.std(cpu_times))
print('all training times :', cpu_times)
print('----------------------------------')
print('mean gpu training times: ', np.mean(gpu_times), 'standard deviation :',np.std(gpu_times))
print('all training times :', gpu_times)
Which yielded this output:
mean cpu training times: 0.5646213531494141 standard deviation : 0.010005875058323703
all training times : [0.5690040588378906, 0.5500047206878662, 0.5700047016143799, 0.563004732131958, 0.5570034980773926, 0.5486617088317871, 0.5630037784576416, 0.5680046081542969, 0.57651686668396, 0.5810048580169678]
----------------------------------
mean gpu training times: 2.0273998022079467 standard deviation : 0.05105794761358874
all training times : [2.0265607833862305, 2.0070691108703613, 1.9900789260864258, 1.9856727123260498, 1.9925382137298584, 2.0021069049835205, 2.1197071075439453, 2.1220884323120117, 2.0516715049743652, 1.9765043258666992]
The peak in CPU usage refers to the CPU training runs, and the peak in GPU usage the GPU training runs.
7000 samples is too small to fill the GPU pipeline, your GPU is likely to be starving. We usually work with millions of samples when using GPU acceleration.

What may be the cause of pretty slow training speed for CNN (transfer learning)?

I use GPU for training a model with transfer learning from Inception v3.
Weights='imagenet'. The convolution base is frozen and dense layers on top are used for 10 classes classification for MNIST digit recognition.
The code is the following:
from keras.preprocessing import image
datagen=ImageDataGenerator(
#rescale=1./255,
preprocessing_function=tf.keras.applications.inception_v3.preprocess_input,
featurewise_center=False, # set input mean to 0 over the dataset
samplewise_center=False, # set each sample mean to 0
featurewise_std_normalization=False, # divide inputs by std of the dataset
samplewise_std_normalization=False, # divide each input by its std
zca_whitening=False, # apply ZCA whitening
rotation_range=10, # randomly rotate images in the range (degrees, 0 to 180)
zoom_range = 0.1, # Randomly zoom image
width_shift_range=0.1, # randomly shift images horizontally (fraction of total width)
height_shift_range=0.1, # randomly shift images vertically (fraction of total height)
horizontal_flip=False, # randomly flip images
vertical_flip=False)
train_generator=datagen.flow_from_directory(
train_path,
target_size=(224, 224),
color_mode="rgb",
class_mode="categorical",
batch_size=86,
interpolation="bilinear",
)
test_generator=datagen.flow_from_directory(
test_path,
target_size=(224, 224),
color_mode="rgb",
class_mode="categorical",
batch_size=86,
interpolation="bilinear",
)
#Import pre-trained model InceptionV3
from keras.applications import InceptionV3
#Instantiate convolutional base
conv_base = InceptionV3(weights='imagenet',
include_top=False,
input_shape=(224, 224, 3)) # 3 = number of channels in RGB pictures
#Forbid training of conv part
conv_base.trainable=False
#Build model
model=Sequential()
model.add(conv_base)
model.add(Flatten())
model.add(Dense(256,activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))
# Define the optimizer
optimizer = RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0)
# Compile the model
model.compile(optimizer=optimizer,loss="categorical_crossentropy",metrics=['accuracy'] )
history = model.fit_generator(train_generator,
epochs = 1, validation_data = test_generator,
verbose = 2, steps_per_epoch=60000 // 86)
#, callbacks=[learning_rate_reduction])
The obtained training rate was 1 epoch/hour (even after reducing lr to 0.001), when I used rescale=1./255 for data generator.
After searching for answers, I found that the cause my be in not appropriate form for input.
When I tried to use preprocessing_function=tf.keras.applications.inception_v3.preprocess_input,
I received a message after 30 min of training:
Epoch 1/1
/usr/local/lib/python3.6/dist-packages/keras/utils/data_utils.py:616: UserWarning: The input 1449 could not be retrieved. It could be because a worker has died.
UserWarning)
/usr/local/lib/python3.6/dist-packages/keras/utils/data_utils.py:616: UserWarning: The input 614 could not be retrieved. It could be because a worker has died.
UserWarning)
What is wrong with the model?
Thanks in advance.
The learning rate doesn't affect the training rate.
How fast you train the model depends on your gpu, you cpu, and IO on your drive, ceteris paribus.
First, check if your gpu is used for the training.
from keras import backend as K
K.tensorflow_backend._get_available_gpus()
Next, is 32 the max batch_size your gpu can handle? Try increasing the batch_size until you get OOM error.
Or you could monitor your gpu and cpu usage.
If gpu and cpu usage is not maxed, it may be limited by your drive IO speed.
Nothing is wrong in the model.
For increasing the speed of epochs, try the following:
Switch on XLA.
import tensorflow as tf
tf.config.optimizer.set_jit(True)
Use mixed precision
from tensorflow.keras.mixed_precision import experimental as mixed_precision
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_policy(policy)

Training Word2Vec is extremely slow

I'm currently trying to train a Word2Vec skip-gram model with negative sampling using keras from tensorflow 2.0 based on this tutorial (without the auxiliary output). My model looks like this:
# vocab_size = about 1 million
# embedding_dim = 300
# Target word and context word as integers
input_target = keras.Input((1,))
input_context = keras.Input((1,))
# Retrieve word embeddings of target and context word
embeddings = keras.layers.Embedding(vocab_size + 1, embedding_dim, input_length=1, name="embeddings")
embedding_target = embeddings(input_target)
embedding_context = embeddings(input_context)
# Reshape embeddings
reshape = keras.layers.Reshape((embedding_dim, 1))
embedding_target = reshape(embedding_target)
embedding_context = reshape(embedding_context)
# Compute dot product of embedding vectors
dot = keras.layers.Dot(1)([embedding_target, embedding_context])
dot = keras.layers.Reshape((1,))(dot)
# Compute output
output = keras.layers.Dense(1, activation="sigmoid")(dot)
# Create mode
model = keras.Model(inputs=[input_target, input_context], outputs=output)
model.compile(loss="binary_crossentropy", optimizer="rmsprop")
The data set I'm using is the english wikipedia data set provided by tensorflow. However I do not train on the full data set (5,824,596 articles), just on 3,000,000 articles. As you can imagine, the number of unique words (~1 million) and the total number of words (~1.2 billion) are still huge. However the authors of the original Word2Vec paper state in the abstract that learning high quality word vectors of a corpus of 1.6 billion words takes less than a day.
My implementation needs 3 seconds per article to train on average. This looks really slow to me as it would take more than 100 days to train on all 3 million articles once.
I have an Nvidia GTX 1080 TI, and nvidia-smi shows that the GPU utilization is at 0% most of the time with short "peaks" to about 5%.
My first thought was that the preprocessing steps are too slow. So I tested the training with dummy-data generated with numpy:
target = np.random.randint(0, vocab_size, 2048 * 1000, np.int32)
context = np.random.randint(0, vocab_size, 2048 * 1000, np.int32)
label = np.random.randint(0, 2, 2048 * 1000, np.int32)
model.fit([target, context], label, 4096)
I tested different values for batch_size (32, 1024, 2048, 4096), but the GPU utilization was still at 0% most of the time. The whole target, context and label arrays fit into memory.
As I changed the fit method to train_on_batch (but with no batching - put the whole arrays into the function) the GPU utilization at least varied between 0% and almost 40%.
Besides tensorflow-gpu is installed and tensorflow does recognize the GPU: 2020-03-13 15:44:52.424589: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10349 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:b1:00.0, compute capability: 6.1)
Am I missing something obvious? Could be possible as I'm new to keras/tensorflow and to machine learning in general.

How to train a model on multi gpus with tensorflow2 and keras?

I have an LSTM model that I want to train on multiple gpus. I transformed the code to do this and in nvidia-smi I could see that it is using all the memory of all the gpus and each of the gpus are utilizing around 40% BUT the estimated time for training of each batch was almost the same as 1 gpu.
Can someone please guid me and tell me how I can train properly on multiple gpus?
My code:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dropout
import os
from tensorflow.keras.callbacks import ModelCheckpoint
checkpoint_path = "./model/"
checkpoint_dir = os.path.dirname(checkpoint_path)
cp_callback = ModelCheckpoint(filepath=checkpoint_path, save_freq= 'epoch', verbose=1 )
# NNET - LSTM
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
regressor = Sequential()
regressor.add(LSTM(units = 180, return_sequences = True, input_shape = (X_train.shape[1], 3)))
regressor.add(Dropout(0.2))
regressor.add(LSTM(units = 180, return_sequences = True))
regressor.add(Dropout(0.2))
regressor.add(LSTM(units = 180))
regressor.add(Dropout(0.2))
regressor.add(Dense(units = 4))
regressor.compile(optimizer = 'adam', loss = 'mean_squared_error')
regressor.fit(X_train, y_train, epochs = 10, batch_size = 32, callbacks=[cp_callback])
Assuming that your batch_size for a single GPU is N and the time taken per batch is X secs.
You can measure the training speed by measuring the time taken for the model to converge, but you have to make sure that you feed in the right batch_size with 2 GPUs since 2 GPUs will have twice the memory of a single GPU you should linearly scale your batch_size to 2N. It might be deceiving to see that the model still takes X secs per batch, but you should know that now your model is seeing 2N samples per batch, and would lead to a quicker convergence because now you can train with a higher learning rate.
If both of your GPUs have their memory utilized and are sitting at 40% utilization there might be multiple reasons
The model is too simple and you don't need all that compute.
Your batch_size is small and your GPUs can handle a bigger batch_size
Your CPU is the bottleneck and thus making the GPUs wait for the data to be ready, this can be the case when you see spikes in GPU utilization
You need to write a better and performant data pipeline. You can find more about efficient data input pipelines here - https://www.tensorflow.org/guide/data_performance
You can try using CuDNNLSTM. Its way faster than the usual LSTM layer.
https://www.tensorflow.org/api_docs/python/tf/compat/v1/keras/layers/CuDNNLSTM

Keras fit generator not use GPU in Tensorflow backend

I have a code implemented with Keras and I am using the fit_generator because the size of the data is pretty big.
Here is my code:
model = ad.kerasCNN()
model.fit_generator(
ad.data_generator(X_TRAIN_metadata, Y_TRAIN_metadata, batch_size),
steps_per_epoch,
epochs=maxEpoch,
verbose=1,
callbacks=[history, early_stopping],
validation_data=ad.data_generator(X_DEV_metadata, Y_DEV_metadata, batch_size),
validation_steps=steps_per_epoch_dev,
class_weight=None,
max_queue_size=10,
workers=1,
use_multiprocessing=False,
initial_epoch=0
)
When I run this code with a GPU I don't face any issue during the training phase, the GPU being used at between 70 - 80%.
But when the code goes into the validation, the GPU is not used.
This implies that the validation time is 23 times longer than the training phase.
Is it a normal behavior of keras / tensorflow?
What can I do to use the GPU in validation time?