How to use multiple GPUs to train model in tensorflow - tensorflow

I've read the keras official document and it says
To do single-host, multi-device synchronous training with a Keras model, you would use the tf.distribute.MirroredStrategy API. Here's how it works:
Instantiate a MirroredStrategy, optionally configuring which specific devices you want to use (by default the strategy will use all GPUs available).
Use the strategy object to open a scope, and within this scope, create all the Keras objects you need that contain variables. Typically, that means creating & compiling the model inside the distribution scope.
Train the model via fit() as usual.
Here is what I did. Basically, I have 8 GPUs but only 3 are available for the task (5, 6 & 7). I create a strategy using these 3 GPUs and compile the model inside the scope. However, each epoch in my training process takes as much time as using a single GPU, and nvidia also shows that only GPU 7 is in use when I do nvidia-smi in the terminal. Maybe the warning message shows the problem? But I am not an expert... If it is the issue, could someone translate it into plain English or provide a solution? Thanks a lot!!
strategy = tf.distribute.MirroredStrategy(["GPU:5", "GPU:6", "GPU:7"])
print('Number of devices: {}'.format(strategy.num_replicas_in_sync))
WARNING:tensorflow:Some requested devices in `tf.distribute.Strategy` are not visible to TensorFlow: /job:localhost/replica:0/task:0/device:GPU:6,/job:localhost/replica:0/task:0/device:GPU:5,/job:localhost/replica:0/task:0/device:GPU:7
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:5', '/job:localhost/replica:0/task:0/device:GPU:6', '/job:localhost/replica:0/task:0/device:GPU:7')
Number of devices: 3
with strategy.scope():
model_test = model_unet
model_test.compile(loss=loss,
optimizer = adam_opt,
metrics=['accuracy',segmentation_models.metrics.IOUScore()],
)
model_test.fit(x_train,y_train,
validation_data=(x_val,y_val),
batch_size= 16,
epochs= 6, verbose=1, callbacks = callbacks
)
# an example of the first epoch
Train on 14400 samples, validate on 3600 samples
Epoch 1/6
6384/14400 [============>.................] - ETA: 5:35 - loss: 0.0045 - accuracy: 0.9833 - iou_score: 0.8918
Only GPU 7 is in use

Related

Tensorflow model.fit crashed in while loop

I am trying to optimise the learning_rate parameter of my ML-model with a while-loop.
The first model completes all of its learning steps, however, the in the second iteration of the while-loop and thus the second call of model.fit() fails already in the first epoch. No output is generated.
Edit:
I have traced the problem to the Tensorboard callback. Without that call-back the loop successfully trains all 4 models, while with the callback the loop fails at the beginning of the second iteration/model fit. What am I doing wrong here?
for lr in [0.005, 0.001, 0.0005, ...]:
tf.keras.backend.clear_session()
...
## create model
model = createModel(...)
model.compile(learning_rate = lr)
tensorboard_callback = tf.keras.callbacks.TensorBoard(...)
model.fit(..., callbacks = [tensorboard_callback])
Does anybody know what I am doing wrong or why this does not work?
Thank you very much!
Environment:
I am working on a Debian server with two Nvidia Tesla V100S (32GB) cards (only one is used for training the model), 128 CPU cores and 2TB main memory
Python: 3.7.9
Tensorflow: 2.4.1
The implementation is inside a Jupyter notebook
The problem has been solved. There was an incorrect version of cuDNN installed. Thanks

Unknown number of steps - Training convolution neural network at Google Colab Pro

I am trying to run (training) my CNN at Google Colab Pro, when I run my code, all is allright, but It does not know the number of steps, so an infinite loop is created.
Mounted at /content/drive
2.2.0-rc3
Found 10018 images belonging to 2 classes.
Found 1336 images belonging to 2 classes.
WARNING:tensorflow:`period` argument is deprecated. Please use `save_freq` to specify the frequency in number of batches seen.
Epoch 1/300
8/Unknown - 364s 45s/step - loss: 54.9278 - accuracy: 0.5410
I am using ImageDataGenerator() for loadings images. How can I fix it?
An iterator does not store anything, it generates the data dynamically. When you are using a dataset or dataset iterator, you must provide steps_per_epoch. The length of an iterator is unknown until you iterate through it. You could explicitly pass len(datafiles) into the .fit function. So, You need to provide steps_per_epoch as shown below.
model.fit_generator(
train_data_gen,
steps_per_epoch=total_train // batch_size,
epochs=epochs,
validation_data=val_data_gen,
validation_steps=total_val // batch_size
)
More details are mentioned here
steps_per_epoch: Integer or None. Total number of steps (batches of
samples) before declaring one epoch finished and starting the next
epoch. When training with input tensors such as TensorFlow data
tensors, the default None is equal to the number of samples in your
dataset divided by the batch size, or 1 if that cannot be determined.
If x is a tf.data dataset, and 'steps_per_epoch' is None, the epoch
will run until the input dataset is exhausted. This argument is not
supported with array inputs.
I notice you are using binary classification. One more thing to remember when you use ImageDataGenerator is to provide class_mode as shown below. Otherwise, there will be a bug (in keras) or 50% accuracy (in tf.keras).
train_data_gen = train_image_generator.flow_from_directory(batch_size=batch_size,
directory=train_dir,
shuffle=True,
target_size=(IMG_HEIGHT, IMG_WIDTH),class_mode='binary') #

Difference in Performance between Cloud Compute VM and AI Platform

I have a GCP cloud compute VM, which is an n1-standard-16, with 4 P100 GPUs attached, and a solid state drive for storing data. I'll refer to this as "the VM".
I've previously used the VM to train a tensorflow based CNN. I want to move away from this to using AI Platform so I can run multiple jobs simultaneously. However I've run into some problems.
Problems
When the training is run on the VM I can set a batch size of 400, and the standard time for an epoch to complete is around 25 minutes.
When the training is running on a complex_model_m_p100 AI platform machine, which I believe to be equivalent to the VM, I can set a maximum batch size of 128, and the standard time for an epoch to complete is 1 hour 40 minutes.
Differences: the VM vs AI Platform
The VM uses TF1.12 and AI Platform uses TF1.15. Consequently there is a difference in GPU drivers (CUDA 9 vs CUDA 10).
The VM is equipped with a solid state drive, which I don't think is the case for AI platform machines.
I want to understand the cause of the reduced batch size, and decrease the epoch times on AI Platform to comparable levels to Glamdring. Has anyone else run into this issue? Am I running on the correct kind of AI Platform machine? Any advice would be welcome!
Could be a bunch of stuff. There's two ways to go about, making the VM look more like AI Platform:
export IMAGE_FAMILY="tf-latest-gpu" # 1.15 instead of 1.12
export ZONE=...
export INSTANCE_NAME=...
gcloud compute instances create $INSTANCE_NAME \
--zone=$ZONE \
--image-family=$IMAGE_FAMILY \
--image-project=deeplearning-platform-release \
--maintenance-policy=TERMINATE \
--metadata="install-nvidia-driver=True"
n and then attach 4 GPUs after that.
... or making AI Platform looking more like the VM:
https://cloud.google.com/ai-platform/training/docs/machine-types#gpus-and-tpus,
because you are using a Legacy Machine right now.
After following the advice of #Frederik Bode and creating a replica VM with TF 1.15 and associated drivers installed I've managed to solve my problem.
Rather than using the multi_gpu_model function call within tf.keras, it's actually best to use a distributed strategy and run the model within that scope.
There is a guide describing how to do it here.
Essentially now my code looks like this:
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
training_dataset, validation_dataset = get_datasets()
model = setup_model()
# Don't do this, it's not necessary!
#### NOT NEEDED model = tf.keras.utils.multi_gpu_model(model, 4)
opt = tf.keras.optimizers.Adam(learning_rate=args.learning_rate)
model.compile(loss='sparse_categorical_crossentropy',
optimizer=opt,
metrics=['accuracy'])
steps_per_epoch = args.steps_per_epoch
validation_steps = args.validation_steps
model.fit(training_dataset, steps_per_epoch=steps_per_epoch, epochs=args.num_epochs,
validation_data=validation_dataset, validation_steps=validation_steps)
I setup a small dataset so I could rapidly prototype this.
With a single P100 GPU the epoch time average to 66 seconds.
With 4 GPUs, using the code above, the averagge epoch time was 19 seconds.

Log device info in DNNClassifier estimator in Tensorflow

I am using DNNClassifier Estimator to train a binary classifier. I want to log device info to verify whether my model is running on GPU or CPU.
Since, with using Estimator we don't deal with session, how can I log device info?
Major Problem: My 3 layered neural net with hidden units [100, 75, 50] is running faster on CPU than GPU. I tried to increase batch size till 256 but still the same. Hence, I want to confirm whether it actually is using GPU.
Use config argument of tf.estimator.Estimator.__init__:
classifier = \
DNNClassifier(feature_columns=feature_columns,
hidden_units=[100, 75, 50],
config=tf.estimator.RunConfig(session_config=tf.ConfigProto(log_device_placement=True)))

TensorBoard Callback in Keras does not respect initial_epoch of fit?

I'm trying to train multiple models in parallel on a single graphics card. To achieve that I need to resume training of models from saved weights which is not a problem. The model.fit() method has even a parameter initial_epoch that lets me tell the model which epoch the loaded model is on. However when i pass a TensorBoard callback to the fit() method in order to monitor the training of the models, on Tensorboard all data is shown on x=0.
Is there a ways to overcome this and adjust the epoch on tensorboard?
By the way: Im running Keras 2.0.6 and Tensorflow 1.3.0.
self.callbacks = [TensorBoardCallback(log_dir='./../logs/'+self.model_name, histogram_freq=0, write_graph=True, write_images=False, start_epoch=self.step_num)]
self.model.fit(x=self.data['X_train'], y=self.data['y_train'], batch_size=self.input_params[-1]['batch_size'], epochs=1, validation_data=(self.data['X_test'], self.data['y_test']), verbose=verbose, callbacks=self.callbacks, shuffle=self.hyperparameters['shuffle_data'], initial_epoch=self.step_num)
self.model.save_weights('./weights/%s.hdf5'%(self.model_name))
self.model.load_weights('./weights/%s.hdf5'%(self.model_name))
self.model.fit(x=self.data['X_train'], y=self.data['y_train'], batch_size=self.input_params[-1]['batch_size'], epochs=1, validation_data=(self.data['X_test'], self.data['y_test']), verbose=verbose, callbacks=self.callbacks, shuffle=self.hyperparameters['shuffle_data'], initial_epoch=self.step_num)
self.model.save_weights('./weights/%s.hdf5'%(self.model_name))
The resulting graph on Tensorboard looks like this which is not what i was hoping for:
Update:
When passing epochs=10 to the first model.fit() the 10 epoch results are displayed in TensorBoard (see picture).
However when reloading the model and running it (with the same callback attached) the on_epoch_end method of the callback gets never called.
Turns out that when i pass the number of episodes to model.fit() to tell it how long to train, it has to be the number FROM the initial_epoch specified. So if initial_epoch=self.step_num then , epochs=self.step_num+10 if i want to train for 10 episodes.
Say we just started fitting our model and our first time epoch count is 30
(please ignore other paramterers just look at epochs and initial_epoch)
model.fit(train_dataloader,validation_data = test_dataloader,epochs =30,steps_per_epoch = len(train_dataloader),callbacks = callback_list)
Now say ,after 30 epoch we want to start again from 31st epoch (you can see this in tesnorboard) by changing our Adam optimizer(or nay optimizer) learning rate
so we can do is
model.optimizer.learning_rate = 0.0005
model1.fit(train_dataloader,validation_data = test_dataloader,initial_epoch=30,epochs =55,steps_per_epoch = len(train_dataloader),callbacks = callback_list)
=> So here initial_epoch= where we have left training last time;
epochs= initial_epoch+num_epoch we want to run for this second fit