How to train a model on multi gpus with tensorflow2 and keras? - tensorflow

I have an LSTM model that I want to train on multiple gpus. I transformed the code to do this and in nvidia-smi I could see that it is using all the memory of all the gpus and each of the gpus are utilizing around 40% BUT the estimated time for training of each batch was almost the same as 1 gpu.
Can someone please guid me and tell me how I can train properly on multiple gpus?
My code:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dropout
import os
from tensorflow.keras.callbacks import ModelCheckpoint
checkpoint_path = "./model/"
checkpoint_dir = os.path.dirname(checkpoint_path)
cp_callback = ModelCheckpoint(filepath=checkpoint_path, save_freq= 'epoch', verbose=1 )
# NNET - LSTM
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
regressor = Sequential()
regressor.add(LSTM(units = 180, return_sequences = True, input_shape = (X_train.shape[1], 3)))
regressor.add(Dropout(0.2))
regressor.add(LSTM(units = 180, return_sequences = True))
regressor.add(Dropout(0.2))
regressor.add(LSTM(units = 180))
regressor.add(Dropout(0.2))
regressor.add(Dense(units = 4))
regressor.compile(optimizer = 'adam', loss = 'mean_squared_error')
regressor.fit(X_train, y_train, epochs = 10, batch_size = 32, callbacks=[cp_callback])

Assuming that your batch_size for a single GPU is N and the time taken per batch is X secs.
You can measure the training speed by measuring the time taken for the model to converge, but you have to make sure that you feed in the right batch_size with 2 GPUs since 2 GPUs will have twice the memory of a single GPU you should linearly scale your batch_size to 2N. It might be deceiving to see that the model still takes X secs per batch, but you should know that now your model is seeing 2N samples per batch, and would lead to a quicker convergence because now you can train with a higher learning rate.
If both of your GPUs have their memory utilized and are sitting at 40% utilization there might be multiple reasons
The model is too simple and you don't need all that compute.
Your batch_size is small and your GPUs can handle a bigger batch_size
Your CPU is the bottleneck and thus making the GPUs wait for the data to be ready, this can be the case when you see spikes in GPU utilization
You need to write a better and performant data pipeline. You can find more about efficient data input pipelines here - https://www.tensorflow.org/guide/data_performance

You can try using CuDNNLSTM. Its way faster than the usual LSTM layer.
https://www.tensorflow.org/api_docs/python/tf/compat/v1/keras/layers/CuDNNLSTM

Related

Data parallelism on multiple GPUs

I am trying to train a model using data parallelism on multiple GPUs on a single machine. As I think, in data parallelism, we divide the data into batches, and then batches are deployed parallel. Afterward, the average gradient is calculated based on the current batch errors (for example, if there are 2 GPUs: errors will be 2 batches) and updated based on the average gradient.
Now, when I implemented horovod, I observed some other things. For example, I observed that the number of epochs trained is divided according to the number of GPUs. For example, if I train the model on 300 epochs, then, on 1 GPU, the number of epochs is 300, but on 2 GPUs, it is divided into 150 epochs (150 epochs process GPU1 and remains 150 epochs process 2nd GPU), and similarly, on 3 GPUs, it is 100 epochs. Is this correct? If it is correct, then how does it achieve data parallelism?
Here is my code:
import math
import sys
import time
import scipy.io
import numpy as np
import pandas as pd
from sklearn import metrics
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
from tensorflow.compat.v1.keras import backend as K
import horovod.tensorflow.keras as hvd
from tensorflow.keras.models import Sequential
# Horovod: initialize Horovod.
hvd.init()
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.visible_device_list = str(hvd.local_rank())
physical_gpus = tf.config.list_physical_devices('GPU')
tf.config.set_visible_devices([physical_gpus[hvd.local_rank()]], "GPU")
def main():
input_shape = (seg_train_x.shape[1], seg_train_x.shape[2], seg_train_x.shape[3])
print(f'input shape {input_shape}')
epochs = int(math.ceil(300.0 / hvd.size()))
batch_size = 100
model = Sequential()
model.add(Conv2D(16, kernel_size=(3, 3),
activation='relu',
input_shape=input_shape))
model.add(Conv2D(32, (3, 3), activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.01),
bias_regularizer=tf.keras.regularizers.l1(0.01)))
model.add(BatchNormalization())
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))
# Horovod: adjust learning rate based on number of GPUs.
scaled_lr = 0.00001 * hvd.size()
opt = tf.keras.optimizers.Adam(scaled_lr)
# Horovod: add Horovod Distributed Optimizer.
opt = hvd.DistributedOptimizer(opt, backward_passes_per_step=1)
model.compile(loss=tf.keras.losses.categorical_crossentropy,
optimizer=opt,
metrics=['accuracy'])
callbacks = [
# Horovod: broadcast initial variable states from rank 0 to all other processes.
# This is necessary to ensure consistent initialization of all workers when
# training is started with random weights or restored from a checkpoint.
hvd.callbacks.BroadcastGlobalVariablesCallback(0),
]
print(f'input shape {seg_train_x.shape}')
# Horovod: save checkpoints only on worker 0 to prevent other workers from corrupting them.
if hvd.rank() == 0:
callbacks.append(tf.keras.callbacks.ModelCheckpoint('./checkpoint-{epoch}.h5'))
csv_logger = tf.keras.callbacks.CSVLogger('training.log')
start = time.time()
model.fit(
seg_train_x,
seg_train_y,
batch_size=batch_size,
callbacks=[callbacks, csv_logger],
epochs=epochs,
validation_data= (seg_val_x, seg_val_y),
verbose=1 if hvd.rank() == 0 else 0,
)
end = time.time()
if hvd.rank() == 0:
print('Total Training Time:', round((end - start), 2), '(s)')
score = model.evaluate(seg_test_x, seg_test_y, verbose=0)
y_pred_test = model.predict(seg_test_x)
# Take the class with the highest probability from the test predictions
max_y_pred_test = np.argmax(y_pred_test, axis=1)
max_y_test = np.argmax(seg_test_y, axis=1) # actual test labels
fScore = metrics.f1_score(max_y_test, max_y_pred_test, average='macro')
print('Test loss:', score[0])
print('Test accuracy:', score[1])
print('F1-Score:', fScore)
if __name__ == '__main__':
main()
Environment:
Framework: (TensorFlow)
Framework version: 2.2.0
Horovod version: v0.21.3
MPI version: (Open MPI) 2.1.1
CUDA version: 10.1, V10.1.243
NCCL version: 2.11.4
Python version: 3.6.9
CMake version: 3.10.2
Why do you reduce the number of epochs? In data parallelism the number of epochs remains the same, but the number of iterations per epoch is getting reduced - you want your model to 'observe' the same amount of data the same amount of times, but because you use more computational devices the batch size is multipled, so the number of iterations 'per epoch' is getting reduced.

Failed copying input tensor from CPU to GPU in order to run GatherVe: Dst tensor is not initialized. [Op:GatherV2]

from random import sample
index=sample(range(0, len(result)), len(result)//5*4)
description_train=[child[0] for i, child in enumerate(result) if i in index]
ipc_train=[child[1] for i, child in enumerate(result) if i in index]
description_test=[child[0] for i, child in enumerate(result) if i not in index]
ipc_test=[child[1] for i, child in enumerate(result) if i not in index]
import numpy as np
def to_onehot(li):
result=np.zeros(8)
if 'A' in li:
result[0]=1
if 'B' in li:
result[1]=1
if 'C' in li:
result[2]=1
if 'D' in li:
result[3]=1
if 'E' in li:
result[4]=1
if 'F' in li:
result[5]=1
if 'G' in li:
result[6]=1
if 'H' in li:
result[7]=1
return result
from tensorflow.python.keras.preprocessing.text import Tokenizer
max_words=100000
num_classes=8
t=Tokenizer(num_words=max_words)
t.fit_on_texts(description_train)
X_train=t.texts_to_matrix(description_train, mode='binary')
X_test=t.texts_to_matrix(description_test, mode='binary')
Y_train=np.array([to_onehot(child) for child in ipc_train], dtype=np.int32)
Y_test=np.array([to_onehot(child) for child in ipc_test], dtype=np.int32)
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense, Dropout
model = Sequential()
model.add(Dense(1024, input_shape=(max_words,), activation='relu'))
model.add(Dense(128, activation='relu'))
model.add(Dense(num_classes, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train, Y_train, batch_size=128, epochs=5, validation_split=0.1)
the last line (model.fit) result in a following error.
InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run GatherV2: Dst tensor is not initialized. [Op:GatherV2]
How can I fix it?
thank you in advance.
I had this error very often, even with high-RAM EC2 instances. The only solution for me was to use generators:
from tensorflow.keras.utils import Sequence
import numpy as np
class DataGenerator(Sequence):
def __init__(self, x_set, y_set, batch_size):
self.x, self.y = x_set, y_set
self.batch_size = batch_size
def __len__(self):
return int(np.ceil(len(self.x) / float(self.batch_size)))
def __getitem__(self, idx):
batch_x = self.x[idx * self.batch_size:(idx + 1) * self.batch_size]
batch_y = self.y[idx * self.batch_size:(idx + 1) * self.batch_size]
return batch_x, batch_y
train_gen = DataGenerator(X_train, y_train, 32)
test_gen = DataGenerator(X_test, y_test, 32)
history = model.fit(train_gen,
epochs=6,
validation_data=test_gen)
In the above example, we assume that X and y are numpy arrays.
My guess on what's happening: even though I'm using a high-RAM instance, I suspect the problem is a limitation in the GPU memory, and even though I'm training in batches, when not using generators, TensorFlow is trying to load the full array into the GPU memory.
it may be because of RAM shortage, and you can do one of the followings to solve the problem:
decreasing the batch_size can help most of the times and this is the best option when the training speed is not matters for you(as you know with decreasing the batch_size , it takes more long time for model to be trained).
you can run your code on a system with a large amount of RAM or run your code on a VM or Google Colab for free(Google Colab gives you 16 GB of RAM for free with Tesla K80 GPUs and TPU.
reduce the number of samples or reduce your data dimension with
methods such as PCA or feature selection.
also if your model`s hidden layers size is so large, you can decrease it to solve problem in some situations.because with decreasing hidden layer size the model complexity and parameters decrease and it occupies less memory.
One solution is to reduce the size of input images to fit the capacity of GPU. For me, I reduced from (224,224,10) to (128,128,10).
I found a solution. I reduced the number of sample by
model.fit(X_train[0:3000], Y_train[0:3000], batch_size=128, epochs=5, validation_split=0.1)
Then, the error disappeared.
Good luck for everyone.
Reducing the sample size is not always the option because why would you have those many samples in the first place, therefore, I would recommend a few options:
Use a cloud VM (AWS, Azure or GCP) with higher specs and pay hourly and be done and dusted on this one
If you don't want to pay, and Ok to write extra code, then basically, you have to create your own custom generator to call flow_from_directory to load dataset in batches. Refer to this:
https://www.askpython.com/python/examples/handling-large-datasets-machine-learning
https://www.analyticsvidhya.com/blog/2020/08/image-augmentation-on-the-fly-using-keras-imagedatagenerator/

What may be the cause of pretty slow training speed for CNN (transfer learning)?

I use GPU for training a model with transfer learning from Inception v3.
Weights='imagenet'. The convolution base is frozen and dense layers on top are used for 10 classes classification for MNIST digit recognition.
The code is the following:
from keras.preprocessing import image
datagen=ImageDataGenerator(
#rescale=1./255,
preprocessing_function=tf.keras.applications.inception_v3.preprocess_input,
featurewise_center=False, # set input mean to 0 over the dataset
samplewise_center=False, # set each sample mean to 0
featurewise_std_normalization=False, # divide inputs by std of the dataset
samplewise_std_normalization=False, # divide each input by its std
zca_whitening=False, # apply ZCA whitening
rotation_range=10, # randomly rotate images in the range (degrees, 0 to 180)
zoom_range = 0.1, # Randomly zoom image
width_shift_range=0.1, # randomly shift images horizontally (fraction of total width)
height_shift_range=0.1, # randomly shift images vertically (fraction of total height)
horizontal_flip=False, # randomly flip images
vertical_flip=False)
train_generator=datagen.flow_from_directory(
train_path,
target_size=(224, 224),
color_mode="rgb",
class_mode="categorical",
batch_size=86,
interpolation="bilinear",
)
test_generator=datagen.flow_from_directory(
test_path,
target_size=(224, 224),
color_mode="rgb",
class_mode="categorical",
batch_size=86,
interpolation="bilinear",
)
#Import pre-trained model InceptionV3
from keras.applications import InceptionV3
#Instantiate convolutional base
conv_base = InceptionV3(weights='imagenet',
include_top=False,
input_shape=(224, 224, 3)) # 3 = number of channels in RGB pictures
#Forbid training of conv part
conv_base.trainable=False
#Build model
model=Sequential()
model.add(conv_base)
model.add(Flatten())
model.add(Dense(256,activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.5))
model.add(Dense(10, activation='softmax'))
# Define the optimizer
optimizer = RMSprop(lr=0.001, rho=0.9, epsilon=1e-08, decay=0.0)
# Compile the model
model.compile(optimizer=optimizer,loss="categorical_crossentropy",metrics=['accuracy'] )
history = model.fit_generator(train_generator,
epochs = 1, validation_data = test_generator,
verbose = 2, steps_per_epoch=60000 // 86)
#, callbacks=[learning_rate_reduction])
The obtained training rate was 1 epoch/hour (even after reducing lr to 0.001), when I used rescale=1./255 for data generator.
After searching for answers, I found that the cause my be in not appropriate form for input.
When I tried to use preprocessing_function=tf.keras.applications.inception_v3.preprocess_input,
I received a message after 30 min of training:
Epoch 1/1
/usr/local/lib/python3.6/dist-packages/keras/utils/data_utils.py:616: UserWarning: The input 1449 could not be retrieved. It could be because a worker has died.
UserWarning)
/usr/local/lib/python3.6/dist-packages/keras/utils/data_utils.py:616: UserWarning: The input 614 could not be retrieved. It could be because a worker has died.
UserWarning)
What is wrong with the model?
Thanks in advance.
The learning rate doesn't affect the training rate.
How fast you train the model depends on your gpu, you cpu, and IO on your drive, ceteris paribus.
First, check if your gpu is used for the training.
from keras import backend as K
K.tensorflow_backend._get_available_gpus()
Next, is 32 the max batch_size your gpu can handle? Try increasing the batch_size until you get OOM error.
Or you could monitor your gpu and cpu usage.
If gpu and cpu usage is not maxed, it may be limited by your drive IO speed.
Nothing is wrong in the model.
For increasing the speed of epochs, try the following:
Switch on XLA.
import tensorflow as tf
tf.config.optimizer.set_jit(True)
Use mixed precision
from tensorflow.keras.mixed_precision import experimental as mixed_precision
policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_policy(policy)

Resnet-50 adversarial training with cleverhans FGSM accuracy stuck at 5%

I am facing a strange problem when adversarially training a resnet-50, and I am not sure whether is's a logical error, or a bug somewhere in the code/libraries.
I am adversarially training a resnet-50 thats loaded from Keras, using the FastGradientMethod from cleverhans, and expecting the adversarial accuracy to rise at least above 90% (probably 99.x%). The training algorithm, training- and attack-params should be visible in the code.
The problem, as already stated in the title is, that the accuracy is stuck at 5% after training ~3000 of 39002 training inputs in the first epoch. (GermanTrafficSignRecognitionBenchmark, GTSRB).
When training without and adversariy loss function, the accuracy does not get stuck after 3000 samples, but continues to rise > 0.95 in the first epoch.
When substituting the network with a lenet-5, alexnet and vgg19, the code works as expected, and an accuracy absolutely comparabele to the non-adversarial, categorical_corssentropy lossfunction is achieved. I've also tried running the procedure using solely tf-cpu and different versions of tensorflow, the result is always the same.
Code for obtaining ResNet-50:
def build_resnet50(num_classes, img_size):
from tensorflow.keras.applications import ResNet50
from tensorflow.keras import Model
from tensorflow.keras.layers import Dense, Flatten
resnet = ResNet50(weights='imagenet', include_top=False, input_shape=img_size)
x = Flatten(input_shape=resnet.output.shape)(resnet.output)
x = Dense(1024, activation='sigmoid')(x)
predictions = Dense(num_classes, activation='softmax', name='pred')(x)
model = Model(inputs=[resnet.input], outputs=[predictions])
return model
Training:
def lr_schedule(epoch):
# decreasing learning rate depending on epoch
return 0.001 * (0.1 ** int(epoch / 10))
def train_model(model, xtrain, ytrain, xtest, ytest, lr=0.001, batch_size=32,
epochs=10, result_folder=""):
from cleverhans.attacks import FastGradientMethod
from cleverhans.utils_keras import KerasModelWrapper
import tensorflow as tf
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.callbacks import LearningRateScheduler, ModelCheckpoint
sgd = SGD(lr=lr, decay=1e-6, momentum=0.9, nesterov=True)
model(model.input)
wrap = KerasModelWrapper(model)
sess = tf.compat.v1.keras.backend.get_session()
fgsm = FastGradientMethod(wrap, sess=sess)
fgsm_params = {'eps': 0.01,
'clip_min': 0.,
'clip_max': 1.}
loss = get_adversarial_loss(model, fgsm, fgsm_params)
model.compile(loss=loss, optimizer=sgd, metrics=['accuracy'])
model.fit(xtrain, ytrain,
batch_size=batch_size,
validation_data=(xtest, ytest),
epochs=epochs,
callbacks=[LearningRateScheduler(lr_schedule)])
Loss-function:
def get_adversarial_loss(model, fgsm, fgsm_params):
def adv_loss(y, preds):
import tensorflow as tf
tf.keras.backend.set_learning_phase(False) #turn off dropout during input gradient calculation, to avoid unconnected gradients
# Cross-entropy on the legitimate examples
cross_ent = tf.keras.losses.categorical_crossentropy(y, preds)
# Generate adversarial examples
x_adv = fgsm.generate(model.input, **fgsm_params)
# Consider the attack to be constant
x_adv = tf.stop_gradient(x_adv)
# Cross-entropy on the adversarial examples
preds_adv = model(x_adv)
cross_ent_adv = tf.keras.losses.categorical_crossentropy(y, preds_adv)
tf.keras.backend.set_learning_phase(True) #turn back on
return 0.5 * cross_ent + 0.5 * cross_ent_adv
return adv_loss
Versions used:
tf+tf-gpu: 1.14.0
keras: 2.3.1
cleverhans: > 3.0.1 - latest version pulled from github
It is a side-effect of the way we estimate the moving averages on BatchNormalization.
The mean and variance of the training data that you used are different from the ones of the dataset used to train the ResNet50. Because the momentum on the BatchNormalization has a default value of 0.99, with only 10 iterations it does not converge quickly enough to the correct values for the moving mean and variance. This is not obvious during training when the learning_phase is 1 because BN uses the mean/variance of the batch. Nevertheless when we set learning_phase to 0, the incorrect mean/variance values which are learned during training significantly affect the accuracy.
You can fix this problem by below approachs:
More iterations
Reduce the size of the batch from 32 to 16(to perform more updates per epoch) and increase the number of epochs from 10 to 250. This way the moving average and variance will converge to the correct values.
Change the momentum of BatchNormalization
Keep the number of iterations fixed but change the momentum of the BatchNormalization layer to update more aggressively the rolling mean and variance (not recommended for production models).
On the original snippet, add the following code between reading the base_model and defining the new layers:
# ....
base_model = ResNet50(weights='imagenet', include_top=False, input_shape=input_shape)
# PATCH MOMENTUM - START
import json
conf = json.loads(base_model.to_json())
for l in conf['config']['layers']:
if l['class_name'] == 'BatchNormalization':
l['config']['momentum'] = 0.5
m = Model.from_config(conf['config'])
for l in base_model.layers:
m.get_layer(l.name).set_weights(l.get_weights())
base_model = m
# PATCH MOMENTUM - END
x = base_model.output
# ....
Would also recommend you to try another hack provided bu us here.

Google Colab: Why is CPU faster than TPU?

I'm using Google colab TPU to train a simple Keras model. Removing the distributed strategy and running the same program on the CPU is much faster than TPU. How is that possible?
import timeit
import os
import tensorflow as tf
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
# Load Iris dataset
x = load_iris().data
y = load_iris().target
# Split data to train and validation set
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.30, shuffle=False)
# Convert train data type to use TPU
x_train = x_train.astype('float32')
x_val = x_val.astype('float32')
# Specify a distributed strategy to use TPU
resolver = tf.contrib.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
tf.contrib.distribute.initialize_tpu_system(resolver)
strategy = tf.contrib.distribute.TPUStrategy(resolver)
# Use the strategy to create and compile a Keras model
with strategy.scope():
model = Sequential()
model.add(Dense(32, input_shape=(4,), activation=tf.nn.relu, name="relu"))
model.add(Dense(3, activation=tf.nn.softmax, name="softmax"))
model.compile(optimizer=Adam(learning_rate=0.1), loss='logcosh')
start = timeit.default_timer()
# Fit the Keras model on the dataset
model.fit(x_train, y_train, batch_size=20, epochs=20, validation_data=[x_val, y_val], verbose=0, steps_per_epoch=2)
print('\nTime: ', timeit.default_timer() - start)
Thank you for your question.
I think what's happening here is a matter of overhead -- since the TPU runs on a separate VM (accessible at grpc://$COLAB_TPU_ADDR), each call to run a model on the TPU incurs some amount of overhead as the client (the Colab notebook in this case) sends a graph to the TPU, which is then compiled and run. This overhead is small compared to the time it takes to run e.g. ResNet50 for one epoch, but large compared to run a simple model like the one in your example.
For best results on TPU we recommend using tf.data.Dataset. I updated your example for TensorFlow 2.2:
%tensorflow_version 2.x
import timeit
import os
import tensorflow as tf
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
# Load Iris dataset
x = load_iris().data
y = load_iris().target
# Split data to train and validation set
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.30, shuffle=False)
# Convert train data type to use TPU
x_train = x_train.astype('float32')
x_val = x_val.astype('float32')
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.experimental.TPUStrategy(resolver)
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(20)
val_dataset = tf.data.Dataset.from_tensor_slices((x_val, y_val)).batch(20)
# Use the strategy to create and compile a Keras model
with strategy.scope():
model = Sequential()
model.add(Dense(32, input_shape=(4,), activation=tf.nn.relu, name="relu"))
model.add(Dense(3, activation=tf.nn.softmax, name="softmax"))
model.compile(optimizer=Adam(learning_rate=0.1), loss='logcosh')
start = timeit.default_timer()
# Fit the Keras model on the dataset
model.fit(train_dataset, epochs=20, validation_data=val_dataset)
print('\nTime: ', timeit.default_timer() - start)
This takes about 30 seconds to run, compared to ~1.3 seconds to run on CPU. We can substantially reduce the overhead here by repeating the dataset and running one long epoch rather than several small ones. I replaced the dataset setup with this:
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).repeat(20).batch(20)
val_dataset = tf.data.Dataset.from_tensor_slices((x_val, y_val)).batch(20)
And replaced the fit call with this:
model.fit(train_dataset, validation_data=val_dataset)
This brings the runtime down to about 6 seconds for me. This is still slower than CPU, but that's not surprising for such a small model that can easily be run locally. In general, you'll see more benefit from using TPUs with larger models. I recommend looking through TensorFlow's official TPU guide, which presents a larger image classification model for the MNIST dataset.
This is probably due to the batch size you are using. In comparison to CPU and GPU, the training speed of a TPU is highly dependent on the batch size. Check the following site for more information:
https://cloud.google.com/tpu/docs/performance-guide
The Cloud TPU hardware is different from CPUs and GPUs. At a high
level, CPUs can be characterized as having a low number of high
performing threads. GPUs can be characterized as having a very high
number of low performing threads. A Cloud TPU, with its 128 x 128
matrix unit, can be thought of as either a single, very powerful
thread, which can perform 16K ops per cycle, or 128 x 128 tiny, simple
threads that are connected in pipeline fashion. Correspondingly, when
addressing memory, multiples of 8 (floats) are desirable, as well as
multiples of 128 for operations targeting the matrix unit.
This means that the batch size should be a multiple of 128, depending on the number of TPUs. Google Colab provides 8 TPUs to you, so in the best case you should select a batch size of 128 * 8 = 1024.