RNN on Colab TPU runs at the same speed as local CPU version - google-colaboratory

I implemented a local version of an RNN and a Colab TPU version of an RNN(code-below). When I execute the Colab TPU version(code-below), the training speed is very slow like my local version running on my laptop's CPU.
Does Colab TPU support RNN networks?
Am I missing something here?
import tensorflow as tf
import os
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, SimpleRNN
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
print("All devices: ", tf.config.list_logical_devices('TPU'))
strategy = tf.distribute.TPUStrategy(resolver)
with strategy.scope():
model = Sequential()
model.add(SimpleRNN(units=32, input_shape=(1,step), activation="relu"))
model.add(Dense(16, activation="relu"))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='rmsprop')
model.fit(X,y, epochs=50, batch_size=16, verbose=0)

ctrl-f on this page for RNN. It seems like it should work if you can make the RNN static enough.
In general, dynamic operations don't work well with TPUs since it needs to recompile the model graph for each new training example.

Related

Google Colab: Why is CPU faster than TPU?

I'm using Google colab TPU to train a simple Keras model. Removing the distributed strategy and running the same program on the CPU is much faster than TPU. How is that possible?
import timeit
import os
import tensorflow as tf
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
# Load Iris dataset
x = load_iris().data
y = load_iris().target
# Split data to train and validation set
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.30, shuffle=False)
# Convert train data type to use TPU
x_train = x_train.astype('float32')
x_val = x_val.astype('float32')
# Specify a distributed strategy to use TPU
resolver = tf.contrib.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
tf.contrib.distribute.initialize_tpu_system(resolver)
strategy = tf.contrib.distribute.TPUStrategy(resolver)
# Use the strategy to create and compile a Keras model
with strategy.scope():
model = Sequential()
model.add(Dense(32, input_shape=(4,), activation=tf.nn.relu, name="relu"))
model.add(Dense(3, activation=tf.nn.softmax, name="softmax"))
model.compile(optimizer=Adam(learning_rate=0.1), loss='logcosh')
start = timeit.default_timer()
# Fit the Keras model on the dataset
model.fit(x_train, y_train, batch_size=20, epochs=20, validation_data=[x_val, y_val], verbose=0, steps_per_epoch=2)
print('\nTime: ', timeit.default_timer() - start)
Thank you for your question.
I think what's happening here is a matter of overhead -- since the TPU runs on a separate VM (accessible at grpc://$COLAB_TPU_ADDR), each call to run a model on the TPU incurs some amount of overhead as the client (the Colab notebook in this case) sends a graph to the TPU, which is then compiled and run. This overhead is small compared to the time it takes to run e.g. ResNet50 for one epoch, but large compared to run a simple model like the one in your example.
For best results on TPU we recommend using tf.data.Dataset. I updated your example for TensorFlow 2.2:
%tensorflow_version 2.x
import timeit
import os
import tensorflow as tf
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.optimizers import Adam
# Load Iris dataset
x = load_iris().data
y = load_iris().target
# Split data to train and validation set
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.30, shuffle=False)
# Convert train data type to use TPU
x_train = x_train.astype('float32')
x_val = x_val.astype('float32')
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.experimental.TPUStrategy(resolver)
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(20)
val_dataset = tf.data.Dataset.from_tensor_slices((x_val, y_val)).batch(20)
# Use the strategy to create and compile a Keras model
with strategy.scope():
model = Sequential()
model.add(Dense(32, input_shape=(4,), activation=tf.nn.relu, name="relu"))
model.add(Dense(3, activation=tf.nn.softmax, name="softmax"))
model.compile(optimizer=Adam(learning_rate=0.1), loss='logcosh')
start = timeit.default_timer()
# Fit the Keras model on the dataset
model.fit(train_dataset, epochs=20, validation_data=val_dataset)
print('\nTime: ', timeit.default_timer() - start)
This takes about 30 seconds to run, compared to ~1.3 seconds to run on CPU. We can substantially reduce the overhead here by repeating the dataset and running one long epoch rather than several small ones. I replaced the dataset setup with this:
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).repeat(20).batch(20)
val_dataset = tf.data.Dataset.from_tensor_slices((x_val, y_val)).batch(20)
And replaced the fit call with this:
model.fit(train_dataset, validation_data=val_dataset)
This brings the runtime down to about 6 seconds for me. This is still slower than CPU, but that's not surprising for such a small model that can easily be run locally. In general, you'll see more benefit from using TPUs with larger models. I recommend looking through TensorFlow's official TPU guide, which presents a larger image classification model for the MNIST dataset.
This is probably due to the batch size you are using. In comparison to CPU and GPU, the training speed of a TPU is highly dependent on the batch size. Check the following site for more information:
https://cloud.google.com/tpu/docs/performance-guide
The Cloud TPU hardware is different from CPUs and GPUs. At a high
level, CPUs can be characterized as having a low number of high
performing threads. GPUs can be characterized as having a very high
number of low performing threads. A Cloud TPU, with its 128 x 128
matrix unit, can be thought of as either a single, very powerful
thread, which can perform 16K ops per cycle, or 128 x 128 tiny, simple
threads that are connected in pipeline fashion. Correspondingly, when
addressing memory, multiples of 8 (floats) are desirable, as well as
multiples of 128 for operations targeting the matrix unit.
This means that the batch size should be a multiple of 128, depending on the number of TPUs. Google Colab provides 8 TPUs to you, so in the best case you should select a batch size of 128 * 8 = 1024.

Is it possible to train model on GPU,then predict on CPU

I want to train my custom model with GPU devices.
I am wondering Will clients be able to use it via CPU ?
Yes, you do the heavy job of training on a GPU, save weights and then, your CPU will only do the matrix multiplication for predictions.
In Tensorflow and Keras you can train your model and save Neural Network weights:
Tensorflow:
# ON GPU
with tf.Session() as sess:
sess.run(init)
save_path = saver.save(sess, "/tmp/saved_model.ckpt")
# ON CPU
with tf.Session() as sess:
saver.restore(sess, "/tmp/saved_model.ckpt")
Keras:
model.save_weights('your_model_weights.h5')
model.load_weights('your_model_weights.h5')
With sklearn algorithms, you can save weights this way:
model=XGBClassifier(max_depth=100, learning_rate=0.7, n_estimators=10, objective='binary:logistic',booster='gbtree',n_jobs=16,eval_metric="error",eval_set=eval_set, verbose=True)
clf=model.fit(x_train,y_train)
from sklearn.externals import joblib
joblib.dump(clf, '/path/your_model.joblib')
model = joblib.load('/path/your_model.joblib')
model.predict(X_train)

How to transform keras model to tpu model

I am trying to transform my Keras model in the Google cloud console into a TPU model. Unfortunatelly I am getting an error as shown below. My minimal example is the following:
import keras
from keras.models import Sequential
from keras.layers import Dense, Activation
import tensorflow as tf
import os
model = Sequential()
model.add(Dense(32, input_dim=784))
model.add(Dense(32))
model.add(Activation('relu'))
model.compile(optimizer='rmsprop', loss='mse')
tpu_model = tf.contrib.tpu.keras_to_tpu_model(
model,
strategy=tf.contrib.tpu.TPUDistributionStrategy(
tf.contrib.cluster_resolver.TPUClusterResolver(TPU_WORKER)))
My output is:
Using TensorFlow backend.
Traceback (most recent call last):
File "cloud_python4.py", line 11, in <module>
tpu_model = tf.contrib.tpu.keras_to_tpu_model(AttributeError: module 'tensorflow.contrib.tpu' has no attribute 'keras_to_tpu_model'
The keras_to_tpu_model method seems experimental as indicated on the tensorflow website. Has it recently been removed? If so, how can I proceed to make use of TPUs to estimate my Keras model? If the keras_to_tpu_model method would be still available, why can I not invoke it?
I am assuming you defined you TPU_WORKER as below
import os
TPU_WORKER = ‘grpc://’ + os.environ[‘COLAB_TPU_ADDR’]
Instead of converting your model to TPU, build a distribution strategy. This is the method by which the batch will be distributed to the eight TPUs and how the loss from each will be calculated.
resolver = tf.contrib.cluster_resolver.TPUClusterResolver(TPU_WORKER)
tf.contrib.distribute.initialize_tpu_system(resolver)
strategy = tf.contrib.distribute.TPUStrategy(resolver)
With the strategy build and compile your model. This should work quite nicely for regression.
with strategy.scope():
model = Sequential()
model.add(Dense(32, input_dim=784))
model.add(Dense(32))
model.add(Activation('relu'))
model.compile(optimizer='rmsprop', loss='mse')
Import keras from tensorflow.
This is because tf.contrib.tpu.keras_to_tpu_model( )' requires a tensorflow version Model, not the keras version.
For example, use from tensorflow.keras.layers import Dense, Activation instead. And so on.

Keras model to tensforflow

Is it possible to convert a keras model (h5 file of network architecture and weights) into a tensorflow model? Or is there an equivalent function to model.save of keras in tensorflow?
Yes, it is possible, because Keras, since it uses Tensorflow as backend, also builds computational graph. You just need to get this graph from your Keras model.
"Keras only uses one graph and one session. You can access the session
via: K.get_session(). The graph associated with it would then be:
K.get_session().graph."
(from fchollet: https://github.com/keras-team/keras/issues/3223#issuecomment-232745857)
Or you can save this graph in checkpoint format (https://www.tensorflow.org/api_docs/python/tf/train/Saver):
import tensorflow as tf
from keras import backend as K
saver = tf.train.Saver()
sess = K.get_session()
retval = saver.save(sess, ckpt_model_name)
By the way, since tensorflow 13 you can use keras right from it:
from tensorflow.python.keras import models, layers

Getting Cuda code from Tensorflow or Keras

I have a code in Keras (or its TF version). I want to have a CUDA code which is equivalence to it. Is there a way to get it?
I know that from Keras I can look at the basic graph topology using the following code:
# LSTM for sequence classification in the IMDB dataset
import numpy
from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers.embeddings import Embedding
from keras import backend as K
from keras.preprocessing import sequence
# fix random seed for reproducibility
numpy.random.seed(7)
# load the dataset but only keep the top n words, zero the rest
top_words = 5000
max_review_length = 500
# create the model
embedding_vecor_length = 32
model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
g = K.get_session().graph
# GIVES THE GRAPH TOPOLOGY!:
graph_def = g.as_graph_def()
Is there a way to have the .cc file that represent this code?
Thanks!
There is no functionality in TensorFlow to generate C++ CUDA source code from a graph, but the XLA framework supports ahead-of-time compilation, which generates efficient bytecode from your TensorFlow graph, which you can then execute on your CUDA-capable GPU.