Can not save model using following multi_gpu_model in Keras - tensorflow

Following the upgrade to Keras 2.0.9, I have been using the multi_gpu_model utility but I can't save my models or best weights using'path')
The error I get is
TypeError: can’t pickle module objects
I suspect there is some problem gaining access to the model object. Is there a work around this issue?

To be honest, the easiest approach to this is to actually examine the multi gpu parallel model using
(The parallel model is simply the model after applying the multi_gpu function). This clearly highlights the actual model (in I think the penultimate layer - I am not at my computer right now). Then you can use the name of this layer to save the model.
model = parallel_model.get_layer('sequential_1)
Often its called sequential_1 but if you are using a published architecture, it may be 'googlenet' or 'alexnet'. You will see the name of the layer from the summary.
Then its simple to just save
Maxims approach works, but its overkill I think.
Rem: you will need to compile both the model, and the parallel model.

Here's a patched version that doesn't fail while saving:
from keras.layers import Lambda, concatenate
from keras import Model
import tensorflow as tf
def multi_gpu_model(model, gpus):
if isinstance(gpus, (list, tuple)):
num_gpus = len(gpus)
target_gpu_ids = gpus
num_gpus = gpus
target_gpu_ids = range(num_gpus)
def get_slice(data, i, parts):
shape = tf.shape(data)
batch_size = shape[:1]
input_shape = shape[1:]
step = batch_size // parts
if i == num_gpus - 1:
size = batch_size - step * i
size = step
size = tf.concat([size, input_shape], axis=0)
stride = tf.concat([step, input_shape * 0], axis=0)
start = stride * i
return tf.slice(data, start, size)
all_outputs = []
for i in range(len(model.outputs)):
# Place a copy of the model on each GPU,
# each getting a slice of the inputs.
for i, gpu_id in enumerate(target_gpu_ids):
with tf.device('/gpu:%d' % gpu_id):
with tf.name_scope('replica_%d' % gpu_id):
inputs = []
# Retrieve a slice of the input.
for x in model.inputs:
input_shape = tuple(x.get_shape().as_list())[1:]
slice_i = Lambda(get_slice,
arguments={'i': i,
'parts': num_gpus})(x)
# Apply model on slice
# (creating a model replica on the target device).
outputs = model(inputs)
if not isinstance(outputs, list):
outputs = [outputs]
# Save the outputs for merging back together later.
for o in range(len(outputs)):
# Merge outputs on CPU.
with tf.device('/cpu:0'):
merged = []
for name, outputs in zip(model.output_names, all_outputs):
axis=0, name=name))
return Model(model.inputs, merged)
You can use this multi_gpu_model function, until the bug is fixed in keras. Also, when loading the model, it's important to provide the tensorflow module object:
model = load_model('multi_gpu_model.h5', {'tf': tf})
How it works
The problem is with import tensorflow line in the middle of multi_gpu_model:
def multi_gpu_model(model, gpus):
import tensorflow as tf
This creates a closure for the get_slice lambda function, which includes the number of gpus (that's ok) and tensorflow module (not ok). Model save tries to serialize all layers, including the ones that call get_slice and fails exactly because tf is in the closure.
The solution is to move import out of multi_gpu_model, so that tf becomes a global object, though still needed for get_slice to work. This fixes the problem of saving, but in loading one has to provide tf explicitly.

It's something that need a little work around by loading the multi_gpu_model weight to the regular model weight.
#1, instantiate your base model on a cpu
with tf.device("/cpu:0"):
model = create_model()
#2, put your model to multiple gpus, say 2
multi_model = multi_gpu_model(model, 2)
#3, compile both models
model.compile(loss=your_loss, optimizer=your_optimizer(lr))
multi_model.compile(loss=your_loss, optimizer=your_optimizer(lr))
#4, train the multi gpu model
# or multi_model.fit_generator()
#5, save weights


Training seq2seq model on Google Colab TPU with big dataset - Keras

I'm trying to train a sequence to sequence model for machine translation using Keras on Google Colab TPU.
I have a dataset which I can load in memory but I have to preprocess to it to feed it to the model. In particular I need to convert the target words to one hot vectors and with many examples I can't load the entire conversion in memory, so I need to make batches of data.
I'm using this function as a batch generator:
def generate_batch_bert(X_ids, X_masks, y, batch_size = 1024):
''' Generate a batch of data '''
while True:
for j in range(0, len(X_ids), batch_size):
# batch of encoder and decoder data
encoder_input_data_ids = X_ids[j:j+batch_size]
encoder_input_data_masks = X_masks[j:j+batch_size]
y_decoder = y[j:j+batch_size]
# decoder target and input for teacher forcing
decoder_input_data = y_decoder[:,:-1]
decoder_target_seq = y_decoder[:,1:]
# batch of decoder target data
decoder_target_data = to_categorical(decoder_target_seq, vocab_size_fr)
# keep only with the right amount of instances for training on TPU
if encoder_input_data_ids.shape[0] == batch_size:
yield([encoder_input_data_ids, encoder_input_data_masks, decoder_input_data], decoder_target_data)
The problem is that whenever I try to run the fit function as follows:, X_train_masks, y_train, batch_size = batch_size),
steps_per_epoch = train_samples//batch_size,
callbacks = callbacks,
validation_data = generate_batch_bert(X_val_ids, X_val_masks, y_val, batch_size = batch_size),
validation_steps = val_samples//batch_size)
I get the following error:
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ make_tensor_proto
raise ValueError("None values not supported.")
ValueError: None values not supported.
Not sure what's wrong and how I can solve this problem.
I tried loading less amount of data in memory so that the conversion to one hot encoding of the target words doesn't crash the kernel and it actually works. So there is obviously something wrong on how I generate batches.
It's hard to tell what's wrong since you don't provide your model
definition nor any sample data. However, I'm fairly certain that you're
running into the same
TensorFlow bug
that I recently got bitten by.
The workaround is to use the API which works much
better with TPUs. Like this:
from import Dataset
import tensorflow as tf
def map_fn(X_id, X_mask, y):
decoder_target_data = tf.one_hot(y[1:], vocab_size_fr)
return (X_id, X_mask, y[:-1]), decoder_target_data
X_ids = Dataset.from_tensor_slices(X_ids)
X_masks = Dataset.from_tensor_slices(X_masks)
y = Dataset.from_tensor_slices(y)
ds =, X_masks, y)).map(map_fn).batch(1024) = ds, ...)

How do you set a Tensorflow 2 Keras optimizer to a state, before you've applied grads with it?

I'm working at a slightly lower-level of Keras than the Model fit API. I would like to be able to set the state of a newly constructed optimizer to the state of it from previous training.
The get_weights and set_weights methods seem promising; they just return and receive numpy arrays or standard scalar data for the state of the optimizer. However, the problem is you cannot set_weights if the weights have not yet been created, and as far as I can tell, the only public way they get created is on the first call to apply_gradients.
For example, the following fails because opt2 will not have its weights created.
import tensorflow as tf
import numpy as np
opt1 = tf.keras.optimizers.Adam()
opt2 = tf.keras.optimizers.Adam()
layer = tf.keras.layers.Dense(1)
# dummy data
x = np.array([[-1, 1], [1, 1]])
y = np.array([[-1], [1]])
# do one optimization step
with tf.GradientTape() as tape:
loss = (layer(x) - y)**2
grads = tape.gradient(loss, layer.trainable_weights)
opt1.apply_gradients(zip(grads, layer.trainable_weights))
# copy state to optimizer 2
opt2.set_weights(opt1.get_weights()) # this fails!
Lets assume I do have on hand the relevant model weights on which the optimizer operates. What is the right way restore state? Based on the implementation of the apply_gradients method, it seems like this is the path:
_ = opt2.iterations # must be called to make this weight appear
# now we can safely set weights
But that feels really hacky to me and prone to fail if implementation details change at a future point. Are there better approaches that I'm missing?

What is the structure of a Keras model if input_shape is omitted and why does it perform better?

I omitted the input_shape in the first layer of my Keras model by mistake. Eventually I noticed this and fixed it – and my model's performance dropped dramatically.
Looking at the structure of the model with and without input_shape, I discovered that the better-performing model has the output shape of multiple. Moreover, plotting it with plot_model shows no connections between the layers:
When it comes to performance, the model I understand (with input_shape) achieves a validation loss of 4.0513 (MSE) after 10 epochs with my test code (below), while the "weird" model manages 1.3218 – and the difference only increases with more epochs.
Model definition:
model = keras.Sequential()
model.add(keras.layers.Dense(64, activation=tf.nn.relu, input_shape=(1001,)))
# add or remove this ^^^^^^^^^^^^^^^^^^^
(never mind the details, this is just a model that demonstrates the difference in performance with and without input_shape)
So what is happening in the better-performing model? What is multiple? How are the layers really connected? How could I build this same model while also specifying input_shape?
Complete script:
import tensorflow as tf
from tensorflow import keras
import numpy as np
from collections import deque
import math, random
def func(x):
return math.sin(x)*5 + math.sin(x*1.8)*4 + math.sin(x/4)*5
def get_data():
x = 0
dx = 0.1
q = deque()
r = 0
data = np.zeros((100000, 1002), np.float32)
while True:
x = x + dx
sig = func(x)
if len(q) < 1000:
arr = np.array(q, np.float32)
for k in range(10):
xx = random.uniform(0.1, 9.9)
data[r, :1000] = arr[:1000]
data[r, 1000] = 5*xx #scale for easier fitting
data[r, 1001] = func(x + xx)
r = r + 1
if r >= data.shape[0]:
if r >= data.shape[0]:
inputs = data[:, :1001]
outputs = data[:, 1001]
return (inputs, outputs)
model = keras.Sequential()
model.add(keras.layers.Dense(64, activation=tf.nn.relu, input_shape=(1001,)))
# add or remove this ^^^^^^^^^^^^^^^^^^^
model.add(keras.layers.Dense(64, activation=tf.nn.relu))
model.add(keras.layers.Dense(64, activation=tf.nn.relu))
model.add(keras.layers.Dense(64, activation=tf.nn.relu))
loss = 'mse',
optimizer = tf.train.RMSPropOptimizer(0.0005),
metrics = ['mae', 'mse'])
inputs, outputs = get_data()
hist =, outputs, epochs=10, validation_split=0.1)
print("Final val_loss is", hist.history['val_loss'][-1])
The reason that the results are different is because the two models have different initial weights. The fact that one performs (significantly) better than the other is purely by chance and as #today mentioned the results they obtain are approximately similar.
As the documentation for tf.set_random_seed explains, random operations use two seeds, the graph-level seed and the operation specific seed; tf.set_random_seed sets the graph-level seed:
Operations that rely on a random seed actually derive it from two seeds: the graph-level and operation-level seeds. This sets the graph-level seed.
Taking a look at the definition for Dense we see that the default kernel initializer is 'glorot_uniform' (let's only consider the kernel initializer here but the same holds for the bias initializer). Walking farther through the source code we'll eventually find out that this fetches the GlorotUniform with default arguments. Specifically the random number generator seed for that specific operation (namely weight initialization) is set to None. Now if we check where this seed is used, we find it is passed to random_ops.truncated_normal for example. This in turn (as do all random operations) fetches now the two seeds, one being the graph-level seed and the other the operation specific seed: seed1, seed2 = random_seed.get_seed(seed). We can check the definition of the get_seed function and we find that if the operation specific seed is not given (which is our case) then it is derived from properties of the current graph: op_seed = ops.get_default_graph()._last_id. The corresponding part of the tf.set_random_seed docs read:
If the graph-level seed is set, but the operation seed is not: The system deterministically picks an operation seed in conjunction with the graph-level seed so that it gets a unique random sequence.
Now coming back to original problem, it makes a difference for the graph structure if input_shape is defined or not. Again looking at a bit of source code we find that Sequential.add builds the inputs and outputs of the network incrementally only if input_shape was specified; otherwise it just stores a list of layers (model._layers); compare model.inputs, model.outputs for the two definitions. The output is incrementally built by calling the layers directly which dispatches to Layer.__call__. This wrapper builds the layer, sets the layer's inputs and outputs and adds some metadata to the outputs; also it uses an ops.name_scope to group operations. We can see this from the visualization provided by Tensorboard (example for the simplified model architecture of Input -> Dense -> Dropout -> Dense):
Now in the case we didn't specify input_shape all the model has is a list of layers. Even after having called compile the model is actually not compiled (just attributes such as the optimizer are set). Instead it is compiled "on the fly" when for the first time data is passed in to the model. This happens in in model._standardize_weights: the model output is obtained via, training=training). Checking this method we find that it builds the layers (note that the model is not yet built) and then computes the output incrementally by using (not __call__). This leaves out all the meta data and also the grouping of operations and hence results in a different structure of the graph (though its computational operations are all the same). Again checking Tensorboard we find:
Expanding both graphs we would find that they contain the same operations, grouped differently together. However this has the effect that the keras.backend.get_session().graph._last_id is different for both definitions and hence results in a different seed for the random operations:
# With `input_shape`:
>>> keras.backend.get_session().graph._last_id
# Without `input_shape`:
>>> keras.backend.get_session().graph._last_id
Performance results
I used the OP's code with some modifications in order to have similar random operations:
Added the steps described here to ensure reproducibility in terms of randomization,
Set random seeds for Dense and Dropout variable initialization,
Removed validation_split since the splitting happens before "on the fly" compilation of the model without input_shape and hence might interfere with the seed,
Set shuffle = False since this might use a separate operation specific seed.
This is the complete code (in addition I performed export PYTHONHASHSEED=0 before running the script):
from collections import deque
from functools import partial
import math
import random
import sys
import numpy as np
import tensorflow as tf
from tensorflow import keras
seed = int(sys.argv[1])
session_conf = tf.ConfigProto(intra_op_parallelism_threads=1,
sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)
def func(x):
return math.sin(x)*5 + math.sin(x*1.8)*4 + math.sin(x/4)*5
def get_data():
x = 0
dx = 0.1
q = deque()
r = 0
data = np.zeros((100000, 1002), np.float32)
while True:
x = x + dx
sig = func(x)
if len(q) < 1000:
arr = np.array(q, np.float32)
for k in range(10):
xx = random.uniform(0.1, 9.9)
data[r, :1000] = arr[:1000]
data[r, 1000] = 5*xx #scale for easier fitting
data[r, 1001] = func(x + xx)
r = r + 1
if r >= data.shape[0]:
if r >= data.shape[0]:
inputs = data[:, :1001]
outputs = data[:, 1001]
return (inputs, outputs)
Dense = partial(keras.layers.Dense, kernel_initializer=keras.initializers.glorot_uniform(seed=1))
Dropout = partial(keras.layers.Dropout, seed=1)
model = keras.Sequential()
model.add(Dense(64, activation=tf.nn.relu,
# input_shape=(1001,)
model.add(Dense(64, activation=tf.nn.relu))
model.add(Dense(64, activation=tf.nn.relu))
model.add(Dense(64, activation=tf.nn.relu))
loss = 'mse',
optimizer = tf.train.RMSPropOptimizer(0.0005)
inputs, outputs = get_data()
shuffled = np.arange(len(inputs))
inputs = inputs[shuffled]
outputs = outputs[shuffled]
hist =, outputs[:, None], epochs=10, shuffle=False)'without.{:d}.loss.npy'.format(seed), hist.history['loss'])
With this code I'd actually expect to obtain similar results for both approaches however it turns out that they are not equal:
for i in $(seq 1 10)
python $i
Plot the mean loss +/- 1 std. dev.:
Initial weights and initial prediction
I verified that the initial weights and an initial prediction (before fitting) is the same for the two versions:
inputs, outputs = get_data()
mode = 'without'
pred = model.predict(inputs)'{mode}.prediction.npy', pred)
for i, layer in enumerate(model.layers):
if isinstance(layer, keras.layers.Dense):
w, b = layer.get_weights()'{mode}.{i:d}.kernel.npy', w)'{mode}.{i:d}.bias.npy', b)
for i in 0 2 4 8
for data in bias kernel
diff -q "with.$i.$data.npy" "without.$i.$data.npy"
Influence of Dropout
[ ! ] I checked the performance after removing all Dropout layers and in that case the performance is actually equal. So the crux seems to lie with the Dropout layers. Actually the performance of the models without Dropout layers is the same as for the model with Dropout layers but without specifying input_shape. So it seems that without input_shape the Dropout layers are not effective.
Basically the difference between the two versions is that one uses __call__ and the other uses call to compute the outputs (as explained above). Since performance is similar to without Dropout layers a possible explanation could be that the Dropout layers don't drop when input_shape is not specified. This could by caused by training=False, i.e. the layers don't recognize they are in training mode. However I don't see a reason why this would happen. Also we can consider again the Tensorboard graphs.
Specifying input_shape:
Not specifying input_shape:
where the switch also depends on the learning phase (as before):
To verify the training kwarg let's subclass Dropout:
class Dropout(keras.layers.Dropout):
def __init__(self, rate, noise_shape=None, seed=None, **kwargs):
super().__init__(rate, noise_shape=noise_shape, seed=1, **kwargs)
def __call__(self, inputs, *args, **kwargs):
training = kwargs.get('training')
if training is None:
training = keras.backend.learning_phase()
print('[__call__] training: {}'.format(training))
return super().__call__(inputs, *args, **kwargs)
def call(self, inputs, training=None):
if training is None:
training = keras.backend.learning_phase()
print('[call] training: {}'.format(training))
return super().call(inputs, training)
I obtain similar outputs for both version, however the calls to __call__ are missing when input_shape is not specified:
[__call__] training: Tensor("keras_learning_phase:0", shape=(), dtype=bool)
[call] training: Tensor("keras_learning_phase:0", shape=(), dtype=bool)
[__call__] training: Tensor("keras_learning_phase:0", shape=(), dtype=bool)
[call] training: Tensor("keras_learning_phase:0", shape=(), dtype=bool)
[__call__] training: Tensor("keras_learning_phase:0", shape=(), dtype=bool)
[call] training: Tensor("keras_learning_phase:0", shape=(), dtype=bool)
[__call__] training: Tensor("keras_learning_phase:0", shape=(), dtype=bool)
[call] training: Tensor("keras_learning_phase:0", shape=(), dtype=bool)
So I suspect that the problem lies somewhere within __call__ but right now I can't figure out what it is.
I'm using Ubuntu 16.04, Python 3.6.7 and Tensorflow 1.12.0 via conda (no GPU support):
$ uname -a
Linux MyPC 4.4.0-141-generic #167-Ubuntu SMP Wed Dec 5 10:40:15 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
$ python --version
Python 3.6.7 :: Anaconda, Inc.
$ conda list | grep tensorflow
tensorflow 1.12.0 mkl_py36h69b6ba0_0
tensorflow-base 1.12.0 mkl_py36h3c3e929_0
I also had keras and keras-base installed (keras-applications and keras-preprocessing are required by tensorflow):
$ conda list | grep keras
keras 2.2.4 0
keras-applications 1.0.6 py36_0
keras-base 2.2.4 py36_0
keras-preprocessing 1.0.5 py36_0
After removing all, keras* and tensorflow*, then reinstalling tensorflow, the discrepancy vanished. Even after reinstalling keras the results remain similar. I also checked with a different virtualenv where tensorflow is installed via pip; also no discrepancy here. Right now I can't reproduce this discrepancy anymore. It must've been a broken installation of tensorflow.

Passing bool to feed dict

So here is an example of using batch normalization over a 1-D input vector. Batch normalization is performed over 100 training examples xTr. I then want to test on say just 1 example later on xTe.
import tensorflow as tf
import numpy as np
from tensorflow.contrib.layers import layers
if __name__ == "__main__":
bn = layers.batch_norm
nFeats = 3
nObs = 100
xTr = np.random.rand(nObs,nFeats) # Train
xTe = np.random.rand(1,nFeats) # Test
bnTrain = tf.placeholder(tf.bool)
X = tf.placeholder(tf.float32,[None,nFeats])
Y = bn(X,nFeats,is_training=bnTrain) # want to be able to change is_training via a feed_dict.
init_op = tf.initialize_all_variables()
with tf.Session() as sess:
yTr_ = Y.eval(feed_dict={X:xTr,bnTrain:True})
yTe_ = Y.eval(feed_dict={X:xTe,bnTrain:False})
But I can't pass a tf.Tensor to a function expecting a normal python bool. What is the best way of going about this so I can change a bool during a session.
The current implementation of the tf.contrib.layers.batch_norm() function is designed to accept a tf.Tensor as the is_training argument (although this fact doesn't appear to be documented), and looking at the revision history, it was added in the TensorFlow 0.10 release. If you are using an older version, please try upgrading to the latest release (currently 0.12), and your existing code should work. Among other improvements, it contains a fused implementation of batch normalization that should make a significant performance improvement.

TensorFlow: How to save the trained model parameters to the file that can be imported to other frameworks?

I'd like to pass the parameters of the trained model (weights and bias for convolution and fully connected layers) to other frameworks or languages including iOS and Torch by parsing the saved file.
I tried tf.train.write_graph(session.graph_def, '', 'graph.pb'), but it seems it only includes the graph architecture without weights and bias. If so, to create checkpoint file (, "model.ckpt")) is the best way? Is it easy to parse ckpt file type in Swift or other languages?
Please let me know if you have any suggestions.
Instead of parsing a .ckpt file, you can just try evaluating the tensor (in your case the weights of a convolutional layer) and getting the values as a numpy array. Here is a quick toy example (tested on r0.10 - there might some small API changes in newer versions):
import tensorflow as tf
import numpy as np
x = tf.placeholder(np.float32, [2,1])
w = tf.Variable(tf.truncated_normal([2,2], stddev=0.1))
b = tf.Variable(tf.constant(1.0, shape=[2,1]))
z = tf.matmul(w, x) + b
with tf.Session() as sess:
w_val, z_val =[w, z], feed_dict={x: np.arange(2).reshape(2,1)})
[[-0.02913031 0.13549708]
[ 0.13807134 0.03763327]]
[[ 1.13549709]
[ 1.0376333 ]]
If you have trouble getting a reference to your tensor (say it is in nested into a higher-level "layer" operation), try finding by name. More info here: Tensorflow: How to get a tensor by name?
If you want to see the how the weights change during training, you can also try to save all the values you are interested into tf.Summary objects and parse them later: Parsing `summary_str` byte string evaluated on tensorflow summary object