Keras: Load checkpoint weights HDF5 generated by multiple GPUs - tensorflow

Checkpoint snippet:
checkpointer = ModelCheckpoint(filepath=os.path.join(savedir, "mid/weights.{epoch:02d}.hd5"), monitor='val_loss', verbose=1, save_best_only=False, save_weights_only=False)
hist = model.fit_generator(
gen.generate(batch_size = batch_size, nb_classes=nb_classes), samples_per_epoch=593920, nb_epoch=nb_epoch, verbose=1, callbacks=[checkpointer], validation_data = gen.vld_generate(VLD_PATH, batch_size = 64, nb_classes=nb_classes), nb_val_samples=10000
)
I trained my model on a multiple GPU host which dumps mid files in HDF5 format. When I loaded them on a single GPU machine with keras.load_weights('mid'), an error was raised:
Using TensorFlow backend.
Traceback (most recent call last):
File "server.py", line 171, in <module>
model = load_model_and_weights('zhch.yml', '7_weights.52.hd5')
File "server.py", line 16, in load_model_and_weights
model.load_weights(os.path.join('model', weights_name))
File "/home/lz/code/ProjectGo/meta/project/libpolicy-server/.virtualenv/lib/python3.5/site-packages/keras/engine/topology.py", line 2701, in load_weights
self.load_weights_from_hdf5_group(f)
File "/home/lz/code/ProjectGo/meta/project/libpolicy-server/.virtualenv/lib/python3.5/site-packages/keras/engine/topology.py", line 2753, in load_weights_from_hdf5_group
str(len(flattened_layers)) + ' layers.')
ValueError: You are trying to load a weight file containing 1 layers into a model with 21 layers.
Is there any way to load checkpoint weights generated by multiple GPUs on a single GPU machine? It seems that no issue of Keras discussed this problem thus any help would be appreciated.

You can load your model on a single GPU like this:
from keras.models import load_model
multi_gpus_model = load_model('mid')
origin_model = multi_gpus_model.layers[-2] # you can use multi_gpus_model.summary() to see the layer of the original model
origin_model.save_weights('single_gpu_model.hdf5')
'single_gpu_model.hdf5' is the file that you can load to the single GPU machine model.

Try this function:
def keras_model_reassign_weights(model_cpu,model_gpu):
weights_temp ={}
print('_'*5,'Collecting weights from GPU model','_'*5)
for layer in model_gpu.layers:
try:
for layer_unw in layer.layers:
#print('Weights extracted for: ',layer_unw.name)
weights_temp[layer_unw.name] = layer_unw.get_weights()
break
except:
print('Skipped: ',layer.name)
print('_'*5,'Writing weights to CPU model','_'*5)
for layer in model_cpu.layers:
try:
layer.set_weights(weights_temp[layer.name])
#print(layer.name,'Done!')
except:
print(layer.name,'weights does not set for this layer!')
return model_cpu
But you need to load weights to your gpu model first:
#load or initialize your keras multi-gpu model
model_gpu = None
#load or initialize your keras model with the same structure, without using keras.multi_gpu function
model_cpu = None
#load weights into multigpu model
model_gpu.load_weights(r'gpu_model_best_checkpoint.hdf5')
#execute function
model_cpu = keras_model_reassign_weights(model_cpu,model_gpu)
#save obtained weights for cpu model
model_cpu.save_weights(r'CPU_model.hdf5')
After transferring you can use weights with a single GPU or CPU model.

Related

LSTM RNN 'Sequential' object has no attribute 'predict_classes'

I'm running predictions on an RNN model and it errors saying:
AttributeError Traceback (most recent call last)
<ipython-input-18-cd3fe876f68d> in <module>
1 # Make sentiment predictions
----> 2 predictions = model.predict_classes(X_test)
AttributeError: 'Sequential' object has no attribute 'predict_classes'
I'm doing a complicated sentiment analysis model that feeds encoded/tokens comments into a Deep NN work using LSTM Layers, the goal is can a RNN be fed all these comments and predict the positivity score of the each comment in this .csv file of 25,000 comments. The model will output either 0 for negative comment or 1 for positive comment.
So I'm at the final stages. I pre-processed and encoded/tokenized all my data, built the NN model, and now I'm running it to train. I decided to train on a few epochs because it's eating the memory of laptop.
# Training the model
batch_size = 1000
epochs = 10
model.fit(
X_train_rnn,
y_train_rnn,
validation_data = (X_val_rnn, y_val_rnn),
epochs = epochs,
batch_size = batch_size,
verbose = 1
)
After my computer trains this model for minutes on end, then I try to make predictions and this where I get the error:
y_rnn_pred = model.predict_classes(X_test_rnn, batch_size=1000)
If you are using tf.keras version 2.5+
replace
y_rnn_pred = model.predict_classes(X_test_rnn, batch_size=1000)
with
y_rnn_pred = (model.predict(X_test_rnn, batch_size=1000) > 0.5).astype("int32")

Training seq2seq model on Google Colab TPU with big dataset - Keras

I'm trying to train a sequence to sequence model for machine translation using Keras on Google Colab TPU.
I have a dataset which I can load in memory but I have to preprocess to it to feed it to the model. In particular I need to convert the target words to one hot vectors and with many examples I can't load the entire conversion in memory, so I need to make batches of data.
I'm using this function as a batch generator:
def generate_batch_bert(X_ids, X_masks, y, batch_size = 1024):
''' Generate a batch of data '''
while True:
for j in range(0, len(X_ids), batch_size):
# batch of encoder and decoder data
encoder_input_data_ids = X_ids[j:j+batch_size]
encoder_input_data_masks = X_masks[j:j+batch_size]
y_decoder = y[j:j+batch_size]
# decoder target and input for teacher forcing
decoder_input_data = y_decoder[:,:-1]
decoder_target_seq = y_decoder[:,1:]
# batch of decoder target data
decoder_target_data = to_categorical(decoder_target_seq, vocab_size_fr)
# keep only with the right amount of instances for training on TPU
if encoder_input_data_ids.shape[0] == batch_size:
yield([encoder_input_data_ids, encoder_input_data_masks, decoder_input_data], decoder_target_data)
The problem is that whenever I try to run the fit function as follows:
model.fit(x=generate_batch_bert(X_train_ids, X_train_masks, y_train, batch_size = batch_size),
steps_per_epoch = train_samples//batch_size,
epochs=epochs,
callbacks = callbacks,
validation_data = generate_batch_bert(X_val_ids, X_val_masks, y_val, batch_size = batch_size),
validation_steps = val_samples//batch_size)
I get the following error:
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/tensor_util.py:445 make_tensor_proto
raise ValueError("None values not supported.")
ValueError: None values not supported.
Not sure what's wrong and how I can solve this problem.
EDIT
I tried loading less amount of data in memory so that the conversion to one hot encoding of the target words doesn't crash the kernel and it actually works. So there is obviously something wrong on how I generate batches.
It's hard to tell what's wrong since you don't provide your model
definition nor any sample data. However, I'm fairly certain that you're
running into the same
TensorFlow bug
that I recently got bitten by.
The workaround is to use the tensorflow.data API which works much
better with TPUs. Like this:
from tensorflow.data import Dataset
import tensorflow as tf
def map_fn(X_id, X_mask, y):
decoder_target_data = tf.one_hot(y[1:], vocab_size_fr)
return (X_id, X_mask, y[:-1]), decoder_target_data
...
X_ids = Dataset.from_tensor_slices(X_ids)
X_masks = Dataset.from_tensor_slices(X_masks)
y = Dataset.from_tensor_slices(y)
ds = Dataset.zip((X_ids, X_masks, y)).map(map_fn).batch(1024)
model.fit(x = ds, ...)

What parameters can be used in filepath for ModelCheckpoints (tf.Keras)?

I'm trying to train a Keras model and save model weighta at every epoch and patch.
I define a checkpoint as follows:
checkpoint_path='model_checkpoints_5000/checkpoints_{epoch:02d}_{batch:04d}'
checkpoint = ModelCheckpoint(filepath = checkpoint_5000_path,frequency = 5000)
and train the model:
model.fit(x=x_train, y=y_train, epochs=3, validation_data=(x_test, y_test),
batch_size=10, callbacks=[checkpoint])
But right after the forst iteration the error occurs:
KeyError: 'Failed to format this callback filepath: "model_checkpoints_5000/checkpoints_{epoch:02d}_{batch:04d}". Reason: \'batch\'
How can I have python add the batch number to the file name?
How to fide the list of other parameters that are available for output?
My setup: Windows 10, jupyter notebook in chrome, Python 3.5.4, Tensorflow 2.3.0, Keras is imported from Tensorflow.

Can't import frozen graph with BatchNorm layer

I have trained a Keras model based on this repo.
After the training I save the model as checkpoint files like this:
sess=tf.keras.backend.get_session()
saver = tf.train.Saver()
saver.save(sess, current_run_path + '/checkpoint_files/model_{}.ckpt'.format(date))
Then I restore the graph from the checkpoint files and freeze it using the standard tf freeze_graph script. When I want to restore the frozen graph I get the following error:
Input 0 of node Conv_BN_1/cond/ReadVariableOp/Switch was passed float from Conv_BN_1/gamma:0 incompatible with expected resource
How can I fix this issue?
Edit: My problem is related to this question. Unfortunately, I can't use the workaround.
Edit 2:
I have opened an issue on github and created a gist to reproduce the error.
https://github.com/keras-team/keras/issues/11032
Just resolved the same issue. I connected this few answers: 1, 2, 3 and realized that issue originated from batchnorm layer working state: training or learning. So, in order to resolve that issue you just need to place one line before loading your model:
keras.backend.set_learning_phase(0)
Complete example, to export model
import tensorflow as tf
from tensorflow.python.framework import graph_io
from tensorflow.keras.applications.inception_v3 import InceptionV3
def freeze_graph(graph, session, output):
with graph.as_default():
graphdef_inf = tf.graph_util.remove_training_nodes(graph.as_graph_def())
graphdef_frozen = tf.graph_util.convert_variables_to_constants(session, graphdef_inf, output)
graph_io.write_graph(graphdef_frozen, ".", "frozen_model.pb", as_text=False)
tf.keras.backend.set_learning_phase(0) # this line most important
base_model = InceptionV3()
session = tf.keras.backend.get_session()
INPUT_NODE = base_model.inputs[0].op.name
OUTPUT_NODE = base_model.outputs[0].op.name
freeze_graph(session.graph, session, [out.op.name for out in base_model.outputs])
to load *.pb model:
from PIL import Image
import numpy as np
import tensorflow as tf
# https://i.imgur.com/tvOB18o.jpg
im = Image.open("/home/chichivica/Pictures/eagle.jpg").resize((299, 299), Image.BICUBIC)
im = np.array(im) / 255.0
im = im[None, ...]
graph_def = tf.GraphDef()
with tf.gfile.GFile("frozen_model.pb", "rb") as f:
graph_def.ParseFromString(f.read())
graph = tf.Graph()
with graph.as_default():
net_inp, net_out = tf.import_graph_def(
graph_def, return_elements=["input_1", "predictions/Softmax"]
)
with tf.Session(graph=graph) as sess:
out = sess.run(net_out.outputs[0], feed_dict={net_inp.outputs[0]: im})
print(np.argmax(out))
This is bug with Tensorflow 1.1x and as another answer stated, it is because of the internal batch norm learning vs inference state. In TF 1.14.0 you actually get a cryptic error when trying to freeze a batch norm layer.
Using set_learning_phase(0) will put the batch norm layer (and probably others like dropout) into inference mode and thus the batch norm layer will not work during training, leading to reduced accuracy.
My solution is this:
Create the model using a function (do not use K.set_learning_phase(0)):
def create_model():
inputs = Input(...)
...
return model
model = create_model()
Train model
Save weights:
model.save_weights("weights.h5")
Clear session (important so layer names are the same) and set learning phase to 0:
K.clear_session()
K.set_learning_phase(0)
Recreate model and load weights:
model = create_model()
model.load_weights("weights.h5")
Freeze as before
Thanks for pointing the main issue! I found that keras.backend.set_learning_phase(0) to be not working sometimes, at least in my case.
Another approach might be: for l in keras_model.layers: l.trainable = False

Error when implementing tensorflow high level api

I am trying to implement tensorflows provided high level api's, specifically the baseline classifier. However when trying to train the model, I get the following
Error:
NotFoundError (see above for traceback): Key baseline/bias not found in checkpoint
[[Node: save/RestoreV2 = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
Code:
import tensorflow as tf
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
def digit_cross():
# Number of classes, one class for each of 10 digits.
num_classes = 10
digit = datasets.load_digits()
x = digit.data
y = digit.target
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.3, random_state=42)
y_train_index = np.arange(y_train.size)
train_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": np.array(x_train)},
y=np.array(y_train),
num_epochs=None,
shuffle=False)
# Build BaselineClassifier
classifier = tf.estimator.BaselineClassifier(n_classes=num_classes,
model_dir="./checkpoints_tutorial17-1/")
# Fit model.
classifier.train(train_input_fn)
digit_cross()
It seems that you have a checkpoint in model_dir="./checkpoints_tutorial17-1/", which is from another model and is not from a BaselineClassifier. To be specific, you have a checkpoint file and model.ckpt-* files in that folder.
As tensorflow documented:
model_dir: Directory to save model parameters, graph and etc. This can also be used to load checkpoints from the directory into a estimator to continue training a previously saved model. If PathLike object, the path will be resolved. If None, the model_dir in config will be used if set. If both are set, they must be same. If both are None, a temporary directory will be used.
Here, BaselineClassifier will first build a graph which uses baseline/bias. Then it finds out that there is a previous checkpoint in model_dir. It will try to load this checkpoint and you should see an info (if you've done tf.logging.set_verbosity(tf.logging.INFO)) saying something like
"INFO:tensorflow:Restoring parameters from .../checkpoints_tutorial17-1\model.ckpt-..."
Because this checkpoint in model_dir is not from a BaselineClassifier, it won't have baseline/bias. BaselineClassifier cannot find it and will thus throw an error.