Loaded keras model fails to continue training, dimensions mismatch - tensorflow

I'm using tensorflow with keras to train to a char-RNN using google colabs. I train my model for 10 epochs and save it, using 'model.save()' as shown in the documentation for saving models. Immediately after, I load it again just to check, I try to call model.fit() on the loaded model and I get a "Dimensions must be equal" error using the exact same training set. The training data is in a tensorflow dataset organised in batches as shown in the documentation for tf datasets. Here is a minimal working example:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
X = np.random.randint(0,50,(10000))
seq_len = 150
batch_size = 20
dataset = tf.data.Dataset.from_tensor_slices(X)
dataset = dataset.batch(seq_len+1,drop_remainder=True)
dataset = dataset.map(lambda x: (x[:-1],x[1:]))
dataset = dataset.shuffle(20).batch(batch_size,drop_remainder=True)
def make_model(vocabulary_size,embedding_dimension,rnn_units,batch_size,stateful):
model = Sequential()
model.add(Embedding(vocabulary_size,embedding_dimension,
batch_input_shape=[batch_size,None]))
model.add(LSTM(rnn_units,return_sequences=True,stateful=stateful))
model.add(Dense(vocabulary_size))
model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
optimizer='adam',metrics=['accuracy'])
model.summary()
return model
vocab_size = 51
emb_dim = 20
rnn_units = 10
model = make_model(vocab_size,emb_dim,rnn_units,batch_size,False)
model.fit(dataset,epochs=10)
model.save('/content/test_model')
model2 = tf.keras.models.load_model('/content/test_model')
model2.fit(dataset,epochs=10)
The first training line, "model.fit()", runs fine but the last line returns the error:
ValueError: Dimensions must be equal, but are 20 and 150 for '{{node
Equal}} = Equal[T=DT_INT64, incompatible_shape_error=true](ArgMax,
ArgMax_1)' with input shapes: [20], [20,150].
I want to be able to resume training later, as my real dataset is much larger. Therefore, saving only the weights is not an ideal option.
Any advice?
Thanks!

If you have saved checkpoints than, from those checkpoints, you can resume with reduced dataset. Your neural network / layers and dimensions should be same.

The problem is the 'accuracy' metric. For some reason, there is some mishandling of dimensions on the predictions when the model is loaded with this metric, as I found in this thread (see last comment). Running model.compile() on the loaded model with the same metric allows training to continue. However, it shouldn't be necessary to compile the model again. Moreover, this means that the optimiser state is lost, as explained in this answer, thus, this is not very useful for resuming training.
On the other hand, using 'sparse_categorical_accuracy' from the start works just fine. I am able to load the model and continue training without having to recompile. In hindsight, this choice is more appropriate given that the outputs of my last layer are logits over the distribution of characters. Thus, this is not a binary but a multiclass classification problem. Nonetheless, I verified that both 'accuracy' and 'sparse_categorical_accuracy' returned the same values in my specific example. Thus, I believe that keras is internally converting accuracy to categorical accuracy, but something goes wrong when doing this on a model that has been just loaded which forces the need to recompile.
I also verified that if the saved model was compiled with 'accuracy', loading the model and recompiling with 'sparse_categorical_accuracy' will allow resuming training. However, as mentioned before, this would discard the state of the optimiser and I suspect that it would be no better than just making a new model and loading only the weights from the saved one.

Related

Training with Dataset API and numpy array yields completely different results

I have a CNN regression model and feature comes in (2000, 3000, 1) shape, where 2000 is total number of samples with each being a (3000, 1) 1D array. Batch size is 8, 20% of the full dataset is used for validation.
However, zip feature and label into tf.data.Dataset gives completely different scores from feeding numpy arrays directly in.
The tf.data.Dataset code looks like:
# Load features and labels
features = np.array(features) # shape is (2000, 3000, 1)
labels = np.array(labels) # shape is (2000,)
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
dataset = dataset.shuffle(buffer_size=2000)
dataset = dataset.batch(8)
train_dataset = dataset.take(200)
val_dataset = dataset.skip(200)
# Training model
model.fit(train_dataset, validation_data=val_dataset,
batch_size=8, epochs=1000)
The numpy code looks like:
# Load features and labels
features = np.array(features) # exactly the same as previous
labels = np.array(labels) # exactly the same as previous
# Training model
model.fit(x=features, y=labels, shuffle=True, validation_split=0.2,
batch_size=8, epochs=1000)
Except for this, other code is exactly the same, for example
# Set global random seed
tf.random.set_seed(0)
np.random.seed(0)
# No preprocessing of feature at all
# Load model (exactly the same)
model = load_model()
# Compile model
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
loss=tf.keras.losses.MeanSquaredError(),
metrics=[tf.keras.metrics.mean_absolute_error, ],
)
The former method via tf.data.Dataset API yields mean absolute error (MAE) around 10-3 on both training and validation set, which looks quite suspicious as the model doesn't have any drop-out or regularization to prevent overfitting. On the other hand, feeding numpy arrays right in gives training MAE around 0.1 and validation MAE around 1.
The low MAE of tf.data.Dataset method looks super suspicious however I just couldn't figure out anything wrong with the code. Also I could confirm the number of training batches is 200 and validation batches is 50, meaning I didn't use the training set for validation.
I tried to vary the global random seed or use some different shuffle seeds, which didn't change the results much. Training was done on NVIDIA V100 GPUs, and I tried tensorflow version 2.9, 2.10, 2.11 which didn't make much difference.
The problem lies in the default behaviour of "shuffle" method of tf.data.Dataset, more specificially the reshuffle_each_iteration argument which is by default True. Meaning if I implement the following code:
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
dataset = dataset.shuffle(buffer_size=2000)
dataset = dataset.batch(8)
train_dataset = dataset.take(200)
val_dataset = dataset.skip(200)
model.fit(train_dataset, validation_data=val_dataset, batch_size=8, epochs=1000)
The dataset would actually be shuffle after each epoch though it might not look so apparently so. As a result, the validation data would leak into training set (in fact there would be no distinguish between these two sets as the order is shuffled every epoch).
So make sure to set reshuffle_each_iteration to False if you would like to shuffle the dataset and then do train-val split.
UPDATE: TensorFlow confirms this issue and warning would be added in future docs.
PS: It's a hard lesson for me, as I have been using the model for analysing the results for several months (as a graduating MPhil student).

OpenVino converted model not returning same score values as original model (Sigmoid)

I've converted a Keras model for use with OpenVino. The original Keras model used sigmoid to return scores ranging from 0 to 1 for binary classification. After converting the model for use with OpenVino, the scores are all near 0.99 for both classes but seem slightly lower for one of the classes.
For example, test1.jpg and test2.jpg (from opposite classes) yield scores of 0.00320357 and 0.9999, respectively.
With OpenVino, the same images yield scores of 0.9998982 and 0.9962392, respectively.
Edit* One suspicion is that the input array is still accepted by the OpenVino model but is somehow changed in shape or "scrambled" and therefore is never a match for class one? In other words, if you fed it random noise, the score would also always be 0.9999. Maybe I'd have to somehow get the OpenVino model to accept the original shape (1,180,180,3) instead of (1,3,180,180) so I don't have to force the input into a different shape than the one the original model accepted? That's weird though because I specified the shape when making the xml and bin for openvino:
python3 /opt/intel/openvino_2021/deployment_tools/model_optimizer/mo_tf.py --saved_model_dir /Users/.../Desktop/.../model13 --output_dir /Users/.../Desktop/... --input_shape=\[1,180,180,3]
However, I know from error messages that the inference engine is expecting (1,3,180,180) for some unknown reason. Could that be the problem? The other suspicion is something wrong with how the original model was frozen. I'm exploring different ways to freeze the original model (keras model converted to pb) in case the problem is related to that.
I checked to make sure the Sigmoid activation function is being used in the OpenVino implementation (same activation as the Keras model) and it looks like it is. Why, then, are the values not the same? Any help would be much appreciated.
The code for the OpenVino inference is:
import openvino
from openvino.inference_engine import IECore, IENetwork
from skimage import io
import sys
import numpy as np
import os
def loadNetwork(model_xml, model_bin):
ie = IECore()
network = ie.read_network(model=model_xml, weights=model_bin)
input_placeholder_key = list(network.input_info)[0]
input_placeholder = network.input_info[input_placeholder_key]
output_placeholder_key = list(network.outputs)[0]
output_placeholder = network.outputs[output_placeholder_key]
return network, input_placeholder_key, output_placeholder_key
batch_size = 1
channels = 3
IMG_HEIGHT = 180
IMG_WIDTH = 180
#loadNetwork('saved_model.xml','saved_model.bin')
image_path = 'test.jpg'
def load_source(path_to_image):
image = io.imread(path_to_image)
img = np.resize(image,(180,180))
return img
img_new = load_source('test2.jpg')
#Batch?
def classify(image):
device = 'CPU'
network, input_placeholder_key, output_placeholder_key = loadNetwork('saved_model.xml','saved_model.bin')
ie = IECore()
exec_net = ie.load_network(network=network, device_name=device)
res = exec_net.infer(inputs={input_placeholder_key: image})
print(res)
res = res[output_placeholder_key]
return res
result = classify(img_new)
print(result)
result = result[0]
top_result = np.argmax(result)
print(top_result)
print(result[top_result])
And the result:
{'StatefulPartitionedCall/model/dense/Sigmoid': array([[0.9962392]], dtype=float32)}
[[0.9962392]]
0
0.9962392
Generally, Tensorflow is the only network with the shape NHWC while most others use NCHW. Thus, the OpenVINO Inference Engine satisfies the majority of networks and uses the NCHW layout. Model must be converted to NCHW layout in order to work with Inference Engine.
The conversion of the native model format into IR involves the process where the Model Optimizer performs the necessary transformation to convert the shape to the layout required by the Inference Engine (N,C,H,W). Using the --input_shape parameter with the correct input shape of the model should suffice.
Besides, most TensorFlow models are trained with images in RGB order. In this case, inference results using the Inference Engine samples may be incorrect. By default, Inference Engine samples and demos expect input with BGR channels order. If you trained your model to work with RGB order, you need to manually rearrange the default channels order in the sample or demo application or reconvert your model using the Model Optimizer tool with --reverse_input_channels argument.
I suggest you validate this by inferring your model with the Hello Classification Python Sample instead since this is one of the official samples provided to test the model's functionality.
You may refer to this "Intel Math Kernel Library for Deep Neural Network" for deeper explanation regarding the input shape.

Delayed echo of sin - cannot reproduce Tensorflow result in Keras

I am experimenting with LSTMs in Keras with little to no luck. At some moment I decided to scale back to the most basic problems in order finally achieve some positive result.
However, even with simplest problems I find that Keras is unable to converge while the implementation of the same problem in Tensorflow gives stable result.
I am unwilling to just switch to Tensorflow without understanding why Keras keeps diverging on any problem I attempt.
My problem is a many-to-many sequence prediction of delayed sin echo, example below:
Blue line is a network input sequence, red dotted line is an expected output.
The experiment was inspired by this repo and workable Tensorflow solution was also created from it too.
The relevant excerpts from the my code are below, and full version of my minimal reproducible example is available here.
Keras model:
model = Sequential()
model.add(LSTM(n_hidden,
input_shape=(n_steps, n_input),
return_sequences=True))
model.add(TimeDistributed(Dense(n_input, activation='linear')))
model.compile(loss=custom_loss,
optimizer=keras.optimizers.Adam(lr=learning_rate),
metrics=[])
Tensorflow model:
x = tf.placeholder(tf.float32, [None, n_steps, n_input])
y = tf.placeholder(tf.float32, [None, n_steps])
weights = {
'out': tf.Variable(tf.random_normal([n_hidden, n_steps], seed = SEED))
}
biases = {
'out': tf.Variable(tf.random_normal([n_steps], seed = SEED))
}
lstm = rnn.LSTMCell(n_hidden, forget_bias=1.0)
outputs, states = tf.nn.dynamic_rnn(lstm, inputs=x,
dtype=tf.float32,
time_major=False)
h = tf.transpose(outputs, [1, 0, 2])
pred = tf.nn.bias_add(tf.matmul(h[-1], weights['out']), biases['out'])
individual_losses = tf.reduce_sum(tf.squared_difference(pred, y),
reduction_indices=1)
loss = tf.reduce_mean(individual_losses)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate) \
.minimize(loss)
I claim that other parts of code (data_generation, training) are completely identical. But learning progress with Keras stalls early and yields unsatisfactory predictions. Graphs of logloss for both libraries and example predictions are attached below:
Logloss for Tensorflow-trained model:
Logloss for Keras-trained model:
It's not easy to read from graph, but Tensorflow reaches target_loss=0.15 and stops early after about 10k batches. But Keras uses up all 13k batches reaching loss about only 1.5. In a separate experiment where Keras was running for 100k batches it went no further stalling around 1.0.
Figures below contain: black line - model input signal, green dotted line - ground truth output, red line - acquired model output.
Predictions of Tensorflow-trained model:
Predictions of Keras-trained model:
Thank you for suggestions and insights, dear colleagues!
Ok, I have managed to solve this. Keras implementation now converges steadily to a sensible solution too:
The models were in fact not identical. You may inspect with extra caution the Tensorflow model version from the question and verify for yourself that actual Keras equivalent is listed below, and isn't what stated in the question:
model = Sequential()
model.add(LSTM(n_hidden,
input_shape=(n_steps, n_input),
return_sequences=False))
model.add(Dense(n_steps, input_shape=(n_hidden,), activation='linear'))
model.compile(loss=custom_loss,
optimizer=keras.optimizers.Adam(lr=learning_rate),
metrics=[])
I will elaborate. Workable solution here uses that last column of size n_hidden spat out by LSTM as an intermediate activation then fed to the Dense layer.
So, in a way, the actual prediction here is made by the regular perceptron.
One extra take away note - source of mistake in the original Keras solution is already evident from the inference examples attached to question. We see there that earlier timestamps fail utterly, while later timestamps are near perfect. These earlier timestamps correspond to the states of LSTM when it were just initialized on new window and clueless of context.

Tensorflow load pre-trained model use different optimizer

I want to load a pre-trained model (optimized by AdadeltaOptimizer) and continue training with SGD (GradientDescentOptimizer). The models are saved and loaded with tensorlayer API:
save model:
import tensorlayer as tl
tl.files.save_npz(network.all_params,
name=model_dir + "model-%d.npz" % global_step)
load model:
load_params = tl.files.load_npz(path=resume_dir + '/', name=model_name)
tl.files.assign_params(sess, load_params, network)
If I continue training with adadelta, the training loss (cross entropy) looks normal (start at a close value as the loaded model). However, if I change the optimizer to SGD, the training loss would be as large as a newly initialized model.
I took a look at the model-xxx.npz file from tl.files.save_npz. It only saves all model parameters as ndarray. I'm not sure how the optimizer or learning rate is involved here.
You probably would have to import the tensor into a variable which is the loss function/cross-entropy that feeds into your Adam Optimizer previously. Now, just feed it through your SGD optimizer instead.
saver = tf.train.import_meta_graph('filename.meta')
saver.restore(sess,tf.train.latest_checkpoint('./'))
graph = tf.get_default_graph()
cross_entropy = graph.get_tensor_by_name("entropy:0") #Tensor to import
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cross_entropy)
In this case, I have tagged the cross-entropy Tensor before training my pre-train model with the name entropy, as such
tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y_conv), name = 'entropy')
If you are unable to make changes to your pretrain model, you can obtain the list of Tensors in your model(after you have imported it) from graph and deduce which Tensor you require. I have no experience with Tensorlayer, so this guide is to provide more of an understanding. You can take a look at Tensorlayer-Layers, they should explain how to obtain your Tensor. As Tensorlayer is built on top of Tensorflow, most of the functions should still be available.
You can specify the parameters you want to save in your checkpoint file.
save_npz([save_list, name, sess])
In the save_list you're specifying only the network parameters that don't contain the optimizer parameters, thus no learning rate or any other optimizer parameters.
If you want to save the current learning rate (in order to use the same exact learning rate when you restore the model) you have to add it to the save_list, like that:
save_npz(network.all_params.extend([learning_rate])
(I suppoose that all_params is an array, I guess my supposition is correct.
Since you want to change the optimizer, I suggest you save the learning_rate only as optimizer parameter and not any other variable that the optimizer creates.
In that way, you'll be able to change the optimizer and restoring the model, otherwise (if you put in your checkpoint any other variable) the graph you'll try to restore won't find the variables in which place the saved value and you won't be able to change it.
https://tensorlayer.readthedocs.io/en/latest/user/get_start_advance.html#pre-trained-cnn
vgg = tl.models.vgg16(pretrained=True)
img = tl.vis.read_image('data/tiger.jpeg')
img = tl.prepro.imresize(img, (224, 224)).astype(np.float32) / 255
output = vgg(img, is_train=False)
For 2.0 version, use this

DeepLearning Anomaly Detection for images

I am still relatively new to the world of Deep Learning. I wanted to create a Deep Learning model (preferably using Tensorflow/Keras) for image anomaly detection. By anomaly detection I mean, essentially a OneClassSVM.
I have already tried sklearn's OneClassSVM using HOG features from the image. I was wondering if there is some example of how I can do this in deep learning. I looked up but couldn't find one single code piece that handles this case.
The way of doing this in Keras is with the KerasRegressor wrapper module (they wrap sci-kit learn's regressor interface). Useful information can also be found in the source code of that module. Basically you first have to define your Network Model, for example:
def simple_model():
#Input layer
data_in = Input(shape=(13,))
#First layer, fully connected, ReLU activation
layer_1 = Dense(13,activation='relu',kernel_initializer='normal')(data_in)
#second layer...etc
layer_2 = Dense(6,activation='relu',kernel_initializer='normal')(layer_1)
#Output, single node without activation
data_out = Dense(1, kernel_initializer='normal')(layer_2)
#Save and Compile model
model = Model(inputs=data_in, outputs=data_out)
#you may choose any loss or optimizer function, be careful which you chose
model.compile(loss='mean_squared_error', optimizer='adam')
return model
Then, pass it to the KerasRegressor builder and fit with your data:
from keras.wrappers.scikit_learn import KerasRegressor
#chose your epochs and batches
regressor = KerasRegressor(build_fn=simple_model, nb_epoch=100, batch_size=64)
#fit with your data
regressor.fit(data, labels, epochs=100)
For which you can now do predictions or obtain its score:
p = regressor.predict(data_test) #obtain predicted value
score = regressor.score(data_test, labels_test) #obtain test score
In your case, as you need to detect anomalous images from the ones that are ok, one approach you can take is to train your regressor by passing anomalous images labeled 1 and images that are ok labeled 0.
This will make your model to return a value closer to 1 when the input is an anomalous image, enabling you to threshold the desired results. You can think of this output as its R^2 coefficient to the "Anomalous Model" you trained as 1 (perfect match).
Also, as you mentioned, Autoencoders are another way to do anomaly detection. For this I suggest you take a look at the Keras Blog post Building Autoencoders in Keras, where they explain in detail about the implementation of them with the Keras library.
It is worth noticing that Single-class classification is another way of saying Regression.
Classification tries to find a probability distribution among the N possible classes, and you usually pick the most probable class as the output (that is why most Classification Networks use Sigmoid activation on their output labels, as it has range [0, 1]). Its output is discrete/categorical.
Similarly, Regression tries to find the best model that represents your data, by minimizing the error or some other metric (like the well-known R^2 metric, or Coefficient of Determination). Its output is a real number/continuous (and the reason why most Regression Networks don't use activations on their outputs). I hope this helps, good luck with your coding.