How to get reproducible result in Amazon SageMaker with TensorFlow Estimator? - tensorflow

I am currently using AWS SageMaker Python SDK to train EfficientNet model (https://github.com/qubvel/efficientnet) to my data. Specifically, I use TensorFlow estimator as below. This code is in SageMaker notebook instance
import sagemaker
from sagemaker.tensorflow.estimator import TensorFlow
### sagemaker version = 1.50.17, python version = 3.6
estimator = TensorFlow("train.py", py_version = "py3", framework_version = "2.1.0",
role = sagemaker.get_execution_role(),
train_instance_type = "ml.m5.xlarge",
train_instance_count = 1,
image_name = 'xxx.dkr.ecr.xxx.amazonaws.com/xxx',
hyperparameters = {list of hyperparameters here: epochs, batch size},
subnets = [xxx],
security_group_ids = [xxx]
estimator.fit({
'class_1': 's3_path_class_1',
'class_2': 's3_path_class_2'
})
The code for train.py contains the usual training procedure, getting the image and labels from S3, transform them into the right array shape for EfficientNet input, and split into train, validation, and test set. In order to get reproducible result, I use the following reset_random_seeds function (If Keras results are not reproducible, what's the best practice for comparing models and choosing hyper parameters?) before calling EfficientNet model itself.
### code of train.py
import os
os.environ['PYTHONHASHSEED']=str(1)
import numpy as np
import tensorflow as tf
import efficientnet.tfkeras as efn
import random
### tensorflow version = 2.1.0
### tf.keras version = 2.2.4-tf
### efficientnet version = 1.1.0
def reset_random_seeds():
os.environ['PYTHONHASHSEED']=str(1)
tf.random.set_seed(1)
np.random.seed(1)
random.seed(1)
if __name__ == "__main__":
### code for getting training data
### ... (I have made sure that the training input is the same every time i re-run the code)
### end of code
reset_random_seeds()
model = efn.EfficientNetB5(include_top = False,
weights = 'imagenet',
input_shape = (80, 80, 3),
pooling = 'avg',
classes = 3)
model.compile(optimizer = 'Adam', loss = 'categorical_crossentropy')
model.fit(X_train, Y_train, batch_size = 64, epochs = 30, shuffle = True, verbose = 2)
### Prediction section here
However, each time i run the notebook instance, i always get a different result from the previous run. When I switched train_instance_type to "local" i always get the same result each time i run the notebook. Therefore, is the non-reproducible result caused by the training instance type that I have chosen? this instance (ml.m5.xlarge) has 4 vCPU, 16 Mem (GiB), and no GPU. If so, how to obtain reproducible results under this training instance?

Is it possible that your inconsistent result is getting from the
tf.random.set_seed()
Came across a post here: Tensorflow: Different results with the same random seed

Related

Exception encountered when calling layer "feature_extraction_layer" (type KerasLayer). : downloading a model from tensorflow hub

I want create a model from tensorflow hub but when I try to load the model. I am getting following error.
enter image description here
`# lets create a model function
from tensorflow.keras import Sequential
IMAGE_SHAPE = (224,224)
def create_model(model_url,num_classes=10):
"""
create keras sequential
Args:
model_url(str)L a tensorflow hub feature extraction URL
num_classes (int): number of output neurons in the output layer
"""
# download a pretrained model.
feature_extraction_layer = hub.KerasLayer(model_url,trainable = False,name = 'feature_extraction_layer',
input_shape=IMAGE_SHAPE+(3,))
# creat model
model = Sequential([feature_extraction_layer,
layers.Dense(num_classes,activation='softmax',name='output_layer')
])
return model`
I tried running the same code on google colab but it didnt work. there also it gave me the same error.

Tensorflow: Classifying images in batches

I have followed this TensorFlow tutorial to classify images using transfer learning approach. Using almost 16,000 manually classified images (with about 40/60 split of 1/0) added on top of the pre-trained MobileNet V2 model, my model achieved 96% accuracy on the hold out test set. I then saved the resulting model.
Next, I would like to use this trained model to classify new images. To do so, I have adapted one of the portions of the tutorial's code (in the end where it says #Retrieve a batch of images from the test set) in the way described below. The code works, however, it only processes one batch of 32 images and that's it (there are hundreds of images in the source folder). What am I missing here? Please advise.
# Import libraries
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import preprocessing
from tensorflow.keras.preprocessing import image_dataset_from_directory
import matplotlib.pyplot as plt
import numpy as np
import os
# Load saved model
model = tf.keras.models.load_model('/model')
# Re-compile model
base_learning_rate = 0.0001
model.compile(optimizer=tf.keras.optimizers.Adam(lr=base_learning_rate),
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])
# Define paths
PATH = 'Data/'
new_dir = os.path.join(PATH, 'New_images') # New_images must contain at least one class (sub-folder)
IMG_SIZE = (640, 640)
BATCH_SIZE = 32
new_dataset = image_dataset_from_directory(new_dir, shuffle=True, batch_size=BATCH_SIZE, image_size=IMG_SIZE)
# Retrieve a batch of images from the test set
image_batch, label_batch = new_dataset.as_numpy_iterator().next()
predictions = model.predict_on_batch(image_batch).flatten()
# Apply a sigmoid since our model returns logits
predictions = tf.nn.sigmoid(predictions)
predictions = tf.where(predictions < 0.5, 0, 1)
print('Predictions:\n', predictions.numpy())
len(new_dataset) # equals 25, i.e., there are 25 batches
Replace this code:
# Retrieve a batch of images from the test set
image_batch, label_batch = new_dataset.as_numpy_iterator().next()
predictions = model.predict_on_batch(image_batch).flatten()
with this one:
predictions = model.predict(new_dataset,batch_size=BATCH_SIZE).flatten()
tf.data.Dataset objects can be directly passed to the method predict(). Reference

Is there a way to find the batch size for a tf.data.Dataset

I understand you can assign a batch size to a Dataset and return a new dataset object. Is there an API to interrogate the batch size given a dataset object?
I am trying to find the calls at:
https://www.tensorflow.org/api_docs/python/tf/data/Dataset
when you call the .batch(32) method , it returns an tensorflow.python.data.ops.dataset_ops.BatchDataset object. As documented in Tensorflow Documentation This kind of object has private attribute called ._batch_size which contain a tensor of batch_size.
In tensorflow 2.X you need just call .numpy() method of this tensor to convert it to numpy.int64 type.
In tensorflow 1.X you need to cal .eval() method.
I do not know if you can just get it as an attribute, but you could just iterate through the dataset once and print the shape:
# create a simple tf.data.Dataset with batchsize 3
import tensorflow as tf
f = tf.data.Dataset.range(10).batch(3) # Dataset with batch_size 3
# iterating once
for one_batch in f:
print('batch size:', one_batch.shape[0])
break
If you know your dataset has targets/labels as well, you have to iterate as follows:
# iterating once
for one_batch_x, one_batch_y in f:
print('batch size:', one_batch_x.shape[0])
break
In both cases, it will print:
batch size: 3
In Tensorflow 1.* access batch_size via dataset._dataset._batch_size:
import tensorflow as tf
import numpy as np
print(tf.__version__) # 1.14.0
dataset = tf.data.Dataset.from_tensor_slices(np.random.randint(0, 2, 100)).batch(10)
with tf.compat.v1.Session() as sess:
batch_size = sess.run(dataset._dataset._batch_size)
print(batch_size) # 10
In Tensorflow 2 you can access via dataset._batch_size:
import tensorflow as tf
import numpy as np
print(tf.__version__) # 2.0.1
dataset = tf.data.Dataset.from_tensor_slices(np.random.randint(0, 2, 100)).batch(10)
batch_size = dataset._batch_size.numpy()
print(batch_size) # 10

Can't import frozen graph with BatchNorm layer

I have trained a Keras model based on this repo.
After the training I save the model as checkpoint files like this:
sess=tf.keras.backend.get_session()
saver = tf.train.Saver()
saver.save(sess, current_run_path + '/checkpoint_files/model_{}.ckpt'.format(date))
Then I restore the graph from the checkpoint files and freeze it using the standard tf freeze_graph script. When I want to restore the frozen graph I get the following error:
Input 0 of node Conv_BN_1/cond/ReadVariableOp/Switch was passed float from Conv_BN_1/gamma:0 incompatible with expected resource
How can I fix this issue?
Edit: My problem is related to this question. Unfortunately, I can't use the workaround.
Edit 2:
I have opened an issue on github and created a gist to reproduce the error.
https://github.com/keras-team/keras/issues/11032
Just resolved the same issue. I connected this few answers: 1, 2, 3 and realized that issue originated from batchnorm layer working state: training or learning. So, in order to resolve that issue you just need to place one line before loading your model:
keras.backend.set_learning_phase(0)
Complete example, to export model
import tensorflow as tf
from tensorflow.python.framework import graph_io
from tensorflow.keras.applications.inception_v3 import InceptionV3
def freeze_graph(graph, session, output):
with graph.as_default():
graphdef_inf = tf.graph_util.remove_training_nodes(graph.as_graph_def())
graphdef_frozen = tf.graph_util.convert_variables_to_constants(session, graphdef_inf, output)
graph_io.write_graph(graphdef_frozen, ".", "frozen_model.pb", as_text=False)
tf.keras.backend.set_learning_phase(0) # this line most important
base_model = InceptionV3()
session = tf.keras.backend.get_session()
INPUT_NODE = base_model.inputs[0].op.name
OUTPUT_NODE = base_model.outputs[0].op.name
freeze_graph(session.graph, session, [out.op.name for out in base_model.outputs])
to load *.pb model:
from PIL import Image
import numpy as np
import tensorflow as tf
# https://i.imgur.com/tvOB18o.jpg
im = Image.open("/home/chichivica/Pictures/eagle.jpg").resize((299, 299), Image.BICUBIC)
im = np.array(im) / 255.0
im = im[None, ...]
graph_def = tf.GraphDef()
with tf.gfile.GFile("frozen_model.pb", "rb") as f:
graph_def.ParseFromString(f.read())
graph = tf.Graph()
with graph.as_default():
net_inp, net_out = tf.import_graph_def(
graph_def, return_elements=["input_1", "predictions/Softmax"]
)
with tf.Session(graph=graph) as sess:
out = sess.run(net_out.outputs[0], feed_dict={net_inp.outputs[0]: im})
print(np.argmax(out))
This is bug with Tensorflow 1.1x and as another answer stated, it is because of the internal batch norm learning vs inference state. In TF 1.14.0 you actually get a cryptic error when trying to freeze a batch norm layer.
Using set_learning_phase(0) will put the batch norm layer (and probably others like dropout) into inference mode and thus the batch norm layer will not work during training, leading to reduced accuracy.
My solution is this:
Create the model using a function (do not use K.set_learning_phase(0)):
def create_model():
inputs = Input(...)
...
return model
model = create_model()
Train model
Save weights:
model.save_weights("weights.h5")
Clear session (important so layer names are the same) and set learning phase to 0:
K.clear_session()
K.set_learning_phase(0)
Recreate model and load weights:
model = create_model()
model.load_weights("weights.h5")
Freeze as before
Thanks for pointing the main issue! I found that keras.backend.set_learning_phase(0) to be not working sometimes, at least in my case.
Another approach might be: for l in keras_model.layers: l.trainable = False

Can not save model using model.save following multi_gpu_model in Keras

Following the upgrade to Keras 2.0.9, I have been using the multi_gpu_model utility but I can't save my models or best weights using
model.save('path')
The error I get is
TypeError: can’t pickle module objects
I suspect there is some problem gaining access to the model object. Is there a work around this issue?
To be honest, the easiest approach to this is to actually examine the multi gpu parallel model using
parallel_model.summary()
(The parallel model is simply the model after applying the multi_gpu function). This clearly highlights the actual model (in I think the penultimate layer - I am not at my computer right now). Then you can use the name of this layer to save the model.
model = parallel_model.get_layer('sequential_1)
Often its called sequential_1 but if you are using a published architecture, it may be 'googlenet' or 'alexnet'. You will see the name of the layer from the summary.
Then its simple to just save
model.save()
Maxims approach works, but its overkill I think.
Rem: you will need to compile both the model, and the parallel model.
Workaround
Here's a patched version that doesn't fail while saving:
from keras.layers import Lambda, concatenate
from keras import Model
import tensorflow as tf
def multi_gpu_model(model, gpus):
if isinstance(gpus, (list, tuple)):
num_gpus = len(gpus)
target_gpu_ids = gpus
else:
num_gpus = gpus
target_gpu_ids = range(num_gpus)
def get_slice(data, i, parts):
shape = tf.shape(data)
batch_size = shape[:1]
input_shape = shape[1:]
step = batch_size // parts
if i == num_gpus - 1:
size = batch_size - step * i
else:
size = step
size = tf.concat([size, input_shape], axis=0)
stride = tf.concat([step, input_shape * 0], axis=0)
start = stride * i
return tf.slice(data, start, size)
all_outputs = []
for i in range(len(model.outputs)):
all_outputs.append([])
# Place a copy of the model on each GPU,
# each getting a slice of the inputs.
for i, gpu_id in enumerate(target_gpu_ids):
with tf.device('/gpu:%d' % gpu_id):
with tf.name_scope('replica_%d' % gpu_id):
inputs = []
# Retrieve a slice of the input.
for x in model.inputs:
input_shape = tuple(x.get_shape().as_list())[1:]
slice_i = Lambda(get_slice,
output_shape=input_shape,
arguments={'i': i,
'parts': num_gpus})(x)
inputs.append(slice_i)
# Apply model on slice
# (creating a model replica on the target device).
outputs = model(inputs)
if not isinstance(outputs, list):
outputs = [outputs]
# Save the outputs for merging back together later.
for o in range(len(outputs)):
all_outputs[o].append(outputs[o])
# Merge outputs on CPU.
with tf.device('/cpu:0'):
merged = []
for name, outputs in zip(model.output_names, all_outputs):
merged.append(concatenate(outputs,
axis=0, name=name))
return Model(model.inputs, merged)
You can use this multi_gpu_model function, until the bug is fixed in keras. Also, when loading the model, it's important to provide the tensorflow module object:
model = load_model('multi_gpu_model.h5', {'tf': tf})
How it works
The problem is with import tensorflow line in the middle of multi_gpu_model:
def multi_gpu_model(model, gpus):
...
import tensorflow as tf
...
This creates a closure for the get_slice lambda function, which includes the number of gpus (that's ok) and tensorflow module (not ok). Model save tries to serialize all layers, including the ones that call get_slice and fails exactly because tf is in the closure.
The solution is to move import out of multi_gpu_model, so that tf becomes a global object, though still needed for get_slice to work. This fixes the problem of saving, but in loading one has to provide tf explicitly.
It's something that need a little work around by loading the multi_gpu_model weight to the regular model weight.
e.g.
#1, instantiate your base model on a cpu
with tf.device("/cpu:0"):
model = create_model()
#2, put your model to multiple gpus, say 2
multi_model = multi_gpu_model(model, 2)
#3, compile both models
model.compile(loss=your_loss, optimizer=your_optimizer(lr))
multi_model.compile(loss=your_loss, optimizer=your_optimizer(lr))
#4, train the multi gpu model
# multi_model.fit() or multi_model.fit_generator()
#5, save weights
model.set_weights(multi_model.get_weights())
model.save(filepath=filepath)
`
refrence: https://github.com/fchollet/keras/issues/8123