I am currently working on tensorflow model and got a issue about its reproducibility.
I built a simple Dense model initialized with a constant value and trained with dummy data.
import tensorflow as tf
weight_init = tf.keras.initializers.Constant(value=0.001)
inputs = tf.keras.Input(shape=(5,))
layer1 = tf.keras.layers.Dense(5,
activation=tf.nn.leaky_relu,
kernel_initializer=weight_init,
bias_initializer=weight_init)
outputs = layer1(inputs)
model = tf.keras.Model(inputs=inputs, outputs=outputs, name="test")
model.compile(loss='mse',
optimizer = 'Adam'
)
model.fit([[111,1.02,-1.98231,1,1],
[112,1.02,-1.98231,1,1],
[113,1.02,-1.98231,1,1],
[114,1.02,-1.98231,1,1],
[115,1.02,-1.98231,1,1],
[116,1.02,-1.98231,1,1],
[117,1.02,-1.98231,1,1]],
[1,1,1,1,1,1,2], epochs = 3, batch_size=1)
Even though I set initial value of model as 0.001, loss of training changes with every attempt....
What am I missing here? Are there any additional values for me to fix to constant?
What is more surprising is that if I change batch_size to 16, loss doesn't change with attempt
Please.. teach me guys...
ćšććš
Since keras.model.fit() has default kwarg shuffle=True, data will be shuffled cross batch. If you change batch_size to any integer that larger than data length, any shuffle will be invalid, because there left only one batch.
So, add shuffle=False in model.fit() will achieve reproducibility here.
Additionally, if your model grows bigger, real reproducibility problem will arise, i.e, there will be slight error in the results of two successive calculations, even though you do no random or shuffle, but just click run, then
click run. We draw this as determinism for reproducibility.
Determinism is a good question that usually easily ignored by many users.
Let's start with the conclusion, i.e., reproducibility is influence by:
random seed (operation_seed+hidden_global_seed)
operation determinism
How to do? Tensorflow determinism has declared precisicely, i.e, add the following codes before building or restoring the model.
tf.keras.utils.set_random_seed(some_seed)
tf.config.experimental.enable_op_determinism()
But it can be used only if you rely heavily on reproducibility, since tf.config.experimental.enable_op_determinism() will reduce the speed significantly. The deeper reason is that, hardware reduces some accuracy in order to speed up the calculation, which usually does not affect our algorithm. Except in the deep learning, the model is very large, leading errors occured easily, and the training cycle is very long, leading accumulated errors. If in a regression model, any extra error is unacceptable, so we need deterministic algorithm in this occasion.
Related
I wish to implement early stopping with Keras and sklean's GridSearchCV.
The working code example below is modified from How to Grid Search Hyperparameters for Deep Learning Models in Python With Keras. The data set may be downloaded from here.
The modification adds the Keras EarlyStopping callback class to prevent over-fitting. For this to be effective it requires the monitor='val_acc' argument for monitoring validation accuracy. For val_acc to be available KerasClassifier requires the validation_split=0.1 to generate validation accuracy, else EarlyStopping raises RuntimeWarning: Early stopping requires val_acc available!. Note the FIXME: code comment!
Note we could replace val_acc by val_loss!
Question: How can I use the cross-validation data set generated by the GridSearchCV k-fold algorithm instead of wasting 10% of the training data for an early stopping validation set?
# Use scikit-learn to grid search the learning rate and momentum
import numpy
from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.optimizers import SGD
# Function to create model, required for KerasClassifier
def create_model(learn_rate=0.01, momentum=0):
# create model
model = Sequential()
model.add(Dense(12, input_dim=8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# Compile model
optimizer = SGD(lr=learn_rate, momentum=momentum)
model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
return model
# Early stopping
from keras.callbacks import EarlyStopping
stopper = EarlyStopping(monitor='val_acc', patience=3, verbose=1)
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
# load dataset
dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")
# split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]
# create model
model = KerasClassifier(
build_fn=create_model,
epochs=100, batch_size=10,
validation_split=0.1, # FIXME: Instead use GridSearchCV k-fold validation data.
verbose=2)
# define the grid search parameters
learn_rate = [0.01, 0.1]
momentum = [0.2, 0.4]
param_grid = dict(learn_rate=learn_rate, momentum=momentum)
grid = GridSearchCV(estimator=model, param_grid=param_grid, verbose=2, n_jobs=1)
# Fitting parameters
fit_params = dict(callbacks=[stopper])
# Grid search.
grid_result = grid.fit(X, Y, **fit_params)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
print("%f (%f) with: %r" % (mean, stdev, param))
[Answer after the question was edited & clarified:]
Before rushing into implementation issues, it is always a good practice to take some time to think about the methodology and the task itself; arguably, intermingling early stopping with the cross validation procedure is not a good idea.
Let's make up an example to highlight the argument.
Suppose that you indeed use early stopping with 100 epochs, and 5-fold cross validation (CV) for hyperparameter selection. Suppose also that you end up with a hyperparameter set X giving best performance, say 89.3% binary classification accuracy.
Now suppose that your second-best hyperparameter set, Y, gives 89.2% accuracy. Examining closely the individual CV folds, you see that, for your best case X, 3 out of the 5 CV folds exhausted the max 100 epochs, while in the other 2 early stopping kicked in, say in 95 and 93 epochs respectively.
Now imagine that, examining your second-best set Y, you see that again 3 out of the 5 CV folds exhausted the 100 epochs, while the other 2 both stopped early enough at ~ 80 epochs.
What would be your conclusion from such an experiment?
Arguably, you would have found yourself in an inconclusive situation; further experiments might reveal which is actually the best hyperparameter set, provided of course that you would have thought to look into these details of the results in the first place. And needless to say, if all this was automated through a callback, you might have missed your best model despite the fact that you would have actually tried it.
The whole CV idea is implicitly based on the "all other being equal" argument (which of course is never true in practice, only approximated in the best possible way). If you feel that the number of epochs should be a hyperparameter, just include it explicitly in your CV as such, rather than inserting it through the back door of early stopping, thus possibly compromising the whole process (not to mention that early stopping has itself a hyperparameter, patience).
Not intermingling these two techniques doesn't mean of course that you cannot use them sequentially: once you have obtained your best hyperparameters through CV, you can always employ early stopping when fitting the model in your whole training set (provided of course that you do have a separate validation set).
The field of deep neural nets is still (very) young, and it is true that it has yet to establish its "best practice" guidelines; add the fact that, thanks to an amazing community, there are all sort of tools available in open source implementations, and you can easily find yourself into the (admittedly tempting) position of mixing things up just because they happen to be available. I am not necessarily saying that this is what you are attempting to do here - I am just urging for more caution when combining ideas that may have not been designed to work along together...
[Old answer, before the question was edited & clarified - see updated & accepted answer above]
I am not sure I have understood your exact issue (your question is quite unclear, and you include many unrelated details, which is never good when asking a SO question - see here).
You don't have to (and actually should not) include any arguments about validation data in your model = KerasClassifier() function call (it is interesting why you don't feel the same need for training data here, too). Your grid.fit() will take care of both the training and validation folds. So provided that you want to keep the hyperparameter values as included in your example, this function call should be simply
model = KerasClassifier(build_fn=create_model,
epochs=100, batch_size=32,
shuffle=True,
verbose=1)
You can see some clear and well-explained examples regarding the use of GridSearchCV with Keras here.
Here is how to do it with only a single split.
fit_params['cl__validation_data'] = (X_val, y_val)
X_final = np.concatenate((X_train, X_val))
y_final = np.concatenate((y_train, y_val))
splits = [(range(len(X_train)), range(len(X_train), len(X_final)))]
GridSearchCV(estimator=model, param_grid=param_grid, cv=splits)I
If you want more splits, you can use 'cl__validation_split' with a fixed ratio and construct splits that meet that criteria.
It might be too paranoid, but I don't use the early stopping data set as a validation data set since it was indirectly used to create the model.
I also think if you are using early stopping with your final model, then it should also be done when you are doing hyper-parameter search.
I am fairly new to ML and am currently implementing a simple 3D CNN in python using tensorflow and keras. I want to optimize based on the AUC and would also like to use early stopping/save the best network in terms of AUC score. I have been using tensorflow's AUC function for this as shown below, and it works well for the training. However, the hdf5 file is not saved (despite the checkpoint save_best_only=True) and hence I cannot get the best weights for the evaluation.
Here are the relevant lines of code:
model.compile(loss='binary_crossentropy',
optimizer=keras.optimizers.Adam(lr=lr),
metrics=[tf.keras.metrics.AUC()])
model.load_weights(path_weights)
filepath = mypath
check = tf.keras.callbacks.ModelCheckpoint(filepath, monitor=tf.keras.metrics.AUC(), save_best_only=True,
mode='auto')
earlyStopping = tf.keras.callbacks.EarlyStopping(monitor=tf.keras.metrics.AUC(), patience=hyperparams['pat'],mode='auto')
history = model.fit(X_trn, y_trn,
batch_size=bs,
epochs=n_epochs,
verbose=1,
callbacks=[check, earlyStopping],
validation_data=(X_val, y_val),
shuffle=True)
Interestingly, if I only change monitor='val_loss' in the early stopping and checkpoint (not the 'metrics' in model.compile), the hdf5 file is saved but obviously gives the best result in terms of validation loss. I have also tried using mode='max' but the problem is the same.
I would very much appreciate your advise, or any other constructive ideas how to work around this problem.
Turns out that even if you add a non-keyword metric, you still need to use its handle to refer to in when you want to monitor it. In your case you can do this:
auc = tf.keras.metrics.AUC() # instantiate it here to have a shorter handle
model.compile(loss='binary_crossentropy',
optimizer=keras.optimizers.Adam(lr=lr),
metrics=[auc])
...
check = tf.keras.callbacks.ModelCheckpoint(filepath,
monitor='auc', # even use the generated handle for monitoring the training AUC
save_best_only=True,
mode='max') # determine better models according to "max" AUC.
if you want to monitor the validation AUC (which makes more sense), simply add val_ in the beginning of the handle:
check = tf.keras.callbacks.ModelCheckpoint(filepath,
monitor='val_auc', # validation AUC
save_best_only=True,
mode='max')
Another problem is that you ModelCheckpoint is saving the weights based on the minimum AUC instead of the max, which you want.
This can be changed by setting mode='max'.
What does mode='auto' do?
This setting essentially checks if the argument of monitor contains 'acc' and sets it to max. In any other case it sets uses mode='min', which is what is happening in your case.
You can confirm this here
The answer posted by Djib2011 should solve your problem. I just wanted to address the use of early stopping. Typically this is used to stop training when over fitting starts to cause the loss to increase. I think it is more effective to address the over fitting issue directly which should enable you to achieve a lower loss. You did not list your model so it is not clear how to address over fitting but some simple guidelines are as follows. If you havee several dense hidden layers at the top of the model delete most of them and just keep the final top dense layer. The more complex the model the more it is prone to over fitting. If that leads to lower training accuracy then keep the layers but add dropout layers. You might also try using regularization in the hidden dense layers. I also find it is beneficial to use the callback ReduceLROnPlateau. Set it up to monitor AUC and reduce the learning rate if it fails to improve.
I've been running into an issue lately trying to train a simple MLP.
I'm basically trying to get a network to map the XYZ position and RPY orientation of the end-effector of a robot arm (6-dimensional input) to the angle of every joint of the robot arm to reach that position (6-dimensional output), so this is a regression problem.
I've generated a dataset using the angles to compute the current position, and generated datasets with 5k, 500k and 500M sets of values.
My issue is the MLP I'm using doesn't learn anything at all. Using Tensorboard (I'm using Keras), I've realized that the output of my very first layer is always zero (see image 1), no matter what I try.
Basically, my input is a shape (6,) vector and the output is also a shape (6,) vector.
Here is what I've tried so far, without success:
I've tried MLPs with 2 layers of size 12, 24; 2 layers of size 48, 48; 4 layers of size 12, 24, 24, 48.
Adam, SGD, RMSprop optimizers
Learning rates ranging from 0.15 to 0.001, with and without decay
Both Mean Squared Error (MSE) and Mean Absolute Error (MAE) as the loss function
Normalizing the input data, and not normalizing it (the first 3 values are between -3 and +3, the last 3 are between -pi and pi)
Batch sizes of 1, 10, 32
Tested the MLP of all 3 datasets of 5k values, 500k values and 5M values.
Tested with number of epoches ranging from 10 to 1000
Tested multiple initializers for the bias and kernel.
Tested both the Sequential model and the Keras functional API (to make sure the issue wasn't how I called the model)
All 3 of sigmoid, relu and tanh activation functions for the hidden layers (the last layer is a linear activation because its a regression)
Additionally, I've tried the very same MLP architecture on the basic Boston housing price regression dataset by Keras, and the net was definitely learning something, which leads me to believe that there may be some kind of issue with my data. However, I'm at a complete loss as to what it may be as the system in its current state does not learn anything at all, the loss function just stalls starting on the 1st epoch.
Any help or lead would be appreciated, and I will gladly provide code or data if needed!
Thank you
EDIT:
Here's a link to 5k samples of the data I'm using. Columns B-G are the output (angles used to generate the position/orientation) and columns H-M are the input (XYZ position and RPY orientation). https://drive.google.com/file/d/18tQJBQg95ISpxF9T3v156JAWRBJYzeiG/view
Also, here's a snippet of the code I'm using:
df = pd.read_csv('kinova_jaco_data_5k.csv', names = ['state0',
'state1',
'state2',
'state3',
'state4',
'state5',
'pose0',
'pose1',
'pose2',
'pose3',
'pose4',
'pose5'])
states = np.asarray(
[df.state0.to_numpy(), df.state1.to_numpy(), df.state2.to_numpy(), df.state3.to_numpy(), df.state4.to_numpy(),
df.state5.to_numpy()]).transpose()
poses = np.asarray(
[df.pose0.to_numpy(), df.pose1.to_numpy(), df.pose2.to_numpy(), df.pose3.to_numpy(), df.pose4.to_numpy(),
df.pose5.to_numpy()]).transpose()
x_train_temp, x_test, y_train_temp, y_test = train_test_split(poses, states, test_size=0.2)
x_train, x_val, y_train, y_val = train_test_split(x_train_temp, y_train_temp, test_size=0.2)
mean = x_train.mean(axis=0)
x_train -= mean
std = x_train.std(axis=0)
x_train /= std
x_test -= mean
x_test /= std
x_val -= mean
x_val /= std
n_epochs = 100
n_hidden_layers=2
n_units=[48, 48]
inputs = Input(shape=(6,), dtype= 'float32', name = 'input')
x = Dense(units=n_units[0], activation=relu, name='dense1')(inputs)
for i in range(1, n_hidden_layers):
x = Dense(units=n_units[i], activation=activation, name='dense'+str(i+1))(x)
out = Dense(units=6, activation='linear', name='output_layer')(x)
model = Model(inputs=inputs, outputs=out)
optimizer = SGD(lr=0.1, momentum=0.4)
model.compile(optimizer=optimizer, loss='mse', metrics=['mse', 'mae'])
history = model.fit(x_train,
y_train,
epochs=n_epochs,
verbose=1,
validation_data=(x_test, y_test),
batch_size=32)
Edit 2
I've tested the architecture with a random dataset where the input was a (6,) vector where input[i] is a random number and the output was a (6,) vector with output[i] = input[i]ÂČ and the network didn't learn anything. I've also tested a random dataset where the input was a random number and the output was a linear function of the input, and the loss converged to 0 pretty quickly. In short, it seems the simple architecture is unable to map a non-linear function.
the output of my very first layer is always zero.
This typically means that the network does not "see" any pattern in the input at all, which causes it to always predict the mean of the target over the entire training set, regardless of input. Your output is in the range of -đ to đ probably with an expected value of 0, so it checks out.
My guess is that the model is too small to represent the data efficiently. I would suggest that you increase the number of parameters in the model by a factor of 10 or 100 and see if it starts seeing something. Limiting the number of parameters has a regularizing effect on the network, and strong regularization usually leads the the aforementioned derping to the mean.
I'm by no means a robotics expert, but I guess that there are a lot of situations where a small nudge in the output parameters causes a large change of the input. Let's say I'm trying to scratch my back with my left hand - the farther my hand goes to the left, the harder the task becomes, so at some point I might want to switch hands, which is a discontinuous configuration change. A bad analogy, sure, but I hope it demonstrates my hunch that there are certain places in the configuration space where small target changes cause large configuration changes.
Such large changes will cause a very large, very noisy gradient around those points. I'm not sure how well the network will work around these noisy gradients, but I would suggest as an experiment that you try to limit the training dataset to a set of outputs that are connected smoothly to one another in the configuration space of the arm, if that makes sense. Going further, you should remove any points from the dataset that are close to such configuration boundaries. To make up for that at inference time, you might instead want to sample several close-by points and choose the most common prediction as the final result. Hopefully some of those points will land in a smooth configuration area.
Also, adding batch normalization before each dense layer will help smooth the gradient and provide for more reliable training.
As for the rest of your hyperparameters:
A batch size of 32 is good, a very small batch size will make the gradient too noisy
The loss function is not critical, both MSE and MAE should work
The activation functions aren't critical, ReLU is a good default choice.
The default initializers a good enough.
Normalizing is important for Dense layers, so keep it
Train for as many epochs as you need as long as both the training and validation loss are dropping. If the validation loss hasn't dropped for 5-10 epochs you might as well stop early.
Adam is a good default choice. Start with a small learning rate and increase the learning rate at the beginning of training only if the training loss is dropping consistently over several epochs.
Further reading: 37 Reasons why your Neural Network is not working
I ended up replacing the first dense layer with a Conv1D layer and the network now seems to be learning decently. It's overfitting to my data, but that's territory I'm okay with.
I'm closing the thread for now, I'll spend some time playing with the architecture.
I want to replicate a network build with the lasagne-library in tensor flow. I'm having some trouble with the batch normalization.
This is the lasagne documentation about the used batch normalization:
http://lasagne.readthedocs.io/en/latest/modules/layers/normalization.html?highlight=batchNorm
In tensorflow I found two functions to normalize:
https://www.tensorflow.org/api_docs/python/tf/nn/batch_normalization
https://www.tensorflow.org/api_docs/python/tf/layers/batch_normalization
The first one is simpler but does not let me choose the alpha parameter from lasagne (Coefficient for the exponential moving average of batch-wise means and standard deviations computed during training). I tried using the second function, which has a lot more options, but there are two things I do not understand about it:
I am not clear about the difference between momentum and renorm_momentum. If I have a alpha of 0.9 in the lasagne network, can I just set both tensorflow momentums to 0.9 and expect the same behaviour?
The tf documentation notes:
when training, the moving_mean and moving_variance need to be updated. By default the update ops are placed in tf.GraphKeys.UPDATE_OPS, so they need to be added as a dependency to the train_op. For example:
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
train_op = optimizer.minimize(loss)
I do not really understand what is happening here and where I need to put something similar in my code. Can I just put this somewhere before I run the session? What parts of this code piece should I not copy literally but change depending on my code?
There is a big difference between tf.nn.batch_normalization and tf.layers.batch_normalization. See my answer here. So you have made the right choice by using the layers version. Now, on your questions:
renorm_momentum only has an effect is you use batch renormalization by setting the renorm argument to True. You can ignore this if using default batch normalization.
Short answer: You can literally copy that code snippet. Put it exactly where you would normally call optimizer.minimize.
Long answer on 2.: Batch normalization has two "modes": Training and inference. During training, mean and variance of the current minibatch is used. During inference, this is not desirable (e.g. you might not even use batches as input, so there would be no minibatch statistics). For this reason, moving averages over minibatch means/variances are kept during training. These moving averages are then used for inference.
By default, Tensorflow only executes what it needs to. Those moving averages are not needed for training, so they normally would never be executed/updated. The tf.control_dependencies context manager forces Tensorflow to do the updates every time it computes whatever is in the code block (in this case the cost). Since the cost certainly needs to be computed exactly one per training step, this is a good way of making sure the moving averages are updated.
The code example seems a bit arcane, but in context it would really just be (as an example):
loss = ...
train_step = SomeOptimizer().minimize(loss)
with tf.Session() as sess:
....
becomes
loss = ...
with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)):
train_step = SomeOptimizer().minimize(loss)
with tf.Session() as sess:
....
Finally, keep in mind to use the correct training argument for batch normalization so that either minibatch statistics or moving averages are used as intended.
So I was reading the tensorflow getstarted tutorial and I found it very hard to follow. There were a lot of explanations missing about each function and why they are necesary (or not).
In the tf.estimator section, what's the meaning or what are they supposed to be the "x_eval" and "y_eval" arrays? The x_train and y_train arrays give the desired output (which is the corresponding y coordinate) for a given x coordinate. But the x_eval and y_eval values are incorrect: for x=5, y should be -4, not -4.1. Where do those values come from? What do x_eval and y_eval mean? Are they necesary? How did they choose those values?
The difference between "input_fn" (what does "fn" even mean?) and "train_input_fn". I see that the only difference is one has
num_epochs=None, shuffle=True
num_epochs=1000, shuffle=False
but I don't understand what "input_fn" or "train_input_fn" are/do, or what's the difference between the two, or if both are necesary.
3.In the
estimator.train(input_fn=input_fn, steps=1000)
piece of code, I don't understand the difference between "steps" and "num_epochs". What's the meaning of each one? Can you have num_epochs=1000 and steps=1000 too?
The final question is, how do i get the W and the b? In the previous way of doing it (not using tf.estimator) they explicitelly found that W=-1 and b=1. If I was doing a more complex neural network, involving biases and weights, I think I would want to recover the actual values of the weights and biases. That's the whole point of why I'm using tensorflow, to find the weights! So how do I recover them in the tf.estimator example?
These are just some of the questions that bugged me while reading the "getStarted" tutorial. I personally think it leaves a lot to desire, since it's very unclear what each thing does and you can at best guess.
I agree with you that the tf.estimator is not very well introduced in this "getting started" tutorial. I also think that some machine learning background would help with understanding what happens in the tutorial.
As for the answers to your questions:
In machine learning, we usually minimizer the loss of the model on the training set, and then we evaluate the performance of the model on the evaluation set. This is because it is easy to overfit the training set and get 100% accuracy on it, so using a separate validation set makes it impossible to cheat in this way.
Here (x_train, y_train) corresponds to the training set, where the global minimum is obtained for W=-1, b=1.
The validation set (x_eval, y_eval) doesn't have to perfectly follow the distribution of the training set. Although we can get a loss of 0 on the training set, we obtain a small loss on the validation set because we don't have exactly y_eval = - x_eval + 1
input_fn means "input function". This is to indicate that the object input_fn is a function.
In tf.estimator, you need to provide an input function if you want to train the estimator (estimator.train()) or evaluate it (estimator.evaluate()).
Usually you want different transformations for training or evaluation, so you have two functions train_input_fn and eval_input_fn (the input_fn in the tutorial is almost equivalent to train_input_fn and is just confusing).
For instance, during training we want to train for multiple epochs (i.e. multiple times on the dataset). For evaluation, we only need one pass over the validation data to compute the metrics we need
The number of epochs is the number of times we repeat the entire dataset. For instance if we train for 10 epochs, the model will see each input 10 times.
When we train a machine learning model, we usually use mini-batches of data. For instance if we have 1,000 images, we can train on batches of 100 images. Therefore, training for 10 epochs means training on 100 batches of data.
Once the estimator is trained, you can access the list of variables through estimator.get_variable_names() and the value of a variable through estimator.get_variable_value().
Usually we never need to do that, as we can for instance use the trained estimator to predict on new examples, using estimator.predict().
If you feel that the getting started is confusing, you can always submit a GitHub issue to tell the TensorFlow team and explain your point.