Early stopping based on AUC - tensorflow

I am fairly new to ML and am currently implementing a simple 3D CNN in python using tensorflow and keras. I want to optimize based on the AUC and would also like to use early stopping/save the best network in terms of AUC score. I have been using tensorflow's AUC function for this as shown below, and it works well for the training. However, the hdf5 file is not saved (despite the checkpoint save_best_only=True) and hence I cannot get the best weights for the evaluation.
Here are the relevant lines of code:
model.compile(loss='binary_crossentropy',
optimizer=keras.optimizers.Adam(lr=lr),
metrics=[tf.keras.metrics.AUC()])
model.load_weights(path_weights)
filepath = mypath
check = tf.keras.callbacks.ModelCheckpoint(filepath, monitor=tf.keras.metrics.AUC(), save_best_only=True,
mode='auto')
earlyStopping = tf.keras.callbacks.EarlyStopping(monitor=tf.keras.metrics.AUC(), patience=hyperparams['pat'],mode='auto')
history = model.fit(X_trn, y_trn,
batch_size=bs,
epochs=n_epochs,
verbose=1,
callbacks=[check, earlyStopping],
validation_data=(X_val, y_val),
shuffle=True)
Interestingly, if I only change monitor='val_loss' in the early stopping and checkpoint (not the 'metrics' in model.compile), the hdf5 file is saved but obviously gives the best result in terms of validation loss. I have also tried using mode='max' but the problem is the same.
I would very much appreciate your advise, or any other constructive ideas how to work around this problem.

Turns out that even if you add a non-keyword metric, you still need to use its handle to refer to in when you want to monitor it. In your case you can do this:
auc = tf.keras.metrics.AUC() # instantiate it here to have a shorter handle
model.compile(loss='binary_crossentropy',
optimizer=keras.optimizers.Adam(lr=lr),
metrics=[auc])
...
check = tf.keras.callbacks.ModelCheckpoint(filepath,
monitor='auc', # even use the generated handle for monitoring the training AUC
save_best_only=True,
mode='max') # determine better models according to "max" AUC.
if you want to monitor the validation AUC (which makes more sense), simply add val_ in the beginning of the handle:
check = tf.keras.callbacks.ModelCheckpoint(filepath,
monitor='val_auc', # validation AUC
save_best_only=True,
mode='max')
Another problem is that you ModelCheckpoint is saving the weights based on the minimum AUC instead of the max, which you want.
This can be changed by setting mode='max'.
What does mode='auto' do?
This setting essentially checks if the argument of monitor contains 'acc' and sets it to max. In any other case it sets uses mode='min', which is what is happening in your case.
You can confirm this here

The answer posted by Djib2011 should solve your problem. I just wanted to address the use of early stopping. Typically this is used to stop training when over fitting starts to cause the loss to increase. I think it is more effective to address the over fitting issue directly which should enable you to achieve a lower loss. You did not list your model so it is not clear how to address over fitting but some simple guidelines are as follows. If you havee several dense hidden layers at the top of the model delete most of them and just keep the final top dense layer. The more complex the model the more it is prone to over fitting. If that leads to lower training accuracy then keep the layers but add dropout layers. You might also try using regularization in the hidden dense layers. I also find it is beneficial to use the callback ReduceLROnPlateau. Set it up to monitor AUC and reduce the learning rate if it fails to improve.

Related

Question about Tensorflow Model Reproducibility

I am currently working on tensorflow model and got a issue about its reproducibility.
I built a simple Dense model initialized with a constant value and trained with dummy data.
import tensorflow as tf
weight_init = tf.keras.initializers.Constant(value=0.001)
inputs = tf.keras.Input(shape=(5,))
layer1 = tf.keras.layers.Dense(5,
activation=tf.nn.leaky_relu,
kernel_initializer=weight_init,
bias_initializer=weight_init)
outputs = layer1(inputs)
model = tf.keras.Model(inputs=inputs, outputs=outputs, name="test")
model.compile(loss='mse',
optimizer = 'Adam'
)
model.fit([[111,1.02,-1.98231,1,1],
[112,1.02,-1.98231,1,1],
[113,1.02,-1.98231,1,1],
[114,1.02,-1.98231,1,1],
[115,1.02,-1.98231,1,1],
[116,1.02,-1.98231,1,1],
[117,1.02,-1.98231,1,1]],
[1,1,1,1,1,1,2], epochs = 3, batch_size=1)
Even though I set initial value of model as 0.001, loss of training changes with every attempt....
What am I missing here? Are there any additional values for me to fix to constant?
What is more surprising is that if I change batch_size to 16, loss doesn't change with attempt
Please.. teach me guys...
嗨嗐嗨
Since keras.model.fit() has default kwarg shuffle=True, data will be shuffled cross batch. If you change batch_size to any integer that larger than data length, any shuffle will be invalid, because there left only one batch.
So, add shuffle=False in model.fit() will achieve reproducibility here.
Additionally, if your model grows bigger, real reproducibility problem will arise, i.e, there will be slight error in the results of two successive calculations, even though you do no random or shuffle, but just click run, then
click run. We draw this as determinism for reproducibility.
Determinism is a good question that usually easily ignored by many users.
Let's start with the conclusion, i.e., reproducibility is influence by:
random seed (operation_seed+hidden_global_seed)
operation determinism
How to do? Tensorflow determinism has declared precisicely, i.e, add the following codes before building or restoring the model.
tf.keras.utils.set_random_seed(some_seed)
tf.config.experimental.enable_op_determinism()
But it can be used only if you rely heavily on reproducibility, since tf.config.experimental.enable_op_determinism() will reduce the speed significantly. The deeper reason is that, hardware reduces some accuracy in order to speed up the calculation, which usually does not affect our algorithm. Except in the deep learning, the model is very large, leading errors occured easily, and the training cycle is very long, leading accumulated errors. If in a regression model, any extra error is unacceptable, so we need deterministic algorithm in this occasion.

How to use Early Stopping with KerasRegressor in gridsearch (withing sklearn pipeline)? [duplicate]

I wish to implement early stopping with Keras and sklean's GridSearchCV.
The working code example below is modified from How to Grid Search Hyperparameters for Deep Learning Models in Python With Keras. The data set may be downloaded from here.
The modification adds the Keras EarlyStopping callback class to prevent over-fitting. For this to be effective it requires the monitor='val_acc' argument for monitoring validation accuracy. For val_acc to be available KerasClassifier requires the validation_split=0.1 to generate validation accuracy, else EarlyStopping raises RuntimeWarning: Early stopping requires val_acc available!. Note the FIXME: code comment!
Note we could replace val_acc by val_loss!
Question: How can I use the cross-validation data set generated by the GridSearchCV k-fold algorithm instead of wasting 10% of the training data for an early stopping validation set?
# Use scikit-learn to grid search the learning rate and momentum
import numpy
from sklearn.model_selection import GridSearchCV
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasClassifier
from keras.optimizers import SGD
# Function to create model, required for KerasClassifier
def create_model(learn_rate=0.01, momentum=0):
# create model
model = Sequential()
model.add(Dense(12, input_dim=8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# Compile model
optimizer = SGD(lr=learn_rate, momentum=momentum)
model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
return model
# Early stopping
from keras.callbacks import EarlyStopping
stopper = EarlyStopping(monitor='val_acc', patience=3, verbose=1)
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
# load dataset
dataset = numpy.loadtxt("pima-indians-diabetes.csv", delimiter=",")
# split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]
# create model
model = KerasClassifier(
build_fn=create_model,
epochs=100, batch_size=10,
validation_split=0.1, # FIXME: Instead use GridSearchCV k-fold validation data.
verbose=2)
# define the grid search parameters
learn_rate = [0.01, 0.1]
momentum = [0.2, 0.4]
param_grid = dict(learn_rate=learn_rate, momentum=momentum)
grid = GridSearchCV(estimator=model, param_grid=param_grid, verbose=2, n_jobs=1)
# Fitting parameters
fit_params = dict(callbacks=[stopper])
# Grid search.
grid_result = grid.fit(X, Y, **fit_params)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
print("%f (%f) with: %r" % (mean, stdev, param))
[Answer after the question was edited & clarified:]
Before rushing into implementation issues, it is always a good practice to take some time to think about the methodology and the task itself; arguably, intermingling early stopping with the cross validation procedure is not a good idea.
Let's make up an example to highlight the argument.
Suppose that you indeed use early stopping with 100 epochs, and 5-fold cross validation (CV) for hyperparameter selection. Suppose also that you end up with a hyperparameter set X giving best performance, say 89.3% binary classification accuracy.
Now suppose that your second-best hyperparameter set, Y, gives 89.2% accuracy. Examining closely the individual CV folds, you see that, for your best case X, 3 out of the 5 CV folds exhausted the max 100 epochs, while in the other 2 early stopping kicked in, say in 95 and 93 epochs respectively.
Now imagine that, examining your second-best set Y, you see that again 3 out of the 5 CV folds exhausted the 100 epochs, while the other 2 both stopped early enough at ~ 80 epochs.
What would be your conclusion from such an experiment?
Arguably, you would have found yourself in an inconclusive situation; further experiments might reveal which is actually the best hyperparameter set, provided of course that you would have thought to look into these details of the results in the first place. And needless to say, if all this was automated through a callback, you might have missed your best model despite the fact that you would have actually tried it.
The whole CV idea is implicitly based on the "all other being equal" argument (which of course is never true in practice, only approximated in the best possible way). If you feel that the number of epochs should be a hyperparameter, just include it explicitly in your CV as such, rather than inserting it through the back door of early stopping, thus possibly compromising the whole process (not to mention that early stopping has itself a hyperparameter, patience).
Not intermingling these two techniques doesn't mean of course that you cannot use them sequentially: once you have obtained your best hyperparameters through CV, you can always employ early stopping when fitting the model in your whole training set (provided of course that you do have a separate validation set).
The field of deep neural nets is still (very) young, and it is true that it has yet to establish its "best practice" guidelines; add the fact that, thanks to an amazing community, there are all sort of tools available in open source implementations, and you can easily find yourself into the (admittedly tempting) position of mixing things up just because they happen to be available. I am not necessarily saying that this is what you are attempting to do here - I am just urging for more caution when combining ideas that may have not been designed to work along together...
[Old answer, before the question was edited & clarified - see updated & accepted answer above]
I am not sure I have understood your exact issue (your question is quite unclear, and you include many unrelated details, which is never good when asking a SO question - see here).
You don't have to (and actually should not) include any arguments about validation data in your model = KerasClassifier() function call (it is interesting why you don't feel the same need for training data here, too). Your grid.fit() will take care of both the training and validation folds. So provided that you want to keep the hyperparameter values as included in your example, this function call should be simply
model = KerasClassifier(build_fn=create_model,
epochs=100, batch_size=32,
shuffle=True,
verbose=1)
You can see some clear and well-explained examples regarding the use of GridSearchCV with Keras here.
Here is how to do it with only a single split.
fit_params['cl__validation_data'] = (X_val, y_val)
X_final = np.concatenate((X_train, X_val))
y_final = np.concatenate((y_train, y_val))
splits = [(range(len(X_train)), range(len(X_train), len(X_final)))]
GridSearchCV(estimator=model, param_grid=param_grid, cv=splits)I
If you want more splits, you can use 'cl__validation_split' with a fixed ratio and construct splits that meet that criteria.
It might be too paranoid, but I don't use the early stopping data set as a validation data set since it was indirectly used to create the model.
I also think if you are using early stopping with your final model, then it should also be done when you are doing hyper-parameter search.

Is my training data set too complex for my neural network?

I am new to machine learning and stack overflow, I am trying to interpret two graphs from my regression model.
Training error and Validation error from my machine learning model
my case is similar to this guy Very large loss values when training multiple regression model in Keras but my MSE and RMSE are very high.
Is my modeling underfitting? if yes what can I do to solve this problem?
Here is my neural network I used for solving a regression problem
def build_model():
model = keras.Sequential([
layers.Dense(128, activation=tf.nn.relu, input_shape=[len(train_dataset.keys())]),
layers.Dense(64, activation=tf.nn.relu),
layers.Dense(1)
])
optimizer = tf.keras.optimizers.RMSprop(0.001)
model.compile(loss='mean_squared_error',
optimizer=optimizer,
metrics=['mean_absolute_error', 'mean_squared_error'])
return model
and my data set
I have 500 samples, 10 features and 1 target
Quite the opposite: it looks like your model is over-fitting. When you have low error rates for your training set, it means that your model has learned from the data well and can infer the results accurately. If your validation data is high afterwards however, that means that the information learned from your training data is not successfully being applied to new data. This is because your model has 'fit' onto your training data too much, and only learned how to predict well when its based off of that data.
To solve this, we can introduce common solutions to reduce over-fitting. A very common technique is to use Dropout layers. This will randomly remove some of the nodes so that the model cannot correlate with them too heavily - therefor reducing dependency on those nodes and 'learning' more using the other nodes too. I've included an example that you can test below; try playing with the value and other techniques to see what works best. And as a side note: are you sure that you need that many nodes within your dense layer? Seems like quite a bit for your data set, and that may be contributing to the over-fitting as a result too.
def build_model():
model = keras.Sequential([
layers.Dense(128, activation=tf.nn.relu, input_shape=[len(train_dataset.keys())]),
Dropout(0.2),
layers.Dense(64, activation=tf.nn.relu),
layers.Dense(1)
])
optimizer = tf.keras.optimizers.RMSprop(0.001)
model.compile(loss='mean_squared_error',
optimizer=optimizer,
metrics=['mean_absolute_error', 'mean_squared_error'])
return model
Well i think your model is overfitting
There are several ways that can help you :
1-Reduce the network’s capacity Which you can do by removing layers or reducing the number of elements in the hidden layers
2- Dropout layers, which will randomly remove certain features by setting them to zero
3-Regularization
If i want to give a brief explanation on these:
-Reduce the network’s capacity:
Some models have a large number of trainable parameters. The higher this number, the easier the model can memorize the target class for each training sample. Obviously, this is not ideal for generalizing on new data.by lowering the capacity of the network, it's going to learn the patterns that matter or that minimize the loss. But remember،reducing the network’s capacity too much will lead to underfitting.
-regularization:
This page can help you a lot
https://towardsdatascience.com/handling-overfitting-in-deep-learning-models-c760ee047c6e
-Drop out layer
You can use some layer like this
model.add(layers.Dropout(0.5))
This is a dropout layer with a 50% chance of setting inputs to zero.
For more details you can see this page:
https://machinelearningmastery.com/how-to-reduce-overfitting-with-dropout-regularization-in-keras/
As mentioned in the existing answer by #omoshiroiii your model in fact seems to be overfitting, that's why RMSE and MSE are too high.
Your model learned the detail and noise in the training data to the extent that it is now negatively impacting the performance of the model on new data.
The solution is therefore randomly removing some of the nodes so that the model cannot correlate with them too heavily.

Implementing stochastic forward passes in part of a neural network in Keras?

my problem is the following:
I am working on an object detection problem and would like to use dropout during test time to obtain a distribution of outputs. The object detection network consists of a training model and a prediction model, which wraps around the training model. I would like to perform several stochastic forward passes using the training model and combine these e.g. by averaging the predictions in the prediction wrapper. Is there a way of doing this in a keras model instead of requiring an intermediate processing step using numpy?
Note that this question is not about how to enable dropout during test time
def prediction_wrapper(model):
# Example code.
# Arguments
# model: the training model
regression = model.outputs[0]
classification = model.outputs[1]
predictions = # TODO: perform several stochastic forward passes (dropout during train and test time) here
avg_predictions = # TODO: combine predictions here, e.g. by computing the mean
outputs = # TODO: do some processing on avg_predictions
return keras.models.Model(inputs=model.inputs, outputs=outputs, name=name)
I use keras with a tensorflow backend.
I appreciate any help!
The way I understand, you're trying to average the weight updates for a single sample while Dropout is enabled. Since dropout is random, you would get different weight updates for the same sample.
If this understanding is correct, then you could create a batch by duplicating the same sample. Here I am assuming that the Dropout is different for each sample in a batch. Since, backpropagation averages the weight updates anyway, you would get your desired behavior.
If that does not work, then you could write a custom loss function and train with a batch-size of one. You could update a global counter inside your custom loss function and return non-zero loss only when you've averaged them the way you want it. I don't know if this would work, it's just an idea.

DeepLearning Anomaly Detection for images

I am still relatively new to the world of Deep Learning. I wanted to create a Deep Learning model (preferably using Tensorflow/Keras) for image anomaly detection. By anomaly detection I mean, essentially a OneClassSVM.
I have already tried sklearn's OneClassSVM using HOG features from the image. I was wondering if there is some example of how I can do this in deep learning. I looked up but couldn't find one single code piece that handles this case.
The way of doing this in Keras is with the KerasRegressor wrapper module (they wrap sci-kit learn's regressor interface). Useful information can also be found in the source code of that module. Basically you first have to define your Network Model, for example:
def simple_model():
#Input layer
data_in = Input(shape=(13,))
#First layer, fully connected, ReLU activation
layer_1 = Dense(13,activation='relu',kernel_initializer='normal')(data_in)
#second layer...etc
layer_2 = Dense(6,activation='relu',kernel_initializer='normal')(layer_1)
#Output, single node without activation
data_out = Dense(1, kernel_initializer='normal')(layer_2)
#Save and Compile model
model = Model(inputs=data_in, outputs=data_out)
#you may choose any loss or optimizer function, be careful which you chose
model.compile(loss='mean_squared_error', optimizer='adam')
return model
Then, pass it to the KerasRegressor builder and fit with your data:
from keras.wrappers.scikit_learn import KerasRegressor
#chose your epochs and batches
regressor = KerasRegressor(build_fn=simple_model, nb_epoch=100, batch_size=64)
#fit with your data
regressor.fit(data, labels, epochs=100)
For which you can now do predictions or obtain its score:
p = regressor.predict(data_test) #obtain predicted value
score = regressor.score(data_test, labels_test) #obtain test score
In your case, as you need to detect anomalous images from the ones that are ok, one approach you can take is to train your regressor by passing anomalous images labeled 1 and images that are ok labeled 0.
This will make your model to return a value closer to 1 when the input is an anomalous image, enabling you to threshold the desired results. You can think of this output as its R^2 coefficient to the "Anomalous Model" you trained as 1 (perfect match).
Also, as you mentioned, Autoencoders are another way to do anomaly detection. For this I suggest you take a look at the Keras Blog post Building Autoencoders in Keras, where they explain in detail about the implementation of them with the Keras library.
It is worth noticing that Single-class classification is another way of saying Regression.
Classification tries to find a probability distribution among the N possible classes, and you usually pick the most probable class as the output (that is why most Classification Networks use Sigmoid activation on their output labels, as it has range [0, 1]). Its output is discrete/categorical.
Similarly, Regression tries to find the best model that represents your data, by minimizing the error or some other metric (like the well-known R^2 metric, or Coefficient of Determination). Its output is a real number/continuous (and the reason why most Regression Networks don't use activations on their outputs). I hope this helps, good luck with your coding.