Stateful Recurrent Neural Networks with fit_generator() - tensorflow

Context
I read some blogs about the implementation of stateful recurrent neural networks in Keras (for example here and here).
There are also several questions regarding stateful RNNs on stackoverflow, whereby this question comes close to mine.
The linked tutorials use the fit()-method instead of fit_generator() and pass states by manually iterating over the epochs with epochs=1 in fit() like in this example taken from here:
# fit an LSTM network to training data
def fit_lstm(train, batch_size, nb_epoch, neurons):
X, y = train[:, 0:-1], train[:, -1]
X = X.reshape(X.shape[0], 1, X.shape[1])
model = Sequential()
model.add(LSTM(neurons, batch_input_shape=(batch_size, X.shape[1], X.shape[2]), stateful=True))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
for i in range(nb_epoch):
model.fit(X, y, epochs=1, batch_size=batch_size, verbose=0, shuffle=False)
model.reset_states()
return model
My question
I'd like to use fit_generator() instead of fit(), but also use stateless LSTM/GRU-layers.
What I was missing in other stackoverflow-questions like the one linked above is:
Can I proceed in the same way as with fit(), meaning setting epochs=1, and iterate over it for x times while setting
model.reset_states() in each iteration like in the example?
Or does fit_generator() already reset states only after finishing batch_size when stateful=True is used (what would be great)?
Or does fit_generator() reset states after each single batch (what would be a problem)?
The latter question deals in particular with this statement form here:
Stateless: In the stateless LSTM configuration, internal state is
reset after each training batch or each batch when making predictions.
Stateful: In the stateful LSTM configuration, internal state is only
reset when the reset_state() function is called.

Related

Approximating a smooth multidimensional function using Keras to an error of 1e-4

I am trying to approximate a function that smoothly maps five inputs to a single probability using Keras, but seem to have hit a limit. A similar problem was posed here (Keras Regression to approximate function (goal: loss < 1e-7)) for a ten-dimensional function and I have found that the architecture proposed there, namely:
model = Sequential()
model.add(Dense(128,input_shape=(5,), activation='tanh'))
model.add(Dense(64,activation='tanh'))
model.add(Dense(1,activation='sigmoid'))
model.compile(optimizer='adam', loss='mae')
gives me my best results, converging to a best loss of around 7e-4 on my validation data when the batch size is 1000. Adding or removing more neurons or layers seems to reduce the accuracy. Dropout regularisation also reduces accuracy. I am currently using 1e7 training samples, which took two days to generate (hence the desire to approximate this function). I would like to reduce the mae by another order of magnitude, does anyone have any suggestions how to do this?
I recommend use utilize the keras callbacks ReduceLROnPlateau, documentation is [here][1] and ModelCheckpoint, documentation is [here.][2]. For the first, set it to monitory validation loss and it will reduce the learning rate by a factor(factor) if the loss fails to reduce after a fixed number (patience) of consecutive epochs. For the second also monitor validation loss and save the weights for the model with the lowest validation loss to a directory. After training load the weights and use them to evaluate or predict on the test set. My code implementation is shown below.
checkpoint=tf.keras.callbacks.ModelCheckpoint(filepath=save_loc, monitor='val_loss', verbose=1, save_best_only=True,
save_weights_only=True, mode='auto', save_freq='epoch', options=None)
lr_adjust=tf.keras.callbacks.ReduceLROnPlateau( monitor="val_loss", factor=0.5, patience=1, verbose=1, mode="auto",
min_delta=0.00001, cooldown=0, min_lr=0)
callbacks=[checkpoint, lr_adjust]
history = model.fit_generator( train_generator, epochs=EPOCHS,
steps_per_epoch=STEPS_PER_EPOCH,validation_data=validation_generator,
validation_steps=VALIDATION_STEPS, callbacks=callbacks)
model.load_weights(save_loc) # load the saved weights
# after this use the model to evaluate or predict on the test set.
# if you are satisfied with the results you can then save the entire model with
model.save(save_loc)
[1]: https://keras.io/api/callbacks/reduce_lr_on_plateau/
[2]: https://keras.io/api/callbacks/model_checkpoint/

Tune a pre-existing model with Keras Tuner

I am looking at Keras Tuner as a way of doing hyperparameter optimization, but all of the examples I have seen show an entirely fresh model being defined. For example, from the Keras Tuner Hello World:
def build_model(hp):
model = keras.Sequential()
model.add(layers.Flatten(input_shape=(28, 28)))
for i in range(hp.Int('num_layers', 2, 20)):
model.add(layers.Dense(units=hp.Int('units_' + str(i), 32, 512, 32),
activation='relu'))
model.add(layers.Dense(10, activation='softmax'))
model.compile(
optimizer=keras.optimizers.Adam(
hp.Choice('learning_rate', [1e-2, 1e-3, 1e-4])),
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
return model
I already have a model that I would like to tune, but does that mean I have to rewrite it with the hyperparameters spliced in to the body, as above, or can I simply pass the hyperameters in to the model at the top? For example like this:
def build_model(hp):
model = MyExistingModel(
batch_size=hp['batch_size'],
seq_len=hp['seq_len'],
rnn_hidden_units=hp['hidden_units'],
rnn_type='gru',
num_rnn_layers=hp['num_rnn_layers']
)
optimizer = optimizer_factory['adam'](
learning_rate=hp['learning_rate'],
momentum=0.9,
)
model.compile(
optimizer=optimizer,
loss='sparse_categorical_crossentropy',
metrics=['sparse_categorical_accuracy'],
)
return model
The above seems to work, as far as I can see. The model initialization args are all passed to the internal TF layers, through a HyperParameters instance, and accessed from there... although I'm not really sure how to pass it in... I think it can be done by predefining a HyperParameters object and passing it in to the tuner, so it then gets passed in to build_model:
hp = HyperParameters()
hp.Choice('learning_rate', [1e-1, 1e-3])
tuner = RandomSearch(
build_model,
max_trials=5,
hyperparameters=hp,
tune_new_entries=False,
objective='val_accuracy')
Internally my model has two RNNs (LSTM or GRU) and an MLP. But I have yet to come across a Keras Tuner build_model that takes an existing model like this a simply passes in the hyperparameters. The model is quite complex, and I would like to avoid having to redefine it (as well as avoiding code duplication).
Indeed this is possible, as this GitHub issue makes clear...
However rather than passing the hp object through the hyperparameters arg to the Tuner, instead I override the Tuner run_trial method in the manner suggested here.

Different results when using Manual KFold-Cross validation vs. KerasClassifier-KFold Cross Validation

I've been struggling to understand why two similar Kfold-cross validations result in two different averages.
When I use a manual KFold approach (with Tensorflow and Keras)
cvscores = []
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=3)
for train, test in kfold.split(X, y):
model = create_baseline()
model.fit(X[train], y[train], epochs=50, batch_size=32, verbose=0)
scores = model.evaluate(X[test], y[test], verbose=0)
#print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
cvscores.append(scores[1] * 100)
print("%.2f%% (+/- %.2f%%)" % (np.mean(cvscores), np.std(cvscores)))
I get
65.89% (+/- 3.77%)
When I use the KerasClassifier wrapper from scikit
estimator = KerasClassifier(build_fn=create_baseline, epochs=50, batch_size=32, verbose=0)
kfold = StratifiedKFold(n_splits=10,shuffle=True, random_state=3)
results = cross_val_score(estimator, X, y, cv=kfold, scoring='accuracy')
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
I get
63.82% (5.37%)
Additionally, when using KerasClassifier the following warning appears
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/wrappers/scikit_learn.py:241: Sequential.predict_classes (from tensorflow.python.keras.engine.sequential) is deprecated and will be removed after 2021-01-01.
Instructions for updating:
Please use instead:* `np.argmax(model.predict(x), axis=-1)`, if your model does multi-class classification (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`, if your model does binary classification (e.g. if it uses a `sigmoid` last-layer activation).
Do the results differ because KerasClassifier uses predict_classes() while the manual Tensorflow/Keras approach uses just predict()? If so, which approach is more reasonable?
My model looks like this
def create_baseline():
model = tf.keras.models.Sequential()
model.add(Dense(8, activation='relu', input_shape=(12,)))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
The two CV-results do not look too different, they are both within each others standard deviation.
You fixed the seed for the StratifiedKFold class, that's good. However there is additional randomness you should take control of and that comes from the weight initialization. Make sure you initialize your model for each CV-run with different weights, but use the same 10 initializations for both cross-validations, manual and automatic. You can pass an initializer to each layer, they have a seed argument as well. In general you should fix all possible seeds (np.random.seed(3), tf.set_random_seed(3)).
What happens if you run cross_val_score() or your manual version twice? Do you get the same results / numbers?

Best practices in Tensorflow 2.0(Training step)

In tensorflow 2.0 you don't have to worry about training phase(batch size, number of epochs etc), because everything can be defined in compile method: model.fit(X_train,Y_train,batch_size = 64,epochs = 100).
But I have seen the following code style:
optimizer = tf.keras.optimizers.Adam(0.001)
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy()
#tf.function
def train_step(inputs, labels):
with tf.GradientTape() as tape:
predictions = model(inputs, training=True)
regularization_loss = tf.math.add_n(model.losses)
pred_loss = loss_fn(labels, predictions)
total_loss = pred_loss + regularization_loss
gradients = tape.gradient(total_loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
for epoch in range(NUM_EPOCHS):
for inputs, labels in train_data:
train_step(inputs, labels)
print("Finished epoch", epoch)
So here you can observe "more detailed" code, where you manually define by for loops you training procedure.
I have following question: what is the best practice in Tensorflow 2.0? I haven't found a any complete tutorial.
Use what is best for your needs.
Both methods are documented in Tensorflow tutorials.
If you don't need anything special, no extra losses, strange metrics or intricate gradient computation, just use a model.fit() or a model.fit_generator(). This is totally ok and makes your life easier.
A custom training loop might come in handy when you have complicated models with non-trivial loss/gradients calculation.
Up to now, two applications I tried were easier with this:
Training a GAN's generator and discriminator simultaneously without having to do the generation step twice. (It's complicated because you have a loss function that applies to different y_true values, and each case should update only a part of the model) - The other option would require to have a few separate models, each model with its own trainable=True/False configuration, and train then in separate phases.
Training inputs (good for style transfer models) -- Alternatively, create a custom layer that takes dummy inputs and that outputs its own trainable weights. But it gets complicated to compile several loss functions for each of the outputs of the base and style networks.

Cross validation in deep neural networks

How do you perform cross-validation in a deep neural network? I know that to perform cross validation to will train it on all folds except one and test it on the excluded fold. Then do this for k fold times and average the accuries for each fold. How do you do this for each iteration. Do you update the parameters at each fold? Or you perform k-fold cross validation for each iteration? Or is each training on all folds but one fold considered as one iteration?
Stratified cross validation
There are a couple of solution available to run deep neural network in fold-cross validation.
def create_baseline():
# create model
model = Sequential()
model.add(Dense(60, input_dim=11, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal', activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return model
# evaluate model with standardized dataset
estimator = KerasClassifier(build_fn=create_baseline, epochs=100, batch_size=5, verbose=0)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(estimator, X, encoded_Y, cv=kfold)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
Cross-validation is a general technique in ML to prevent overfitting. There is no difference between doing it on a deep-learning model and doing it on a linear regression. The idea is the same for all ML models. The basic idea behind CV, you described in your question is correct.
But the question how do you do this for each iteration does not make sense. There is nothing in CV algorithm that relates to iterations while training. You trained your model and only then you evaluate it.
Do you update the parameters at each fold?. You train the same model k-times and most probably each time you will have different parameters.
The answer that CV is not needed in DL is wrong. The basic idea of CV is to have a better estimate of how your model is performing on a limited dataset. So if your dataset is small the ability to train k models will give you a better estimate (the downsize is that you spend ~k times more time). If you have 100mln examples, most probably having 5% testing/validation set will already give you a good estimate.