Can't see GAN's losses in Tensorboard - tensorflow

I am developing a GAN and I wanted to see the losses of the networks in Tensorboard, I have added the callback to the Fit function, I am launching tensorboard from the main directory but nothing appears on it, what else do I have to add?
Compile part of the GAN:
epochs = 15
gan = GAN(discriminator=discriminator, generator=generator, latent_dim=latent_dim)
tb_callback = tf.keras.callbacks.TensorBoard(log_dir="logs/", histogram_freq=1)
gan.compile(
d_optimizer=keras.optimizers.Adam(learning_rate=0.00005),
g_optimizer=keras.optimizers.Adam(learning_rate=0.0001),
loss_fn=keras.losses.BinaryCrossentropy(),
)
gan.fit(
dataset, epochs=epochs, callbacks=[GANMonitor(num_img=10, latent_dim=latent_dim),
tb_callback]
)

Related

Validation loss reported seems to be wrong, can preprocessing be the reason?

I'm training a resnet model with Keras, fine tuned on my own images. While training, Tensorboard is constantly reporting a validation loss that seems unrelated to training loss (much higher, see image below where train is orange line and validation blue line). Furthermore when training is finished (for example final losses as reported by Tensorboard could be respectively 0.06 and 0.57) I evaluate the model "manually" and validation loss seems to be in the same range of training loss (ex:0.07).
I suspect that preprocessing could be the reason of this strange result. Essentially the inputs and the outputs of the model are created like this:
inp = tf.keras.Input(input_shape)
resnet = tf.keras.applications.ResNet50V2(include_top=False, input_shape=input_shape, input_tensor=inp,pooling="avg")
# Add ResNet50V2 specific preprocessing method into the model.
preprocessed = tf.keras.layers.Lambda(lambda x: tf.keras.applications.resnet_v2.preprocess_input(x))(inp)
out = resnet(preprocessed)
out = tf.keras.layers.Dense(num_outputs, activation=None)(out)
and the training :
model.compile(
optimizer=tf.keras.optimizers.Adam(lrate),
loss='mse',
metrics=[tf.keras.metrics.MeanSquaredError()],
)
model.fit(
train_dataset,
epochs=epochs,
validation_data=val_dataset,
callbacks=callbacks
)
It's like if preprocessing does not occur when validation loss is calculated but I don't know why.

How to fine tune a trained model and saved model with tensorflow?

I have a big dataset and I want to train efficientnetb0 by this dataset but google colab gets a run timeout so I want to train the fully connected layers of the model and save it then load it again after some hours and fine-tune the base model(efficientnetb0). how can I do this?
You can do this in following way. Please change it as per your requirement. This is an snapshot of starting point.
if not os.path.exists('e10.h5'):
model = get_model() #this method constructs the model and compiles it
else:
model = load_model('tf_keras_cifar10.h5') #load the model from file
print('lr is ', K.get_session().run(model.optimizer.lr))
initial_epoch=10
epochs=13
history = model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs,validation_data=(x_test, y_test), initial_epoch=initial_epoch)
model.save('e.h5')

Issue with Tensorboard and nothing logging

I was following google's tensorboard tutorial with hparams here. However, when I try to implement that in my own model, nothing is showing in the logs. The main difference is that I used an Image Data Generator, but I do not see how that would affect the hyperparameters. I have included all the code used to get the hyperparameters, but removed the model and basic packages I imported for ease.
# Load the TensorBoard notebook
%load_ext tensorboard
# Clear all logs
!rm -rf ./logs/
Here is what I have set up for the hyperparameters. Just learning rate and weight decay. Slightly augmented from the tutorial, but largely very much the same style.
HP_lr = hp.HParam('learning_rate', hp.Discrete([3, 4, 5]))
HP_weight_decay= hp.HParam('l2_weight_decay', hp.Discrete([4, 5, 6]))
METRIC_ACCURACY = 'accuracy'
This is a little different to account for the values above, but those are simply variable names
# file writer
with tf.summary.create_file_writer('logs/hparam_tuning').as_default():
hp.hparams_config(
hparams=[HP_lr, HP_weight_decay],
metrics=[hp.Metric(METRIC_ACCURACY, display_name='Accuracy')],
)
I have a function that builds the model taking an hparams argument. Besides using datagen.flow() in the model.fit, nothing changes.
def train_test_model(hparams):
model = build_model(hparams)
model.fit(datagen.flow(x_train, y_train, batch_size=64),
epochs=1,verbose=0)
_, accuracy = model.evaluate(x_test, y_test,batch_size=64, verbose = 1)
return accuracy
# For each run log the metrics and hyperparameters used
def run(run_dir, hparams):
with tf.summary.create_file_writer(run_dir).as_default():
hp.hparams(hparams) # record the values used in this trial
accuracy = train_test_model(hparams)
tf.summary.scalar(METRIC_ACCURACY, accuracy, step=1)
Sets up the dictionary to be used by hp
session_num = 0
for learn_rate in HP_lr.domain.values:
for wd in HP_weight_decay.domain.values:
hparams = {
HP_lr: 1*10**(-learn_rate), # transform to something like 1e-3
HP_weight_decay: 1*10**(-wd)
}
run_name = "run-%d" % session_num
print('--- Starting trial: %s' % run_name)
print({h.name: hparams[h] for h in hparams})
run('logs/hparam_tuning/' + run_name, hparams)
session_num += 1
%tensorboard --logdir logs/hparam_tuning

How to view train_on_batch tensorboard log files generated by Google Colab?

I know how to view tensorboard plots on my local machine whilst my neural networks train using code in a local Jupyter Notebook, using the following code. What do I need to do differently when I use Google Colab to train the neural network instead? I can't see any tutorials/examples online when using train_on_batch.
After defining my model (convnet)...
convnet.compile(loss='categorical_crossentropy',
optimizer=tf.keras.optimizers.Adam(0.001),
metrics=['accuracy']
)
# create tensorboard graph data for the model
tb = tf.keras.callbacks.TensorBoard(log_dir='Logs/Exp_15',
histogram_freq=0,
batch_size=batch_size,
write_graph=True,
write_grads=False)
tb.set_model(convnet)
num_epochs = 3
batches_processed_counter = 0
for epoch in range(num_epochs):
for batch in range(int(train_img.samples/batch_size)):
batches_processed_counter = batches_processed_counter + 1
# get next batch of images & labels
X_imgs, X_labels = next(train_img)
#train model, get cross entropy & accuracy for batch
train_CE, train_acc = convnet.train_on_batch(X_imgs, X_labels)
# validation images - just predict
X_imgs_val, X_labels_val = next(val_img)
val_CE, val_acc = convnet.test_on_batch(X_imgs_val, X_labels_val)
# create tensorboard graph info for the cross entropy loss and training accuracies
# for every batch in every epoch (so if 5 epochs and 10 batches there should be 50 accuracies )
tb.on_epoch_end(batches_processed_counter, {'train_loss': train_CE, 'train_acc': train_acc})
# create tensorboard graph info for the cross entropy loss and VALIDATION accuracies
# for every batch in every epoch (so if 5 epochs and 10 batches there should be 50 accuracies )
tb.on_epoch_end(batches_processed_counter, {'val_loss': val_CE, 'val_acc': val_acc})
print('epoch', epoch, 'batch', batch, 'train_CE:', train_CE, 'train_acc:', train_acc)
print('epoch', epoch, 'batch', batch, 'val_CE:', val_CE, 'val_acc:', val_acc)
tb.on_train_end(None)
I can see that the log file has generated successfully within the Google Colab runtime. How do I view this in Tensorboard? I've seen solutions that describe downloading the log file to a local machine and viewing that in tensorboard locally but this doesn't display anything. Is there something I'm missing in my code to allow this to work on tensorboard locally? And/or an alternative solution to view the log data in Tensorboard within Google Colab?
In case its important for the details of the solution, I'm on a Mac. Also, the tutorials I've seen online show how to use Tensorboard with Google Colab when using the fit code but can't see how to modify my code which doesn't use fit but rather train_on_batch.
Thanks to Dr Ryan Cunningham from Manchester Metropolitan University for the solution to this problem , which was the following:
%load_ext tensorboard
%tensorboard --logdir './Logs'
...which allows me to view the Tensorboard plots in the Google Colab document itself, and see the plots update while the NN is training.
So, the full set of code, to view the Tensorboard plots while the network is training is (after defining the neural network, which I've called convnet):
# compile the neural net after defining the loss, optimisation and
# performance metric
convnet.compile(loss='categorical_crossentropy', # cross entropy is suited to
# multi-class classification
optimizer=tf.keras.optimizers.Adam(0.001),
metrics=['accuracy']
)
# create tensorboard graph data for the model
tb = tf.keras.callbacks.TensorBoard(log_dir='Logs/Exp_15',
histogram_freq=0,
batch_size=batch_size,
write_graph=True,
write_grads=False)
tb.set_model(convnet)
%load_ext tensorboard
%tensorboard --logdir './Logs'
# iterate through the training set for x epochs,
# each time iterating through the batches,
# for each batch, train, calculate loss & optimise weights.
# (mini-batch approach)
num_epochs = 1
batches_processed_counter = 0
for epoch in range(num_epochs):
for batch in range(int(train_img.samples/batch_size)):
batches_processed_counter = batches_processed_counter + 1
# get next batch of images & labels
X_imgs, X_labels = next(train_img)
#train model, get cross entropy & accuracy for batch
train_CE, train_acc = convnet.train_on_batch(X_imgs, X_labels)
# validation images - just predict
X_imgs_val, X_labels_val = next(val_img)
val_CE, val_acc = convnet.test_on_batch(X_imgs_val, X_labels_val)
# create tensorboard graph info for the cross entropy loss and training accuracies
# for every batch in every epoch (so if 5 epochs and 10 batches there should be 50 accuracies )
tb.on_epoch_end(batches_processed_counter, {'train_loss': train_CE, 'train_acc': train_acc})
# create tensorboard graph info for the cross entropy loss and VALIDATION accuracies
# for every batch in every epoch (so if 5 epochs and 10 batches there should be 50 accuracies )
tb.on_epoch_end(batches_processed_counter, {'val_loss': val_CE, 'val_acc': val_acc})
print('epoch', epoch, 'batch', batch, 'train_CE:', train_CE, 'train_acc:', train_acc)
print('epoch', epoch, 'batch', batch, 'val_CE:', val_CE, 'val_acc:', val_acc)
tb.on_train_end(None)
Note: it can take a few seconds after the cell has finished running before the cell output refreshes and shows the Tensorboard plots.
!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip ngrok-stable-linux-amd64.zip
get_ipython().system_raw('tensorboard --logdir /content/trainingdata/objectdetection/ckpt_output/trainingImatges/ --host 0.0.0.0 --port 6006 &')
get_ipython().system_raw('./ngrok http 6006 &')
! curl -s http://localhost:4040/api/tunnels | python3 -c \
"import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"
This gives you a tensorboard from the log files created. This creates a tunnel for the tensorboard on colab and makes it accessible through a public URL provided by ngrok. When you run the final command, the public URL is printed. And it works with TF1.13 . I guess you can use the same approach for TF2 as well.

Keras with tensorflow throws ResourceExhaustedError

For research purposes, I am training a neural network that is updating its weights differently depending on the parity of the epoch:
1) If the epoch is even, change the weights of the NN with backpropagation
2) If the epoch is odd, only update the model with update_weights_with_custom_function() therefore freeze the network.
Here is a simplified part of the code that implements this (notice the epochs=1):
for epoch in range(nb_epoch):
if epoch % 2 == 0:
model.trainable = True # Unfreeze the model
else:
model.trainable = False # Freeze the model
model.compile(optimizer=optim, loss=gaussian_loss, metrics=['accuracy'])
hist = model.fit(X_train, Y_train,
batch_size=batch_size,
epochs=1,
shuffle=True,
verbose=1,
callbacks=[tbCallBack, csv_epochs, early_stop],
validation_data=(X_val, Y_val))
if epoch % 2 == 1:
update_weights_with_custom_function()
Problem: after a few epoch, keras throws a ResourceExhaustedError but only with tensorflow, not with theano. It seems that looping over compile() is creating models without releasing them.
Therefore, what should I do? I know that K.clear_session() releases memory but it requires to save the model and reload it (see) which gives me some issues as load_model() in my case does not work out of the box.
I'm also open to other ways to do what I am trying to achieve (i.e. freezing a NN model depending on the parity of the epoch).
Summary: keras with tensorflow backend is throwing a ResourceExhaustedError because I am looping over compile().
As Marcin Możejko pointed out, using eval() is doing exactly what I was trying to achieve.
I added a custom callback (inspiration was here), which avoids the loop over compile()
The problem is now solved, even if the tensorflow issue was not addressed directly.