How to view train_on_batch tensorboard log files generated by Google Colab? - tensorflow

I know how to view tensorboard plots on my local machine whilst my neural networks train using code in a local Jupyter Notebook, using the following code. What do I need to do differently when I use Google Colab to train the neural network instead? I can't see any tutorials/examples online when using train_on_batch.
After defining my model (convnet)...
convnet.compile(loss='categorical_crossentropy',
optimizer=tf.keras.optimizers.Adam(0.001),
metrics=['accuracy']
)
# create tensorboard graph data for the model
tb = tf.keras.callbacks.TensorBoard(log_dir='Logs/Exp_15',
histogram_freq=0,
batch_size=batch_size,
write_graph=True,
write_grads=False)
tb.set_model(convnet)
num_epochs = 3
batches_processed_counter = 0
for epoch in range(num_epochs):
for batch in range(int(train_img.samples/batch_size)):
batches_processed_counter = batches_processed_counter + 1
# get next batch of images & labels
X_imgs, X_labels = next(train_img)
#train model, get cross entropy & accuracy for batch
train_CE, train_acc = convnet.train_on_batch(X_imgs, X_labels)
# validation images - just predict
X_imgs_val, X_labels_val = next(val_img)
val_CE, val_acc = convnet.test_on_batch(X_imgs_val, X_labels_val)
# create tensorboard graph info for the cross entropy loss and training accuracies
# for every batch in every epoch (so if 5 epochs and 10 batches there should be 50 accuracies )
tb.on_epoch_end(batches_processed_counter, {'train_loss': train_CE, 'train_acc': train_acc})
# create tensorboard graph info for the cross entropy loss and VALIDATION accuracies
# for every batch in every epoch (so if 5 epochs and 10 batches there should be 50 accuracies )
tb.on_epoch_end(batches_processed_counter, {'val_loss': val_CE, 'val_acc': val_acc})
print('epoch', epoch, 'batch', batch, 'train_CE:', train_CE, 'train_acc:', train_acc)
print('epoch', epoch, 'batch', batch, 'val_CE:', val_CE, 'val_acc:', val_acc)
tb.on_train_end(None)
I can see that the log file has generated successfully within the Google Colab runtime. How do I view this in Tensorboard? I've seen solutions that describe downloading the log file to a local machine and viewing that in tensorboard locally but this doesn't display anything. Is there something I'm missing in my code to allow this to work on tensorboard locally? And/or an alternative solution to view the log data in Tensorboard within Google Colab?
In case its important for the details of the solution, I'm on a Mac. Also, the tutorials I've seen online show how to use Tensorboard with Google Colab when using the fit code but can't see how to modify my code which doesn't use fit but rather train_on_batch.

Thanks to Dr Ryan Cunningham from Manchester Metropolitan University for the solution to this problem , which was the following:
%load_ext tensorboard
%tensorboard --logdir './Logs'
...which allows me to view the Tensorboard plots in the Google Colab document itself, and see the plots update while the NN is training.
So, the full set of code, to view the Tensorboard plots while the network is training is (after defining the neural network, which I've called convnet):
# compile the neural net after defining the loss, optimisation and
# performance metric
convnet.compile(loss='categorical_crossentropy', # cross entropy is suited to
# multi-class classification
optimizer=tf.keras.optimizers.Adam(0.001),
metrics=['accuracy']
)
# create tensorboard graph data for the model
tb = tf.keras.callbacks.TensorBoard(log_dir='Logs/Exp_15',
histogram_freq=0,
batch_size=batch_size,
write_graph=True,
write_grads=False)
tb.set_model(convnet)
%load_ext tensorboard
%tensorboard --logdir './Logs'
# iterate through the training set for x epochs,
# each time iterating through the batches,
# for each batch, train, calculate loss & optimise weights.
# (mini-batch approach)
num_epochs = 1
batches_processed_counter = 0
for epoch in range(num_epochs):
for batch in range(int(train_img.samples/batch_size)):
batches_processed_counter = batches_processed_counter + 1
# get next batch of images & labels
X_imgs, X_labels = next(train_img)
#train model, get cross entropy & accuracy for batch
train_CE, train_acc = convnet.train_on_batch(X_imgs, X_labels)
# validation images - just predict
X_imgs_val, X_labels_val = next(val_img)
val_CE, val_acc = convnet.test_on_batch(X_imgs_val, X_labels_val)
# create tensorboard graph info for the cross entropy loss and training accuracies
# for every batch in every epoch (so if 5 epochs and 10 batches there should be 50 accuracies )
tb.on_epoch_end(batches_processed_counter, {'train_loss': train_CE, 'train_acc': train_acc})
# create tensorboard graph info for the cross entropy loss and VALIDATION accuracies
# for every batch in every epoch (so if 5 epochs and 10 batches there should be 50 accuracies )
tb.on_epoch_end(batches_processed_counter, {'val_loss': val_CE, 'val_acc': val_acc})
print('epoch', epoch, 'batch', batch, 'train_CE:', train_CE, 'train_acc:', train_acc)
print('epoch', epoch, 'batch', batch, 'val_CE:', val_CE, 'val_acc:', val_acc)
tb.on_train_end(None)
Note: it can take a few seconds after the cell has finished running before the cell output refreshes and shows the Tensorboard plots.

!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip ngrok-stable-linux-amd64.zip
get_ipython().system_raw('tensorboard --logdir /content/trainingdata/objectdetection/ckpt_output/trainingImatges/ --host 0.0.0.0 --port 6006 &')
get_ipython().system_raw('./ngrok http 6006 &')
! curl -s http://localhost:4040/api/tunnels | python3 -c \
"import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"
This gives you a tensorboard from the log files created. This creates a tunnel for the tensorboard on colab and makes it accessible through a public URL provided by ngrok. When you run the final command, the public URL is printed. And it works with TF1.13 . I guess you can use the same approach for TF2 as well.

Related

Can't see GAN's losses in Tensorboard

I am developing a GAN and I wanted to see the losses of the networks in Tensorboard, I have added the callback to the Fit function, I am launching tensorboard from the main directory but nothing appears on it, what else do I have to add?
Compile part of the GAN:
epochs = 15
gan = GAN(discriminator=discriminator, generator=generator, latent_dim=latent_dim)
tb_callback = tf.keras.callbacks.TensorBoard(log_dir="logs/", histogram_freq=1)
gan.compile(
d_optimizer=keras.optimizers.Adam(learning_rate=0.00005),
g_optimizer=keras.optimizers.Adam(learning_rate=0.0001),
loss_fn=keras.losses.BinaryCrossentropy(),
)
gan.fit(
dataset, epochs=epochs, callbacks=[GANMonitor(num_img=10, latent_dim=latent_dim),
tb_callback]
)

Codes worked fine one week ago, but keep getting error since yesterday: Fine-tuning Bert model training via PyTorch on Colab

I am new to Bert. Two weeks ago I successfully ran a fine-tuning Bert model on a nlp classification task though the outcome was not brilliant. Yesterday, however, when I tried to run the same code and data, an AttributeError was always there, which says: 'str' object has no attribute 'dim'. Please know everything is on Colab and via PyTorch Transformers.
What should I do to fix it?
Here is one thing I tried when I installed transformers but turned out it did not work:
instead of
!pip install transformers ,
I tried to use previous transformers version:
!pip install --target lib --upgrade transformers==3.5.0
Any feedback will be greatly appreciated!
Please see the code and the error message as below:
Code:
train definition
# function to train the model
def train():
model.train()
total_loss, total_accuracy = 0, 0
# empty list to save model predictions
total_preds=[]
# iterate over batches
for step,batch in enumerate(train_dataloader):
# progress update after every 50 batches.
if step % 200 == 0 and not step == 0:
print(' Batch {:>5,} of {:>5,}.'.format(step, len(train_dataloader)))
# push the batch to gpu
batch = [r.to(device) for r in batch]
sent_id, mask, labels = batch
# clear previously calculated gradients
model.zero_grad()
# get model predictions for the current batch
preds = model(sent_id, mask)
# compute the loss between actual and predicted values
loss = cross_entropy(preds, labels)
# add on to the total loss
total_loss = total_loss + loss.item()
# backward pass to calculate the gradients
loss.backward()
# clip the the gradients to 1.0. It helps in preventing the exploding gradient problem
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
# update parameters
optimizer.step()
# update learning rate schedule
# scheduler.step()
# model predictions are stored on GPU. So, push it to CPU
preds=preds.detach().cpu().numpy()
# append the model predictions
total_preds.append(preds)
# compute the training loss of the epoch
avg_loss = total_loss / len(train_dataloader)
# predictions are in the form of (no. of batches, size of batch, no. of classes).
# reshape the predictions in form of (number of samples, no. of classes)
total_preds = np.concatenate(total_preds, axis=0)
#returns the loss and predictions
return avg_loss, total_preds
training process
# set initial loss to infinite
best_valid_loss = float('inf')
# empty lists to store training and validation loss of each epoch
train_losses=[]
valid_losses=[]
#for each epoch
for epoch in range(epochs):
print('\n Epoch {:} / {:}'.format(epoch + 1, epochs))
#train model
train_loss, _ = train()
#evaluate model
valid_loss, _ = evaluate()
#save the best model
if valid_loss < best_valid_loss:
best_valid_loss = valid_loss
torch.save(model.state_dict(), 'saved_weights.pt')
# append training and validation loss
train_losses.append(train_loss)
valid_losses.append(valid_loss)
print(f'\nTraining Loss: {train_loss:.3f}')
print(f'Validation Loss: {valid_loss:.3f}')
Error message:
Epoch 1 / 10
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-41-c5138ddf6b25> in <module>()
12
13 #train model
---> 14 train_loss, _ = train()
15
16 #evaluate model
5 frames
/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py in linear(input, weight, bias)
1686 if any([type(t) is not Tensor for t in tens_ops]) and has_torch_function(tens_ops):
1687 return handle_torch_function(linear, tens_ops, input, weight, bias=bias)
-> 1688 if input.dim() == 2 and bias is not None:
1689 # fused op is marginally faster
1690 ret = torch.addmm(bias, input, weight.t())
AttributeError: 'str' object has no attribute 'dim'
As far as I remember - there was an old transformer version in colab. Something like 2.11.0. Try:
!pip install transformers~=2.11.0
Change the version number until it works.

Issue with Tensorboard and nothing logging

I was following google's tensorboard tutorial with hparams here. However, when I try to implement that in my own model, nothing is showing in the logs. The main difference is that I used an Image Data Generator, but I do not see how that would affect the hyperparameters. I have included all the code used to get the hyperparameters, but removed the model and basic packages I imported for ease.
# Load the TensorBoard notebook
%load_ext tensorboard
# Clear all logs
!rm -rf ./logs/
Here is what I have set up for the hyperparameters. Just learning rate and weight decay. Slightly augmented from the tutorial, but largely very much the same style.
HP_lr = hp.HParam('learning_rate', hp.Discrete([3, 4, 5]))
HP_weight_decay= hp.HParam('l2_weight_decay', hp.Discrete([4, 5, 6]))
METRIC_ACCURACY = 'accuracy'
This is a little different to account for the values above, but those are simply variable names
# file writer
with tf.summary.create_file_writer('logs/hparam_tuning').as_default():
hp.hparams_config(
hparams=[HP_lr, HP_weight_decay],
metrics=[hp.Metric(METRIC_ACCURACY, display_name='Accuracy')],
)
I have a function that builds the model taking an hparams argument. Besides using datagen.flow() in the model.fit, nothing changes.
def train_test_model(hparams):
model = build_model(hparams)
model.fit(datagen.flow(x_train, y_train, batch_size=64),
epochs=1,verbose=0)
_, accuracy = model.evaluate(x_test, y_test,batch_size=64, verbose = 1)
return accuracy
# For each run log the metrics and hyperparameters used
def run(run_dir, hparams):
with tf.summary.create_file_writer(run_dir).as_default():
hp.hparams(hparams) # record the values used in this trial
accuracy = train_test_model(hparams)
tf.summary.scalar(METRIC_ACCURACY, accuracy, step=1)
Sets up the dictionary to be used by hp
session_num = 0
for learn_rate in HP_lr.domain.values:
for wd in HP_weight_decay.domain.values:
hparams = {
HP_lr: 1*10**(-learn_rate), # transform to something like 1e-3
HP_weight_decay: 1*10**(-wd)
}
run_name = "run-%d" % session_num
print('--- Starting trial: %s' % run_name)
print({h.name: hparams[h] for h in hparams})
run('logs/hparam_tuning/' + run_name, hparams)
session_num += 1
%tensorboard --logdir logs/hparam_tuning

TensorBoard Callback in Keras does not respect initial_epoch of fit?

I'm trying to train multiple models in parallel on a single graphics card. To achieve that I need to resume training of models from saved weights which is not a problem. The model.fit() method has even a parameter initial_epoch that lets me tell the model which epoch the loaded model is on. However when i pass a TensorBoard callback to the fit() method in order to monitor the training of the models, on Tensorboard all data is shown on x=0.
Is there a ways to overcome this and adjust the epoch on tensorboard?
By the way: Im running Keras 2.0.6 and Tensorflow 1.3.0.
self.callbacks = [TensorBoardCallback(log_dir='./../logs/'+self.model_name, histogram_freq=0, write_graph=True, write_images=False, start_epoch=self.step_num)]
self.model.fit(x=self.data['X_train'], y=self.data['y_train'], batch_size=self.input_params[-1]['batch_size'], epochs=1, validation_data=(self.data['X_test'], self.data['y_test']), verbose=verbose, callbacks=self.callbacks, shuffle=self.hyperparameters['shuffle_data'], initial_epoch=self.step_num)
self.model.save_weights('./weights/%s.hdf5'%(self.model_name))
self.model.load_weights('./weights/%s.hdf5'%(self.model_name))
self.model.fit(x=self.data['X_train'], y=self.data['y_train'], batch_size=self.input_params[-1]['batch_size'], epochs=1, validation_data=(self.data['X_test'], self.data['y_test']), verbose=verbose, callbacks=self.callbacks, shuffle=self.hyperparameters['shuffle_data'], initial_epoch=self.step_num)
self.model.save_weights('./weights/%s.hdf5'%(self.model_name))
The resulting graph on Tensorboard looks like this which is not what i was hoping for:
Update:
When passing epochs=10 to the first model.fit() the 10 epoch results are displayed in TensorBoard (see picture).
However when reloading the model and running it (with the same callback attached) the on_epoch_end method of the callback gets never called.
Turns out that when i pass the number of episodes to model.fit() to tell it how long to train, it has to be the number FROM the initial_epoch specified. So if initial_epoch=self.step_num then , epochs=self.step_num+10 if i want to train for 10 episodes.
Say we just started fitting our model and our first time epoch count is 30
(please ignore other paramterers just look at epochs and initial_epoch)
model.fit(train_dataloader,validation_data = test_dataloader,epochs =30,steps_per_epoch = len(train_dataloader),callbacks = callback_list)
Now say ,after 30 epoch we want to start again from 31st epoch (you can see this in tesnorboard) by changing our Adam optimizer(or nay optimizer) learning rate
so we can do is
model.optimizer.learning_rate = 0.0005
model1.fit(train_dataloader,validation_data = test_dataloader,initial_epoch=30,epochs =55,steps_per_epoch = len(train_dataloader),callbacks = callback_list)
=> So here initial_epoch= where we have left training last time;
epochs= initial_epoch+num_epoch we want to run for this second fit

Using batch size with TensorFlow Validation Monitor

I'm using tf.contrib.learn.Estimator to train a CNN having 20+ layers. I'm using GTX 1080 (8 GB) for training. My dataset is not so large but my GPU runs out of memory with a batch size greater than 32. So I'm using a batch size of 16 for training and Evaluating the classifier (GPU runs out of memory while evaluation as well if a batch_size is not specified).
# Configure the accuracy metric for evaluation
metrics = {
"accuracy":
learn.MetricSpec(
metric_fn=tf.metrics.accuracy, prediction_key="classes"),
}
# Evaluate the model and print results
eval_results = classifier.evaluate(
x=X_test, y=y_test, metrics=metrics, batch_size=16)
Now the problem is that after every 100 steps, I only get training loss printed on screen. I want to print validation loss and accuracy as well, So I'm using a ValidationMonitor
validation_monitor = tf.contrib.learn.monitors.ValidationMonitor(
X_test,
y_test,
every_n_steps=50)
# Train the model
classifier.fit(
x=X_train,
y=y_train,
batch_size=8,
steps=20000,
monitors=[validation_monitor]
ActualProblem: My code crashes (Out of Memory) when I use ValidationMonitor, I think the problem might be solved if I could specify a batch size here as well and I can't figure out how to do that. I want ValidationMonitor to evaluate my validation data in batches, like I do it manually after training using classifier.evaluate, is there a way to do that?
The ValidationMonitor's constructor accepts a batch_size arg that should do the trick.
You need to add config=tf.contrib.learn.RunConfig( save_checkpoints_secs=save_checkpoints_secs) in your model definition. The save_checkpoints_secs can be changed to save_checkpoints_steps, but not both.