Issue with Tensorboard and nothing logging - tensorflow

I was following google's tensorboard tutorial with hparams here. However, when I try to implement that in my own model, nothing is showing in the logs. The main difference is that I used an Image Data Generator, but I do not see how that would affect the hyperparameters. I have included all the code used to get the hyperparameters, but removed the model and basic packages I imported for ease.
# Load the TensorBoard notebook
%load_ext tensorboard
# Clear all logs
!rm -rf ./logs/
Here is what I have set up for the hyperparameters. Just learning rate and weight decay. Slightly augmented from the tutorial, but largely very much the same style.
HP_lr = hp.HParam('learning_rate', hp.Discrete([3, 4, 5]))
HP_weight_decay= hp.HParam('l2_weight_decay', hp.Discrete([4, 5, 6]))
METRIC_ACCURACY = 'accuracy'
This is a little different to account for the values above, but those are simply variable names
# file writer
with tf.summary.create_file_writer('logs/hparam_tuning').as_default():
hp.hparams_config(
hparams=[HP_lr, HP_weight_decay],
metrics=[hp.Metric(METRIC_ACCURACY, display_name='Accuracy')],
)
I have a function that builds the model taking an hparams argument. Besides using datagen.flow() in the model.fit, nothing changes.
def train_test_model(hparams):
model = build_model(hparams)
model.fit(datagen.flow(x_train, y_train, batch_size=64),
epochs=1,verbose=0)
_, accuracy = model.evaluate(x_test, y_test,batch_size=64, verbose = 1)
return accuracy
# For each run log the metrics and hyperparameters used
def run(run_dir, hparams):
with tf.summary.create_file_writer(run_dir).as_default():
hp.hparams(hparams) # record the values used in this trial
accuracy = train_test_model(hparams)
tf.summary.scalar(METRIC_ACCURACY, accuracy, step=1)
Sets up the dictionary to be used by hp
session_num = 0
for learn_rate in HP_lr.domain.values:
for wd in HP_weight_decay.domain.values:
hparams = {
HP_lr: 1*10**(-learn_rate), # transform to something like 1e-3
HP_weight_decay: 1*10**(-wd)
}
run_name = "run-%d" % session_num
print('--- Starting trial: %s' % run_name)
print({h.name: hparams[h] for h in hparams})
run('logs/hparam_tuning/' + run_name, hparams)
session_num += 1
%tensorboard --logdir logs/hparam_tuning

Related

Training seq2seq model on Google Colab TPU with big dataset - Keras

I'm trying to train a sequence to sequence model for machine translation using Keras on Google Colab TPU.
I have a dataset which I can load in memory but I have to preprocess to it to feed it to the model. In particular I need to convert the target words to one hot vectors and with many examples I can't load the entire conversion in memory, so I need to make batches of data.
I'm using this function as a batch generator:
def generate_batch_bert(X_ids, X_masks, y, batch_size = 1024):
''' Generate a batch of data '''
while True:
for j in range(0, len(X_ids), batch_size):
# batch of encoder and decoder data
encoder_input_data_ids = X_ids[j:j+batch_size]
encoder_input_data_masks = X_masks[j:j+batch_size]
y_decoder = y[j:j+batch_size]
# decoder target and input for teacher forcing
decoder_input_data = y_decoder[:,:-1]
decoder_target_seq = y_decoder[:,1:]
# batch of decoder target data
decoder_target_data = to_categorical(decoder_target_seq, vocab_size_fr)
# keep only with the right amount of instances for training on TPU
if encoder_input_data_ids.shape[0] == batch_size:
yield([encoder_input_data_ids, encoder_input_data_masks, decoder_input_data], decoder_target_data)
The problem is that whenever I try to run the fit function as follows:
model.fit(x=generate_batch_bert(X_train_ids, X_train_masks, y_train, batch_size = batch_size),
steps_per_epoch = train_samples//batch_size,
epochs=epochs,
callbacks = callbacks,
validation_data = generate_batch_bert(X_val_ids, X_val_masks, y_val, batch_size = batch_size),
validation_steps = val_samples//batch_size)
I get the following error:
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/tensor_util.py:445 make_tensor_proto
raise ValueError("None values not supported.")
ValueError: None values not supported.
Not sure what's wrong and how I can solve this problem.
EDIT
I tried loading less amount of data in memory so that the conversion to one hot encoding of the target words doesn't crash the kernel and it actually works. So there is obviously something wrong on how I generate batches.
It's hard to tell what's wrong since you don't provide your model
definition nor any sample data. However, I'm fairly certain that you're
running into the same
TensorFlow bug
that I recently got bitten by.
The workaround is to use the tensorflow.data API which works much
better with TPUs. Like this:
from tensorflow.data import Dataset
import tensorflow as tf
def map_fn(X_id, X_mask, y):
decoder_target_data = tf.one_hot(y[1:], vocab_size_fr)
return (X_id, X_mask, y[:-1]), decoder_target_data
...
X_ids = Dataset.from_tensor_slices(X_ids)
X_masks = Dataset.from_tensor_slices(X_masks)
y = Dataset.from_tensor_slices(y)
ds = Dataset.zip((X_ids, X_masks, y)).map(map_fn).batch(1024)
model.fit(x = ds, ...)

How to view train_on_batch tensorboard log files generated by Google Colab?

I know how to view tensorboard plots on my local machine whilst my neural networks train using code in a local Jupyter Notebook, using the following code. What do I need to do differently when I use Google Colab to train the neural network instead? I can't see any tutorials/examples online when using train_on_batch.
After defining my model (convnet)...
convnet.compile(loss='categorical_crossentropy',
optimizer=tf.keras.optimizers.Adam(0.001),
metrics=['accuracy']
)
# create tensorboard graph data for the model
tb = tf.keras.callbacks.TensorBoard(log_dir='Logs/Exp_15',
histogram_freq=0,
batch_size=batch_size,
write_graph=True,
write_grads=False)
tb.set_model(convnet)
num_epochs = 3
batches_processed_counter = 0
for epoch in range(num_epochs):
for batch in range(int(train_img.samples/batch_size)):
batches_processed_counter = batches_processed_counter + 1
# get next batch of images & labels
X_imgs, X_labels = next(train_img)
#train model, get cross entropy & accuracy for batch
train_CE, train_acc = convnet.train_on_batch(X_imgs, X_labels)
# validation images - just predict
X_imgs_val, X_labels_val = next(val_img)
val_CE, val_acc = convnet.test_on_batch(X_imgs_val, X_labels_val)
# create tensorboard graph info for the cross entropy loss and training accuracies
# for every batch in every epoch (so if 5 epochs and 10 batches there should be 50 accuracies )
tb.on_epoch_end(batches_processed_counter, {'train_loss': train_CE, 'train_acc': train_acc})
# create tensorboard graph info for the cross entropy loss and VALIDATION accuracies
# for every batch in every epoch (so if 5 epochs and 10 batches there should be 50 accuracies )
tb.on_epoch_end(batches_processed_counter, {'val_loss': val_CE, 'val_acc': val_acc})
print('epoch', epoch, 'batch', batch, 'train_CE:', train_CE, 'train_acc:', train_acc)
print('epoch', epoch, 'batch', batch, 'val_CE:', val_CE, 'val_acc:', val_acc)
tb.on_train_end(None)
I can see that the log file has generated successfully within the Google Colab runtime. How do I view this in Tensorboard? I've seen solutions that describe downloading the log file to a local machine and viewing that in tensorboard locally but this doesn't display anything. Is there something I'm missing in my code to allow this to work on tensorboard locally? And/or an alternative solution to view the log data in Tensorboard within Google Colab?
In case its important for the details of the solution, I'm on a Mac. Also, the tutorials I've seen online show how to use Tensorboard with Google Colab when using the fit code but can't see how to modify my code which doesn't use fit but rather train_on_batch.
Thanks to Dr Ryan Cunningham from Manchester Metropolitan University for the solution to this problem , which was the following:
%load_ext tensorboard
%tensorboard --logdir './Logs'
...which allows me to view the Tensorboard plots in the Google Colab document itself, and see the plots update while the NN is training.
So, the full set of code, to view the Tensorboard plots while the network is training is (after defining the neural network, which I've called convnet):
# compile the neural net after defining the loss, optimisation and
# performance metric
convnet.compile(loss='categorical_crossentropy', # cross entropy is suited to
# multi-class classification
optimizer=tf.keras.optimizers.Adam(0.001),
metrics=['accuracy']
)
# create tensorboard graph data for the model
tb = tf.keras.callbacks.TensorBoard(log_dir='Logs/Exp_15',
histogram_freq=0,
batch_size=batch_size,
write_graph=True,
write_grads=False)
tb.set_model(convnet)
%load_ext tensorboard
%tensorboard --logdir './Logs'
# iterate through the training set for x epochs,
# each time iterating through the batches,
# for each batch, train, calculate loss & optimise weights.
# (mini-batch approach)
num_epochs = 1
batches_processed_counter = 0
for epoch in range(num_epochs):
for batch in range(int(train_img.samples/batch_size)):
batches_processed_counter = batches_processed_counter + 1
# get next batch of images & labels
X_imgs, X_labels = next(train_img)
#train model, get cross entropy & accuracy for batch
train_CE, train_acc = convnet.train_on_batch(X_imgs, X_labels)
# validation images - just predict
X_imgs_val, X_labels_val = next(val_img)
val_CE, val_acc = convnet.test_on_batch(X_imgs_val, X_labels_val)
# create tensorboard graph info for the cross entropy loss and training accuracies
# for every batch in every epoch (so if 5 epochs and 10 batches there should be 50 accuracies )
tb.on_epoch_end(batches_processed_counter, {'train_loss': train_CE, 'train_acc': train_acc})
# create tensorboard graph info for the cross entropy loss and VALIDATION accuracies
# for every batch in every epoch (so if 5 epochs and 10 batches there should be 50 accuracies )
tb.on_epoch_end(batches_processed_counter, {'val_loss': val_CE, 'val_acc': val_acc})
print('epoch', epoch, 'batch', batch, 'train_CE:', train_CE, 'train_acc:', train_acc)
print('epoch', epoch, 'batch', batch, 'val_CE:', val_CE, 'val_acc:', val_acc)
tb.on_train_end(None)
Note: it can take a few seconds after the cell has finished running before the cell output refreshes and shows the Tensorboard plots.
!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip ngrok-stable-linux-amd64.zip
get_ipython().system_raw('tensorboard --logdir /content/trainingdata/objectdetection/ckpt_output/trainingImatges/ --host 0.0.0.0 --port 6006 &')
get_ipython().system_raw('./ngrok http 6006 &')
! curl -s http://localhost:4040/api/tunnels | python3 -c \
"import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"
This gives you a tensorboard from the log files created. This creates a tunnel for the tensorboard on colab and makes it accessible through a public URL provided by ngrok. When you run the final command, the public URL is printed. And it works with TF1.13 . I guess you can use the same approach for TF2 as well.

Hparams plugin with tf.keras (tensorflow 2.0)

I try to follow the example from the tensorflow docs and setup hyperparameter logging. It also mentions that, if you use tf.keras, you can just use the callback hp.KerasCallback(logdir, hparams). However, if I use the callback I don't get my metrics (only the outcome).
The trick is to define the Hparams config with the path in which TensorBoard saves its validation logs.
So, if your TensorBoard callback is set up as:
log_dir = 'path/to/training-logs'
tensorboard_cb = TensorBoard(log_dir=log_dir)
Then you should set up Hparams like this:
hparams_dir = os.path.join(log_dir, 'validation')
with tf.summary.create_file_writer(hparams_dir).as_default():
hp.hparams_config(
hparams=HPARAMS,
metrics=[hp.Metric('epoch_accuracy')] # metric saved by tensorboard_cb
)
hparams_cb = hp.KerasCallback(
writer=hparams_dir,
hparams=HPARAMS
)
I managed but not entirely sure what was the magic word. Here my flow in case it helps.
callbacks.append(hp.KerasCallback(log_dir, hparams))
HP_NUM_LATENT = hp.HParam('num_latent_dim', hp.Discrete([2, 5, 100]))
hparams = {
HP_NUM_LATENT: num_latent,
}
model = create_simple_model(latent_dim=hparams[HP_NUM_LATENT]) # returns compiled model
model.fit(x, y, validation_data=validation_data,
epochs=4,
verbose=2,
callbacks=callbacks)
Since I have lost a couple of hours because of this. I would like to add to the good remark of Julian about defining the hparams config, that the tag of the metric you like to log with hparams and possibly its group in hp.Metric(tag='epoch_accuracy', group='validation') should match the one of a metric that you capture with Keras model.fit(..., metrics=). See hparams_demo for a good example
I just want to add to the previous answers. If you are using TensorBoard in a notebook on Colab, the issue may not be due to your code, but due to how TensorBoard is run on Colab. And the solution is to kill the existing TensorBoard and launch it again.
Please correct me if I am wrong.
Sample code:
from tensorboard.plugins.hparams import api as hp
HP_LR = hp.HParam('learning_rate', hp.Discrete([1e-4, 5e-4, 1e-3]))
HPARAMS = [HP_LR]
# this METRICS does not seem to have any effects in my example as
# hp uses epoch_accuracy and epoch_loss for both training and validation anyway.
METRICS = [hp.Metric('epoch_accuracy', group="validation", display_name='val_accuracy')]
# save the configuration
log_dir = '/content/logs/hparam_tuning'
with tf.summary.create_file_writer(log_dir).as_default():
hp.hparams_config(hparams=HPARAMS, metrics=METRICS)
def fitness_func(hparams, seed):
rng = random.Random(seed)
# here we build the model
model = tf.keras.Sequential(...)
model.compile(..., metrics=['accuracy']) # need to pass the metric of interest
# set up callbacks
_log_dir = os.path.join(log_dir, seed)
tb_callbacks = tf.keras.callbacks.TensorBoard(_log_dir) # log metrics
hp_callbacks = hp.KerasCallback(_log_dir, hparams) # log hparams
# fit the model
history = model.fit(
..., validation_data=(x_te, y_te), callbacks=[tb_callbacks, hp_callbacks])
rng = random.Random(0)
session_index = 0
# random search
num_session_groups = 4
sessions_per_group = 2
for group_index in range(num_session_groups):
hparams = {h: h.domain.sample_uniform(rng) for h in HPARAMS}
hparams_string = str(hparams)
for repeat_index in range(sessions_per_group):
session_id = str(session_index)
session_index += 1
fitness_func(hparams, session_id)
To check if there is any existing TensorBoard process, run the following in Colab:
!ps ax | grep tensorboard
Assume PID for the TensorBoard process is 5315. Then,
!kill 5315
and run
# of course, replace the dir below with your log_dir
%tensorboard --logdir='/content/logs/hparam_tuning'
In my case, after I reset TensorBoard as above, it can properly log the metrics specified in model.compile, i.e., accuracies.

how to track percentage complete and average training iteration runtime in tensorboard?

I have what I think should be a simple problem but I can't seem to figure it out.
Let's say that I have something like this
with tf.Session(graph=self.training_graph) as sess:
init = tf.global_variables_initializer()
logger.info("initializing global variables")
sess.run(init)
# add the operations that distory input images according to the hyperparameters
self._setup_meta_training_tensors()
self._add_jpeg_decoding()
self._add_input_distortions()
evaluation_step, prediction = self._add_evaluation_step(
self.train_final_tensor, self.train_ground_truth_input)
self.merged = tf.summary.merge_all()
self.train_writer = tf.summary.FileWriter(os.path.join(
self.model.tensorboard_directory, 'train/'), sess.graph)
self.validation_writer = tf.summary.FileWriter(os.path.join(
self.model.tensorboard_directory, 'validation/'))
self.train_saver = tf.train.Saver()
for step in range(self.training_steps):
start = time.time()
train_bottlenecks, train_ground_truth = (
self._get_random_distorted_bottlenecks(sess,
self.training_batch_size,
self.IMAGE_CATEGORY_TRAINING,
self.train_bottleneck_tensor,
self.train_resized_input_tensor))
# Feed the bottlenecks and ground truth into the graph, and run a training
# step. Capture training summaries for TensorBoard with the `merged` op.
train_summary, _ = sess.run(
[self.merged, self.train_step],
feed_dict={self.train_bottleneck_input: train_bottlenecks,
self.train_ground_truth_input: train_ground_truth})
train_time = time.time() - start
self.train_writer.add_summary(train_summary, step)
is_last_step = (step + 1 == self.training_steps)
if (step % self.eval_step_interval) == 0 or is_last_step:
train_accuracy, cross_entropy_value = sess.run(
[evaluation_step, self.cross_entropy],
feed_dict={self.train_bottleneck_input: train_bottlenecks,
self.train_ground_truth_input: train_ground_truth})
validation_bottlenecks, validation_ground_truth, _ = (
self._get_random_bottlenecks(sess,
self.validation_batch_size,
self.IMAGE_CATEGORY_VALIDATION,
self.train_bottleneck_tensor,
self.train_resized_input_tensor))
validation_summary, validation_accuracy = sess.run(
[self.merged, evaluation_step],
feed_dict={self.train_bottleneck_input: validation_bottlenecks,
self.train_ground_truth_input: validation_ground_truth})
self.validation_writer.add_summary(validation_summary, step)
Now my tensorboard is tracking all sorts of variables relating to the self.training_graph - accuracy, cross entropy, information about the weights and what not.
All I want to do is have another graph on tensorboard that tracks the average runtime of each training step. If I time the step, (see train_time), how do I put these into an ever increasing array and show it in tensorboard for this graph?
The issue seems to be that these values aren't apart of my main model graph, they're different values. If I make them with a new graph that simple appends new runtimes then they don't show up in tensorboard. I could make them apart of the graph but that seems dumb.. why would my complicated ML graph have a random part that caluclates the average training iteration runtime?
I would use a helper library like https://github.com/lanpa/tensorboardX which abstracts away an annoying additional session call.

load multiple models in Tensorflow

I have written the following convolutional neural network (CNN) class in Tensorflow [I have tried to omit some lines of code for clarity.]
class CNN:
def __init__(self,
num_filters=16, # initial number of convolution filters
num_layers=5, # number of convolution layers
num_input=2, # number of channels in input
num_output=5, # number of channels in output
learning_rate=1e-4, # learning rate for the optimizer
display_step = 5000, # displays training results every display_step epochs
num_epoch = 10000, # number of epochs for training
batch_size= 64, # batch size for mini-batch processing
restore_file=None, # restore file (default: None)
):
# define placeholders
self.image = tf.placeholder(tf.float32, shape = (None, None, None,self.num_input))
self.groundtruth = tf.placeholder(tf.float32, shape = (None, None, None,self.num_output))
# builds CNN and compute prediction
self.pred = self._build()
# I have already created a tensorflow session and saver objects
self.sess = tf.Session()
self.saver = tf.train.Saver()
# also, I have defined the loss function and optimizer as
self.loss = self._loss_function()
self.optimizer = tf.train.AdamOptimizer(learning_rate).minimize(self.loss)
if restore_file is not None:
print("model exists...loading from the model")
self.saver.restore(self.sess,restore_file)
else:
print("model does not exist...initializing")
self.sess.run(tf.initialize_all_variables())
def _build(self):
#builds CNN
def _loss_function(self):
# computes loss
#
def train(self, train_x, train_y, val_x, val_y):
# uses mini batch to minimize the loss
self.sess.run(self.optimizer, feed_dict = {self.image:sample, self.groundtruth:gt})
# I save the session after n=10 epochs as:
if epoch%n==0:
self.saver.save(sess,'snapshot',global_step = epoch)
# finally my predict function is
def predict(self, X):
return self.sess.run(self.pred, feed_dict={self.image:X})
I have trained two CNNs for two separate tasks independently. Each took around 1 day. Say, model1 and model2 are saved as 'snapshot-model1-10000' and 'snapshot-model2-10000' (with their corresponding meta files) respectively. I can test each model and compute its performance separately.
Now, I want to load these two models in a single script. I would naturally try to do as below:
cnn1 = CNN(..., restore_file='snapshot-model1-10000',..........)
cnn2 = CNN(..., restore_file='snapshot-model2-10000',..........)
I encounter the error [The error message is long. I just copied/pasted a snippet of it.]
NotFoundError: Tensor name "Variable_26/Adam_1" not found in checkpoint files /home/amitkrkc/codes/A549_models/snapshot-hela-95000
[[Node: save_1/restore_slice_85 = RestoreSlice[dt=DT_FLOAT, preferred_shard=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save_1/Const_0, save_1/restore_slice_85/tensor_name, save_1/restore_slice_85/shape_and_slice)]]
Is there a way to load from these two files two separate CNNs? Any suggestion/comment/feedback is welcome.
Thank you,
Yes there is. Use separate graphs.
g1 = tf.Graph()
g2 = tf.Graph()
with g1.as_default():
cnn1 = CNN(..., restore_file='snapshot-model1-10000',..........)
with g2.as_default():
cnn2 = CNN(..., restore_file='snapshot-model2-10000',..........)
EDIT:
If you want them into same graph. You'll have to rename some variables. One idea is have each CNN in separate scope and let saver handle variables in that scope e.g.:
saver = tf.train.Saver(tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES), scope='model1')
and in cnn wrap all your construction in scope:
with tf.variable_scope('model1'):
...
EDIT2:
Other idea is renaming variables which saver manages (since I assume you want to use your saved checkpoints without retraining everything. Saving allows different variable names in graph and in checkpoint, have a look at documentation for initialization.
This should be a comment to the most up-voted answer. But I do not have enough reputation to do that.
Anyway.
If you(anyone searched and got to this point) still having trouble with the solution provided by lpp AND you are using Keras, check following quote from github.
This is because the keras share a global session if no default tf session provided
When the model1 created, it is on graph1
When the model1 loads weight, the weight is on a keras global session which is associated with graph1
When the model2 created, it is on graph2
When the model2 loads weight, the global session does not know the graph2
A solution below may help,
graph1 = Graph()
with graph1.as_default():
session1 = Session()
with session1.as_default():
with open('model1_arch.json') as arch_file:
model1 = model_from_json(arch_file.read())
model1.load_weights('model1_weights.h5')
# K.get_session() is session1
# do the same for graph2, session2, model2
You need to create 2 sessions and restore the 2 models separately. In order for this to work you need to do the following:
1a. When you're saving the models you need to add scopes to the variable names. That way you will know which variables belong to which model:
# The first model
tf.Variable(tf.zeros([self.batch_size]), name="model_1/Weights")
...
# The second model
tf.Variable(tf.zeros([self.batch_size]), name="model_2/Weights")
...
1b. Alternatively, if you already saved the models you can rename the variables by adding scope with this script.
2.. When you restore the different models you need to filter by variable name like this:
# The first model
sess_1 = tf.Session()
sess_1.run(tf.initialize_all_variables())
saver_1 = tf.train.Saver([v for v in tf.all_variables() if 'model_1' in v.name])
saver_1.restore(sess_1, weights_1_file)
sess_1.run(pred, feed_dict={image: X})
# The second model
sess_2 = tf.Session()
sess_2.run(tf.initialize_all_variables())
saver_2 = tf.train.Saver([v for v in tf.all_variables() if 'model_2' in v.name])
saver_2.restore(sess_2, weights_2_file)
sess_2.run(pred, feed_dict={image: X})
I encountered the same problem and could not solve the problem (without retraining) with any solution i found on the internet. So what I did is load each model in two separate threads which communicate with the main thread. It is simple enough to write the code, you just have to be careful when you synchronize the threads.
In my case each thread received the input for its problem and returned to the main thread the output. It works without any observable overhead.
One way is to clear your session if you want to train or load multiple models in succession. You can easily do this using
from keras import backend as K
# load and use model 1
K.clear_session()
# load and use model 2
K.clear_session()`
K.clear_session() destroys the current TF graph and creates a new one.
Useful to avoid clutter from old models / layers.