I try to follow the example from the tensorflow docs and setup hyperparameter logging. It also mentions that, if you use tf.keras, you can just use the callback hp.KerasCallback(logdir, hparams). However, if I use the callback I don't get my metrics (only the outcome).
The trick is to define the Hparams config with the path in which TensorBoard saves its validation logs.
So, if your TensorBoard callback is set up as:
log_dir = 'path/to/training-logs'
tensorboard_cb = TensorBoard(log_dir=log_dir)
Then you should set up Hparams like this:
hparams_dir = os.path.join(log_dir, 'validation')
with tf.summary.create_file_writer(hparams_dir).as_default():
hp.hparams_config(
hparams=HPARAMS,
metrics=[hp.Metric('epoch_accuracy')] # metric saved by tensorboard_cb
)
hparams_cb = hp.KerasCallback(
writer=hparams_dir,
hparams=HPARAMS
)
I managed but not entirely sure what was the magic word. Here my flow in case it helps.
callbacks.append(hp.KerasCallback(log_dir, hparams))
HP_NUM_LATENT = hp.HParam('num_latent_dim', hp.Discrete([2, 5, 100]))
hparams = {
HP_NUM_LATENT: num_latent,
}
model = create_simple_model(latent_dim=hparams[HP_NUM_LATENT]) # returns compiled model
model.fit(x, y, validation_data=validation_data,
epochs=4,
verbose=2,
callbacks=callbacks)
Since I have lost a couple of hours because of this. I would like to add to the good remark of Julian about defining the hparams config, that the tag of the metric you like to log with hparams and possibly its group in hp.Metric(tag='epoch_accuracy', group='validation') should match the one of a metric that you capture with Keras model.fit(..., metrics=). See hparams_demo for a good example
I just want to add to the previous answers. If you are using TensorBoard in a notebook on Colab, the issue may not be due to your code, but due to how TensorBoard is run on Colab. And the solution is to kill the existing TensorBoard and launch it again.
Please correct me if I am wrong.
Sample code:
from tensorboard.plugins.hparams import api as hp
HP_LR = hp.HParam('learning_rate', hp.Discrete([1e-4, 5e-4, 1e-3]))
HPARAMS = [HP_LR]
# this METRICS does not seem to have any effects in my example as
# hp uses epoch_accuracy and epoch_loss for both training and validation anyway.
METRICS = [hp.Metric('epoch_accuracy', group="validation", display_name='val_accuracy')]
# save the configuration
log_dir = '/content/logs/hparam_tuning'
with tf.summary.create_file_writer(log_dir).as_default():
hp.hparams_config(hparams=HPARAMS, metrics=METRICS)
def fitness_func(hparams, seed):
rng = random.Random(seed)
# here we build the model
model = tf.keras.Sequential(...)
model.compile(..., metrics=['accuracy']) # need to pass the metric of interest
# set up callbacks
_log_dir = os.path.join(log_dir, seed)
tb_callbacks = tf.keras.callbacks.TensorBoard(_log_dir) # log metrics
hp_callbacks = hp.KerasCallback(_log_dir, hparams) # log hparams
# fit the model
history = model.fit(
..., validation_data=(x_te, y_te), callbacks=[tb_callbacks, hp_callbacks])
rng = random.Random(0)
session_index = 0
# random search
num_session_groups = 4
sessions_per_group = 2
for group_index in range(num_session_groups):
hparams = {h: h.domain.sample_uniform(rng) for h in HPARAMS}
hparams_string = str(hparams)
for repeat_index in range(sessions_per_group):
session_id = str(session_index)
session_index += 1
fitness_func(hparams, session_id)
To check if there is any existing TensorBoard process, run the following in Colab:
!ps ax | grep tensorboard
Assume PID for the TensorBoard process is 5315. Then,
!kill 5315
and run
# of course, replace the dir below with your log_dir
%tensorboard --logdir='/content/logs/hparam_tuning'
In my case, after I reset TensorBoard as above, it can properly log the metrics specified in model.compile, i.e., accuracies.
Related
I fine-tuned a BERT model from Tensorflow hub to build a simple sentiment analyzer. The model trains and runs fine. On export, I simply used:
tf.saved_model.save(model, export_dir='models')
And this works just fine.. until I reboot.
On a reboot, the model no longer loads. I've tried using a Keras loader as well as the Tensorflow Server, and I get the same error.
I get the following error message:
Not found: /tmp/tfhub_modules/09bd4e665682e6f03bc72fbcff7a68bf879910e/assets/vocab.txt; No such file or directory
The model is trying to load assets from the tfhub modules cache, which is wiped by reboot. I know I could persist the cache, but I don't want to do that because I want to be able to generate models and then copy them over to a separate application without worrying about the cache.
The crux of it is that I don't think it's necessary to look in the cache for the assets at all. The model was saved with an assets folder wherein vocab.txt was generated, so in order to find the assets it just needs to look in its own assets folder (I think). However, it doesn't seem to be doing that.
Is there any way to change this behaviour?
Added the code for building and exporting the model (it's not a clever model, just prototyping my workflow):
bert_model_name = "bert_en_uncased_L-12_H-768_A-12"
BATCH_SIZE = 64
EPOCHS = 1 # Initial
def build_bert_model(bert_model_name):
input_layer = tf.keras.layers.Input(shape=(), dtype=tf.string, name="inputs")
preprocessing_layer = hub.KerasLayer(
map_model_to_preprocess[bert_model_name], name="preprocessing"
)
encoder_inputs = preprocessing_layer(input_layer)
bert_model = hub.KerasLayer(
map_name_to_handle[bert_model_name], name="BERT_encoder"
)
outputs = bert_model(encoder_inputs)
net = outputs["pooled_output"]
net = tf.keras.layers.Dropout(0.1)(net)
net = tf.keras.layers.Dense(1, activation=None, name="classifier")(net)
return tf.keras.Model(input_layer, net)
def main():
train_ds, val_ds = load_sentiment140(batch_size=BATCH_SIZE, epochs=EPOCHS)
steps_per_epoch = tf.data.experimental.cardinality(train_ds).numpy()
init_lr = 3e-5
optimizer = tf.keras.optimizers.Adam(learning_rate=init_lr)
model = build_bert_model(bert_model_name)
model.compile(optimizer=optimizer, loss='mse', metrics='mse')
model.fit(train_ds, validation_data=val_ds, steps_per_epoch=steps_per_epoch)
tf.saved_model.save(model, export_dir='models')
This problem comes from a TensorFlow bug triggered by versions /1 and /2 of https://tfhub.dev/tensorflow/bert_en_uncased_preprocess. The updated models tensorflow/bert_*_preprocess/3 (released last Friday) avoid this bug. Please update to the newest version.
The Classify Text with BERT tutorial has been updated accordingly.
Thanks for bringing this up!
I was following google's tensorboard tutorial with hparams here. However, when I try to implement that in my own model, nothing is showing in the logs. The main difference is that I used an Image Data Generator, but I do not see how that would affect the hyperparameters. I have included all the code used to get the hyperparameters, but removed the model and basic packages I imported for ease.
# Load the TensorBoard notebook
%load_ext tensorboard
# Clear all logs
!rm -rf ./logs/
Here is what I have set up for the hyperparameters. Just learning rate and weight decay. Slightly augmented from the tutorial, but largely very much the same style.
HP_lr = hp.HParam('learning_rate', hp.Discrete([3, 4, 5]))
HP_weight_decay= hp.HParam('l2_weight_decay', hp.Discrete([4, 5, 6]))
METRIC_ACCURACY = 'accuracy'
This is a little different to account for the values above, but those are simply variable names
# file writer
with tf.summary.create_file_writer('logs/hparam_tuning').as_default():
hp.hparams_config(
hparams=[HP_lr, HP_weight_decay],
metrics=[hp.Metric(METRIC_ACCURACY, display_name='Accuracy')],
)
I have a function that builds the model taking an hparams argument. Besides using datagen.flow() in the model.fit, nothing changes.
def train_test_model(hparams):
model = build_model(hparams)
model.fit(datagen.flow(x_train, y_train, batch_size=64),
epochs=1,verbose=0)
_, accuracy = model.evaluate(x_test, y_test,batch_size=64, verbose = 1)
return accuracy
# For each run log the metrics and hyperparameters used
def run(run_dir, hparams):
with tf.summary.create_file_writer(run_dir).as_default():
hp.hparams(hparams) # record the values used in this trial
accuracy = train_test_model(hparams)
tf.summary.scalar(METRIC_ACCURACY, accuracy, step=1)
Sets up the dictionary to be used by hp
session_num = 0
for learn_rate in HP_lr.domain.values:
for wd in HP_weight_decay.domain.values:
hparams = {
HP_lr: 1*10**(-learn_rate), # transform to something like 1e-3
HP_weight_decay: 1*10**(-wd)
}
run_name = "run-%d" % session_num
print('--- Starting trial: %s' % run_name)
print({h.name: hparams[h] for h in hparams})
run('logs/hparam_tuning/' + run_name, hparams)
session_num += 1
%tensorboard --logdir logs/hparam_tuning
I am learning how to use the tf.data.Dataset api. I am using the sample code provided by google for their coursera tensorflow class. Specifically I am working with the c_dataset.ipynb notebook here.
This notebook has a model.train routine which looks like this:
model.train(input_fn = get_train(), steps = 1000)
The get_train() routine eventually calls code which uses the tf.data.Dataset api with this snippet of code:
filenames_dataset = tf.data.Dataset.list_files(filename)
# read lines from text files
# this results in a dataset of textlines from all files
textlines_dataset = filenames_dataset.flat_map(tf.data.TextLineDataset)
# Parse text lines as comma-separated values (CSV)
# this does the decoder function for each textline
dataset = textlines_dataset.map(decode_csv)
The comments do a pretty explanation of what happens. Later this routine will return like so:
# return the features and label as a tensorflow node, these
# will trigger file load operations progressively only when
# needed.
return dataset.make_one_shot_iterator().get_next()
Is there anyway to evaluate the result for one iteration? I tried to something like this but it fails.
# Try to read what its using from the cvs file.
one_batch_the_csv_file = get_train()
with tf.Session() as sess:
result = sess.run(one_batch_the_csv_file)
print(one_batch_the_csv_file)
Per the suggestion from Ruben below, I added this
I moved on to the next set of labs in this class where they introduce tensorboard and I get some graphs but still no inputs or outputs. With that said, here is a more complete set of source.
# curious he did not do this
# I am guessing because the output is so verbose
tf.logging.set_verbosity(tf.logging.INFO) # putting back in since, tf.train.LoggingTensorHook mentions it
def train_and_evaluate(output_dir, num_train_steps):
# Added this while trying to get input vals from csv.
# This gives an error about scafolding
# summary_hook = tf.train.SummarySaverHook()
# SAVE_EVERY_N_STEPS,
# summary_op=tf.summary.merge_all())
# To convert a model to distributed train and evaluate do four things
estimator = tf.estimator.DNNClassifier( # 1. Estimator
model_dir = output_dir,
feature_columns = feature_cols,
hidden_units=[160, 80, 40, 20],
n_classes=2,
config=tf.estimator.RunConfig().replace(save_summary_steps=2) # 2. run config
# ODD. he mentions we need a run config in the videos, but it was missing in the lab
# notebook. Later I found the bug report which gave me this bit of code.
# I got a working TensorBoard when I changed this from save_summary_steps=10 to 2.
)#
# .. also need the trainspec to tell the estimator how to get training data
train_spec = tf.estimator.TrainSpec(
input_fn = read_dataset('./taxi-train.csv', mode = tf.estimator.ModeKeys.TRAIN), # make sure you use the dataset api
max_steps = num_train_steps)
# training_hook=[summary_hook]) # Added this while trying to get input vals from csv.
# ... also need this
# serving and training-time inputs are often very different
exporter = tf.estimator.LatestExporter('exporter', serving_input_receiver_fn = serving_input_fn)
# .. also need an EvalSpec which controls the evaluation and
# the checkpointing of the model since they happen at the same time
eval_spec = tf.estimator.EvalSpec(
input_fn = read_dataset('./taxi-valid.csv', mode = tf.estimator.ModeKeys.EVAL), # make sure you use the dataset api
steps=None, # evals on 100 batches
start_delay_secs = 1, # start evaluating after N secoonds. orig was 1. 3 seemed to fail?
throttle_secs = 10, # eval no more than every 10 secs. Can not be more frequent than the checkpoint config specified in the run config.
exporters = exporter) # how to export the model for production.
tf.estimator.train_and_evaluate(
estimator,
train_spec, # 3. Train Spec
eval_spec) # 4. Eval Spec
OUTDIR = './model_trained'
shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time
TensorBoard().start(OUTDIR)
# need to let this complete before running next cell
# call the above routine
train_and_evaluate(OUTDIR, num_train_steps = 6000) # originally 2000. 1000 after reset shows only projectors
I do not know exactly what kind of information you want to extract. If you are interested in step N, as a general answer:
If you want exactly the results, just run with model.train(input_fn = get_train(), steps = N).
Check train module functions here for specific content in a determined step.
If you search for step you will find different classes:
CheckpointSaverHook: Saves checkpoints every N steps or seconds.
LoggingTensorHook: Prints the given tensors every N local steps, every N seconds, or at end.
ProfilerHook: Captures CPU/GPU profiling information every N steps or seconds.
SummarySaverHook: Saves summaries every N steps.
Etc. (there are more, just check what can be useful for you).
I have written the following convolutional neural network (CNN) class in Tensorflow [I have tried to omit some lines of code for clarity.]
class CNN:
def __init__(self,
num_filters=16, # initial number of convolution filters
num_layers=5, # number of convolution layers
num_input=2, # number of channels in input
num_output=5, # number of channels in output
learning_rate=1e-4, # learning rate for the optimizer
display_step = 5000, # displays training results every display_step epochs
num_epoch = 10000, # number of epochs for training
batch_size= 64, # batch size for mini-batch processing
restore_file=None, # restore file (default: None)
):
# define placeholders
self.image = tf.placeholder(tf.float32, shape = (None, None, None,self.num_input))
self.groundtruth = tf.placeholder(tf.float32, shape = (None, None, None,self.num_output))
# builds CNN and compute prediction
self.pred = self._build()
# I have already created a tensorflow session and saver objects
self.sess = tf.Session()
self.saver = tf.train.Saver()
# also, I have defined the loss function and optimizer as
self.loss = self._loss_function()
self.optimizer = tf.train.AdamOptimizer(learning_rate).minimize(self.loss)
if restore_file is not None:
print("model exists...loading from the model")
self.saver.restore(self.sess,restore_file)
else:
print("model does not exist...initializing")
self.sess.run(tf.initialize_all_variables())
def _build(self):
#builds CNN
def _loss_function(self):
# computes loss
#
def train(self, train_x, train_y, val_x, val_y):
# uses mini batch to minimize the loss
self.sess.run(self.optimizer, feed_dict = {self.image:sample, self.groundtruth:gt})
# I save the session after n=10 epochs as:
if epoch%n==0:
self.saver.save(sess,'snapshot',global_step = epoch)
# finally my predict function is
def predict(self, X):
return self.sess.run(self.pred, feed_dict={self.image:X})
I have trained two CNNs for two separate tasks independently. Each took around 1 day. Say, model1 and model2 are saved as 'snapshot-model1-10000' and 'snapshot-model2-10000' (with their corresponding meta files) respectively. I can test each model and compute its performance separately.
Now, I want to load these two models in a single script. I would naturally try to do as below:
cnn1 = CNN(..., restore_file='snapshot-model1-10000',..........)
cnn2 = CNN(..., restore_file='snapshot-model2-10000',..........)
I encounter the error [The error message is long. I just copied/pasted a snippet of it.]
NotFoundError: Tensor name "Variable_26/Adam_1" not found in checkpoint files /home/amitkrkc/codes/A549_models/snapshot-hela-95000
[[Node: save_1/restore_slice_85 = RestoreSlice[dt=DT_FLOAT, preferred_shard=-1, _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save_1/Const_0, save_1/restore_slice_85/tensor_name, save_1/restore_slice_85/shape_and_slice)]]
Is there a way to load from these two files two separate CNNs? Any suggestion/comment/feedback is welcome.
Thank you,
Yes there is. Use separate graphs.
g1 = tf.Graph()
g2 = tf.Graph()
with g1.as_default():
cnn1 = CNN(..., restore_file='snapshot-model1-10000',..........)
with g2.as_default():
cnn2 = CNN(..., restore_file='snapshot-model2-10000',..........)
EDIT:
If you want them into same graph. You'll have to rename some variables. One idea is have each CNN in separate scope and let saver handle variables in that scope e.g.:
saver = tf.train.Saver(tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES), scope='model1')
and in cnn wrap all your construction in scope:
with tf.variable_scope('model1'):
...
EDIT2:
Other idea is renaming variables which saver manages (since I assume you want to use your saved checkpoints without retraining everything. Saving allows different variable names in graph and in checkpoint, have a look at documentation for initialization.
This should be a comment to the most up-voted answer. But I do not have enough reputation to do that.
Anyway.
If you(anyone searched and got to this point) still having trouble with the solution provided by lpp AND you are using Keras, check following quote from github.
This is because the keras share a global session if no default tf session provided
When the model1 created, it is on graph1
When the model1 loads weight, the weight is on a keras global session which is associated with graph1
When the model2 created, it is on graph2
When the model2 loads weight, the global session does not know the graph2
A solution below may help,
graph1 = Graph()
with graph1.as_default():
session1 = Session()
with session1.as_default():
with open('model1_arch.json') as arch_file:
model1 = model_from_json(arch_file.read())
model1.load_weights('model1_weights.h5')
# K.get_session() is session1
# do the same for graph2, session2, model2
You need to create 2 sessions and restore the 2 models separately. In order for this to work you need to do the following:
1a. When you're saving the models you need to add scopes to the variable names. That way you will know which variables belong to which model:
# The first model
tf.Variable(tf.zeros([self.batch_size]), name="model_1/Weights")
...
# The second model
tf.Variable(tf.zeros([self.batch_size]), name="model_2/Weights")
...
1b. Alternatively, if you already saved the models you can rename the variables by adding scope with this script.
2.. When you restore the different models you need to filter by variable name like this:
# The first model
sess_1 = tf.Session()
sess_1.run(tf.initialize_all_variables())
saver_1 = tf.train.Saver([v for v in tf.all_variables() if 'model_1' in v.name])
saver_1.restore(sess_1, weights_1_file)
sess_1.run(pred, feed_dict={image: X})
# The second model
sess_2 = tf.Session()
sess_2.run(tf.initialize_all_variables())
saver_2 = tf.train.Saver([v for v in tf.all_variables() if 'model_2' in v.name])
saver_2.restore(sess_2, weights_2_file)
sess_2.run(pred, feed_dict={image: X})
I encountered the same problem and could not solve the problem (without retraining) with any solution i found on the internet. So what I did is load each model in two separate threads which communicate with the main thread. It is simple enough to write the code, you just have to be careful when you synchronize the threads.
In my case each thread received the input for its problem and returned to the main thread the output. It works without any observable overhead.
One way is to clear your session if you want to train or load multiple models in succession. You can easily do this using
from keras import backend as K
# load and use model 1
K.clear_session()
# load and use model 2
K.clear_session()`
K.clear_session() destroys the current TF graph and creates a new one.
Useful to avoid clutter from old models / layers.
I would like to run a given model both on the train set (is_training=True) and on the validation set (is_training=False), specifically with how dropout is applied. Right now the prebuilt models expose a parameter is_training that is passed it the dropout layer when building the network. The issue is that If I call the method twice with different values of is_training, I will get two different networks that do no share weights (I think?). How do I go about getting the two networks to share the same weights such that I can run the network that I have trained on the validation set?
I wrote a solution with your comment to use Overfeat in train and test mode. (I couldn't test it so you can check if it works?)
First some imports and parameters:
import tensorflow as tf
slim = tf.contrib.slim
overfeat = tf.contrib.slim.nets.overfeat
batch_size = 32
inputs = tf.placeholder(tf.float32, [batch_size, 231, 231, 3])
dropout_keep_prob = 0.5
num_classes = 1000
In train mode, we pass a normal scope to the function overfeat:
scope = 'overfeat'
is_training = True
output = overfeat.overfeat(inputs, num_classes, is_training,
dropout_keep_prob, scope=scope)
Then in test mode, we create the same scope but with reuse=True.
scope = tf.VariableScope(reuse=True, name='overfeat')
is_training = False
output = overfeat.overfeat(inputs, num_classes, is_training,
dropout_keep_prob, scope=scope)
you can just use a placeholder for is_training:
isTraining = tf.placeholder(tf.bool)
# create nn
net = ...
net = slim.dropout(net,
keep_prob=0.5,
is_training=isTraining)
net = ...
# training
sess.run([net], feed_dict={isTraining: True})
# testing
sess.run([net], feed_dict={isTraining: False})
It depends on the case, the solutions are different.
My first option would be to use a different process to do the evaluation. You only need to check that there is a new checkpoint and load that weights into the evaluation network (with is_training=False):
checkpoint = tf.train.latest_checkpoint(self.checkpoints_path)
# wait until a new check point is available
while self.lastest_checkpoint == checkpoint:
time.sleep(30) # sleep 30 seconds waiting for a new checkpoint
checkpoint = tf.train.latest_checkpoint(self.checkpoints_path)
logging.info('Restoring model from {}'.format(checkpoint))
self.saver.restore(session, checkpoint)
self.lastest_checkpoint = checkpoint
The second option is after every epoch you unload the graph and create a new evaluation graph. This solution waste a lot of time loading and unloading graphs.
The third option is to share the weights. But feeding these networks with queues or dataset can lead to issues, so you have to be very careful. I only use this for Siamese networks.
with tf.variable_scope('the_scope') as scope:
your_model(is_training=True)
scope.reuse_variables()
your_model(is_training=False)