How to overwrite checkpoint of same global step for Tensorboard? - tensorflow

If you resume training from a specific epoch (same global step) and run tensorboard, tensorboard (add_scalar) will end up plotting 2 points for that particular global step.
Example, I want trying to test whether changing the learning rate halfway through training will improve/worsen the accuracy:
Plot example
The plots for same time step are plotted twice (I resumed from +15 epochs behind the latest epoch).
Search up the web, cannot find any commands that can ask Tensorboard to just overwrite the previous saved checkpoint for the new one. My expectation is that Tensorboard would know to overwrite the same global step point but it is plotting it together.

You can delete the log file running the script (before creating model).
And then it will be created empty.
Or you can add time in the name of the log:
import time
NAME = 'my_cnn-{}'.format(int(time.time()))
tensorboard = TensorBoard(log_dir='logs/{}'.format(NAME))

Related

Program crashed in the last step in test Tensorflow-gpu 2.0.0

When using Tensorflow 2.0.0 and split dataset into train-set and test-set. The training and testing code is as following:
for epoch in range(params.num_epochs):
for step, (x_batch_train, y_batch_train) in enumerate(train_dist_dataset):
DO TRAINING HERE....
if epoch % params.num_epoch_record == 0:
for step, (x_test, y_test) in enumerate(test_dist_dataset):
DO TESTing HERE....
checkpoint.step.assign_add(1)
save_path = manager.save()
logger.info("Saved checkpoint {}".format(save_path))
However, when after the last test data in enumerate(test_dist_dataset) the program will crash and shows up:
F .\tensorflow/core/kernels/conv_2d_gpu.h:964] Non-OK-status: GpuLaunchKernel( SwapDimension1And2InTensor3UsingTiles<T, kNumThreads, kTileSize, kTileSize, conjugate>, total_tiles_count, kNumThreads, 0, d.stream(), input, input_dims, output) status: Internal: invalid configuration argument
So, how it occurs and how to solve it?
In my case, the problem is related to the batch size. I am using nvidia docker 19.12 and data generator. The code works well with one gpu and the problem happened only with mirroredstrategy in the model.predict.
When the total number of data can not be divided by the batch_size perfectly, the error happens. For example, you have 5 data and batch_size is 2. The 3rd batch will have only one data and brings a problem.
The solution is either throw away the last data, or in my case, add some dummy data to fill up the last batch.

I trained a neural network but I cannot find where it got saved, I cannot find any .meta , .index, .data files

I was trying to follow this page https://www.tensorflow.org/tutorials/sequences/audio_recognition
I successfully executed the following command:
python tensorflow/examples/speech_commands/train.py
I used a virtual environment in Anaconda. Used Tensorflow 14 and Python 3.6
It took about about 22 hours to train it. it said "/tmp/speech_commands_train/conv.ckpt-100" after every 100 iterations
(there were 18000 in total)
but now when I try to find conv.ckpt-18000.meta or just speech_commands_train I cannot find it.
I am very new to this. This is my first effort in deep learning.
how the terminal looked when training ended
Firstly, what you mean by " Where It saved", by it you mean logs, the trained model or weights.
In your case, you are just storing the weights at given checkpoints hence you can acess them at given paths said in the tutorial
I0730 16:54:41.813438 55030 train.py:252] Saving to "/tmp/speech_commands_train/conv.ckpt-100"
*This is saving out the current trained weights to a checkpoint file. If your training script gets interrupted, you can look for the last saved checkpoint and then restart the script with -*
Also you can store logs using file writer and model using save_model or tensorboard callback with logdir.
Don't forget to upvote if found it useful

Why and when do I need to use the global step in tensorflow

I am using tensorflow, but I am not sure why I even need the global_step variable or if it is even necessary for training. I have sth like this:
gradients_and_vars = optimizer.compute_gradients(value)
train_op = optimizer.apply_gradients(gradients_and_vars)
and then in my loop inside a session I do this:
_ = sess.run([train_op])
I am using a Queue to feed my data the the graph. Do I even have to instantiate a global_step variable?
My loop looks like this:
while not coord.should_stop():
So this loop stops, when it should stop. So why do I need the global_step at all?
You don't need the global step in all cases. But sometimes people want to stop training, tweak some code and then continue training with the saved and restored model. Then often it is nice to know how long (=for how many time steps) this model had been trained so far. Thus the global step.
Also sometimes your learning rate regime might depend on the time the model already had been trained. Say you want to decay your learning rate every 100.000 steps. If you don't keep track of the number of steps already taken this might be difficult if you interrupted training in between and didn't keep track of the number of steps already taken.
And furthermore if you are using tensorboard the global step is the central parameter for your x-axis of the charts.

Which file to be used for eval step in TEXTSUM?

Am working on the texsum model of tensorflow which is text summarization. I was following commands specified in readme at github/textsum. It said that file named validation, present in data folder, is to be used in eval step, but there was no validation file in data folder.
I thought to make one myself and later realized that it should be a binary file. So I needed to prepare a text file which will be converted to binary.
But that text file has to have a specific format. Will it be same as that of the file used in train step? Can i use the same file for train step and eval step?
The sequence of steps i followed are:
Step 1: Train the model by using the vocab file which was mentioned as "updated" for toy dataset
Step 2: Training continued for a while and it got "Killed" at running_avg_loss: 3.590769
Step 3: Using the same data and vocab files for eval step, as had been used for training, I ran eval. It keeps on running with running_avg_loss between 6 to 7
I am doubtful of step 3, if same files are to be used or not.
So you don't have to run eval unless you are in fact testing your model after you have trained to determine how the training does against another set of data it has never seen before. I have also been sing it to determine if I am starting to overfit the data.
So you will usually take 20-30% of your overall dataset and use it for the eval process. You then go about training against your training data. Once complete, you can just run decode right away should you desire or you can run eval against the 20% - 30% dataset you set aside form the start. Once you feel comfortable with the results you can then run your decode to get the results.
Your binary format should be the same as your training data.

"rewind" tensorflow training step

I occasionally hit a problem with training in tensorflow and stochastic gradient descent where I load a mini-batch that wreaks havoc on my optimization op, pushing it to Nans. This, of course, throws an error in the training process and forces me to start over. Even if I wrap the optimization op in a try statement, by the time an exception is raised, the damage is done and I need to re-start.
Does anyone have a good way of, essentially, rewinding optimization back to a valid state when it hits an error? I would think you could use checkpoints for this, but the docs on saving/restoring are so spotty that i'm not sure...
As you suggest checkpoints are the way to do it. The key steps for your case are as follows:
First create a saver object after you've defined your graph:
saver = tf.train.Saver(max_to_keep=5, keep_checkpoint_every_n_hours=1)
Next, write out check points intermittently during training:
for step in range(max_steps):
... some training steps here
# Save the model every 100 iterations
if step % 100 == 0:
saver.save(sess, checkpoint_dir, global_step=step)
Finally, when you catch an error, reload the last good checkpoint:
# this next command restores the latest checkpoint or explicitly specify the filename if you want to use some other logic
restore_fn = tf.train.latest_checkpoint(FLAGS.restore_dir)
print('Restoring from %s' % restore_fn)
saver.restore(sess, restore_fn)
Answering a different question:
Which optimizer are you using?
Big jumps, like you can get with simple gradient descent, shouldn't be possible with gradient clipping or an optimizer with a limited step size (like Adam).