How to see more evaluation steps in tensorboard - tensorflow

I want to see more evaluation steps in Tensorboard, while I'm training and evaluating my object detection (standard code in tensorflow object detection).
Here you can see what I mean for number of evaluation steps. As you can see, it's fixed to 10 visualization.
I can't find where to change and increase this parameter. Moreover, these visualizations are random and not the last 10.
Is it possible to set a different number of visualization?
And what can I do for see the last N evaluations instead of random N evaluations?
Thank you in advance.
Added: Image from link:

I assume you're using this code:
https://github.com/tensorflow/models/tree/master/research/object_detection
(you should include that link to clarify in future questions, and if that assumption is wrong you should edit your question to specify what code you're using)
If you look at the trainer.py code at the bottom they have:
slim.learning.train(
train_tensor,
logdir=train_dir,
master=master,
is_chief=is_chief,
session_config=session_config,
startup_delay_steps=train_config.startup_delay_steps,
init_fn=init_fn,
summary_op=summary_op,
number_of_steps=(
train_config.num_steps if train_config.num_steps else None),
save_summaries_secs=120,
sync_optimizer=sync_optimizer,
saver=saver)
It looks like they've hard coded save_summaries_sec=120 to save a summary every 120 seconds. That's what you want to edit to change the tensorboard summary update period.
Edit: I've added the image to the question to help clarify. I believe the answer is in tf.summary.image you have a property max_outputs which controls the number of values from the block of images. To choose a subset of images specifically you should simply write your own code to select them in whatever way you see fit, randomly, or in some order, then pass that new set of images to tf.summary.image.

You may want to consider looking at the eval_config section of the model config file.
eval_config: {
num_examples: 100
num_visualizations: 50
# Note: The below line limits the evaluation process to 10 evaluations.
# Remove the below line to evaluate indefinitely.
#max_evals: 10
}
I'm guessing that max_evals is what you're looking for.

Related

Does `tf.data.Dataset.take()` return random sample?

Different calls of tf.data.Dataset.take() return different batches from the given dataset. Are those samples chosen randomly or is there another mechanism at play?
This is all the more confusing that the documentation makes no reference as to the randomness of the sampling.
Most probably, you might be using data.shuffle() before tf.data.Dataset.take().
Commenting that out should make the iterator behave as intended: take the same results over and over for each iterator run.
-- Or if you used an api that automatically shuffles without asking like image_dataset_from_directory
shuffle: Whether to shuffle the data. Default: True.
If set to False, sorts the data in alphanumeric order.
You would have to explicitly set shuffle=False when creating the dataset
I am a newbie to this domain. But from what I have seen in my notebook is that the take() does pick random samples. For instance, in the image shown here, I had just called image_dataset_from_directory() before calling take(), so no shuffling preceded the take op, still I see different samples on every run. Pls correct me if I am wrong, will help my understanding as well.

TensorFlow Supervisor just stores the latest five models

I am using TensorFlow's Supervisor to train my own model. I followed the official guide to set save_model_secs to be 600. However, I strangely find the path log_dir merely saves the latest five models and automatically discard models generated earlier. I carefully read the source code supervisor.py but cannot find the relevant removal code or mechanism why just five models can be saved all along the training process. Does any have any hint to help me? Any help is really appreciated.
tf.train.Supervisor has a saver argument. If not given, it will use a default. This is configured to only store the last five checkpoints. You can overwrite this by passing your own tf.train.Saver object.
See here for the docs. There are essentially two ways of storing more checkpoints when creating the Saver:
Pass some large integer to the max_to_keep argument. If you have enough storage, passing 0 or None should result in all checkpoints being kept.
Saver also has an argument keep_checkpoint_every_n_hours. This will give you a separate "stream" of checkpoints that will be kept indefinitely. So for example you could store checkponts every 600 seconds (via the save_model_secs argument to Supervisor), but only keep the five most recent of those, but additionally save checkpoints each, say, 30 minutes (0.5 hours) all of which will be kept.

Tensorflow Shuffle Batch Non Deterministic

I am trying to get deterministic behaviour from tf.train.shuffle_batch(). I could, instead, use tf.train.batch() which works fine (always the same order of elements), but I need to get examples from multiple tf-records and so I am stuck with shuffle_batch().
I am using:
random.seed(0)
np.random.seed(0)
tf.set_random_seed(0)
data_entries = tf.train.shuffle_batch(
[data], batch_size=batch_size, num_threads=1, capacity=512,
seed=57, min_after_dequeue=32)
But every time I restart my script I get slightly different results (not completely different, but about 20% of the elements are in the wrong order).
Is there anything I am missing?
Edit: Solved it! See my answer below!
Maybe I misunderstood something, but you can collect multiple tf-records in a queue with tf.train.string_input_producer(), then read the examples into tensors and finally use tf.train.batch().
Take a look at CIFAR-10 input.
Answering my own question:
First the reason shuffle_batch is non deterministic:
The time until I request a batch is inherently random.
In that time, a random number of tensors are available.
Tensorflow calls a shuffle operation that is seeded but depending on the number of items, it will return a different order.
So no matter the seeding, the order is always different unless the number of elements is constant. So the solution is to keep the number of elements constant, but how we do it?
By setting capacity=min_after_dequeue+batch_size. This will force Tensorflow to fill up the queue until it reaches full capacity before dequeuing an item. Therefore, at the time of the shuffle operation, we have capacity many items which is a constant number.
So why are we doing this? Because one tf.record contains many examples but we want examples from multiple tf.records. With a normal batch we would first get all the examples of one record and then of the next one. This also means we should set min_after_dequeue to something larger than the number of items in one tf.record. In my example, I have 50 examples in one file so I set min_after_dequeue=2048.
Alternatively, we can also shuffle the examples before creating the tf.records, but this was not possible for me because I read tf.records from multiple directories (each with their own dataset).
Last Note: You should also use a batch size of 1 to be super save.

Remove data from tensorboard event files to make them smaller

When I train a model for multiple days with image summary activated, my .tfevent files are huge ( > 70GiB).
I don't want to deactivate the image summary as it allows me to visualize the progress of the network during training. However, once the network is trained, I don't need those information anymore (in fact, I'm not even sure it is possible to visualize previous images with tensorboard).
I would like to be able to remove them from the event file without loosing other information like the loss curve (as it is useful to compare models together).
The solution would be to use two separate summary (one for the images and one for the loss) but I would like to know if there is a better way.
It is sure better to save the big summaries less often as Terry has suggested, but in case you already have an event file which is huge, you can still reduce its size by deleting some of the summaries.
I have had this issue, where I have saved a lot of image summaries, which I don't need now, so I have written a script to copy the eventfile, while only leaving the scalar summaries:
https://gist.github.com/serycjon/c9ad58ecc3176d87c49b69b598f4d6c6
The important stuff is:
for event in tf.train.summary_iterator(event_file_path):
event_type = event.WhichOneof('what')
if event_type != 'summary':
writer.add_event(event)
else:
wall_time = event.wall_time
step = event.step
# possible types: simple_value, image, histo, audio
filtered_values = [value for value in event.summary.value if value.HasField('simple_value')]
summary = tf.Summary(value=filtered_values)
filtered_event = tf.summary.Event(summary=summary,
wall_time=wall_time,
step=step)
writer.add_event(filtered_event)
you can use this as a base for more complicated stuff, like leaving only every 100-th image summary, filtering based on summary tag, etc.
If you look at the event types in the log using #serycjon's loop you'll see that the graph_def and meta_graph_def might be saved often.
I had 46 GB worth of logs that I reduced to 1.6 GB by removing all the graphs. You can leave one graph so that you can still view it in tensorboard.
Just handled this problem, hoping this is not too late.
My slolution is to save your image summary every 100(or other value) training steps, then the growth speed of the .tfevent's file size will be slow down, eventually the file size will be much smaller.

Learning rate doesn't change for AdamOptimizer in TensorFlow

I would like to see how the learning rate changes during training (print it out or create a summary and visualize it in tensorboard).
Here is a code snippet from what I have so far:
optimizer = tf.train.AdamOptimizer(1e-3)
grads_and_vars = optimizer.compute_gradients(loss)
train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)
sess.run(tf.initialize_all_variables())
for i in range(0, 10000):
sess.run(train_op)
print sess.run(optimizer._lr_t)
If I run the code I constantly get the initial learning rate (1e-3) i.e. I see no change.
What is a correct way for getting the learning rate at every step?
I would like to add that this question is really similar to mine. However, I cannot post my findings in the comment section there since I do not have enough rep.
I was asking myself the exact same question, and wondering why wouldn't it change. By looking at the original paper (page 2), one sees that the self._lr stepsize (designed with alpha in the paper) is required by the algorithm, but never updated. We also see that there is an alpha_t that is updated for every t step, and should correspond to the self._lr_t attribute. But in fact, as you observe, evaluating the value for the self._lr_t tensor at any point during the training returns always the initial value, that is, _lr.
So your question, as I understood it, is how to get the alpha_t for TensorFlow's AdamOptimizer as described in section 2 of the paper and in the corresponding TF v1.2 API page:
alpha_t = alpha * sqrt(1-beta_2_t) / (1-beta_1_t)
BACKGROUND
As you observed, the _lr_t tensor doesn't change thorough the training, which may lead to the false conclusion that the optimizer doesn't adapt (this can be easily tested by switching to the vanilla GradientDescentOptimizer with the same alpha). And, in fact, other values do change: a quick look at the optimizer's __dict__ shows the following keys: ['_epsilon_t', '_lr', '_beta1_t', '_lr_t', '_beta1', '_beta1_power', '_beta2', '_updated_lr', '_name', '_use_locking', '_beta2_t', '_beta2_power', '_epsilon', '_slots'].
By inspecting them through training, I noticed that only _beta1_power, _beta2_power and the _slots get updated.
Further inspecting the optimizer's code, in line 211, we see the following update:
update_beta1 = self._beta1_power.assign(
self._beta1_power * self._beta1_t,
use_locking=self._use_locking)
Which basically means that _beta1_power, which is initialized with _beta1, will be multiplied by _beta_1_t after every iteration, which is also initialized with beta_1_t.
But here comes the confusing part: _beta1_t and _beta2_t never get updated, so effectively they hold the initial values (_beta1and _beta2) through the whole training, contradicting the notation of the paper in a similar fashion as _lr and lr_t do. I guess this is for a reason but I personally don't know why, in any case this are protected/private attributes of the implementation (as they start with an underscore) and don't belong to the public interface (they may even change among TF versions).
So after this small background we can see that _beta_1_power and _beta_2_power are the original beta values exponentiated to the current training step, that is, the equivalent to the variables referred with beta_tin the paper. Going back to the definition of alpha_t in the section 2 of the paper, we see that, with this information, it should be pretty straightforward to implement:
SOLUTION
optimizer = tf.train.AdamOptimizer()
# rest of the graph...
# ... somewhere in your session
# note that a0 comes from a scalar, whereas bb1 and bb2 come from tensors and thus have to be evaluated
a0, bb1, bb2 = optimizer._lr, optimizer._beta1_power.eval(), optimizer._beta2_power.eval()
at = a0* (1-bb2)**0.5 /(1-bb1)
print(at)
The variable at holds the alpha_t for the current training step.
DISCLAIMER
I couldn't find a cleaner way of getting this value by just using the optimizer's interface, but please let me know if it exists one! I guess there is none, which actually puts into question the usefulness of plotting alpha_t, since it does not depend on the data.
Also, to complete this information, section 2 of the paper also gives the formula for the weight updates, which is much more telling, but also more plot-intensive. For a very nice and good-looking implementation of that, you may want to take a look at this nice answer from the post that you linked.
Hope it helps! Cheers,
Andres