As can be seen in this picture, my graph has a: "fraction_of_32_full" output attached to it. I haven't done this explicitly, and nowhere in my code have I set any limits of a size of 32.
The only data I am explicitly adding to my summary is the cost of each batch, yet when I view my TensorBoard visualisation, I see this:
As you can see, it contains three things. The cost, which I asked for, and two other variables which I havent asked for. The fraction of 25,000 full, and the fraction of 32 full.
My Questions Are:
What are these?
Why are they added to my summaries without me explicitly asking?
I can actually answer my own question here. I did some digging, and found the answers.
What are these?
These are measures of how full your queues are. I have two queues, a string input producer queue which reads my files, and a batch queue which batches my records. Both of the records are in the format: "fraction of x full" where x is the capacity of the queue.
The reason the string input producer is fraction of 32, is because if you look at the documentation here, you'll see the default capacity is 32.
Why are these added to my summaries without me explicitly asking.
This was a little trickier. If you look at the source code for the input string producer here, you'll see that although the input_string_producer doesn't explicitly ask for a summary name, it returns an input_producer, which has a default summary name of: summary_name="fraction_of_%d_full" % capacity. Check line 235 for this. Something similar happens here for the batch queue.
The reason these are being recorded without explicitly asking, is because they were created without explicitly asking, and then the line of code:
merged_summaries = tf.summary.merge_all()
merged all these summaries together, so when I called:
sess.run([optimizer, cost, merged_summaries], ..... )
writer.add_summary(s, batch)
I was actually asking for these to be recorded too.
I hope this answer helped some people.
Related
Different calls of tf.data.Dataset.take() return different batches from the given dataset. Are those samples chosen randomly or is there another mechanism at play?
This is all the more confusing that the documentation makes no reference as to the randomness of the sampling.
Most probably, you might be using data.shuffle() before tf.data.Dataset.take().
Commenting that out should make the iterator behave as intended: take the same results over and over for each iterator run.
-- Or if you used an api that automatically shuffles without asking like image_dataset_from_directory
shuffle: Whether to shuffle the data. Default: True.
If set to False, sorts the data in alphanumeric order.
You would have to explicitly set shuffle=False when creating the dataset
I am a newbie to this domain. But from what I have seen in my notebook is that the take() does pick random samples. For instance, in the image shown here, I had just called image_dataset_from_directory() before calling take(), so no shuffling preceded the take op, still I see different samples on every run. Pls correct me if I am wrong, will help my understanding as well.
I am trying to get deterministic behaviour from tf.train.shuffle_batch(). I could, instead, use tf.train.batch() which works fine (always the same order of elements), but I need to get examples from multiple tf-records and so I am stuck with shuffle_batch().
I am using:
random.seed(0)
np.random.seed(0)
tf.set_random_seed(0)
data_entries = tf.train.shuffle_batch(
[data], batch_size=batch_size, num_threads=1, capacity=512,
seed=57, min_after_dequeue=32)
But every time I restart my script I get slightly different results (not completely different, but about 20% of the elements are in the wrong order).
Is there anything I am missing?
Edit: Solved it! See my answer below!
Maybe I misunderstood something, but you can collect multiple tf-records in a queue with tf.train.string_input_producer(), then read the examples into tensors and finally use tf.train.batch().
Take a look at CIFAR-10 input.
Answering my own question:
First the reason shuffle_batch is non deterministic:
The time until I request a batch is inherently random.
In that time, a random number of tensors are available.
Tensorflow calls a shuffle operation that is seeded but depending on the number of items, it will return a different order.
So no matter the seeding, the order is always different unless the number of elements is constant. So the solution is to keep the number of elements constant, but how we do it?
By setting capacity=min_after_dequeue+batch_size. This will force Tensorflow to fill up the queue until it reaches full capacity before dequeuing an item. Therefore, at the time of the shuffle operation, we have capacity many items which is a constant number.
So why are we doing this? Because one tf.record contains many examples but we want examples from multiple tf.records. With a normal batch we would first get all the examples of one record and then of the next one. This also means we should set min_after_dequeue to something larger than the number of items in one tf.record. In my example, I have 50 examples in one file so I set min_after_dequeue=2048.
Alternatively, we can also shuffle the examples before creating the tf.records, but this was not possible for me because I read tf.records from multiple directories (each with their own dataset).
Last Note: You should also use a batch size of 1 to be super save.
When I train a model for multiple days with image summary activated, my .tfevent files are huge ( > 70GiB).
I don't want to deactivate the image summary as it allows me to visualize the progress of the network during training. However, once the network is trained, I don't need those information anymore (in fact, I'm not even sure it is possible to visualize previous images with tensorboard).
I would like to be able to remove them from the event file without loosing other information like the loss curve (as it is useful to compare models together).
The solution would be to use two separate summary (one for the images and one for the loss) but I would like to know if there is a better way.
It is sure better to save the big summaries less often as Terry has suggested, but in case you already have an event file which is huge, you can still reduce its size by deleting some of the summaries.
I have had this issue, where I have saved a lot of image summaries, which I don't need now, so I have written a script to copy the eventfile, while only leaving the scalar summaries:
https://gist.github.com/serycjon/c9ad58ecc3176d87c49b69b598f4d6c6
The important stuff is:
for event in tf.train.summary_iterator(event_file_path):
event_type = event.WhichOneof('what')
if event_type != 'summary':
writer.add_event(event)
else:
wall_time = event.wall_time
step = event.step
# possible types: simple_value, image, histo, audio
filtered_values = [value for value in event.summary.value if value.HasField('simple_value')]
summary = tf.Summary(value=filtered_values)
filtered_event = tf.summary.Event(summary=summary,
wall_time=wall_time,
step=step)
writer.add_event(filtered_event)
you can use this as a base for more complicated stuff, like leaving only every 100-th image summary, filtering based on summary tag, etc.
If you look at the event types in the log using #serycjon's loop you'll see that the graph_def and meta_graph_def might be saved often.
I had 46 GB worth of logs that I reduced to 1.6 GB by removing all the graphs. You can leave one graph so that you can still view it in tensorboard.
Just handled this problem, hoping this is not too late.
My slolution is to save your image summary every 100(or other value) training steps, then the growth speed of the .tfevent's file size will be slow down, eventually the file size will be much smaller.
For monitoring my model's performance on my evaluation dataset, I'm using tf.train.string_input_producer for the filenames queue on .tfr files, then I feed the parsed examples to the tf.train.batch function, that produces batches of a fixed size.
Assume my evaluation dataset contains exactly 761 examples (a prime number). To read all the examples exactly once, I have to have a batch size that divides 761, but there is no such, except 1 that will be too slow and 761 that will not fit in my GPU. Any standard way for reading each example exactly once?
Actually, my dataset size is not 761, but there is no number in the reasonable range of 50-300 that divides it exactly. Also I'm working with many different datasets, and finding a number that approximately divides the number of examples in each dataset can be a hassle.
Note that using the num_epochs parameter to tf.train.string_input_producer does not solve the issue.
Thanks!
You can use reader.read_up_to as in this example. Your last batch will be smaller, so you need to make sure your network doesn't hard-wire batch-size anywhere
I would like to see how the learning rate changes during training (print it out or create a summary and visualize it in tensorboard).
Here is a code snippet from what I have so far:
optimizer = tf.train.AdamOptimizer(1e-3)
grads_and_vars = optimizer.compute_gradients(loss)
train_op = optimizer.apply_gradients(grads_and_vars, global_step=global_step)
sess.run(tf.initialize_all_variables())
for i in range(0, 10000):
sess.run(train_op)
print sess.run(optimizer._lr_t)
If I run the code I constantly get the initial learning rate (1e-3) i.e. I see no change.
What is a correct way for getting the learning rate at every step?
I would like to add that this question is really similar to mine. However, I cannot post my findings in the comment section there since I do not have enough rep.
I was asking myself the exact same question, and wondering why wouldn't it change. By looking at the original paper (page 2), one sees that the self._lr stepsize (designed with alpha in the paper) is required by the algorithm, but never updated. We also see that there is an alpha_t that is updated for every t step, and should correspond to the self._lr_t attribute. But in fact, as you observe, evaluating the value for the self._lr_t tensor at any point during the training returns always the initial value, that is, _lr.
So your question, as I understood it, is how to get the alpha_t for TensorFlow's AdamOptimizer as described in section 2 of the paper and in the corresponding TF v1.2 API page:
alpha_t = alpha * sqrt(1-beta_2_t) / (1-beta_1_t)
BACKGROUND
As you observed, the _lr_t tensor doesn't change thorough the training, which may lead to the false conclusion that the optimizer doesn't adapt (this can be easily tested by switching to the vanilla GradientDescentOptimizer with the same alpha). And, in fact, other values do change: a quick look at the optimizer's __dict__ shows the following keys: ['_epsilon_t', '_lr', '_beta1_t', '_lr_t', '_beta1', '_beta1_power', '_beta2', '_updated_lr', '_name', '_use_locking', '_beta2_t', '_beta2_power', '_epsilon', '_slots'].
By inspecting them through training, I noticed that only _beta1_power, _beta2_power and the _slots get updated.
Further inspecting the optimizer's code, in line 211, we see the following update:
update_beta1 = self._beta1_power.assign(
self._beta1_power * self._beta1_t,
use_locking=self._use_locking)
Which basically means that _beta1_power, which is initialized with _beta1, will be multiplied by _beta_1_t after every iteration, which is also initialized with beta_1_t.
But here comes the confusing part: _beta1_t and _beta2_t never get updated, so effectively they hold the initial values (_beta1and _beta2) through the whole training, contradicting the notation of the paper in a similar fashion as _lr and lr_t do. I guess this is for a reason but I personally don't know why, in any case this are protected/private attributes of the implementation (as they start with an underscore) and don't belong to the public interface (they may even change among TF versions).
So after this small background we can see that _beta_1_power and _beta_2_power are the original beta values exponentiated to the current training step, that is, the equivalent to the variables referred with beta_tin the paper. Going back to the definition of alpha_t in the section 2 of the paper, we see that, with this information, it should be pretty straightforward to implement:
SOLUTION
optimizer = tf.train.AdamOptimizer()
# rest of the graph...
# ... somewhere in your session
# note that a0 comes from a scalar, whereas bb1 and bb2 come from tensors and thus have to be evaluated
a0, bb1, bb2 = optimizer._lr, optimizer._beta1_power.eval(), optimizer._beta2_power.eval()
at = a0* (1-bb2)**0.5 /(1-bb1)
print(at)
The variable at holds the alpha_t for the current training step.
DISCLAIMER
I couldn't find a cleaner way of getting this value by just using the optimizer's interface, but please let me know if it exists one! I guess there is none, which actually puts into question the usefulness of plotting alpha_t, since it does not depend on the data.
Also, to complete this information, section 2 of the paper also gives the formula for the weight updates, which is much more telling, but also more plot-intensive. For a very nice and good-looking implementation of that, you may want to take a look at this nice answer from the post that you linked.
Hope it helps! Cheers,
Andres