Tracking counts of examples used in training - tensorflow

I am trying to implement the CBOW word2vec model based on the skipgrams implementation on the tensorflow repository:
https://github.com/tensorflow/tensorflow/blob/v0.10.0/tensorflow/models/embedding/word2vec.py
I have previously implemented the simplified version following the TensorFlow tutorials, so I understand that I will have to modify the data batching function as well as a small part of the graph to get the context embedding.
In the skipgram implementation, the data batching function is used in lines 348-351.
(words, counts, words_per_epoch, self._epoch, self._words, examples,
labels) = word2vec.skipgram(filename=opts.train_data,
batch_size=opts.batch_size,
window_size=opts.window_size,
min_count=opts.min_count,
subsample=opts.subsample)
From my understanding, the variables assigned are as follows:
words: terms in the vocabulary
counts: associated counts of terms used in the corpus
words_per_epoch: total word count in the corpus
self._epoch: current count of epochs used
self._words: current count of training examples used
examples: current batch of training examples
labels: current batch of training labels
I have managed to replicate the tensor for words, counts, words_per_epoch, examples and labels. However, self._epoch and self._words have eluded me. If my understanding is correct, I need to be able to track the count of the training examples used. However, this is not provided by the sample batching function. The counts are later used in a multi-threaded manner to terminate the training loop, hence I can't simply use a loop to add up the counts.
I understand that bits of the tensorflow ops are implemented in C++. However, as I am not familiar with C++, I will have to replicate those parts using Python.
Will be great if I can get some suggestions to obtain the tensor for self._words. The tensor basically has to increment only when every time a new batch of examples/labels are called. With that, I can simply use a self._epoch = self._words // words_per_epoch to get the other tensor.

Figured out the trick while looking at the source code for tensorflow.models.embedding.word2vec_optimized.py. Specifically, how global_step was incremented when loss was called in lines 218-225.
In my case, I would have to do it as so:
# codes to prepare features and labels tensors
data_processed = tf.Variable(0, trainable=False, dtype=tf.int64)
epochs_processed = data_processed // data_per_epoch
inc_op = data_processed.assign_add(batch_size)
with tf.control_dependencies([inc_op]):
features_batch, labels_batch = tf.train.batch([features, labels],
batch_size=batch_size)
In this case, the tensor data_processed will always be incremented by batch_size whenever features_batch or labels_batch is called. epochs_processed will also be incremented accordingly.
The use of tf.control_dependencies(control_inputs) is key here. It returns a context manager. The operations specified in control_inputs must be executed before the operations defined in the context.

Related

Loss reduction in tf.distributed.MirroredStrategy()

I'm confused regarding using distributed strategy and the correct way of reduction in loss functions.
I implemented a U-Net using tf.distribute.MirroredStrategy(). Everything works fine using default loss BinaryCrossentropy as follows:
with strategy.scope():
model = build_network((size, size, 3), num_classes)
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=args.learning_rate),
loss=tf.keras.losses.BinaryCrossentropy()])
However, I want to create custom loss functions. To start with, I wrote a wrapper containing BinaryCrossentropy, to get familiar with the correct way of using the reduction methods. I followed the instructions in https://www.tensorflow.org/tutorials/distribute/custom_training#define_the_loss_function
and used tf.nn.compute_average_loss in order to divide by the global batch_size.
def loss_functions(loss_spec):
if loss_spec == 'cross_entropy':
def c_loss(truth, pred):
my_loss = tf.keras.losses.BinaryCrossentropy(reduction=tf.keras.losses.Reduction.NONE)(truth, pred)
my_loss = tf.math.reduce_mean(my_loss, axis=[1, 2]) # to compute average across the two image dimensions
my_loss = tf.nn.compute_average_loss(my_loss) # sums up all items and divides by batch size
return my_loss
return c_loss
which is called in the following way:
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=args.learning_rate),
loss=utils.loss_functions('cross_entropy')])
It also works, but I realised a difference of factor of number or replica compared to using tf.keras.losses.BinaryCrossentropy(). I.e., when using two kernels, using BinaryCrossentropy() directly yields a loss twice as large as my custom loss. Thus, to geht the same, I would need to divide by the batch size per replica instead of global batch size, i.e., the way it should NOT be done according to the documentation.
However, the documentation refers to building an own training routine, whereas I am using model.compile() and model.fit() methods.
Can anybody explain this behaviour to me?
UPDATE:
The use of tf.nn.compute_average_loss or the use of any reduction on the batch axis is not needed when using model.compile() and model.fit() at all - the reduction and scaling is done automatically. However, I still do not know how model.fit() does work internally.
Thanks and cheers, everybody

Defining assignment function as variable in tensroflow?

I am training a neural network by SGD (batch size = 1). The inputs are randomly generated, and the labels are calculated based on the input. AKA the data does not have to be realistic, but the relationships between inputs and labels are specific. I will train my NN only 1 epoch, but with many batches.
I have the following code:
training_input = tf.Variable(tf.zeros(...))
assign_training_input_with_random_values = training_input.assign(tf.random_normal(...))
//Create a session, initialize a bunch of variables, construct a neural network...
for batch in range(batch_number):
sess.run(assign_training_input_with_random_values)
//Train my neural network...
However I noticed that if I write the above code differently the speed goes down by a lot:
//Run the assignment operation directly without defining it as a variable
for batch in range(batch_size)
sess.run(training_input.assign(tf.random_normal(...)))
//Train my neural network...
The first snippet being significantly faster makes me worry that tensorflow is only randomizing when I define the assign_training_input_with_random_values variable, and the same training examples are fed to the NN over every batch afterwards. In this case, the NN will probably not generalize well. Meanwhile, the second snippet is slow because it is randomizing every batch. Is this actually the case or is there another reason for this?
First the explanation to your observations
Computational difference between 1st and 2nd solutions
It makes sense that your first solution is faster than the second. You define the assign operation once and then execute that for 100 epochs. However in the 2nd solution you create an op every epoch, growing the computational graph over time which causes your program to slow down.
Observation about the 1st solution
(After #Y.Z.'s finding) Apparently the first solution does evaluate to different random number arrays every time you run it. Therefore, the first solution is also valid.
Another way to implement this
The correct way to implement your solution would be to use a tf.placeholder to feed values in every epoch the following way.
import tensorflow as tf
import numpy as np
training_input = tf.Variable(tf.zeros(shape=[3, 2]))
tf_random = tf.placeholder(shape=[3, 2], dtype=tf.float32)
assign_training_input_with_random_values = training_input.assign(tf_random)
#Create a session, initialize a bunch of variables, construct a neural network...
epoch=0
with tf.Session() as sess:
while epoch < 10:
epoch+= 1
sess.run(assign_training_input_with_random_values, feed_dict={tf_random:np.random.normal(size=(3,2))})
Comparing Solution 1 vs My solution
So turns out, both your first solution and my solution will not grow the graph. If you run the line
print([n.name for n in tf.get_default_graph().as_graph_def().node])
for your first solution and my solution (Be careful to run tf.reset_default_graph() at the beginning) you'll see that the number of tensors remain constant regardless of the number of iterations. Appears that TensorFlow is smart enough to prune those old tf.random tensors no longer used.

Must I call iter.get_next when using Tensorflow's Dataset?

I have converted code over from a queue based system to tensorflow's dataset. After the conversion, I'm seeing a loss of accuracy and times have increased. I am attributing this to an incorrect implementation on my end and I am currently trying to troubleshoot what may be the issue. Now through trial and error in this conversion I made some assumptions based on a number of articles and examples I came across and I just wanted to make sure that my current implementation is correct and that my assumptions were as well.
Previously I had a huge number of images and I would batch them into a queue, then pop off the queue with my 100 images, perform processing and summarization and then continue. This loading into memory via the queue I believed was potentially causing a bottleneck, so when I heard about the Dataset API, I figured it was worth a look. So I now retrieve all image info and and pass it to my method where I then perform the batch via the Dataset batch method. The before and after are shown below. I had read that it wasn't necessary to call iter.get_next on the dataset as the ops would call it automatically, however with the accuracy I'm seeing at the end, I'm hesitant on whether this is true or not. Currently as you can see, I just pass the iter.initializer as an op to sess.run with my other ops and pass the feed_dict. Any insight would be helpful as I'm somewhat new to this. Thanks!
Previous sample function when using queue:
(Mind you I would queue the images into a blob object and pass that subset to this method)
def get_summary(self, sess, images, labels, weights, keep_prob = 1.0):
feed_dict = {self._input_images: images, self._input_labels: labels,
self._input_weights: weights, self._is_training: False}
summary, acc = sess.run([self._summary_op, self._accuracy], feed_dict=feed_dict)
return summary, acc
Current sample function using Dataset API:
(Now prior to calling this, I populate my blob object with all data and use the batch functionality below -- notice that I never make a call to iter.get_next())
def get_summary(self, sess, images, labels, weights, keep_prob = 1.0, batch_size=32):
dataset = tf.data.Dataset.from_tensor_slices((self._input_images, self._input_labels,
self._input_weights)).repeat().batch(batch_size)
iter = dataset.make_initializable_iterator()
feed_dict = {self._input_images: images, self._input_labels: labels,
self._input_weights: weights, self._is_training: False}
_, summary, acc = sess.run([iter.initializer, self._summary_op, self._accuracy], feed_dict=feed_dict)
return summary, acc
From that code snippet, it looks like you never use the values from iter so it should be having no effect on your summaries. For example, you should be able to delete the lines that create the iterator, and remove iter.initializer from the list passed to sess.run() and get the same result.
To answer the broader question of "Must I call iter.get_next()?": in graph-based TensorFlow there must be a dataflow connection between the tf.data.Iterator and the tensor/operation that you pass to sess.run() in order to consume values from that iterator. If you are using the low-level TensorFlow API, the easiest way to achieve this is to call iter.get_next() to get one or more tf.Tensor objects, and then use those tensors as the input to your model.
However, if you are using the high-level tf.estimator API, your input_fn can return a tf.data.Dataset without creating a tf.data.Iterator (or calling Iterator.get_next(), and the Estimator API will take care of creating the iterator and calling get_next() for you.

Tensorflow batch normalization: difference momentum and renorm_momentum

I want to replicate a network build with the lasagne-library in tensor flow. I'm having some trouble with the batch normalization.
This is the lasagne documentation about the used batch normalization:
http://lasagne.readthedocs.io/en/latest/modules/layers/normalization.html?highlight=batchNorm
In tensorflow I found two functions to normalize:
https://www.tensorflow.org/api_docs/python/tf/nn/batch_normalization
https://www.tensorflow.org/api_docs/python/tf/layers/batch_normalization
The first one is simpler but does not let me choose the alpha parameter from lasagne (Coefficient for the exponential moving average of batch-wise means and standard deviations computed during training). I tried using the second function, which has a lot more options, but there are two things I do not understand about it:
I am not clear about the difference between momentum and renorm_momentum. If I have a alpha of 0.9 in the lasagne network, can I just set both tensorflow momentums to 0.9 and expect the same behaviour?
The tf documentation notes:
when training, the moving_mean and moving_variance need to be updated. By default the update ops are placed in tf.GraphKeys.UPDATE_OPS, so they need to be added as a dependency to the train_op. For example:
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
train_op = optimizer.minimize(loss)
I do not really understand what is happening here and where I need to put something similar in my code. Can I just put this somewhere before I run the session? What parts of this code piece should I not copy literally but change depending on my code?
There is a big difference between tf.nn.batch_normalization and tf.layers.batch_normalization. See my answer here. So you have made the right choice by using the layers version. Now, on your questions:
renorm_momentum only has an effect is you use batch renormalization by setting the renorm argument to True. You can ignore this if using default batch normalization.
Short answer: You can literally copy that code snippet. Put it exactly where you would normally call optimizer.minimize.
Long answer on 2.: Batch normalization has two "modes": Training and inference. During training, mean and variance of the current minibatch is used. During inference, this is not desirable (e.g. you might not even use batches as input, so there would be no minibatch statistics). For this reason, moving averages over minibatch means/variances are kept during training. These moving averages are then used for inference.
By default, Tensorflow only executes what it needs to. Those moving averages are not needed for training, so they normally would never be executed/updated. The tf.control_dependencies context manager forces Tensorflow to do the updates every time it computes whatever is in the code block (in this case the cost). Since the cost certainly needs to be computed exactly one per training step, this is a good way of making sure the moving averages are updated.
The code example seems a bit arcane, but in context it would really just be (as an example):
loss = ...
train_step = SomeOptimizer().minimize(loss)
with tf.Session() as sess:
....
becomes
loss = ...
with tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)):
train_step = SomeOptimizer().minimize(loss)
with tf.Session() as sess:
....
Finally, keep in mind to use the correct training argument for batch normalization so that either minibatch statistics or moving averages are used as intended.

Questions about tensorflow GetStarted tutorial

So I was reading the tensorflow getstarted tutorial and I found it very hard to follow. There were a lot of explanations missing about each function and why they are necesary (or not).
In the tf.estimator section, what's the meaning or what are they supposed to be the "x_eval" and "y_eval" arrays? The x_train and y_train arrays give the desired output (which is the corresponding y coordinate) for a given x coordinate. But the x_eval and y_eval values are incorrect: for x=5, y should be -4, not -4.1. Where do those values come from? What do x_eval and y_eval mean? Are they necesary? How did they choose those values?
The difference between "input_fn" (what does "fn" even mean?) and "train_input_fn". I see that the only difference is one has
num_epochs=None, shuffle=True
num_epochs=1000, shuffle=False
but I don't understand what "input_fn" or "train_input_fn" are/do, or what's the difference between the two, or if both are necesary.
3.In the
estimator.train(input_fn=input_fn, steps=1000)
piece of code, I don't understand the difference between "steps" and "num_epochs". What's the meaning of each one? Can you have num_epochs=1000 and steps=1000 too?
The final question is, how do i get the W and the b? In the previous way of doing it (not using tf.estimator) they explicitelly found that W=-1 and b=1. If I was doing a more complex neural network, involving biases and weights, I think I would want to recover the actual values of the weights and biases. That's the whole point of why I'm using tensorflow, to find the weights! So how do I recover them in the tf.estimator example?
These are just some of the questions that bugged me while reading the "getStarted" tutorial. I personally think it leaves a lot to desire, since it's very unclear what each thing does and you can at best guess.
I agree with you that the tf.estimator is not very well introduced in this "getting started" tutorial. I also think that some machine learning background would help with understanding what happens in the tutorial.
As for the answers to your questions:
In machine learning, we usually minimizer the loss of the model on the training set, and then we evaluate the performance of the model on the evaluation set. This is because it is easy to overfit the training set and get 100% accuracy on it, so using a separate validation set makes it impossible to cheat in this way.
Here (x_train, y_train) corresponds to the training set, where the global minimum is obtained for W=-1, b=1.
The validation set (x_eval, y_eval) doesn't have to perfectly follow the distribution of the training set. Although we can get a loss of 0 on the training set, we obtain a small loss on the validation set because we don't have exactly y_eval = - x_eval + 1
input_fn means "input function". This is to indicate that the object input_fn is a function.
In tf.estimator, you need to provide an input function if you want to train the estimator (estimator.train()) or evaluate it (estimator.evaluate()).
Usually you want different transformations for training or evaluation, so you have two functions train_input_fn and eval_input_fn (the input_fn in the tutorial is almost equivalent to train_input_fn and is just confusing).
For instance, during training we want to train for multiple epochs (i.e. multiple times on the dataset). For evaluation, we only need one pass over the validation data to compute the metrics we need
The number of epochs is the number of times we repeat the entire dataset. For instance if we train for 10 epochs, the model will see each input 10 times.
When we train a machine learning model, we usually use mini-batches of data. For instance if we have 1,000 images, we can train on batches of 100 images. Therefore, training for 10 epochs means training on 100 batches of data.
Once the estimator is trained, you can access the list of variables through estimator.get_variable_names() and the value of a variable through estimator.get_variable_value().
Usually we never need to do that, as we can for instance use the trained estimator to predict on new examples, using estimator.predict().
If you feel that the getting started is confusing, you can always submit a GitHub issue to tell the TensorFlow team and explain your point.