PTB rnn model one PTBModel object instead of three - tensorflow

in the PTB rnn model, three PTBModel objects are created, namely m, mvalid and mtest:
with tf.Graph().as_default(), tf.Session() as session:
initializer = tf.random_uniform_initializer(-config.init_scale,
config.init_scale)
with tf.variable_scope("model", reuse=None, initializer=initializer):
**m** = PTBModel(is_training=True, config=config)
with tf.variable_scope("model", reuse=True, initializer=initializer):
**mvalid** = PTBModel(is_training=False, config=config)
**mtest** = PTBModel(is_training=False, config=eval_config)
my questions are:
do all these three objects live in the same graph? (It looks like they all live under the default graph.)
do these three objects share the same placeholders, e.g., _input_data? Or is it the case that different sets of placeholders are created with each PTBModel object, so that for example there are three _input_data placeholders within the same graph (one _input_data used for feeding training data, another for validation and yet another for testing)?
suppose I only create one PTBModel object, would it be possible to reuse the _input_data placeholder used for training and change its shape and use it for testing as well (where the 1st dimension, num_steps, is set to 1 at test time)?
Thanks!

Yes, these three objects live in the same graph.
The placeholders are different and you need to use the correct one if you want to evaluate particular part of the graph.
It would in theory be possible but it is not as trivial. E.g. you could have a training graph unrolled for 20 steps but use only a subset of the steps for evaluation. The other possibility might be to use dynamic_rnn functionality.
In general, building a few copies of the graph is not very expensive and it might not be worth spending a lot of time on optimizing the number of allocated nodes.

Related

What's the differences between tf.GraphKeys.TRAINABLE_VARIABLES and tf.GraphKeys.UPDATE_OPS in tensorflow?

Here is doc of tf.GraphKeys in tensorflow, such as TRAINABLE_VARIABLES: the subset of Variable objects that will be trained by an optimizer.
And i know tf.get_collection(), which can find some tensor that you want.
When use tensorflow.contrib.layers.batch_norm(), the parameter updates_collections default value is GraphKeys.UPDATE_OPS.
How can we understand those collections, and difference in them.
Besides, we can find more in ops.py.
These are two different things.
TRAINABLE_VARIABLES
TRAINABLE_VARIABLES is the collection of variables or training parameters which should be modified when minimizing the loss. For example, these can be the weights determining the function performed by each node in the network.
How do variables get added to this collection? This happens automatically when you define a new variable with tf.get_variable, unless you specify
tf.get_variable(..., trainable=False)
When would you want a variable to be untrainable? This happens from time to time. For example, occasionally you will want to use a two-step approach in which you first train the entire network on a large, generic dataset, then fine-tune the network on a smaller dataset which is specifically related to your problem. In such cases, you might want to fine-tune only part of the network, e.g., the last layer. Specifying some variables as untrainable is one of the ways to do this.
UPDATE_OPS
UPDATE_OPS is a collection of ops (operations performed when the graph runs, like multiplication, ReLU, etc.), not variables. Specifically, this collection maintains a list of ops which need to run before each training step.
How do ops get added to this collection?
By definition, update_ops occur outside the regular flow of training by loss minimization, so generally you will be adding ops to this collection only under special circumstances. For example, when performing batch normalization, you want to recompute the batch mean and variance before each training step, and this is how it's done. The mechanics of batch normalization using tf.contrib.layers.batch_norm are described in more detail in this article.
Disagree with the previous answer.
Actually, everything is an OP in the tensorflow, the variables in the TRAINABLE_VARIABLES collections are also OPs, which is created by the OP tf.get_variable or tf.Variable.
As for the UPDATE_OPS collection, it usually include the moving average and moving variance, crated in the tf.layers.batch_norm function. These ops can also be regarded as variables, as their values are updated at each training step, just like the weights and bias.
The main difference is that the trainable variables participate the process of back propagation, while the variables in the UPDATE_OPS not. They only participate the inference process in the test mode, so so gridients are computed on these variable in the UPDATE_OPS .

Avoiding weight sharing among certain layers in BucketingModule in mxnet?

I am using BucketingModule for training multiple small models/bots together. Here, the bucket key is bot_id. However, each bot has separate set of target labels/classes (and hence, different size of softmax layer for each bot).
Is there any way to train such a model in mxnet, where I want to share the weights for all the layers but one (softmax) among all the bots?
How would I initialize such a model using sym_gen method?
If in the sym_gen method, for the Softmax layer I specify the num_hidden=size_dict[bot] i.e.,
pred = mx.sym.FullyConnected(data=pred, num_hidden=len(size_dict[bot]), name='pred')
pred = mx.sym.SoftmaxOutput(data=pred, label=label, name='softmax')
I get the error:
Inferred shape does not match shared_exec.arg_array's shape
which makes sense as each bot has different number of target classes.
This issue was posted and resolved here: https://github.com/apache/incubator-mxnet/issues/9042
You can make sym_gen(default_bucket_key) returns a "master network" that contains all these FC layers of different shapes, and sym_gen(other_keys) returns a subset of the master network with one particular FC. Note that for the master network, you probably need to use mx.sym.Group to group all outputs together so only one symbol is returned.

Reusing part of a tensorflow trained graph

So, I trained a tensorflow model with a few layers, more or less like this:
with tf.variable_scope('model1') as scope:
inputs = tf.placeholder(tf.int32, [None, num_time_steps])
embeddings = tf.get_variable('embeddings', (vocab_size, embedding_size))
lstm = tf.nn.rnn_cell.LSTMCell(lstm_units)
embedded = tf.nn.embedding_lookup(embeddings, inputs)
_, state = tf.nn.dynamic_rnn(lstm, embedded, dtype=tf.float32, scope=scope)
# more stuff on the state
Now, I wanted to reuse the embedding matrix and the lstm weights in another model, which is very different from this one except for these two components.
As far as I know, if I load them with a tf.Saver object, it will look for
variables with the exact same names, but I'm using different variable_scopes in the two graphs.
In this answer, it is suggested to create the graph where the LSTM is trained as a superset of the other one, but I don't think it is possible in my case, given the differences in the two models. Anyway, I don't think it is a good idea to make one graph dependent on the other, if they do independent things.
I thought about changing the variable scope of the LSTM weights and embeddings in the serialized graph. I mean, where it originally read model1/Weights:0 or something, it would be another_scope/Weights:0. Is it possible and feasible?
Of course, if there is a better solution, it is also welcome.
I found out that the Saver can be initialized with a dictionary mapping variable names (without the trailing :0) in the serialized file to the variable objects I want to restore in the graph. For example:
varmap = {'model1/some_scope/weights': variable_in_model2,
'model1/another_scope/weights': another_variable_in_model2}
saver = tf.train.Saver(varmap)
saver.restore(sess, path_to_saved_file)

Tracking counts of examples used in training

I am trying to implement the CBOW word2vec model based on the skipgrams implementation on the tensorflow repository:
https://github.com/tensorflow/tensorflow/blob/v0.10.0/tensorflow/models/embedding/word2vec.py
I have previously implemented the simplified version following the TensorFlow tutorials, so I understand that I will have to modify the data batching function as well as a small part of the graph to get the context embedding.
In the skipgram implementation, the data batching function is used in lines 348-351.
(words, counts, words_per_epoch, self._epoch, self._words, examples,
labels) = word2vec.skipgram(filename=opts.train_data,
batch_size=opts.batch_size,
window_size=opts.window_size,
min_count=opts.min_count,
subsample=opts.subsample)
From my understanding, the variables assigned are as follows:
words: terms in the vocabulary
counts: associated counts of terms used in the corpus
words_per_epoch: total word count in the corpus
self._epoch: current count of epochs used
self._words: current count of training examples used
examples: current batch of training examples
labels: current batch of training labels
I have managed to replicate the tensor for words, counts, words_per_epoch, examples and labels. However, self._epoch and self._words have eluded me. If my understanding is correct, I need to be able to track the count of the training examples used. However, this is not provided by the sample batching function. The counts are later used in a multi-threaded manner to terminate the training loop, hence I can't simply use a loop to add up the counts.
I understand that bits of the tensorflow ops are implemented in C++. However, as I am not familiar with C++, I will have to replicate those parts using Python.
Will be great if I can get some suggestions to obtain the tensor for self._words. The tensor basically has to increment only when every time a new batch of examples/labels are called. With that, I can simply use a self._epoch = self._words // words_per_epoch to get the other tensor.
Figured out the trick while looking at the source code for tensorflow.models.embedding.word2vec_optimized.py. Specifically, how global_step was incremented when loss was called in lines 218-225.
In my case, I would have to do it as so:
# codes to prepare features and labels tensors
data_processed = tf.Variable(0, trainable=False, dtype=tf.int64)
epochs_processed = data_processed // data_per_epoch
inc_op = data_processed.assign_add(batch_size)
with tf.control_dependencies([inc_op]):
features_batch, labels_batch = tf.train.batch([features, labels],
batch_size=batch_size)
In this case, the tensor data_processed will always be incremented by batch_size whenever features_batch or labels_batch is called. epochs_processed will also be incremented accordingly.
The use of tf.control_dependencies(control_inputs) is key here. It returns a context manager. The operations specified in control_inputs must be executed before the operations defined in the context.

Caching Computations in TensorFlow

Is there a canonical way to reuse computations from a previously-supplied placeholder in TensorFlow? My specific use case:
supply many inputs (using one placeholder) simultaneously, all of which are fed through a network to obtain smaller representations
define a loss based on various combinations of these smaller representations
train on one batch at a time, where each batch uses some subset of the inputs, without recomputing the smaller representations
Here is the goal in code, but which is defective because the same computations are carried out again and again:
X_in = some_fixed_data
combinations_in = large_set_of_combination_indices
for combination_batch_in in batches(combinations_in, batch_size=128):
session.run(train_op, feed_dict={X: X_in, combinations: combination_batch_in})
Thanks.
The canonical way to share computed values across sess.Run() calls is to use a Variable. In this case, you could set up your graph so that when the Placeholders are fed, they compute a new value of the representation that is saved into a Variable. A separate portion of the graph reads those Variables to compute the loss. This will not work if you need to compute gradients through the part of the graph that computes the representation. Computing those gradients will require recomputing every Op in the encoder.
This is the kind of thing that should be solved automatically with CSE (common subexpression elimination). Not sure what the support in TensorFlow right now, might be kind of spotty, but there's optimizer_do_cse flag for Graph options which is defaulting to false, and you can set it to true using GraphConstructorOptions. Here's a C++ example of using GraphConstructorOptions (sorry, couldn't find a Python one)
If that doesn't work, you could do "manual CSE", ie, figure out which part is being needlessly recomputed, factor it out into separate Tensor, and reference that tensor in all the calculations.