Avoiding weight sharing among certain layers in BucketingModule in mxnet? - mxnet

I am using BucketingModule for training multiple small models/bots together. Here, the bucket key is bot_id. However, each bot has separate set of target labels/classes (and hence, different size of softmax layer for each bot).
Is there any way to train such a model in mxnet, where I want to share the weights for all the layers but one (softmax) among all the bots?
How would I initialize such a model using sym_gen method?
If in the sym_gen method, for the Softmax layer I specify the num_hidden=size_dict[bot] i.e.,
pred = mx.sym.FullyConnected(data=pred, num_hidden=len(size_dict[bot]), name='pred')
pred = mx.sym.SoftmaxOutput(data=pred, label=label, name='softmax')
I get the error:
Inferred shape does not match shared_exec.arg_array's shape
which makes sense as each bot has different number of target classes.

This issue was posted and resolved here: https://github.com/apache/incubator-mxnet/issues/9042
You can make sym_gen(default_bucket_key) returns a "master network" that contains all these FC layers of different shapes, and sym_gen(other_keys) returns a subset of the master network with one particular FC. Note that for the master network, you probably need to use mx.sym.Group to group all outputs together so only one symbol is returned.

Related

How to implement skip connections from MSG-GAN paper

I am trying to implement the technique described in the MSG-GAN paper:
https://arxiv.org/pdf/1903.06048.pdf
But I am having difficulty understanding some things, for example, how are the connections made from the generator to the discriminator? These are Conv2D connections literally? (in that case, how would I insert the real images to train the discriminator?) Or does the discriminator have multiple outputs (one prediction for each resolution and the generator has to optimize the average loss of the resolutions)?
How are the connections made from the generator to the discriminator?
Generator output them and discriminator concatenate them with feature maps from last layer at corresponding input layer.
These are Conv2D connections literally?
Those just input tensor with shape like(batch size, W, H, 3), same as ordinary image input.
Does the discriminator have multiple outputs?
No, this is end to end training, training with all resolution outputs at same time, otherwise it will just like the Progressive Growing GAN and no reason for concatenate operation at each input layer(beginning of each block) of discriminator.
this is only a partial answer.
I would say that if you had to implement this in keras, and you don't want each model (G and D) to be in one piece, it's actually easier to have them separated and then use tf.GradientTape() to train
Does the discriminator have multiple outputs?
yes, if implementing them as separate models, yes there will be multiple inputs and multiple outputs of multiple resolutions, only one of those is the final output.

Training Tensorflow only one object

Corresponding Tensorflow documentation I trained 3 objects and get result (It can recognize these objects). When I show other objects (not the 3 ones) it doesn't work correctly.
I want to train only one object (example: a cup) and recognize only this object. Is it possible to do via Tensorflow ?
Your question doesn't provide enough details, but as I can guess your trained the network with softmax activation and Categorical or SparseCategorical cross entropy loss. If my guess is right, such network always generates prediction to one of three classess, regardless to actual data, i.e. there is no option of "no-one".
In order to train network to recognize only one class of objects, make the only one output with only one channel and sigmoid activation. Use BinaryCrossEntropy loss to train your model for the specific object. Provide dataset that includes examples with this object and without it.

Difference between bidirectional_dynamic_rnn and stack_bidirectional_dynamic_rnn in Tensorflow

I am building a dynamic RNN network with stacking multiple LSTMs. I see there are 2 options
# cells_fw and cells_bw are list of cells eg LSTM cells
stacked_cell_fw = tf.contrib.rnn.MultiRNNCell(cells_fw)
stacked_cell_bw = tf.contrib.rnn.MultiRNNCell(cells_bw)
output = tf.nn.bidirectional_dynamic_rnn(
stacked_cell_fw, stacked_cell_bw, INPUT,
sequence_length=LENGTHS, dtype=tf.float32)
vs
output = tf.contrib.rnn.stack_bidirectional_dynamic_rnn(cells_fw, cells_bw, INPUT,
sequence_length=LENGTHS, dtype=tf.float32)
What is the difference between the 2 approaches and is one better than the other?
If you want to have have multiple layers that pass the information backward or forward in time, there are two ways how to design this. Assume the forward layer consists of two layers F1, F2 and the backword layer consists of two layers B1, B2.
If you use tf.nn.bidirectional_dynamic_rnn the model will look like this (time flows from left to right):
If you use tf.contrib.rnn.stack_bidirectional_dynamic_rnn the model will look like this:
Here the black dot between first and second layer represents a concatentation. I.e., the outputs of the forward and backward cells are concatenated together and fed to the backward and forward layers of the next upper layer. This means both F2 and B2 receive exactly the same input and there is an explicit connection between backward and forward layers. In "Speech Recognition with Deep Recurrent Neural Networks" Graves et al. summarize this as follows:
... every hidden layer receives input from both the
forward and backward layers at the level below.
This connection only happens implicitly in the unstacked BiRNN (first image), namely when mapping back to the output. The stacked BiRNN usually performed better for my purposes, but I guess that depends on your problem setting. But for sure it is worthwile to try it out!
EDIT
In response to your comment: I base my answer on the documentation of the function tf.contrib.rnn.stack_bidirectional_dynamic_rnn which says:
Stacks several bidirectional rnn layers. The combined forward and
backward layer outputs are used as input of the next layer.
tf.bidirectional_rnn does not allow to share forward and backward
information between layers.
Also, I looked at the implementation available under this link.

Changing a trained network to keep only a subset of its output

Suppose I have a trained TensorFlow classification network for 20 classes as in PASCAL VOC 2007: aeroplane, bicycle, ..., car, cat, ..., person, ..., tvmonitor.
Now, I would like to have a sub-network for only a subset of the classes, e.g., 3 classes: car, cat, person.
Then, I can use this network for testing or for re-training/fine-tuning on a new dataset, only for the 3 classes.
It should be possible to extract this sub-network out of the original network, since it is only the last layer that will change. We need to discard the neurons/weights for the discarded classes.
My question: Is there an easy way to do this in TensorFlow?
It will be great if you can point to some sample code or similar solution.
I have googled, but have not come across any mention of this.
The symmetric problem, expanding the number of classes without discarding the original weights, can potentially be useful for some people, but my current focus is the one above.
If you want to only keep the output for a few slices, you could simply extract the corresponding slices from the last layer.
For example, let's assume the last layer is fully connected. Its weights are a tensor of size num_previous x num_output.
You want to keep only a few of these outputs, says output 1, 22, and 42. You can get the weights of your new fully connected layer as:
outputs_to_keep = [1, 22, 42]
new_W = tf.transpose(tf.gather(tf.transpose(old_W), outputs_to_keep))
It is possible to extract a pretrained subnet as you said. It is called transfer learning. There are different ways to do it, here you have one:
Find the layer you want to start with. You can use Tensorboard to find it and then use graph.get_tensor_by_name() Usually you keep the convolutional layers and discard the fully connected ones.
Connect your new layers (normally fully connected ones) to the previous layer.
Freeze the variables (weights) of the pretrained layers using trainable=false. Alternatively, you can instruct the optimizer to update only the weights from the new layers.
Train your model with the new classes.

PTB rnn model one PTBModel object instead of three

in the PTB rnn model, three PTBModel objects are created, namely m, mvalid and mtest:
with tf.Graph().as_default(), tf.Session() as session:
initializer = tf.random_uniform_initializer(-config.init_scale,
config.init_scale)
with tf.variable_scope("model", reuse=None, initializer=initializer):
**m** = PTBModel(is_training=True, config=config)
with tf.variable_scope("model", reuse=True, initializer=initializer):
**mvalid** = PTBModel(is_training=False, config=config)
**mtest** = PTBModel(is_training=False, config=eval_config)
my questions are:
do all these three objects live in the same graph? (It looks like they all live under the default graph.)
do these three objects share the same placeholders, e.g., _input_data? Or is it the case that different sets of placeholders are created with each PTBModel object, so that for example there are three _input_data placeholders within the same graph (one _input_data used for feeding training data, another for validation and yet another for testing)?
suppose I only create one PTBModel object, would it be possible to reuse the _input_data placeholder used for training and change its shape and use it for testing as well (where the 1st dimension, num_steps, is set to 1 at test time)?
Thanks!
Yes, these three objects live in the same graph.
The placeholders are different and you need to use the correct one if you want to evaluate particular part of the graph.
It would in theory be possible but it is not as trivial. E.g. you could have a training graph unrolled for 20 steps but use only a subset of the steps for evaluation. The other possibility might be to use dynamic_rnn functionality.
In general, building a few copies of the graph is not very expensive and it might not be worth spending a lot of time on optimizing the number of allocated nodes.