pruning tensorflow connections and weights (using cifar10 cnn) - tensorflow

I'm using tensorflow to run a cnn for image classification.
I use tensorflow cifar10 cnn implementation.(tensorflow cifar10)
I want to decrease the number of connections, meaning I want to prune the low-weight connections.
How can I create a new graph(subgraph) without some of the nuerones?

Tensorflow does not allow you lock/freeze a particular kernel of a particular layer, that I have found. The only I've found to do this is to use the tf.assign() function as shown in
How to freeze/lock weights of one Tensorflow variable (e.g., one CNN kernel of one layer
It's fairly cave-man but I've seen no other solution that works. Essentially, you have to .assign() the values every so often as you iterate through the data. Since this approach is so inelegant and brute-force, it's very slow. I do the .assign() every 100 batches.
Someone please post a better solution and soon!

The cifar10 model you point to, and for that matter, most models written in TensorFlow, do not model the weights (and hence, connections) of individual neurons directly in the computation graph. For instance, for fully connected layers, all the connections between the two layers, say, with M neurons in the layer below, and 'N' neurons in the layer above, are modeled by one MxN weight matrix. If you wanted to completely remove a neuron and all of its outgoing connections from the layer below, you can simply slice out a (M-1)xN matrix by removing the relevant row, and multiply it with the corresponding M-1 activations of the neurons.

Another way is add an addition mask to control the connections.
The first step involves adding mask and threshold variables to the
layers that need to undergo pruning. The variable mask is the same
shape as the layer's weight tensor and determines which of the weights
participate in the forward execution of the graph.
There is a pruning implementation under tensorflow/contrib/model_pruning to prune the model. Hope this can help you to prune model quickly.
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/model_pruning

I think google has an updated answer here : https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/model_pruning
Removing pruning nodes from the trained graph:
$ bazel build -c opt contrib/model_pruning:strip_pruning_vars
$ bazel-bin/contrib/model_pruning/strip_pruning_vars --checkpoint_path=/tmp/cifar10_train --output_node_names=softmax_linear/softmax_linear_2 --filename=cifar_pruned.pb
I suppose that cifar_pruned.pb will be smaller, since the pruned "or zero masked" variables are removed.

Related

What's the relationship between Tensorflow's dataflow graph and DNN?

As we know, a DNN is comprised of many layers which consist of many neurons applying the same function to different parts of the input. Meanwhile, if we use Tensorflow to execute a DNN task, we will get a dataflow graph generated by Tensorflow automatically and we can use Tensorboard to visualize the dataflow graph as blow. But there is no neuron in the layer. So I wonder what is the relationship between Tensorflow dataflow graph and a DNN? When a neuron of DNN's layer map into dataflow graph, how is it represented?What is the relationship of neuron in DNN and node in tensorflow(representing an operation)? I just started to learn DNN and Tensorflow, please help me arrange thoughts in order. Thanks:) enter image description here
You have to differentiate between the metaphoric representation of a DNN and it's mathematic description. The math behind a classic neuron is the sum of the weighted inputs + a bias (usually calling a activation function on this result)
So in this case you have an input vector mutplied by a weight vector (containing trainable variables) and then summed up with a bias scalar (also trainable)
If you now consider a layer of neurons instead of one, the weights will become a matrix and the bias a vector. So calculating a feed forward layer is nothing more then a matrix multiplication follow by a sum of vectors.
This is the operation you can see in your tensorflow graph.
You can actually build your Neural Network this way without any use of the so called High Level API which use the Layer abstraction. (Many have done this in the early days of tensorflow)
The actual "magic", which tensorflow does for you is calculating and executing the derivatives of this foreword pass in order to calculate the updates for the weights.

Merge weights of same model trained on 2 different computers using tensorflow

I was doing some research on training deep neural networks using tensorflow. I know how to train a model. My problem is i have to train the same model on 2 different computers with different datasets. Then save the model weights. Later i have to merge the 2 model weight files somehow. I have no idea how to merge them. Is there a function that does this or should the weights be averaged?
Any help on this problem would be useful
Thanks in advance
There is literally no way to merge weights, you cannot average or combine them in any way, as the result will not mean anything. What you could do instead is combine predictions, but for that the training classes have to be the same.
This is not a programming limitation but a theoretical one.
It is better to merge weight updates (gradients) during the training and keep a common set of weights rather than trying to merge the weights after individual trainings have completed. Both individually trained networks may find a different optimum and e.g. averaging the weights may give a network which performs worse on both datasets.
There are two things you can do:
Look at 'data parallel training': distributing forward and backward passes of the training process over multiple compute nodes each of which has a subset of the entire data.
In this case typically:
each node propagates a minibatch forward through the network
each node propagates the loss gradient backwards through the network
a 'master node' collects gradients from minibatches on all nodes and updates the weights correspondingly
and distributes the weight updates back to the compute nodes to make sure each of them has the same set of weights
(there are variants of the above to avoid that compute nodes idle too long waiting for results from others). The above assumes that Tensorflow processes running on the compute nodes can communicate with each other during the training.
Look at https://www.tensorflow.org/deploy/distributed) for more details and an example of how to train networks over multiple nodes.
If you really have train the networks separately, look at ensembling, see e.g. this page: https://mlwave.com/kaggle-ensembling-guide/ . In a nutshell, you would train individual networks on their own machines and then e.g. use an average or maximum over the outputs of both networks as a combined classifier / predictor.

When predicting with an LSTM in Keras, is the hidden state still adjusted?

When I first train an LSTM in Keras on sequence data - my training data -
and then use model.predict() to make predictions with my test data as input, is the hidden state of the LSTM still being adjusted?
Basic operation of a neural network is to take an input (vector) which is connected to the output with connections and, sometimes, other layers such as context layers. These connections are modelled as matrices and vary in strength, we call these weight matrices.
This means that the only thing we do when we are feeding data into the network is to put a vector into the network, multiply the values with the weight matrix and call that the output. In special cases, like recurrent networks, we even keep some values stored in other vectors and combine this stored value with the current input.
During training we not only feed data into the network, we also compute an error value that we evaluate in a clever way so that it tells us how we should change the weight matrices we multiply our inputs (and possibly past inputs for recurrent layers) with.
Therefore: yes, of course the basic execution behavior does not change for recurrent layers. We are just not updating weights anymore.
There are layers that do behave differently during execution time because they are treated as regularisers, i.e. methods that make training the network more efficient, which are deemed as unnecessary during execution. Examples for these layers are Noise and BatchNormalization. Almost all neural network layers (including recurrent ones) include drop-out which is another form of regularisation which disables a random percentage of connections in the layer. This is also only done during training.

Tensorflow LSTM Dropout Implementation

How specifically does tensorflow apply dropout when calling tf.nn.rnn_cell.DropoutWrapper() ?
Everything I read about applying dropout to rnn's references this paper by Zaremba et. al which says don't apply dropout between recurrent connections. Neurons should be dropped out randomly before or after LSTM layers, but not inter-LSTM layers. Ok.
The question I have is how are the neurons turned off with respect to time?
In the paper that everyone cites, it seems that a random 'dropout mask' is applied at each timestep, rather than generating one random 'dropout mask' and reusing it, applying it to all the timesteps in a given layer being dropped out. Then generating a new 'dropout mask' on the next batch.
Further, and probably what matters more at the moment, how does tensorflow do it? I've checked the tensorflow api and tried searching around for a detailed explanation but have yet to find one.
Is there a way to dig into the actual tensorflow source code?
You can check the implementation here.
It uses the dropout op on the input into the RNNCell, then on the output, with the keep probabilities you specify.
It seems like each sequence you feed in gets a new mask for input, then for output. No changes inside of the sequence.

Tensorflow: jointly training CNN + LSTM

There are quite a few examples on how to use LSTMs alone in TF, but I couldn't find any good examples on how to train CNN + LSTM jointly.
From what I see, it is not quite straightforward how to do such training, and I can think of a few options here:
First, I believe the simplest solution (or the most primitive one) would be to train CNN independently to learn features and then to train LSTM on CNN features without updating the CNN part, since one would probably have to extract and save these features in numpy and then feed them to LSTM in TF. But in that scenario, one would probably have to use a differently labeled dataset for pretraining of CNN, which eliminates the advantage of end to end training, i.e. learning of features for final objective targeted by LSTM (besides the fact that one has to have these additional labels in the first place).
Second option would be to concatenate all time slices in the batch
dimension (4-d Tensor), feed it to CNN then somehow repack those
features to 5-d Tensor again needed for training LSTM and then apply a cost function. My main concern, is if it is possible to do such thing. Also, handling variable length sequences becomes a little bit tricky. For example, in prediction scenario you would only feed single frame at the time. Thus, I would be really happy to see some examples if that is the right way of doing joint training. Besides that, this solution looks more like a hack, thus, if there is a better way to do so, it would be great if someone could share it.
Thank you in advance !
For joint training, you can consider using tf.map_fn as described in the documentation https://www.tensorflow.org/api_docs/python/tf/map_fn.
Lets assume that the CNN is built along similar lines as described here https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10.py.
def joint_inference(sequence):
inference_fn = lambda image: inference(image)
logit_sequence = tf.map_fn(inference_fn, sequence, dtype=tf.float32, swap_memory=True)
lstm_cell = tf.contrib.rnn.LSTMCell(128)
output_state, intermediate_state = tf.nn.dynamic_rnn(cell=lstm_cell, inputs=logit_sequence)
projection_function = lambda state: tf.contrib.layers.linear(state, num_outputs=num_classes, activation_fn=tf.nn.sigmoid)
projection_logits = tf.map_fn(projection_function, output_state)
return projection_logits
Warning: You might have to look into device placement as described here https://www.tensorflow.org/tutorials/using_gpu if your model is larger than the memory gpu can allocate.
An Alternative would be to flatten the video batch to create an image batch, do a forward pass from CNN and reshape the features for LSTM.