How to derive the shapes of tensors given the networks architecture, number of outputs and number of samples - tensorflow

How can we calculate the shapes of various tensors involved in the computation graph once we know the network architecture (number of hidden layers, number of units in each layer), number of outputs, number of inputs and number of samples in the training set. (Assuming the network is fully connected).
For example, lets say there are 100 features, 10000 samples, 2 hidden layers (H0, H1) and 10 outputs. H1 has 500 units and H2 has 5000 units. Assume RELU activation is used in H0/H1 and softmax is used for output layer.
In this case, although the sequence of calculations that need to happen is very clear, finding correct shape for each constant/variable/place holder is difficult.
I am trying to understand if there is a standard method which we can follow to do this.

Related

Dynamic Unrolling of Simple Neural Nets using Keras

I am trying to replicate a neural net to compute the energy of molecules (image given below). The Energy is the sum of bonded/non-bonded interactions and angle/dihedral strains. I have 4 separate neural networks that find out the energy due to each of these, and the total energy is the sum of energies due to each interaction, there may be 100s of these. In my data-set, I only know the total energy.
If my total energy is computed using multiple (an unknown number, decided by the molecule) forward-pos on different neural networks, how do I get keras to backpropagate through the dynamically constructued sum. A non-keras Tensorflow method would work too. (I would have just summed together the outputs of the neural nets if I knew before hand how many bonds would there be, the issue is having to unfold copies of the neural net at runtime).
This is just an example image given in the paper:
In summary, the question is: "How do I implement dynamic unrolling and feed it to a sum in Keras?".
Keras layers can be given a shape of (None, actual-shape...) if one of the dimensions is not known. Then we can use a TensorFlow layer to sum over the axis indexed 0 using tf.reduce_sum(layer, axis=0). So dynamic layer sizes are not hard to achieve in Keras.
However if the input shapes pose more of a constraint, we can pass in the full matrix with dummy 0 values appended, and a mask matrix, then we can use tf.multiply to reject the dummy values, the backpropagation will automatically work of course.

Print the number of activations in a Tensorflow model

I am attempting to count the number of activations in a model, for example in a LeNet. How could I count the total number of activations?
There is a way to count the number of trainable parameters, however, there does not seem to be an option for calculating individual activations.
The number of activations depends on the layers of the model, for example:
For a fully connected layer (Dense), the number of activations is equal to the number neurons.
For a convolutional layer, the number of activation is number of filters times the spatial dimensions of the output feature maps (which depends on padding, input size, etc).
For a recurrent layer, it depends as LSTM/GRU have a complicated structure. For a simple RNN it is just the number of neurons times the number of timesteps.

How to prune neurons in neural network

Context:
Suppose we have a simple 3-layer feed-forward network. The hidden size of the first linear layer is 100000 -- W1[input_size, 100000] in which input_size is a number much smaller than 100000. Some of neurons won't be learning any thing. I want to select and shutdown these neurons using pruning.
Expected outcomes
After pruning the selected neurons, we will have a smaller network with the less neurons in the first layer, say reduced to 500. And this smaller network turns out to have the same predicting capacity to the large one.
My implementation:
According to some criterion (some metrics applied to check weight similarities after each backpropagation update), I have cheery picked the indices of neurons I want to shut down, e.g., [1,7,8 ...].
Zero out the weights represented by the indices in W1, W1[:, 1, 7, 8 ...] = 0. So that no information will be passed forward via these neurons to the next layer.
Will that be enough? Should I be manually intervening the backpropagation as well? Zero out neurons stop only computations passing forward, but for learning/updating on weights, backpropgation matters more. Since I am using pytorch, it will be great if illustrations are provided in pytorch, other frameworks like tensorflow, Keras are also fine.

TensorFlow's RNN units and cells

The constructor for many of the RNN classes (BasicRNNCell, LSTMCell, and so on) accepts an argument named num_units. This sets the number of units in the cell.
I thought this identified the number of elements the RNN should process in sequence. So if you want an RNN to process sequences of length N, you'd have N units per cell. Is this correct? What exactly is an RNN unit?
No, it's not correct.
num_units refers to the number of features your cells can represent. At each time step, you give an input of a certain size (that you are calling "the number of elements the RNN should process in sequence"). This is like the layer 0 of your neural network. This input is then processed into a hidden layer, with size num_units. This is also the size of the cell output.
What you call N, is set by the size of your inputs tensor. num_units is a hyperparameter of your model. The bigger it is, the more degrees of freedom your model has (more descriptive features).
here num_units refers to the number of units in LSTM(or rnn) cell.
num_units can be interpreted as the analogy of hidden layer from the feed forward neural network.The number of nodes in hidden layer of a feed forward neural network is equivalent to num_units number of LSTM units in a LSTM cell at every time step of the network.Following picture should clear any confusion-
enter image description here
(cited from https://jasdeep06.github.io/posts/Understanding-LSTM-in-Tensorflow-MNIST/

What is num_units in tensorflow BasicLSTMCell?

In MNIST LSTM examples, I don't understand what "hidden layer" means. Is it the imaginary-layer formed when you represent an unrolled RNN over time?
Why is the num_units = 128 in most cases ?
From this brilliant article
num_units can be interpreted as the analogy of hidden layer from the feed forward neural network. The number of nodes in hidden layer of a feed forward neural network is equivalent to num_units number of LSTM units in a LSTM cell at every time step of the network.
See the image there too!
The number of hidden units is a direct representation of the learning capacity of a neural network -- it reflects the number of learned parameters. The value 128 was likely selected arbitrarily or empirically. You can change that value experimentally and rerun the program to see how it affects the training accuracy (you can get better than 90% test accuracy with a lot fewer hidden units). Using more units makes it more likely to perfectly memorize the complete training set (although it will take longer, and you run the risk of over-fitting).
The key thing to understand, which is somewhat subtle in the famous Colah's blog post (find "each line carries an entire vector"), is that X is an array of data (nowadays often called a tensor) -- it is not meant to be a scalar value. Where, for example, the tanh function is shown, it is meant to imply that the function is broadcast across the entire array (an implicit for loop) -- and not simply performed once per time-step.
As such, the hidden units represent tangible storage within the network, which is manifest primarily in the size of the weights array. And because an LSTM actually does have a bit of it's own internal storage separate from the learned model parameters, it has to know how many units there are -- which ultimately needs to agree with the size of the weights. In the simplest case, an RNN has no internal storage -- so it doesn't even need to know in advance how many "hidden units" it is being applied to.
A good answer to a similar question here.
You can look at the source for BasicLSTMCell in TensorFlow to see exactly how this is used.
Side note: This notation is very common in statistics and machine-learning, and other fields that process large batches of data with a common formula (3D graphics is another example). It takes a bit of getting used to for people who expect to see their for loops written out explicitly.
The argument n_hidden of BasicLSTMCell is the number of hidden units of the LSTM.
As you said, you should really read Colah's blog post to understand LSTM, but here is a little heads up.
If you have an input x of shape [T, 10], you will feed the LSTM with the sequence of values from t=0 to t=T-1, each of size 10.
At each timestep, you multiply the input with a matrix of shape [10, n_hidden], and get a n_hidden vector.
Your LSTM gets at each timestep t:
the previous hidden state h_{t-1}, of size n_hidden (at t=0, the previous state is [0., 0., ...])
the input, transformed to size n_hidden
it will sum these inputs and produce the next hidden state h_t of size n_hidden
From Colah's blog post:
If you just want to have code working, just keep with n_hidden = 128 and you will be fine.
An LSTM keeps two pieces of information as it propagates through time:
A hidden state; which is the memory the LSTM accumulates using its (forget, input, and output) gates through time, and
The previous time-step output.
Tensorflow’s num_units is the size of the LSTM’s hidden state (which is also the size of the output if no projection is used).
To make the name num_units more intuitive, you can think of it as the number of hidden units in the LSTM cell, or the number of memory units in the cell.
Look at this awesome post for more clarity
Since I had some problems to combine the information from the different sources I created the graphic below which shows a combination of the blog post (http://colah.github.io/posts/2015-08-Understanding-LSTMs/) and (https://jasdeep06.github.io/posts/Understanding-LSTM-in-Tensorflow-MNIST/) where I think the graphics are very helpful but an error in explaining the number_units is present.
Several LSTM cells form one LSTM layer. This is shown in the figure below. Since you are mostly dealing with data that is very extensive, it is not possible to incorporate everything in one piece into the model. Therefore, data is divided into small pieces as batches, which are processed one after the other until the batch containing the last part is read in. In the lower part of the figure you can see the input (dark grey) where the batches are read in one after the other from batch 1 to batch batch_size. The cells LSTM cell 1 to LSTM cell time_step above represent the described cells of the LSTM model (http://colah.github.io/posts/2015-08-Understanding-LSTMs/). The number of cells is equal to the number of fixed time steps. For example, if you take a text sequence with a total of 150 characters, you could divide it into 3 (batch_size) and have a sequence of length 50 per batch (number of time_steps and thus of LSTM cells). If you then encoded each character one-hot, each element (dark gray boxes of the input) would represent a vector that would have the length of the vocabulary (number of features). These vectors would flow into the neuronal networks (green elements in the cells) in the respective cells and would change their dimension to the length of the number of hidden units (number_units). So the input has the dimension (batch_size x time_step x features). The Long Time Memory (Cell State) and Short Time Memory (Hidden State) have the same dimensions (batch_size x number_units). The light gray blocks that arise from the cells have a different dimension because the transformations in the neural networks (green elements) took place with the help of the hidden units (batch_size x time_step x number_units). The output can be returned from any cell but mostly only the information from the last block (black border) is relevant (not in all problems) because it contains all information from the previous time steps.
I think it is confusing for TF users by the term "num_hidden". Actually it has nothing to do with the unrolled LSTM cells, and it just is the dimension of the tensor, which is transformed from the time-step input tensor to and fed into the LSTM cell.
This term num_units or num_hidden_units sometimes noted using the variable name nhid in the implementations, means that the input to the LSTM cell is a vector of dimension nhid (or for a batched implementation, it would a matrix of shape batch_size x nhid). As a result, the output (from LSTM cell) would also be of same dimensionality since RNN/LSTM/GRU cell doesn't alter the dimensionality of the input vector or matrix.
As pointed out earlier, this term was borrowed from Feed-Forward Neural Networks (FFNs) literature and has caused confusion when used in the context of RNNs. But, the idea is that even RNNs can be viewed as FFNs at each time step. In this view, the hidden layer would indeed be containing num_hidden units as depicted in this figure:
Source: Understanding LSTM
More concretely, in the below example the num_hidden_units or nhid would be 3 since the size of hidden state (middle layer) is a 3D vector.
I think this is a correctly answer for your question. LSTM always make confusion.
You can refer this blog for more detail Animated RNN, LSTM and GRU
Most LSTM/RNN diagrams just show the hidden cells but never the units of those cells. Hence, the confusion.
Each hidden layer has hidden cells, as many as the number of time steps.
And further, each hidden cell is made up of multiple hidden units, like in the diagram below. Therefore, the dimensionality of a hidden layer matrix in RNN is (number of time steps, number of hidden units).
The Concept of hidden unit is illustrated in this image https://imgur.com/Fjx4Zuo.
Following #SangLe answer, I made a picture (see sources for original pictures) showing cells as classically represented in tutorials (Source1: Colah's Blog) and an equivalent cell with 2 units (Source2: Raimi Karim 's post). Hope it will clarify confusion between cells/units and what really is the network architecture.