Dense final layer vs. another rnn layer - tensorflow

It is common to add a dense fully-connected layer as the last layer on top of a recurrent neural network (which has one or more layers) in order to learn the reduction to the final output dimensionality.
Let's say I need one output with a -1 to 1 range, in which case I would use a dense layer with a tanh activation function.
My question is: Why not add another recurrent layer instead with an internal size of 1?
It will be different (in the sense of propagating that through time) but will it have a disadvantage over the dense layer?

If I understand correctly the two alternatives you present do the exact same computation, so they should behave identically.
In TensorFlow, if you're using dynamic_rnn, it's much easier if all time steps are identical, though, hence processing the output instead of having a different last step.

Related

Best way to add features after last convolutional layer, before fully-connected layers?

I am working on a regression problem related to chess. The output will depend on about 68 values that are given by Stockfish's static evaluation function (example output shown here), as well as the state of the board. However, the static eval features should not be passed through the CNN, only through the final fully-connected layers. Therefore I want to have some convolutional layers take the (one-hot encoded) board state down to a flat vector, then extend it with the other features before passing the full vector to a fully-connected layer.
How can I use Tensorflow to combine these two feature vectors (the result from the CNN and the other game-related features) within a single Layer type that can be added to a Sequential? I couldn't find anything that would handle this in the docs. Would subclassing Layer be the only way to go?

Fully Connected Layer dimensions

I have a few uncertainties regarding the fully connected layer of a convolutional neural network. Lets say the the input is the output of a convolutional layer. I understand the previous layer is flattened. But can it have multiple channels? (for example, can the input to the fully connected layer be 16x16x3 (3 channels, flattened into a vector of 768 elements?)
Next, I understand the equation for outputs is,
outputs = activation(inputs * weights' + bias)
Is there 1 weight per input? (for example, in the example above, would there be 768 weights?)
Next, how many bias's are there? 1 per channel (so 3)? 1 no matter what? Something else?
Lastly, how do filters work in the fully connected layer? Can there be more than 1?
You might have a misunderstanding of how the fully connected neural network works. To get a better understanding of it, you could always check some good tutorials such as online courses from Stanford HERE
To answer your first question: yes, whatever dimensions you have, you need to flatten it before sending to fully connected layers.
To answer your second question, you have to understand that fully connected layer is actually a process of matrix multiplication followed by a vector addition:
input^T * weights + bias = output
where you have an input of dimension 1xIN, weights of size INxOUT, and output of size 1xOUT, so you have 1xIN * INxOUT = 1xOUT. Altogether, you will have INxOUT weights, and OUT weights for each input. You will also need OUT biases, such that the full equation is 1xIN * INxOUT + 1xOUT(bias term).
There is no filters since you are not doing convolution.
Note that fully connected layer is also equal to 1x1 convolution layer, and many implementations use later for fully connected layer, this could be confusing for beginners. For details, please refer to HERE

Explanation of tensorflow "dataflow graphs"

Google gave the following dataflow graph as an example without any explanation of the scenario itself (https://www.tensorflow.org/guide/graphs).
I cannot understand the use case of such a graph. Why do we need a Logit Layer on top of ReLu layer? What's the use of Softmax (cannot see any link between the output and other nodes)? What are the meanings of the four parameters (two weights and two biases)? I would like to see a real case which matches with this datagraph.
This graph is a dense layer, followed by an softmaxlayer. It is basically a neural network with one hidden layer for classification.

Is batchnorm used in neural networks that are not CNN?

1.) Batchnorm is always used in deep convolutional neural networks. But is it also used in not-CNN. In NN. In networks with just fully-connected layers?
2.) Is batchnorm used in shallow CNNs?
3.) If I have a CNN with an input image and an input array IN_array, the output is an array after the last fully-connected layer. I call this array FC_array. If I want to concat that FC_array with the IN_array.
CONCAT_array = tf.concat(values=[FC_array, IN_array])
Is it useful to have a bachnorm after the concat layer? Or should that batchnorm be just after the FC_array before the concat layer?
For information, the IN_array is a tf.one_hot() vector.
Thank you
TL;DR: 1. Yes 2. Yes 3. No
TS;WM:
Batch normalization was a great invention by Sergey Ioffe and Christian Szegedy early 2015. Back in those days, battling vanishing or exploding gradients was an everyday problem. Read that article if you want to gain a deep understanding. but basically this quote from the abstract should give you some idea:
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs.
They did in fact first use batch normalization for DCNNs, which allowed them to beat human performance in the top-5 ImageNet classification, but any network where there are nonlinearities can benefit from batch normalization. Including a network consisting of fully-connected layers.
Yes, it is used for shallow CNN-s too. Any network with more than one layer can benefit from it, albeit it is true that more benefit comes to deeper networks.
First of all, one-hot vectors should never be normalized. Normalization means you subtract the mean and divide by the variance, thus creating a dataset with 0 mean and 1 variance. If you do this to a one-hot vector, then the cross-entropy loss calculation will be completely off. Second, there is no point in normalizing a concat layer separately, since it does not change the values, just concatenates them. Batch normalization is done on the input of a layer, so the one after the concat, that will get the concatenated values, can do it if necessary.

When predicting with an LSTM in Keras, is the hidden state still adjusted?

When I first train an LSTM in Keras on sequence data - my training data -
and then use model.predict() to make predictions with my test data as input, is the hidden state of the LSTM still being adjusted?
Basic operation of a neural network is to take an input (vector) which is connected to the output with connections and, sometimes, other layers such as context layers. These connections are modelled as matrices and vary in strength, we call these weight matrices.
This means that the only thing we do when we are feeding data into the network is to put a vector into the network, multiply the values with the weight matrix and call that the output. In special cases, like recurrent networks, we even keep some values stored in other vectors and combine this stored value with the current input.
During training we not only feed data into the network, we also compute an error value that we evaluate in a clever way so that it tells us how we should change the weight matrices we multiply our inputs (and possibly past inputs for recurrent layers) with.
Therefore: yes, of course the basic execution behavior does not change for recurrent layers. We are just not updating weights anymore.
There are layers that do behave differently during execution time because they are treated as regularisers, i.e. methods that make training the network more efficient, which are deemed as unnecessary during execution. Examples for these layers are Noise and BatchNormalization. Almost all neural network layers (including recurrent ones) include drop-out which is another form of regularisation which disables a random percentage of connections in the layer. This is also only done during training.