Multiple Activation Functions for multiple Layers (Neural Networks) - tensorflow

I have a binary classification problem for my neural network.
I already got good results using the ReLU activation function in my hidden layer and the sigmoid function in the output layer.
Now I'm trying to get even better results.
I added a second hidden layer with the ReLU activation function, and the results got even better.
I tried to use the leaky ReLU function for the second hidden layer instead of the ReLU function and got even better results, but I'm not sure if this is even allowed.
So I have something like that:
Hidden layer 1: ReLU activation function
Hidden layer 2: leaky ReLU activation function
Hidden layer 3: sigmoid activation function
I can't find many resources on it, and those I found always use the same activation function on all hidden layers.

If you mean the Leaky ReLU, I can say that, in fact, the Parametric ReLU (PReLU) is the activation function that generalizes the tradional rectified unit as well as the leaky ReLU. And yes, PReLU impoves model fitting with no significant extra computational cost and little overfitting risk.
For more details, you can check out this link Delving Deep into Rectifiers

Related

Should Relu be used in LSTM hidden layers if Targets contain negative values?

I'm aware that Relu as an ouput layer will only produce non-negative values, should Relu be used however in the hidden layers if targets contains Negative and Positive values ? (A linear regression model for Time Series)
Simple LSTM Example:
model = Sequential()
model.add(LSTM(64, activation = "relu")) # or without Relu?
model.add(Dropout(0.2))
model.add(LSTM(32, activation = "relu")) # or without Relu?
model.add(Dropout(0.2))
model.add(Dense(1))
Additional info: Targets are daily pct Change so mostly distribution is centered around 0 with range -10 < targets < 10
Yes, using Relu is not an error here, but this does not mean that an other activation function would not generate better results, i would still try in doubt other functions, like leaky Relu.
The reason on why the Relu function is not wrong has to do with the fact that when you reach the relu the input has been already elaborated by the layer, this means that there is not a loss of information, because the negative values have been already modified by the network.
The only thing to keep in mind is that if you use relu before the output you can't generate negative values for obvious reasons.

How to decide on activation function?

Currently there are a lot of activation functions like sigmoid, tanh, ReLU ( being the preferred choice ), but I have a question that concerns which choices are needed to be considered so that a certain activation function should be selected.
For example : When we want to Upsample a network in GANs, we prefer using LeakyReLU.
I am a newbie in this subject, and have not found a concrete solution as to which activation function to use in different situations.
My knowledge uptil now :
Sigmoid : When you have a binary class to identify
Tanh : ?
ReLU : ?
LeakyReLU : When you want to upsample
Any help or article will be appreciated.
This is an open research question. The choice of activation is also very intertwined with the architecture of the model and the computation / resources available so it's not something that can be answered in silo. The paper Efficient Backprop, Yann LeCun et. al. has a lot of good insights into what makes a good activation function.
That being said, here are some toy examples that may help get intuition for activation functions. Consider a simple MLP with one hidden layer and a simple classification task:
In the last layer we can use sigmoid in combination with the binary_crossentropy loss in order to use intuition from logistic regression - because we're just doing simple logistic regression on the learned features that the hidden layer gives to the last layer.
What types of features are learned depends on the activation function used in that hidden layer and the number of neurons in that hidden layer.
Here is what ReLU learns when using two hidden neurons:
https://miro.medium.com/max/2000/1*5nK725uTBUeoIA0XjEyA_A.gif
(on the left is what the decision boundary looks like in the feature space)
As you add more neurons you get more pieces with which to approximate the decision boundary. Here is with 3 hidden neurons:
And 10 hidden neurons:
Sigmoid and Tanh produce similar decsion boundaries (this is tanh https://miro.medium.com/max/2000/1*jynT0RkGsZFqt3WSFcez4w.gif - sigmoid is similar) which are more continuous and sinusoidal.
The main difference is that sigmoid is not zero-centered which doesn't make it a good choice for a hidden layer - especially in deep networks.

Is an output layer with 2 units and softmax ideal for binary classification using LSTM?

I am using an LSTM for binary classification and initially tried a model with 1 unit in the output(Dense) layer with sigmoid as the activation function.
However, it didn't perform well and I saw a few notebooks where they used 2 units in the output layer(the layer immediately after the LSTM) with softmax as the activation function. Is there any advantage to using 2 output layers and using softmax instead of a single unit and sigmoid(For the purpose of binary classification)? I am using binary_crossentropy as the loss function
Softmax should be better than sigmoid as the slope of derivative of sigmoid would almost be closer to one(vanishing gradient problem)., which makes it difficult to classify. That might be the reason for softmax to perform better than sigmoid

When using Keras categorical_crossentropy loss, should you use softmax on the last layer?

Most examples I've seen implement softmax on the last layer. But I read that Keras categorical_crossentropy automatically applies softmax after the last layer so doing it is redundant and leads to reduced performance. Who is right?
By default, Keras categorical_crossentropy does not apply softmax to the output (see the categorical_crossentropy implementation and the Tensorflow backend call). However, if you use the backend function directly, there exists the option of setting from_logits=True.

Keras binary_crossentropy vs categorical_crossentropy for multi class single label classification

I've been using binary cross-entropy but recently found out I may be better off using cateogrical cross entropy.
For the problem I'm solving the following is true:
There are 10 possible classes.
A given input only maps to 1 label.
I'm getting much higher accuracies with binary cross-entropy. Should I switch to categorical cross-entropy?
At the moment I'm using standard accuracy (metrics=['accuracy']) and a sigmoid activation layer for the last layer. Can I keep these the same?
If I understand correctly, you have a multiclass problem and your classes are mutually exclusive. You should use categorical_crossentropy and change your output activation function to softmax.
binary_crossentropy, as the name suggests, must be used as loss function only for 2-class problems.