I'm trying to build a regression model in keras, the input will be a normalized greyscale image, and the output of the model will be a normalized greyscale image too.
I'm aware that a sigmoid at the output is not ideal for regression models since it gives a probability between [0 , 1] which is good for classification problems.
but since my data is already normalized between [0 , 1], would a sigmoid output be a good idea ?
Thank you
Generally the Linear activation function is used for regression problems.
Sigmoid activation function is not used for normalizing the data. This is mostly used for binary classification problems when there are 2 values(0 or 1) that need to be predicted for each binary class. Values below 0.5 will be assumed as class 0 and >0.5 will be taken as class 1(for the class between 0,1).
If you have a multi classification problem, you may need to use Softmax activation function to get the probability of each class and then use argmax() to get the high probability index to define the specific class among all other probabilities.
You can follow this link for more details.
I am trying to build a 2 stage VQ-VAE-2 + PixelCNN as shown in the paper:
"Generating Diverse High-Fidelity Images with VQ-VAE-2" (https://arxiv.org/pdf/1906.00446.pdf).
I have 3 implementation questions:
The paper mentions:
We allow each level in the hierarchy to separately depend on pixels.
I understand the second latent space in the VQ-VAE-2 must
be conditioned on a concatenation of the 1st latent space and a
downsampled version of the image. Is that correct ?
The paper "Conditional Image Generation with PixelCNN Decoders" (https://papers.nips.cc/paper/6527-conditional-image-generation-with-pixelcnn-decoders.pdf) says:
h is a one-hot encoding that specifies a class this is equivalent to
adding a class dependent bias at every layer.
As I understand it, the condition is entered as a 1D tensor that is injected into the bias through a convolution. Now for a 2 stage conditional PixelCNN, one needs to condition on the class vector but also on the latent code of the previous stage. A possibility I see is to append them and feed a 3D tensor. Does anyone see another way to do this ?
The loss and optimization are unchanged in 2 stages. One simply adds the loss of each stage into a final loss that is optimized. Is that correct ?
Discussing with one of the author of the paper, I received answers to all those questions and shared them below.
Question 1
This is correct, but the downsampling of the image is implemented with strided convolution rather than a non-parametric resize. This can be absorbed as part of the encoder architecture in something like this (the number after each variable indicates their spatial dim, so for example h64 is [B, 64, 64, D] and so on).
h128 = Relu(Conv2D(image256, stride=(2, 2)))
h64 = Relu(Conv2D(h128, stride=(2, 2)))
h64 = ResNet(h64)
Now for obtaining h32 and q32 we can do:
h32 = Relu(Conv2D(h64, stride=(2, 2)))
h32 = ResNet(h32)
q32 = Quantize(h32)
This way, the gradients flow all the way back to the image and hence we have a dependency between h32 and image256.
Everywhere you can use 1x1 convolution to adjust the size of the last dimension (the feature layers), use strided convolution for down-sampling and strided transposed convolution for upsampling spatial dimensions.
So for this example of quantizing bottom layer, you need to first upsample q32 spatially to become 64x64 and combine it with h64 and feed the result to the quantizer. For additional expressive power we inserted a residual stack in between as well. It looks like this:
hq32 = ResNet(Conv2D(q32, (1, 1)))
hq64 = Conv2DTranspose(hq32, stride=(2, 2))
h64 = Conv2D(concat([h64, hq64]), (1, 1))
q64 = Quantize(h64)
Question 2
The original PixelCNN paper also describes how to use spatial conditioning using convolutions. Flattening and appending to class embedding as a global conditioning is not a good idea. What you would want to do is to apply a transposed convolution to align the spatial dimensions, then a 1x1 convolution to match the feature dimension with hidden reps of pixelcnn and then add it.
Question 3
It's a good idea to train them separately. Besides isolating the losses etc. and being able to tune appropriate learning rates for each stage, you will also be able to use the full memory capacity of your GPU/TPU for each stage. These priors do better and better with larger scale, so it's a good idea to not deny them of that.
I am using 1x1 convolution in the deep network to reduce a feature x: Bx2CxHxW to BxCxHxW. I have three options:
x -> Conv (1x1) -> Batchnorm-->ReLU. Code will be output = ReLU(BN(Conv(x))). Reference resnet
x -> BN -> ReLU-> Conv. So the code will be output = Conv(ReLU(BN(x))) . Reference densenet
x-> Conv. The code is output = Conv(x)
Which one is most using for feature reduction? Why?
Since you are going to train your net end-to-end, whatever configuration you are using - the weights will be trained to accommodate them.
BatchNorm?
I guess the first question you need to ask yourself is do you want to use BatchNorm? If your net is deep and you are concerned with covariate shifts then you probably should have a BatchNorm -- and here goes option no. 3
BatchNorm first?
If your x is the output of another conv layer, than there's actually no difference between your first and second alternatives: your net is a cascade of ...-conv-bn-ReLU-conv-BN-ReLU-conv-... so it's only an "artificial" partitioning of the net into triplets of functions conv, bn, relu and up to the very first and last functions you can split things however you wish. Moreover, since Batch norm is a linear operation (scale + bias) it can be "folded" into an adjacent conv layer without changing the net, so you basically left with conv-relu pairs.
So, there's not really a big difference between the first two options you highlighted.
What else to consider?
Do you really need ReLU when changing dimension of features? You can think of the reducing dimensions as a linear mapping - decomposing the weights mapping to x into a lower rank matrix that ultimately maps into c dimensional space instead of 2c space. If you consider a linear mapping, then you might omit the ReLU altogether.
See fast RCNN SVD trick for an example.
Is it possible to fit or approximate multidimensional functions with neural networks?
Let's say I want to model the function f(x,y)=sin(x)+y from some given measurement data. (f(x,y) is considered as ground truth and is not known). Also if it's possible some code examples written in Tensorflow or Keras would be great.
As said by #AndreHolzner, theoretically you can approximate any continuous function with a neural network as well as you want, on any compact subset of R^n, even with only one hidden layer.
However, in practice, the neural net can have to be very large for some functions, and sometimes be untrainable (the optimal weights may be hard to find without getting in a local minimum). So here are a few practical suggestions (unfortunately vague, because the details depend too much on your data and are hard to predict without multiple tries):
Keep the network not too big (it'hard to define though, unfortunately): you'll just overfit. You'll probably need a LOT of training samples.
A big number of reasonably-sized layers is usually better than a reasonable number of big layers.
If you have some priors about the function, use them: for instance, if you believe there is some kind of periodicity in f (like in your example, but it could be more complicated), you could add the sin() function to some of of the outputs of the first layer (not all, that would give you a truly periodic output). If you suspect a polynom of degree n, just augment you input x with x², ...x^n and use a linear regression on that input, etc. It will be much easier than learning the weights.
The universal approximator theorem is true on any compact subset of R^n, not on the entire multidimensional space. In particular, you'll never be able to predict the value for an input that's way bigger than any of the training samples for instance (say you trained on numbers from 0 to 100, don't test on 200, it will fail).
For an example of regression you can look here for instance. To regress a more complicated function you'd need to put more complicated functions to get pred from x, for instance like this:
n_layers = 3
x = tf.placeholder(shape=[-1, n_dimensions], dtype=tf.float32)
last_layer = x
# Add n_layers dense hidden layers
for i in range(n_layers):
last_layer = tf.layers.dense(inputs=last_layer, units=128, activation=tf.nn.relu)
# Get the output prediction
pred = tf.layers.dense(inputs=last_layer, units=1, activation=None)
# Get the cost, training op, etc, just like in the linear regression example
This question already has answers here:
What are logits? What is the difference between softmax and softmax_cross_entropy_with_logits?
(8 answers)
Closed 2 years ago.
In the following TensorFlow function, we must feed the activation of artificial neurons in the final layer. That I understand. But I don't understand why it is called logits? Isn't that a mathematical function?
loss_function = tf.nn.softmax_cross_entropy_with_logits(
logits = last_layer,
labels = target_output
)
Logits is an overloaded term which can mean many different things:
In Math, Logit is a function that maps probabilities ([0, 1]) to R ((-inf, inf))
Probability of 0.5 corresponds to a logit of 0. Negative logit correspond to probabilities less than 0.5, positive to > 0.5.
In ML, it can be
the vector of raw (non-normalized) predictions that a classification
model generates, which is ordinarily then passed to a normalization
function. If the model is solving a multi-class classification
problem, logits typically become an input to the softmax function. The
softmax function then generates a vector of (normalized) probabilities
with one value for each possible class.
Logits also sometimes refer to the element-wise inverse of the sigmoid function.
Just adding this clarification so that anyone who scrolls down this much can at least gets it right, since there are so many wrong answers upvoted.
Diansheng's answer and JakeJ's answer get it right.
A new answer posted by Shital Shah is an even better and more complete answer.
Yes, logit as a mathematical function in statistics, but the logit used in context of neural networks is different. Statistical logit doesn't even make any sense here.
I couldn't find a formal definition anywhere, but logit basically means:
The raw predictions which come out of the last layer of the neural network.
1. This is the very tensor on which you apply the argmax function to get the predicted class.
2. This is the very tensor which you feed into the softmax function to get the probabilities for the predicted classes.
Also, from a tutorial on official tensorflow website:
Logits Layer
The final layer in our neural network is the logits layer, which will return the raw values for our predictions. We create a dense layer with 10 neurons (one for each target class 0–9), with linear activation (the default):
logits = tf.layers.dense(inputs=dropout, units=10)
If you are still confused, the situation is like this:
raw_predictions = neural_net(input_layer)
predicted_class_index_by_raw = argmax(raw_predictions)
probabilities = softmax(raw_predictions)
predicted_class_index_by_prob = argmax(probabilities)
where, predicted_class_index_by_raw and predicted_class_index_by_prob will be equal.
Another name for raw_predictions in the above code is logit.
As for the why logit... I have no idea. Sorry.
[Edit: See this answer for the historical motivations behind the term.]
Trivia
Although, if you want to, you can apply statistical logit to probabilities that come out of the softmax function.
If the probability of a certain class is p,
Then the log-odds of that class is L = logit(p).
Also, the probability of that class can be recovered as p = sigmoid(L), using the sigmoid function.
Not very useful to calculate log-odds though.
Summary
In context of deep learning the logits layer means the layer that feeds in to softmax (or other such normalization). The output of the softmax are the probabilities for the classification task and its input is logits layer. The logits layer typically produces values from -infinity to +infinity and the softmax layer transforms it to values from 0 to 1.
Historical Context
Where does this term comes from? In 1930s and 40s, several people were trying to adapt linear regression to the problem of predicting probabilities. However linear regression produces output from -infinity to +infinity while for probabilities our desired output is 0 to 1. One way to do this is by somehow mapping the probabilities 0 to 1 to -infinity to +infinity and then use linear regression as usual. One such mapping is cumulative normal distribution that was used by Chester Ittner Bliss in 1934 and he called this "probit" model, short for "probability unit". However this function is computationally expensive while lacking some of the desirable properties for multi-class classification. In 1944 Joseph Berkson used the function log(p/(1-p)) to do this mapping and called it logit, short for "logistic unit". The term logistic regression derived from this as well.
The Confusion
Unfortunately the term logits is abused in deep learning. From pure mathematical perspective logit is a function that performs above mapping. In deep learning people started calling the layer "logits layer" that feeds in to logit function. Then people started calling the output values of this layer "logit" creating the confusion with logit the function.
TensorFlow Code
Unfortunately TensorFlow code further adds in to confusion by names like tf.nn.softmax_cross_entropy_with_logits. What does logits mean here? It just means the input of the function is supposed to be the output of last neuron layer as described above. The _with_logits suffix is redundant, confusing and pointless. Functions should be named without regards to such very specific contexts because they are simply mathematical operations that can be performed on values derived from many other domains. In fact TensorFlow has another similar function sparse_softmax_cross_entropy where they fortunately forgot to add _with_logits suffix creating inconsistency and adding in to confusion. PyTorch on the other hand simply names its function without these kind of suffixes.
Reference
The Logit/Probit lecture slides is one of the best resource to understand logit. I have also updated Wikipedia article with some of above information.
Logit is a function that maps probabilities [0, 1] to [-inf, +inf].
Softmax is a function that maps [-inf, +inf] to [0, 1] similar as Sigmoid. But Softmax also normalizes the sum of the values(output vector) to be 1.
Tensorflow "with logit": It means that you are applying a softmax function to logit numbers to normalize it. The input_vector/logit is not normalized and can scale from [-inf, inf].
This normalization is used for multiclass classification problems. And for multilabel classification problems sigmoid normalization is used i.e. tf.nn.sigmoid_cross_entropy_with_logits
Personal understanding, in TensorFlow domain, logits are the values to be used as input to softmax. I came to this understanding based on this tensorflow tutorial.
https://www.tensorflow.org/tutorials/layers
Although it is true that logit is a function in maths(especially in statistics), I don't think that's the same 'logit' you are looking at. In the book Deep Learning by Ian Goodfellow, he mentioned,
The function σ−1(x) is called the logit in statistics, but this term
is more rarely used in machine learning. σ−1(x) stands for the
inverse function of logistic sigmoid function.
In TensorFlow, it is frequently seen as the name of last layer. In Chapter 10 of the book Hands-on Machine Learning with Scikit-learn and TensorFLow by Aurélien Géron, I came across this paragraph, which stated logits layer clearly.
note that logits is the output of the neural network before going
through the softmax activation function: for optimization reasons, we
will handle the softmax computation later.
That is to say, although we use softmax as the activation function in the last layer in our design, for ease of computation, we take out logits separately. This is because it is more efficient to calculate softmax and cross-entropy loss together. Remember that cross-entropy is a cost function, not used in forward propagation.
(FOMOsapiens).
If you check math Logit function, it converts real space from [0,1] interval to infinity [-inf, inf].
Sigmoid and softmax will do exactly the opposite thing. They will convert the [-inf, inf] real space to [0, 1] real space.
This is why, in machine learning we may use logit before sigmoid and softmax function (since they match).
And this is why "we may call" anything in machine learning that goes in front of sigmoid or softmax function the logit.
Here is G. Hinton video using this term.
Here is a concise answer for future readers. Tensorflow's logit is defined as the output of a neuron without applying activation function:
logit = w*x + b,
x: input, w: weight, b: bias. That's it.
The following is irrelevant to this question.
For historical lectures, read other answers. Hats off to Tensorflow's "creatively" confusing naming convention. In PyTorch, there is only one CrossEntropyLoss and it accepts un-activated outputs. Convolutions, matrix multiplications and activations are same level operations. The design is much more modular and less confusing. This is one of the reasons why I switched from Tensorflow to PyTorch.
logits
The vector of raw (non-normalized) predictions that a classification model generates, which is ordinarily then passed to a normalization function. If the model is solving a multi-class classification problem, logits typically become an input to the softmax function. The softmax function then generates a vector of (normalized) probabilities with one value for each possible class.
In addition, logits sometimes refer to the element-wise inverse of the sigmoid function. For more information, see tf.nn.sigmoid_cross_entropy_with_logits.
official tensorflow documentation
They are basically the fullest learned model you can get from the network, before it's been squashed down to apply to only the number of classes we are interested in. Check out how some researchers use them to train a shallow neural net based on what a deep network has learned: https://arxiv.org/pdf/1312.6184.pdf
It's kind of like how when learning a subject in detail, you will learn a great many minor points, but then when teaching a student, you will try to compress it to the simplest case. If the student now tried to teach, it'd be quite difficult, but would be able to describe it just well enough to use the language.
The logit (/ˈloʊdʒɪt/ LOH-jit) function is the inverse of the sigmoidal "logistic" function or logistic transform used in mathematics, especially in statistics. When the function's variable represents a probability p, the logit function gives the log-odds, or the logarithm of the odds p/(1 − p).
See here: https://en.wikipedia.org/wiki/Logit