Normalization is just normalizing the input layer.
while batch normalization is on each layer.
We do not learn parameters in Normalization
But why we need to learn the batch normalization?
This is has been answered in detail in https://stats.stackexchange.com/a/310761
Deep Learning Book, Section 8.7.1:
Normalizing the mean and standard deviation of a unit can reduce the expressive power of the neural network containing that unit. To
maintain the expressive power of the network, it is common to replace
the batch of hidden unit activations H with γH+β rather than simply
the normalized H. The variables γ and β are learned parameters that
allow the new variable to have any mean and standard deviation. At
first glance, this may seem useless — why did we set the mean to 0,
and then introduce a parameter that allows it to be set back to any
arbitrary value β?
The answer is that the new parametrization can represent the same family of functions of the input as the old parametrization, but the
new parametrization has different learning dynamics. In the old
parametrization, the mean of H was determined by a complicated
interaction between the parameters in the layers below H. In the new
parametrization, the mean of γH+β is determined solely by β. The new
parametrization is much easier to learn with gradient descent.
Related
In deep learning based model training, in general batch of inputs are passed. For example for training a deep learning model with [512] dimensional input feature vector, say for batch size= 4, we mainly pass [4,512] dimenional input. I am curious what are the logical significance of passing the same input after flattening the input across the batch and channel dimenions [2048]. Logically the locality structure will be destroyed but will it significanlty speed up my implementation? And can it affect the performance?
In supervised learning, you would usually be working with data points (e.g. a feature vector or a multi-dimensional input such as an image) paired with some kind of ground-truth (a label for classifications tasks, or another multi-dimensional object altogether). Feeding to your model a flattened tensor containing multiple data points would not make sense in terms of supervision. Assuming you do an inference this way, what would be the supervision signal at the output level of your model? Would you combine the labels as well? All of this seem to depend heavily on the use case: is there some kind of temporal coherence between the elements of the batch?
Performance-wise, this has no implications whatsoever. Tensors are already 'flattened' by design since their memory is laid out in contiguous memory buffers. The idea of multi-dimensionality is an abstraction layer provided by those libraries (namely NumPy's arrays and Torch's tensors) to allow for easier and more flexible control over data.
I'm a beginner on time series analysis with deep learning, and I have been searching for examples with LSTM in which more than one series (for example one for each city or place) is trained to avoid fitting a model for each one. The main benefit of course is that you have more training data and less computational costs. I have found an interesting code to help modeling this problem with conditional/temporally-static variables (it's called cond-rnn). But wherever I search, it's not clear to me some issues regarding sorting the inputs appropriately.
The context is that I have a target and a set of autoregressive inputs (features, lags, timesteps, wherever you call it), in which data from different series are stack together. RF and GB are outperforming LSTM on this task (with overfitting, even when I use 100k+ samples, dropout or regularization), and I'm not sure if I'm using it appropriately.
It is wrong to stack series together and have the inputs-targets randomly sorted (as in the figure)? Does the LSTM need to receive the inputs temporally sorted?
If they need so, do you have any advice on how to deal with the problem of providing new series (that start from the first time period) to the LSTM training? This answer to a similar problem (but another perspective) suggest to pick "places" as an input column, but I don't think this answer help the questions here I posed.
If I am correct Stratified kfold is used so that the dependent variable ratio in the splits is similar to the original data.
What I want to understand is why is it necessary or important to retain that ratio.
Is it necessary for fraud detection problems where data is highly imbalanced.
If yes, why?
Taken from https://stats.stackexchange.com/questions/49540/understanding-stratified-cross-validation
Cross-validation article in Encyclopedia of Database Systems says:
Stratification is the process of rearranging the data as to ensure each fold is a
good representative of the whole. For example in a binary classification problem
where each class comprises 50% of the data, it is best to arrange the data such
that in every fold, each class comprises around half the instances.
About the importance of the stratification, Kohavi (A study of cross-validation
and bootstrap for accuracy estimation and model selection) concludes that:
stratification is generally a better scheme, both in terms of bias and variance,
when compared to regular cross-validation.
All metrics calculate against the true labels. If there is a bias in the system, say it predicts more of one label, a fold with more of said label would give artificially large results.
a methodology to take care of that is to ensure the true labels distribution is very similar for each fold. Then the aggregation of results would be more indicative of system performance.
Many times I have seen in neural networks forward propagation that example vectors are multiplied from the left (vector-matrix) and some times from the right (matrix-vector). Notation, some Tensorflow tutorials and the datasets I have found seem to prefer the former over the later, contrary to the way in which linear algebra tends to be teached (matrix-vector way).
Moreover, they represent inverted ways of representing parameters: enumerate problem variables in dimension 0 or enumerate neurons in dimension 0.
This confuses me and makes me wonder if there is really a standard here or it has been only coincidence. If there is, I would like to know if the standard follows some deeper reasons. I would feel really better answering this question.
(By the way, I know that you will normally use example matrices instead of vectors [or more complex things in conv nets, etc..] because the use of minibatches, but the point still holds.)
Not sure if this answer is what you are looking for, but in the context of Tensorflow, the standard is to use a dense layer (https://www.tensorflow.org/api_docs/python/tf/layers/dense) which is a higher level abstraction that wraps up the affine transformation logic you are referring to.
I came across a paper that implemented SVM using SMO. I planned to implement SVR (support vector regression) on the basis of it, using SMO. But I'm stuck. I want to ask how the initial values of lagrangian parameters are generated? Are they generated using a random function. Because I came across several implementation and there was no such notion of how initial values are generated.
Initial parameters can be taken random and SVR will eventually evolve with optimal ones. The second order derivative is guaranteed to be positive in SVR but in SVM it may not always support optimization.