I am creating a logistic regression model using bigquery and I would like to use L1 regularization (Lasso). When I built model suing the sklearn, I just specified that I want to use the L1 regularization. On BQML, however, I need to specify a float number, according to this documentation. And I am totally confused what this amount should be. Can anyone explain it?
Use L1 regularization to encourage many of the uninformative coefficients in your model to be exactly 0.
L1 regularization—penalizing the absolute value of all the weights—turns out to be quite efficient for wide models.
Try out Regularization Rate (lambda) between 0.1 and 0.3, and see if your model improve.
Related
I'm new to Deep Learning and I saw this for the first time. Having MAE as loss function and MSE to metric. What is the purpose of this and what is gained?
(loss=tf.metrics.MeanAbsoluteError(), metrics=[tf.losses.MeanSquaredError()])
In some cases it is useful to have a loss function different from the metric you are going to evaluate.
Consider the case in which you want to denoise an image, that is you design a network that takes as input a noise image and outputs its clean version. Here, your metric might be the Peak-Signal-to-Noise Ratio (PSNR) or some sort of structural similarity (SSIM) between your output and the ground truth clean image. However, during training, you might consider different loss function, such as L1 (MAE), L2 (MSE) or even a Perceptual Loss, such as the VGG loss, because these have been proved to lead to better results than directly optimizing for PSNR or SSIM.
I have been working on a project related with sequence to sequence autoencoder for time series forecasting. So, I have used tf.contrib.rnn.MultiRNNCell in encoder and decoder. I am confused in which strategy used in order to regularize my seq2seq model. Should I use L2 regularization in the loss or using DropOutWrapper (tf.contrib.rnn.DropoutWrapper) in the multiRNNCell? Or can I use both strategies ... L2 for weigths and bias (projection layer) and DropOutWrapper between cells in the multiRNNCell?
Thanks in advance :)
You can use both dropout and L2 regularization at the same time as is commonly done. They are quite different types of regularization. However, I would note that recent literature has suggested that batch normalization has replaced the need for dropout as noted in the original paper on batch normalization:
https://arxiv.org/abs/1502.03167
From the abstract: "It also acts as a regularizer, in some cases eliminating the need for Dropout."
L2 regularization is typically applied when batchnorm is in use. There's nothing stopping you from applying all 3 forms of regularization, the statement above only indicates that you might not see an improvement by applying dropout when batchnorm is already in use.
There are generally optimal values for the amount of L2 regularization to apply and the dropout keep probability. These are hyperparameters you tune by trial and error or a hyperparameter search algorithm.
I'm new to tensorflow and find that the sample CNN programs are using weight decay to avoid huge weight while they do not always normalize the input in the first place.
Does the weight decay serve the same purpose as the input normalization?
What is the difference between them?
Weight decay is a type of regularisation used to control overfitting of the model. Weight decay is more commonly known as L2 Normalisation. Weight decay is used more in common in shallow learning algorithms like linear regression, logistic regression etc. In deep learning (ex: which uses CNN), weight decay is not so common. In fact other regularisation methods like dropout is used.
Input normalisation on the other hand refers to zero centering your input data and limiting the range of the input data. This procedure helps in quick convergence of the data.
There is no general fixed rule on how this two concepts has to be applied. Hence, you may have seen some variations of this two concepts.
Weight decay is a regularization technique such as L2 regularization that result in gradient descent shrinking the weights on every iteration
I am employing L1 regularization on my neural network parameters in Keras with keras.regularizers.l1(0.01) to obtain a sparse model. I am finding that, while many of my coefficients are close to zero, few of them are actually zero.
Upon looking at the source code for the regularization, it suggests that Keras simply adds the L1 norm of the parameters to the loss function.
This would be incorrect because the parameters would almost certainly never go to zero (within floating point error) as intended with L1 regularization. The L1 norm is not differentiable when a parameter is zero, so subgradient methods need to be used where the parameters are set to zero if close enough to zero in the optimization routine. See the soft threshold operator max(0, ..) here.
Does Tensorflow/Keras do this, or is this impractical to do with stochastic gradient descent?
EDIT: Also here is a superb blog post explaining the soft thresholding operator for L1 regularization.
So despite #Joshua answer, there are three other things that are worth to mention:
There is no problem connected with a gradient in 0. keras is automatically setting it to 1 similarly to relu case.
Remember that values lesser than 1e-6 are actually equal to 0 as this is float32 precision.
The problem of not having most of the values set to 0 might arise due to computational reasons due to the nature of a gradient-descent based algorithm (and setting a high l1 value) because of oscillations which might occur due to gradient discontinuity. To understand imagine that for a given weight w = 0.005 your learning rate is equal to 0.01 and a gradient of the main loss is equal to 0 w.r.t. to w. So your weight would be updated in the following manner:
w = 0.005 - 1 * 0.01 = -0.05 (because gradient is equal to 1 as w > 0),
and after the second update:
w = -0.005 + 1 * 0.01 = 0.05 (because gradient is equal to -1 as w < 0).
As you may see the absolute value of w hasn't decreased even though you applied l1 regularization and this happened due to the nature of the gradient-based algorithm. Of course, this is simplified situation but you could experience such oscillating behavior really often when using l1 norm regularizer.
Keras correctly implements L1 regularization. In the context of neural networks, L1 regularization simply adds the L1 norm of the parameters to the loss function (see CS231).
While L1 regularization does encourages sparsity, it does not guarantee that output will be sparse. The parameter updates from stochastic gradient descent are inherently noisy. Thus, the probability that any given parameter is exactly 0 is vanishingly small.
However, many of the parameters of an L1 regularized network are often close to 0. A rudimentary approach would be to threshold small values to 0. There has been research to explore more advanced methods of generating sparse neural network. In this paper, the authors simultaneously prune and train a neural network to achieve 90-95% sparsity on a number of well known network architectures.
TL;DR:
The formulation in deep learning frameworks are correct, but currently we don't have a powerful solver/optimizer to solve it EXACTLY with SGD or its variants. But if you use proximal optimizers, you can obtain sparse solution.
Your observation is right.
Almost all deep learning frameworks (including TF) implement L1 regularization by adding absolute values of parameters to the loss function. This is Lagrangian form of L1 regularization and IS CORRECT.
However, The SOLVER/OPTIMIZER is to be blamed. Even for the well studied LASSO problem, where the solution should be sparse and the soft-threshold operator DOES give us the sparse solution, the subgradient descent solver CAN NOT get the EXACT SPARSE solution. This answer from Quora gives some insight on convergence property of subgradient descent, which says:
Subgradient descent has very poor convergence properties for
non-smooth functions, such as the Lasso objective, since it ignores
problem structure completely (it doesn't distinguish between the least
squares fit and the regularization term) by just looking at
subgradients of the entire objective. Intuitively, taking small steps
in the direction of the (sub)gradient usually won't lead to
coordinates equal to zero exactly.
If you use proximal operators, you can get sparse solution. For example, you can have a look at the paper "Data-driven sparse structure selection for deep neural networks" (this one comes with MXNET code and easy to reproduce!) or "Stochastic Proximal Gradient Descent with Acceleration Techniques" (this one gives more theoretical insight). I'm not pretty sure if the built-in proximal optimizer in TF (e.g.: tf.train.ProximalAdagradOptimizer) can lead to sparse solutions, but you may have a try.
Another simple work around is to zero out small weights (i.e.: absolute value <1e-4) after training or after each gradient descent step to force sparsity. This is just a handy heuristic approach and not theoretically rigorous.
Keras implements L1 regularization properly, but this is not a LASSO. For the LASSO one would need a soft-thresholding function, as correctly pointed out in the original post. It would be very useful with a function similar to the keras.layers.ThresholdedReLU(theta=1.0), but with f(x) = x for x > theta or f(x) = x for x < -theta, f(x) = 0 otherwise. For the LASSO, theta would be equal to the learning rate times the regularization factor of the L1 function.
I would like to add both L1 and L2 Regularization to my loss function. When I define the weight variable I choose the regularization to use, but it seems I can only choose one.
regLosses=tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
loss=tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(y_conv,y_))+regLosses
when I try to get the losses manually by
weights=tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)
l1Loss=tf.reduce_sum(tf.abs(weights))
l2Loss=tf.nn.l2loss(weights)
loss=tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(y_conv,y_))+.1*l1Loss+.001*l2Loss
It doesn't work - I think because TRAINABLE_VARIABLES returns the variables not the parameters. How do i fix this? Is my manual calculation of l1 loss correct?
Thanks in advance
So I think I discovered the answer. Comments and review welcome.
When I create the weights I use:
W=tf.get_variable(name=name,shape=shape,regularizer=tf.contrib.layers.l1_regularizer(1.0))
Noting that the l1 regularization is simply the sum of the absolute values of the weights and that l2 is the squared of the weights, then I can do the following.
regLosses=tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
l1=tf.reduce_sum(tf.abs(regLosses))
l2=tf.reduce_sum(tf.square(reglosses))