Is the L1 regularization in Keras/Tensorflow *really* L1-regularization? - tensorflow

I am employing L1 regularization on my neural network parameters in Keras with keras.regularizers.l1(0.01) to obtain a sparse model. I am finding that, while many of my coefficients are close to zero, few of them are actually zero.
Upon looking at the source code for the regularization, it suggests that Keras simply adds the L1 norm of the parameters to the loss function.
This would be incorrect because the parameters would almost certainly never go to zero (within floating point error) as intended with L1 regularization. The L1 norm is not differentiable when a parameter is zero, so subgradient methods need to be used where the parameters are set to zero if close enough to zero in the optimization routine. See the soft threshold operator max(0, ..) here.
Does Tensorflow/Keras do this, or is this impractical to do with stochastic gradient descent?
EDIT: Also here is a superb blog post explaining the soft thresholding operator for L1 regularization.

So despite #Joshua answer, there are three other things that are worth to mention:
There is no problem connected with a gradient in 0. keras is automatically setting it to 1 similarly to relu case.
Remember that values lesser than 1e-6 are actually equal to 0 as this is float32 precision.
The problem of not having most of the values set to 0 might arise due to computational reasons due to the nature of a gradient-descent based algorithm (and setting a high l1 value) because of oscillations which might occur due to gradient discontinuity. To understand imagine that for a given weight w = 0.005 your learning rate is equal to 0.01 and a gradient of the main loss is equal to 0 w.r.t. to w. So your weight would be updated in the following manner:
w = 0.005 - 1 * 0.01 = -0.05 (because gradient is equal to 1 as w > 0),
and after the second update:
w = -0.005 + 1 * 0.01 = 0.05 (because gradient is equal to -1 as w < 0).
As you may see the absolute value of w hasn't decreased even though you applied l1 regularization and this happened due to the nature of the gradient-based algorithm. Of course, this is simplified situation but you could experience such oscillating behavior really often when using l1 norm regularizer.

Keras correctly implements L1 regularization. In the context of neural networks, L1 regularization simply adds the L1 norm of the parameters to the loss function (see CS231).
While L1 regularization does encourages sparsity, it does not guarantee that output will be sparse. The parameter updates from stochastic gradient descent are inherently noisy. Thus, the probability that any given parameter is exactly 0 is vanishingly small.
However, many of the parameters of an L1 regularized network are often close to 0. A rudimentary approach would be to threshold small values to 0. There has been research to explore more advanced methods of generating sparse neural network. In this paper, the authors simultaneously prune and train a neural network to achieve 90-95% sparsity on a number of well known network architectures.

TL;DR:
The formulation in deep learning frameworks are correct, but currently we don't have a powerful solver/optimizer to solve it EXACTLY with SGD or its variants. But if you use proximal optimizers, you can obtain sparse solution.
Your observation is right.
Almost all deep learning frameworks (including TF) implement L1 regularization by adding absolute values of parameters to the loss function. This is Lagrangian form of L1 regularization and IS CORRECT.
However, The SOLVER/OPTIMIZER is to be blamed. Even for the well studied LASSO problem, where the solution should be sparse and the soft-threshold operator DOES give us the sparse solution, the subgradient descent solver CAN NOT get the EXACT SPARSE solution. This answer from Quora gives some insight on convergence property of subgradient descent, which says:
Subgradient descent has very poor convergence properties for
non-smooth functions, such as the Lasso objective, since it ignores
problem structure completely (it doesn't distinguish between the least
squares fit and the regularization term) by just looking at
subgradients of the entire objective. Intuitively, taking small steps
in the direction of the (sub)gradient usually won't lead to
coordinates equal to zero exactly.
If you use proximal operators, you can get sparse solution. For example, you can have a look at the paper "Data-driven sparse structure selection for deep neural networks" (this one comes with MXNET code and easy to reproduce!) or "Stochastic Proximal Gradient Descent with Acceleration Techniques" (this one gives more theoretical insight). I'm not pretty sure if the built-in proximal optimizer in TF (e.g.: tf.train.ProximalAdagradOptimizer) can lead to sparse solutions, but you may have a try.
Another simple work around is to zero out small weights (i.e.: absolute value <1e-4) after training or after each gradient descent step to force sparsity. This is just a handy heuristic approach and not theoretically rigorous.

Keras implements L1 regularization properly, but this is not a LASSO. For the LASSO one would need a soft-thresholding function, as correctly pointed out in the original post. It would be very useful with a function similar to the keras.layers.ThresholdedReLU(theta=1.0), but with f(x) = x for x > theta or f(x) = x for x < -theta, f(x) = 0 otherwise. For the LASSO, theta would be equal to the learning rate times the regularization factor of the L1 function.

Related

How to debug exploding gradient (covariance matrix) in Tensorflow 2.0 (TFP)

A question that comes from the fact that I never had to debug my models in TF so deeply.
I'm running a variational inference with a full-rank Gaussian approximation using Tensorflow Probability. I noticed my optimization often explodes. Here is my loss curve.
I suspect numerical issues, as all the losses and the optimization process look reasonable and I don't observe any NaNs.
I use tfp.distributions.MultivariateNormalTriL with a covariance parameter transformed by tfp.bijectors.FillScaleTriL with the default diagonal shift. The condition number of the covariance matrix is reasonable. The variational inference is performed with fit_surrogate_posterior function.
I optimize with an SGD with momentum, using 10 samples per iteration.
Internally in Tensorflow Probability source code, the minimization objective uses a gradient tape:
with tf.GradientTape(watch_accessed_variables=trainable_variables is None) as tape:
for v in trainable_variables or []:
tape.watch(v)
loss = loss_fn()
In order to solve my issue I would like to see the gradient through every operation.
My question is how can I get more insight into which operation is exploding by the gradient computation? How to get the value of gradient at every tensor?
And if any of you faced a similar issue:
Is there a better way to prevent instabilities in the covariance matrix optimization?
Detailed explanations:
I observed that this explosion is caused by one parameter (though it is not always the same parameter that explodes). This can be simply checked by comparing the covariance matrix two iterations before the explosion
and one iteration before the point where the loss explodes
Note the last parameter. When I run the same optimization multiple times, it might happen that one of the "small" parameters (rows from 9 to the last) explodes at some point.
Thanks,
Mateusz

Multiple questions regarding the KL term in the ELBO loss with TensorFlow Probability

I have been trying to conduct a few experiments using TensorFlow Probability (TFP), and I got a few questions.
What is the proper value of the coefficient of the KL loss?
In the paper by Blundell (2015), the coefficient is set to 1/M (where M is the number of mini-batches). In the example given by TFP, the coefficient is given as 1/mnist_data.train.num_examples. Why?
As I go from 2d input to 3d images volumes, the KL loss is still significantly larger (~1k) than the cross-entropy (~1), even after dividing by mnist_data.train.num_examples. Why?
What is the guideline for getting a proper value for this coefficient? Maybe like the two-loss terms should be the same order of magnitude?
The current coefficient only takes care of the number of training samples, but not the network complexity or number of parameters in the network, which I assume the KL loss increase with the complexity of the model.
I am trying to implement a neural network with the KL loss, without using keras.model.losses, as some software production and hardware support limitation. I am trying to train my model with TF 1.10 and TFP 0.3.0., the issue is that for tf<=1.14, tf.keras.model does not support tf.layers inside the Keras model, so I can't use my original model straight away. Is there a way to get the KL loss, not from model.losses, but from layers or weights of the network in a TF construct?
Is batch normalization or group normalization still helpful in Bayesian deep learning?
In the paper by Blundell (2015), the coefficient is set to 1/M (where M is the number of mini-batches). In the example given by TFP, the coefficient is given as 1/mnist_data.train.num_examples. Why?
In the BBB paper eq. 8, they refer to M being the number of mini-batches. To be consistent with the non-stochastic gradient learning, it should be scaled by the number of mini-batches which is what is done by Graves. Another alternative is that done in eq. 9, where they scale it by \pi_i, where the sum of all the values in the set {\pi} sum to one.
In the TFP example, it does look like the num_examples is the total number of independent samples within the training set, which is much larger than the number of batches. This is goes by a few names, such as Safe Bayes or Tempering. Have a look at sec. 8 of this paper for some more discussion about the use of tempering within Bayesian inference and it's suitability.
As I go from 2d input to 3d images volumes, the KL loss is still significantly larger (~1k) than the cross-entropy (~1), even after dividing by mnist_data.train.num_examples. Why?
The ELBO will always be larger than just your cross-entropy (which defines your likelihood). Have a look at how the KL divergence term in the ELBO is found. (and a full mean-field approach where each weight/parameter is assumed to be independent).
Since the assumed posterior is factorised (assume each parameter is independent), can write the joint distribution as a product. This means when you take the log when you are computing the KL between the approx. posterior and the prior, you can write it as a sum of the KL terms between each parameter. Since the KL is >= 0, for each parameter you add to your model you will be adding another positive term to your ELBO. This is likely why your loss is so much more for your 3D model, likely because there is more parameters.
Another reason this could occur is if you have less data (your M is smaller, than the KL term is weighted less).
What is the guideline for getting a proper value for this coefficient? Maybe like the two-loss terms should be the same order of magnitude?
I am unsure of any specific guideline, for training you are interested primarily in the gradients. A large loss does not mean a large gradient. Have a look at the gradients contributed by the negative log likelihood and the KL term in your ELBO. If the KL term is too large, you probably need a more informative prior or more data (you could simply scale the KL term but this feels a bit yucky for the Bayesian in me).
The current coefficient only takes care of the number of training samples, but not the network complexity or the number of parameters in the network, which I assume the KL loss increase with the complexity of the model.
Yes, as stated before, in general, more parameters == greater ELBO (for a mean-field approach as used in Bayes by Backprop).
I am trying to implement a neural network with the KL loss, without using keras.model.losses, as some software production and hardware support limitation. I am trying to train my model with TF 1.10 and TFP 0.3.0., the issue is that for tf<=1.14, tf.keras.model does not support tf.layers inside the Keras model, so I can't use my original model straight away. Is there a way to get the KL loss, not from model.losses, but from layers or weights of the network in a TF construct?
I am unsure about the best way to tackle this part of it. I would be cautious about going to older versions where it isn't explicitly supported. They put those warnings/exceptions in for a reason.
Is batch normalization or group normalization still helpful in Bayesian deep learning?
For variational inference (as done in Bayes by Backprop) Batchnorm is fine. For sampling methods such as MCMC, Batch normalization is no longer suitable. Have a look at https://arxiv.org/pdf/1908.03491v1.pdf for info on suitability for batch norm with sampling methods for approx. Bayesian inference.

Cost function convergence in Tensorflow using softmax_cross_entropy_with_logits and "soft" labels/targets

I've found what is probably a rare case in Tensorflow, but I'm trying to train a classifier (linear or nonlinear) using KL divergence (cross entropy) as the cost function in Tensorflow, with soft targets/labels (labels that form a valid probability distribution but are not "hard" 1 or 0).
However it is clear (tell-tail signs) that something is definitely wrong. I've tried both linear and nonlinear (dense neural network) forms, but no matter what I always get the same final value for my loss function regardless of network architecture (even if I train only a bias). Also, the cost function converges extremely quickly (within like 20-30 iterations) using L-BFGS (a very reliable optimizer!). Another sign something is amiss is that I can't overfit the data, and the validation set appears to have exactly the same loss value as the training set. However, strangely I do see some improvements when I increase network architecture size and/or change regularization loss. The accuracy improves with this as well (although not to the point that I'm happy with it or it's as I expect).
It DOES work as expected when I use the exact same code but send in one-hot encoded labels (not soft targets). An example of the cost function from training taken from Tensorboard is shown below. Can someone pitch me some ideas?
Ahh my friend, you're problem is that with soft targets, especially ones that aren't close to 1 or zero, cross entropy loss doesn't change significantly as the algorithm improves. One thing that will help you understand this problem is to take an example from your training data and compute the entropy....then you will know what the lowest value your cost function can be. This may shed some light on your problem. So for one of your examples, let's say the targets are [0.39019628, 0.44301641, 0.16678731]. Well, using the formula for cross entropy
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
but then using the targets "y_" in place of the predicted probabilities "y" we arrive at the true entropy value of 1.0266190072458234. If you're predictions are just slightly off of target....lets say they are [0.39511779, 0.44509024, 0.15979198], then the cross entropy is 1.026805558049737.
Now, as with most difficult problems, it's not just one thing but a combination of things. The loss function is being implemented correctly, but you made the "mistake" of doing what you should do in 99.9% of cases when training deep learning algorithms....you used 32-bit floats. In this particular case though, you will run out of significant digits that a 32-bit float can represent well before you training algorithm converges to a nice result. If I use your exact same data and code but only change the data types to 64-bit floats though, you can see below that the results are much better -- your algorithm continues to train well out past 2000 iterations and you will see it reflected in your accuracy as well. In fact, you can see from the slope if 128 bit floating point was supported, you could continue training and probably see advantages from it. You wouldn't probably need that precision in your final weights and biases...just during training to support continuing optimization of the cost function.

Why do we clip_by_global_norm to obtain gradients while performing RNN

I am following this tutorial on RNN where on line 177 the following code is executed.
max_grad_norm = 10
....
grads, _ = tf.clip_by_global_norm(tf.gradients(cost, tvars), max_grad_norm)
optimizer = tf.train.GradientDescentOptimizer(self.lr)
self._train_op = optimizer.apply_gradients(zip(grads, tvars),
global_step=tf.contrib.framework.get_or_create_global_step())
Why do we do clip_by_global_norm? How is the value of max_grad_norm decided?
The reason for clipping the norm is that otherwise it may explode:
There are two widely known issues with properly training recurrent
neural networks, the vanishing and the exploding gradient problems
detailed in Bengio et al. (1994). In this paper we attempt to improve
the understanding of the underlying issues by exploring these problems
from an analytical, a geometric and a dynamical systems perspective.
Our analysis is used to justify a simple yet effective solution. We
propose a gradient norm clipping strategy to deal with exploding
gradients
The above taken from this paper.
In terms of how to set max_grad_norm, you could play with it a bit to see how it affects your results. This is usually set to quite small number (I have seen 5 in several cases). Note that tensorflow does not force you to specify this value. If you don't it will specify it itself (as explained in the documentation).
The reason that exploding\vanishing gradient is common in rnn is because while doing backpropagation (this is called backpropagation through time), we will need to multiply the gradient matrices all the way to t=0 (that is, if we currently at t=100, say the 100's character in a sentence, we will need to multiply 100 matrices). Here is the equation for t=3:
(this equation is taken from here)
If the norm of the matrices is bigger than 1, it will eventually explode. It it is smaller that 1, it will eventually vanish. This may happen in usual neural networks as well if they have a lot of hidden layers. However, feed forward neural networks usually don't have so many hidden layers, while the input sequences to rnn can easily have many characters.

Gradient Descent vs Adagrad vs Momentum in TensorFlow

I'm studying TensorFlow and how to use it, even if I'm not an expert of neural networks and deep learning (just the basics).
Following tutorials, I don't understand the real and practical differences between the three optimizers for loss. I look at the API and I understand the principles, but my questions are:
1. When is it preferable to use one instead of the others ?
2. Are there important differences to know ?
Here is a brief explanation based on my understanding:
momentum helps SGD to navigate along the relevant directions and softens the oscillations in the irrelevant. It simply adds a fraction of the direction of the previous step to a current step. This achieves amplification of speed in the correct direction and softens oscillation in wrong directions. This fraction is usually in the (0, 1) range. It also makes sense to use adaptive momentum. In the beginning of learning a big momentum will only hinder your progress, so it makes sense to use something like 0.01 and once all the high gradients disappeared you can use a bigger momentum. There is one problem with momentum: when we are very close to the goal, our momentum in most of the cases is very high and it does not know that it should slow down. This can cause it to miss or oscillate around the minima
nesterov accelerated gradient overcomes this problem by starting to slow down early. In momentum we first compute gradient and then make a jump in that direction amplified by whatever momentum we had previously. NAG does the same thing but in another order: at first we make a big jump based on our stored information, and then we calculate the gradient and make a small correction. This seemingly irrelevant change gives significant practical speedups.
AdaGrad or adaptive gradient allows the learning rate to adapt based on parameters. It performs larger updates for infrequent parameters and smaller updates for frequent one. Because of this it is well suited for sparse data (NLP or image recognition). Another advantage is that it basically eliminates the need to tune the learning rate. Each parameter has its own learning rate and due to the peculiarities of the algorithm the learning rate is monotonically decreasing. This causes the biggest problem: at some point of time the learning rate is so small that the system stops learning.
AdaDelta resolves the problem of monotonically decreasing learning rate in AdaGrad. In AdaGrad the learning rate was calculated approximately as one divided by the sum of square roots. At each stage you add another square root to the sum, which causes denominator to constantly increase. In AdaDelta instead of summing all past square roots it uses sliding window which allows the sum to decrease. RMSprop is very similar to AdaDelta
Adam or adaptive momentum is an algorithm similar to AdaDelta. But in addition to storing learning rates for each of the parameters it also stores momentum changes for each of them separately.
A few visualizations:
I would say that SGD, Momentum and Nesterov are inferior than the last 3.
Salvador Dali's answer already explains about the differences between some popular methods (i.e. optimizers), but I would try to elaborate on them some more.
(Note that our answers disagree about some points, especially regarding ADAGRAD.)
Classical Momentum (CM) vs Nesterov's Accelerated Gradient (NAG)
(Mostly based on section 2 in the paper On the importance of initialization and momentum in deep learning.)
Each step in both CM and NAG is actually composed of two sub-steps:
A momentum sub-step - This is simply a fraction (typically in the range [0.9,1)) of the last step.
A gradient dependent sub-step - This is like the usual step in SGD - it is the product of the learning rate and the vector opposite to the gradient, while the gradient is computed where this sub-step starts from.
CM takes the gradient sub-step first, while NAG takes the momentum sub-step first.
Here is a demonstration from an answer about intuition for CM and NAG:
So NAG seems to be better (at least in the image), but why?
The important thing to note is that it doesn't matter when the momentum sub-step comes - it would be the same either way. Therefore, we might as well behave is if the momentum sub-step has already been taken.
Thus, the question actually is: Assuming that the gradient sub-step is taken after the momentum sub-step, should we calculate the gradient sub-step as if it started in the position before or after taking the momentum sub-step?
"After it" seems like the right answer, as generally, the gradient at some point θ roughly points you in the direction from θ to a minimum (with the relatively right magnitude), while the gradient at some other point is less likely to point you in the direction from θ to a minimum (with the relatively right magnitude).
Here is a demonstration (from the gif below):
The minimum is where the star is, and the curves are contour lines. (For an explanation about contour lines and why they are perpendicular to the gradient, see videos 1 and 2 by the legendary 3Blue1Brown.)
The (long) purple arrow is the momentum sub-step.
The transparent red arrow is the gradient sub-step if it starts before the momentum sub-step.
The black arrow is the gradient sub-step if it starts after the momentum sub-step.
CM would end up in the target of the dark red arrow.
NAG would end up in the target of the black arrow.
Note that this argument for why NAG is better is independent of whether the algorithm is close to a minimum.
In general, both NAG and CM often have the problem of accumulating more momentum than is good for them, so whenever they should change direction, they have an embarrassing "response time". The advantage of NAG over CM that we explained doesn't prevent the problem, but only makes the "response time" of NAG less embarrassing (but embarrassing still).
This "response time" problem is beautifully demonstrated in the gif by Alec Radford (which appeared in Salvador Dali's answer):
ADAGRAD
(Mostly based on section 2.2.2 in ADADELTA: An Adaptive Learning Rate Method (the original ADADELTA paper), as I find it much more accessible than Adaptive Subgradient Methods for Online Learning and Stochastic Optimization (the original ADAGRAD paper).)
In SGD, the step is given by - learning_rate * gradient, while learning_rate is a hyperparameter.
ADAGRAD also has a learning_rate hyperparameter, but the actual learning rate for each component of the gradient is calculated individually.
The i-th component of the t-th step is given by:
learning_rate
- --------------------------------------- * gradient_i_t
norm((gradient_i_1, ..., gradient_i_t))
while:
gradient_i_k is the i-th component of the gradient in the k-th step
(gradient_i_1, ..., gradient_i_t) is a vector with t components. This isn't intuitive (at least to me) that constructing such a vector makes sense, but that's what the algorithm does (conceptually).
norm(vector) is the Eucldiean norm (aka l2 norm) of vector, which is our intuitive notion of length of vector.
Confusingly, in ADAGRAD (as well as in some other methods) the expression that is multiplied by gradient_i_t (in this case, learning_rate / norm(...)) is often called "the learning rate" (in fact, I called it "the actual learning rate" in the previous paragraph). I guess this is because in SGD the learning_rate hyperparameter and this expression are one and the same.
In a real implementation, some constant would be added to the denominator, to prevent a division by zero.
E.g. if:
The i-th component of the gradient in the first step is 1.15
The i-th component of the gradient in the second step is 1.35
The i-th component of the gradient in the third step is 0.9
Then the norm of (1.15, 1.35, 0.9) is the length of the yellow line, which is:
sqrt(1.15^2 + 1.35^2 + 0.9^2) = 1.989.
And so the i-th component of the third step is: - learning_rate / 1.989 * 0.9
Note two things about the i-th component of the step:
It is proportional to learning_rate.
In the calculations of it, the norm is increasing, and so the learning rate is decreasing.
This means that ADAGRAD is sensitive to the choice of the hyperparameter learning_rate.
In addition, it might be that after some time the steps become so small, that ADAGRAD virtually gets stuck.
ADADELTA and RMSProp
From the ADADELTA paper:
The idea presented in this paper was derived from ADAGRAD in order to improve upon the two main drawbacks
of the method: 1) the continual decay of learning rates
throughout training, and 2) the need for a manually selected
global learning rate.
The paper then explains an improvement that is meant to tackle the first drawback:
Instead of accumulating the sum of squared gradients over all
time, we restricted the window of past gradients that are accumulated
to be some fixed size w [...]. This ensures that learning continues
to make progress even after many iterations of updates have
been done.
Since storing w previous squared gradients is inefficient,
our methods implements this accumulation as an exponentially
decaying average of the squared gradients.
By "exponentially decaying average of the squared gradients" the paper means that for each i we compute a weighted average of all of the squared i-th components of all of the gradients that were calculated.
The weight of each squared i-th component is bigger than the weight of the squared i-th component in the previous step.
This is an approximation of a window of size w because the weights in earlier steps are very small.
(When I think about an exponentially decaying average, I like to visualize a comet's trail, which becomes dimmer and dimmer as it gets further from the comet:
)
If you make only this change to ADAGRAD, then you will get RMSProp, which is a method that was proposed by Geoff Hinton in Lecture 6e of his Coursera Class.
So in RMSProp, the i-th component of the t-th step is given by:
learning_rate
- ------------------------------------------------ * gradient_i_t
sqrt(exp_decay_avg_of_squared_grads_i + epsilon)
while:
epsilon is a hyperparameter that prevents a division by zero.
exp_decay_avg_of_squared_grads_i is an exponentially decaying average of the squared i-th components of all of the gradients calculated (including gradient_i_t).
But as aforementioned, ADADELTA also aims to get rid of the learning_rate hyperparameter, so there must be more stuff going on in it.
In ADADELTA, the i-th component of the t-th step is given by:
sqrt(exp_decay_avg_of_squared_steps_i + epsilon)
- ------------------------------------------------ * gradient_i_t
sqrt(exp_decay_avg_of_squared_grads_i + epsilon)
while exp_decay_avg_of_squared_steps_i is an exponentially decaying average of the squared i-th components of all of the steps calculated (until the t-1-th step).
sqrt(exp_decay_avg_of_squared_steps_i + epsilon) is somewhat similar to momentum, and according to the paper, it "acts as an acceleration term". (The paper also gives another reason for why it was added, but my answer is already too long, so if you are curious, check out section 3.2.)
Adam
(Mostly based on Adam: A Method for Stochastic Optimization, the original Adam paper.)
Adam is short for Adaptive Moment Estimation (see this answer for an explanation about the name).
The i-th component of the t-th step is given by:
learning_rate
- ------------------------------------------------ * exp_decay_avg_of_grads_i
sqrt(exp_decay_avg_of_squared_grads_i) + epsilon
while:
exp_decay_avg_of_grads_i is an exponentially decaying average of the i-th components of all of the gradients calculated (including gradient_i_t).
Actually, both exp_decay_avg_of_grads_i and exp_decay_avg_of_squared_grads_i are also corrected to account for a bias toward 0 (for more about that, see section 3 in the paper, and also an answer in stats.stackexchange).
Note that Adam uses an exponentially decaying average of the i-th components of the gradients where most SGD methods use the i-th component of the current gradient. This causes Adam to behave like "a heavy ball with friction", as explained in the paper GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium.
See this answer for more about how Adam's momentum-like behavior is different from the usual momentum-like behavior.
Let's boil it down to a couple of simple question:
Which optimizer would give me the best result/accuracy?
There is no silver bullet. Some optimizers for your task would probably work better than the others. There is no way to tell beforehand, you have to try a few to find the best one. Good news is that the results of different optimizers would probably be close to each other. You have to find the best hyperparameters for any single optimizer you choose, though.
Which optimizer should I use right now?
Maybe, take AdamOptimizer and run it for learning_rate 0.001 and 0.0001. If you want better results, try running for other learning rates. Or try other optimizers and tune their hyperparameters.
Long story
There are a few aspects to consider when choosing your optimizer:
Easy of use (i.e. how fast you can find parameters that work for you);
Convergence speed (basic as SGD or faster as any other);
Memory footprint (typically between 0 and x2 sizes of your model);
Relation to other parts of the training process.
Plain SGD is the bare minimum that can be done: it simply multiplies the gradients by the learning rate and adds the result to the weights. SGD has a number of beautiful qualities: it has only 1 hyperparameter; it does not need any additional memory; it has minimal effect on the other parts of training. It also has 2 drawbacks: it might be too sensitive to learning rate choice and training can take longer than with other methods.
From these drawbacks of plain SGD we can see what are the more complicated update rules (optimizers) are for: we sacrifice a part of our memory to achieve faster training and, possibly, simplify the choice of hyperparameters.
Memory overhead is typically non-significant and can be ignored. Unless the model is extremely large, or you are training on GTX760, or fighting for ImageNet leadership. Simpler methods like momentum or Nesterov accelerated gradient need 1.0 or less of model size (size of the model hyperparameters). Second order methods (Adam, might need twice as much memory and computation.
Convergence speed-wise pretty much anything is better than SGD and anything else is hard to compare. One note might be that AdamOptimizer is good at starting training almost immediately, without a warm-up.
I consider easy-of-use to be the most important in the choice of an optimizer. Different optimizers have a different number of hyperparameters and have a different sensibility to them. I consider Adam the most simple of all readily-available ones. You typically need to check 2-4 learning_rates between 0.001 and 0.0001 to figure out if the model converges nicely. For comparison for SGD (and momentum) I typically try [0.1, 0.01, ... 10e-5]. Adam has 2 more hyperparameters that rarely have to be changed.
Relation between optimizer and other parts of training. Hyperparameter tuning typically involves selecting {learning_rate, weight_decay, batch_size, droupout_rate} simultaneously. All of them are interrelated and each can be viewed as a form of model regularization. One, for example, has to pay close attention if, exactly, weight_decay or L2-norm is used and possibly choose AdamWOptimizer instead of AdamOptimizer.