Is it really necessary to tune/optimize the learning rate when using ADAM optimizer? - tensorflow

Is it really necessary to optimize the initial learning rate when using ADAM as optimizer in tensorflow/keras? How can this be done (in tensorflow 2.x)?

It is. Like with any hyperparameter, an optimal learning rate should be search for. It might be the case that your model will not learn if the learning rate is too big or too small even with an optimizer like ADAM which has a nice properties regarding decay etc.
Example of behavior of a model under ADAM optimizer with respect to a learning rate can be seen in this article How to pick the best learning rate for your machine learning project
Looking for right hyperparameters is called hyperparameter tuning. I am not using TF 2.* in my projects so I will give a reference to what TensorFlow itself offers Hyperparameter Tuning with the HParams Dashboard

Related

Is there a Tensorflow or Keras equivalent to fastai's interp.plot_top_losses?

Is there a Tensorflow or Keras equivalent to fastai's interp.plot_top_losses? If not, how can I manually obtain the predictions with the greatest loss?
Thank you.
I found the answer, it is ktrain! Comes with learning rate finder, learning rate schedules, ready to used per-trained models and many more features inspired by fastai.
https://github.com/amaiya/ktrain

Best case to use tensorflow

I followed all the steps mentioned in the article:
https://stackabuse.com/tensorflow-2-0-solving-classification-and-regression-problems/
Then I compared the results with Linear Regression and found that the error is less (68) than the tensorflow model (84).
from sklearn.linear_model import LinearRegression
logreg_clf = LinearRegression()
logreg_clf.fit(X_train, y_train)
pred = logreg_clf.predict(X_test)
print(np.sqrt(mean_squared_error(y_test, pred)))
Does this mean that if I have large dataset, I will get better results than linear regression?
What is the best situation - when I should be using tensorflow?
Answering your first question, Neural Networks are notoriously known for overfitting on smaller datasets, and here you are comparing the performance of a simple linear regression model with a neural network with two hidden layers on the testing data set, so it's not very surprising to see that the MLP model falling behind (assuming that you are working with relatively a smaller dataset) the linear regression model. Larger datasets will definitely help neural networks in learning more accurate parameters and generalize the phenomena well.
Now coming to your second question, Tensorflow is basically a library for building deep learning models, so whenever you are working on a deep learning problem like image recognition, Natural Language Processing, etc. you need massive computational power and will be processing a ton of data to train your models, and this is where TensorFlow becomes handy, it offers you GPU support which will significantly boost your training process which otherwise becomes practically impossible. Moreover, if you are building a product that has to be deployed in a production environment for it to be consumed, you can make use of TensorFlow Serving which helps you to take your models much closer to the customers.

Using scikit learn for Neural Networks vs Tensorflow in training

I was implementing some sample Neural networks and in most tutorials saw this statement.
Neural networks tend to work better on GPUs than on CPU.
The scikit-learn framework isn’t built for GPU optimization.
So does this statement (work better) refers solely regarding the train phase of a neural network or it includes the prediction part also. Would greatly appreciate some explanation on this.
That statement refers to the training phase. The only issue here is that you can explore the search space of feasible models in a more efficient way using a GPU so you will probably find better models in less time. However, this is only related to computational costs and not to model predictive performance.

Optimizers in Tensorflow

From various examples of Tensorflow (translation, ptb) it seems like that you need to explicitly change learning rate when using GradientDescentOptimizer. But is it the case while using some more 'sophisticated' techniques like Adagrad, Adadelta etc. Also when we continue training the model from a saved instance, are the past values used by these optimizers saved in the model file ?
It depends on the Optimizer you are using. Vanilla SGD needs (accepts) individual adaption of the learning rate. Some others do. Adadelta for example does not. (https://arxiv.org/abs/1212.5701)
So this depends not so much on Tensorflow but rather on the mathematical background of the optimizer you are using.
Furthermore: Yes, saving and restarting the training does not reset the learning rates, but continuous at the point saved.

How do I choose an optimizer for my tensorflow model?

Tensorflow seems to have a large collection of optimizers, is there any high level guideline (or review paper) on which one is best adapted to specific classes of loss functions ?
It depends on your datasets and NN models, but generally, I would start with Adam. Figure 2 in this paper (http://arxiv.org/abs/1412.6980) shows Adam works well.
Also, you can see a very nice animation from
http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html.