How do I choose an optimizer for my tensorflow model? - tensorflow

Tensorflow seems to have a large collection of optimizers, is there any high level guideline (or review paper) on which one is best adapted to specific classes of loss functions ?

It depends on your datasets and NN models, but generally, I would start with Adam. Figure 2 in this paper (http://arxiv.org/abs/1412.6980) shows Adam works well.
Also, you can see a very nice animation from
http://www.denizyuret.com/2015/03/alec-radfords-animations-for.html.

Related

Is it really necessary to tune/optimize the learning rate when using ADAM optimizer?

Is it really necessary to optimize the initial learning rate when using ADAM as optimizer in tensorflow/keras? How can this be done (in tensorflow 2.x)?
It is. Like with any hyperparameter, an optimal learning rate should be search for. It might be the case that your model will not learn if the learning rate is too big or too small even with an optimizer like ADAM which has a nice properties regarding decay etc.
Example of behavior of a model under ADAM optimizer with respect to a learning rate can be seen in this article How to pick the best learning rate for your machine learning project
Looking for right hyperparameters is called hyperparameter tuning. I am not using TF 2.* in my projects so I will give a reference to what TensorFlow itself offers Hyperparameter Tuning with the HParams Dashboard

Which model (GPT2, BERT, XLNet and etc) would you use for a text classification task? Why?

I'm trying to train a model for a sentence classification task. The input is a sentence (a vector of integers) and the output is a label (0 or 1). I've seen some articles here and there about using Bert and GPT2 for text classification tasks. However, I'm not sure which one should I pick to start with. Which of these recent models in NLP such as original Transformer model, Bert, GPT2, XLNet would you use to start with? And why? I'd rather to implement in Tensorflow, but I'm flexible to go for PyTorch too.
Thanks!
It highly depends on your dataset and is part of the data scientist's job to find which model is more suitable for a particular task in terms of selected performance metric, training cost, model complexity etc.
When you work on the problem you will probably test all of the above models and compare them. Which one of them to choose first? Andrew Ng in "Machine Learning Yearning" suggest starting with simple model so you can quickly iterate and test your idea, data preprocessing pipeline etc.
Don’t start off trying to design and build the perfect system.
Instead, build and train a basic system quickly—perhaps in just a few
days
According to this suggestion, you can start with a simpler model such as ULMFiT as a baseline, verify your ideas and then move on to more complex models and see how they can improve your results.
Note that modern NLP models contain a large number of parameters and it is difficult to train them from scratch without a large dataset. That's why you may want to use transfer learning: you can download pre-trained model and use it as a basis and fine-tune it to your task-specific dataset to achieve better performance and reduce training time.
I agree with Max's answer, but if the constraint is to use a state of the art large pretrained model, there is a really easy way to do this. The library by HuggingFace called pytorch-transformers. Whether you chose BERT, XLNet, or whatever, they're easy to swap out. Here is a detailed tutorial on using that library for text classification.
EDIT: I just came across this repo, pytorch-transformers-classification (Apache 2.0 license), which is a tool for doing exactly what you want.
Well like others mentioned, it depends on the dataset and multiple models should be tried and best one must be chosen.
However, sharing my experience, XLNet beats all other models so far by a good margin. Hence if learning is not the objective, i would simple start with XLNET and then try a few more down the line and conclude. It just saves time in exploring.
Below repo is excellent to do all this quickly. Kudos to them.
https://github.com/microsoft/nlp-recipes
It uses hugging face transformers and makes them dead simple. 😃
I have used XLNet, BERT, and GPT2 for summarization tasks (English only). Based on my experience, GPT2 works the best among all 3 on short paragraph-size notes, while BERT performs better for longer texts (up to 2-3 pages). You can use XLNet as a benchmark.

How to choose the threshold of the output of a dnn in tensorflow?

I am currently learning to make neural networks with tensorflow. And the library provides a very convenient way to create one with the estimator DNNClassifier like in this tutorial: https://www.tensorflow.org/get_started/premade_estimators.
However, I don't manage to see how to choose the final treshold of the output layer before making the prediction:
For instance, let's say we have a binary classifier between 'KO' and 'OK'. The end of the neural network compute the probabilities for each possibility for a specific sample, for instance [0.4,0.6] (so 40% that the answer is 'KO' and 60% that the answer is 'OK'). I assume that the dnn takes by default a threshold of 0.5, so it will answer 'OK' here. But I want to change this threshold to 0.8 so that if the dnn is not sure at 80% for 'OK', it will answer 'KO' (in order to tune the FP-rate and the FN-rate).
How can we do that ?
Thanks in advance for your help.
The premade estimators are somewhat rigid. The DNNClassifier, for example, does not provide a mechanism to change the loss function or to obtain the logits/probabilities output by the classifier, as you've discovered.
To modify the logic of how predictions are generated, or to modify your loss function, you'll have to create a custom Estimator. This tutorial walks you through that process.
If you haven't invested too much time learning how to use the Estimator API yet, I recommend you also acquaint yourself with Keras, another high-level API for building and training deep learning models in TensorFlow; you might find it easier to build custom models with Keras rather than Estimators.

How to predict using Tensorflow?

This is a newbie question for the tensorflow experts:
I reading lot of data from power transformer connected to an array of solar panels using arduinos, my question is can I use tensorflow to predict the power generation in future.
I am completely new to tensorflow, if can point me to something similar I can start with that or any github repo which is doing similar predictive modeling.
Edit: Kyle pointed me to the MNIST data, which I believe is a Image Dataset. Again, not sure if tensorflow is the right computation library for this problem or does it only work on Image datasets?
thanks, Rajesh
Surely you can use tensorflow to solve your problem.
TensorFlow™ is an open source software library for numerical
computation using data flow graphs.
So it works not only on Image dataset but also others. Don't worry about this.
And about prediction, first you need to train a model(such as linear regression) on you dataset, then predict. The tutorial code can be found in tensorflow homepage .
Get your hand dirty, you will find it works on your dataset.
Good luck.
You can absolutely use TensorFlow to predict time series. There are plenty of examples out there, like this one. And this is a really interesting one on using RNN to predict basketball trajectories.
In general, TF is a very flexible platform for solving problems with machine learning. You can create any kind of network you can think of in it, and train that network to act as a model for your process. Depending on what kind of costs you define and how you train it, you can build a network to classify data into categories, predict a time series forward a number of steps, and other cool stuff.
There is, sadly, no short answer for how to do this, but that's just because the possibilities are endless! Have fun!

Optimizers in Tensorflow

From various examples of Tensorflow (translation, ptb) it seems like that you need to explicitly change learning rate when using GradientDescentOptimizer. But is it the case while using some more 'sophisticated' techniques like Adagrad, Adadelta etc. Also when we continue training the model from a saved instance, are the past values used by these optimizers saved in the model file ?
It depends on the Optimizer you are using. Vanilla SGD needs (accepts) individual adaption of the learning rate. Some others do. Adadelta for example does not. (https://arxiv.org/abs/1212.5701)
So this depends not so much on Tensorflow but rather on the mathematical background of the optimizer you are using.
Furthermore: Yes, saving and restarting the training does not reset the learning rates, but continuous at the point saved.