Optimizer and Estimator in Neural Networks - tensorflow

When I started with Neural it seemed I understood Optimizers and Estimators well.
Estimators: Classifier to classify the value based on sample set and Regressor to predict the value based on sample set.
Optimizer: Using different optimizers (Adam, GradientDescentOptimizer) to minimise the loss function, which could be complex.
I understand every estimators come up with an Default optimizer internally to perform minimising the loss.
Now my question is how do they fit in together and optimize the machine training?

short answer: loss function link them together.
for example, if you are doing a classification, your classifier can take input and output a prediction. then you can calculate your loss by take predicted class and ground truth class. the task of your optimizer is to minimize the loss by modifying the parameter of your classifier.

Related

Tensorflow loss converging but model fails to predict even on train data

Using ANN with Tensorflow to train a simple known equation Y=Sin(X) or Y=Cos(X). My loss function is converging properly.
Loss function convergence graph. If loss function converges it means model has fitted well to my training dataset.
However, when I predict passing in argument training set itself, model fails to predict even train data which is strange.
Here it can be seen that after 200th value there model shows no training at all
If the loss has converged then model should fit the train dataset perfectly but that is not happening here. What is wrong in my code?
X = np.linspace(0,10*np.pi,1000)
Y = np.sin(X)
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(500,input_shape=(1,),activation='relu'))
model.add(tf.keras.layers.Dense(1))
opt = tf.keras.optimizers.Adam(0.01)
model.compile(optimizer=opt,loss='mse')
r= model.fit(X.reshape(-1,1),Y,epochs=100)
plt.plot(r.history['loss'])
Yhat = model.predict(X.reshape(-1,1)).flatten()
plt.plot(Y)
plt.plot(Yhat)
It is the nature of your data.
It made me remember the old paper which showed that the ANN can't compute even the XOR
Anyway the reason here is that your model is shallow and shallow networks are much less efficient than deep networks. To put in perspective a model like below
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(20,input_shape=(1,),activation='relu'))
model.add(tf.keras.layers.Dense(20,activation='relu'))
model.add(tf.keras.layers.Dense(1))
Will likely perform better even though it has only 1/3 of the parameters of the original model and that is cause the deeper you go the more complex representations can the model create. The core thing to remember is
THE DEEP LEARNING MODEL DON'T BUILD NON-LINEAR DECISION BOUNDARIES as EACH AND EVERY
UNIT IS FUNDAMENTALLY DESIGNED TO CREATE SOME LINEAR DECISION BOUNDARY. so what does
it do? IT FROM STACKING THOSE LINEAR DECISION BOUNDARIES MAKE A REPRESENTATION OF
DATA WHICH IS LINEARLY SEPARABLE.
Also, the most important things is to know your data. In this case using the Probabilistic Models will give almost perfect results. You can easily implement those using TensorFlow probability.

What loss function to use in Keras when metric is SparseTopKCategoricalAccuracy/TopKCategoricalAccuracy?

For multiclass classification problems, Keras and tf.keras have metrics like SparseTopKCategoricalAccuracy and TopKCategoricalAccuracy. However, if one uses loss functions like SparseCategoricalCrossentropy or CategoricalCrossentropy, they cannot achieve the max values for these two metrics.
What is a good loss function to use when one wants to maximize SparseTopKCategoricalAccuracy or TopKCategoricalAccuracy?
I understand that SparseTopKCategoricalAccuracy is not differentiable, just like Accuracy. I am trying to find a function that can approximate the smooth loss function and yield a higher number for SparseTopKCategoricalAccuracy.
CrossEntropy is not the best loss function when you deal with Top-k accuracy because cross-entropy may be prone to overfitting on small datasets or noisy labels.
As you have already pointed out, "smooth loss" functions are developed for top-k classification with SVM. To my knowledge, there is no a "off-the-shelf" loss function in Keras/TF that is best suited for top-k. However, I suggest you to try Smooth Surrogate Loss (SSL) presented in the article and implemented in Pytorch to use with deep neural networks (see Github). It derives from multi-class SVMs as SSL creates a margin between the correct top-k predictions and the incorrect ones. The training time of SSL is comparatevely the same as in the case of cross-entropy thanking to a divide-and-conquer approach and the use of polynomials (see implementation).

diagnosis on training process of neural network

I am training an autoencoder DNN for a regression question. Need suggestions on how to improve the training process.
The total number of training sample is about ~100,000. I use Keras to fit the model, setting validation_split = 0.1. After training, I drew loss function change and got the following picture. As can be seen here, validation loss is unstable and mean values are very close to training loss.
My question is: based on this, what is the next step I should try to improve the training process?
[Edit on 1/26/2019]
The details of network architecture are as follows:
It has 1 latent layer of 50 nodes. The input and output layer have 1000 nodes,respectively. The activation of hidden layer is ReLU. Loss function is MSE. For optimizer, I use Adadelta with default parameter settings. I also tried to set lr=0.5, but got very similar results. Different features of the data have scaled between -10 and 10, with mean of 0.
By observing the graph provided, the network could not approximate the function which establishes a relation between the input and output.
If your features are too diverse. That one of them is large and others have a very small value, then you should normalize the feature vector. You can read more here.
For a better training and testing result, you can follow these tips,
Use a small network. A network with one hidden layer is enough.
Perform activations in the input as well as hidden layers. The output layer must have a linear function. Use ReLU activation function.
Prefer small learning rate like 0.001. Use RMSProp optimizer. It works fine on most regression problems.
If you are not using mean squared error function, use it.
Try slow and steady learning and not fast learning.

Different between fit and evaluate in keras

I have used 100000 samples to train a general model in Keras and achieve good performance. Then, for a particular sample, I want to use the trained weights as initialization and continue to optimize the weights to further optimize the loss of the particular sample.
However, the problem occurred. First, I load the trained weight by the keras API easily, then, I evaluate the loss of the one particular sample, and the loss is close to the loss of the validation loss during the training of the model. I think it is normal. However, when I use the trained weight as the inital and further optimize the weight over the one sample by model.fit(), the loss is really strange. It is much higher than the evaluate result and gradually became normal after several epochs.
I think it is strange that, for the same one simple and loading the same model weight, why the model.fit() and model.evaluate() return different results. I used batch normalization layers in my model and I wonder that it may be the reason. The result of model.evaluate() seems normal, as it is close to what I seen in the validation set before.
So what cause the different between fit and evaluation? How can I solve it?
I think your core issue is that you are observing two different loss values during fit and evaluate. This has been extensively discussed here, here, here and here.
The fit() function loss includes contributions from:
Regularizers: L1/L2 regularization loss will be added during training, increasing the loss value
Batch norm variations: during batch norm, running mean and variance of the batch will be collected and then those statistics will be used to perform normalization irrespective of whether batch norm is set to trainable or not. See here for more discussion on that.
Multiple batches: Of course, the training loss will be averaged over multiple batches. So if you take average of first 100 batches and evaluate on the 100th batch only, the results will be different.
Whereas for evaluate, just do forward propagation and you get the loss value, nothing random here.
Bottomline is, you should not compare train and validation loss (or fit and evaluate loss). Those functions do different things. Look for other metrics to check if your model is training fine.

sampled_softmax_loss vs negative sampling

I am working on text autoencoder so want to use negative sampling for training our model. I want to know the difference between negative sampling and sampled softmax.
Thanks in advance
https://www.tensorflow.org/extras/candidate_sampling.pdf
Accoring to tensorflow negative sampling relates to logistic loss while sampled softmax relates to softmax.
Both of them, at the core, pick a sample of negative examples to compute the loss on and update gradients.
For your model, use it if your output is very large (many classes) AND the regular loss is too slow to compute. If the output has few classes there's not much gain. If the training is fast anyway, why bother with approximations.