Does the training lost diagram showing over-fitting? Deep Q-learning - tensorflow

below diagram is the training loss values against epoch. Based on the diagram, does it mean I have make it over-fitting? If not, what is causing the spike in loss values along the epoch? In overall, it can be observed that the loss value is in decreasing trend. How should I tune my setting in deep Q-learning?

Such a messy loss trajectory would usually mean that the learning rate is too high for the given smoothness of the loss function.
An alternative interpretation is that the loss function is not at all predictive of the success at the given task.

Related

GAN - loss and evaluation of model

I'm struggling with understanding how to "objectively" evaluate a GAN (that is, not simply look at what it generates saying "this looks good/bad").
My understanding is that the discriminator should get a head start and, in theory, discriminator loss and generator loss both ought to converge to 0.5 - at which point both are equally "good".
I'm currently training a model, and I get discriminator loss beginning at 0.7 but quickly converging toward 0.25, and generator loss beginning at 50 and converging toward 0.35 (possibly less with further training).
This doesn't entirely make sense. How can both be better than 0.5?
Are my loss functions incorrect, or what else am I missing? How should performance be measured?
In a GAN setting, it is normal for you to have the losses be better because you are training only one of the networks at a time (thus beating the other network).
You can evaluate the generated output with some of the metrics PSNR, SSIM, FID, L2, Lpips, VGG, or something similar (depending on your particular task). This is still an ongoing area of research on how to objectively evaluate an image, and they are generally used as loss objectives in certain tasks.
I recommend looking at something like Analysis and Evaluation of Image Quality Metrics
I would recommend you look at the generator metrics over time to see if its improving, and obviously confirm that visually as well. You can use logging to see the metric changes or some visualization tools, tensorboard, or wandb for this.

Is it possible to estimate the time needed to train a machine learning model given a size of data and hardware specification?

I am planning to make small Tensor Flow image classification project, which is expected to run on machines with low processing power, and one of the concerns I was asked about was the time needed to train the model.
The project is still in the conception stage and no clear boundary is made.
But assuming that we will use Tensor flow for Python, with a simple Neural Network for say n images data set, is there a way to estimate or predict the time required to train the model before performing the training given the hardware in use?
I have asked one of my colleagues who works in NN and he said that maybe we could calculate the time needed by measuring the time for the first epoch and making an estimation how many epochs needed afterwards. Is this is a valid way? If yes then is it even possible to estimate the number of epochs needed? And either cases is there a way to calculate it before performing any training?
There is no definite way of finding the number of epochs to which the model converges. It is one of the hyperparameter.
Apart from the type of model you are training, convergence also depends on the distribution of data, and the optimizer you are using.
The rough estimate you can make by looking at the number of parameters you have in your model, check time for one epoch, and get a rough idea from "experience" on the number of epochs. BUT you always have to look at the training and validation loss curves to check for the convergence.

CNN model's val_loss go down well but val_loss change a lot

I'm using keras(tensorflow) to train my own CNN model.
As shown in the chart, the train_loss goes down well, but val_loss has big change among each Epoch.
What can be the reason, and what should I do to improve it?
This is typical behavior when training in deep learning. Think about it, your target loss is the training loss, so it is directly affected by the training process and as you said "goes down well". The validation loss is only affected indirectly, so naturally it will be more volatile in comparison.
When you are training, the model is attempting to estimate the real distribution of the data, however all it got is the distribution of the training dataset to rely on (which is similar but not the same).
The big spike at the end of your loss curve might be the result of over-fitting. If you are not using a decaying learning rate during training, I would suggest it.

High variability loss of neural networks

I'm getting really high variability in both the accuracy and loss between each epoch, as high as 10%. It happens to my accuracy all the time, and my loss when I start adding in dropout. However I really need the dropout, any ideas on how to smooth it out?
It is hard to say anything concrete without knowing what you do. But because you mentioned that your dataset is very small: 500 samples, I say that your 10% performance jumps are not surprising. Still a few ideas:
definitely use a bigger dataset if you can. If it is not possible to collect a bigger dataset, try to augment whatever you have.
try a smaller dropout and see how it goes, try different regularizers (dropout is not the only option)
you data is small, you can afford to run more than 200 iterations
see how your model performs on the test set, it is possible that it just severely overfitted the data
Beside the fact that the data set is very small, during a training with a dropout regularization the loss function is not anymore well defined and I presume the accuracy is also biased. Therefore any tracked metric should be assessed without dropout. It seams that keras does not switch it off while calculating the accuracy during training.

Neural network weights explode in linear unit

I am currently implementing a simple neural network and the backprop algorithm in Python with numpy. I have already tested my backprop method using central differences and the resulting gradient is equal.
However, the network fails to approximate a simple sine curve. The network hast one hidden layer (100 neurons) with tanh activation functions and a output layer with a linear activation function. Each unit hast also a bias input. The training is done by simple gradient descent with a learning rate of 0.2.
The problem arises from the gradient, which gets with every epoch larger, but I don't know why? Further, the problem is unchanged, if I decrease the learning rate.
EDIT: I have uploaded the code to pastebin: http://pastebin.com/R7tviZUJ
There are two things you can try, maybe in combination:
Use a smaller learning rate. If it is too high, you may be overshooting the minimum in the current direction by a lot, and so your weights will keep getting larger.
Use smaller initial weights. This is related to the first item. A smaller learning rate would fix this as well.
I had a similar problem (with a different library, DL4J), even in the case of extremely simple target functions. In my case, the issue turned out to be the cost function. When I changed from negative log likelihood to Poisson or L2, I started to get decent results. (And my results got MUCH better once I added exponential learning rate decay.)
Looks like you dont use regularization. If you train your network long enough it will start to learn the excact data rather than abstract pattern.
There are a couple of method to regularize your network like: stopped training, put a high cost to large gradients or more complex like e.g.g drop out. If you search web/books you probably will find many options for this.
A too big learning rate can fail to converge, and even DIVERGE, that is the point.
The gradient could diverge for this reason: when exceeding the position of the minima, the resulting point could not only be a bit further, but could even be at a greater distance than initially, but the other side. Repeat the process, and it will continue to diverge. in other words, the variation rate around the optimal position could be just to big compared to the learning rate.
Source: my understanding of the following video (watch near 7:30).
https://www.youtube.com/watch?v=Fn8qXpIcdnI&list=PLLH73N9cB21V_O2JqILVX557BST2cqJw4&index=10