The GANs gaussian simulator doesn't converge. - mxnet

I try to reimplement the Gaussian simulator described in the GANs paper with mxnet.
But there are two problems in my code.
First, the model doesn't converge very well, even after I try to set a learning rate scheduler.
This is how it looks like after about 500 epochs and the accuracy is bouncing around 0.5 ~ 0.6.
Second, I don't know how to draw the output of the discriminative model. The current curve doesn't look like the one described in the paper.
Could anyone please offer any suggestion?

I think I solved the two problems by myself. Now the code in the Github link should be correct.

Related

RNN/GRU Increasing validation loss but decreasing mean absolute error

I am new to deep learning and I try to implement an RNN (with 2 GRU layers).
At first, the network seems to do it's job quite fine. However, I am currently trying to understand the loss and accuracy curve. I attached the pictures below. The dark-blue line is the training set and the cyan line is the validation set.
After 50 epochs the validation loss increases. My assumption is that this indicates overfitting. However, I am unsure why the validation mean absolute error still decreases. Do you maybe got an idea?
One idea I had in mind was that this could be caused by some big outliers in my dataset. Thus I already tried to clean it up. I also tried to scale it properly. I also added a few dropout layers for further regularization (rate=0.2). However these are just normal dropout layers because cudnn does not seem to support recurrent_dropout from tensorflow.
Remark: I am using the negative log-likelihood as loss function and a tensorflow probability distribution as the output dense layer.
Any hints what I should investigate?
Thanks in advance
Edit: I also attached the non-probabilistic plot as recommended in the comment. Seems like here the mean-absolute-error behaves normal (does not improve all the time).
What are the outputs of your model? It sounds pretty strange that you're using the negative log-likelihood (which basically "works" with distributions) as the loss function but the MAE as a metric, which is suited for deterministic continuous values.
I don't know what is your task and perhaps this is meaningful in your specific case, but perhaps the strange behavior comes out from there.

Object Detection model stuck at low mAP

I am trying to reproduce the results of the SSDLite model reported in the MobileNetV2 paper (arXiv:1801.04381), which should achieve about 22.1% mAP on the COCO detection challenge. However, I am stuck at 9% mAP. This is strange behavior because the model does work somewhat, but is still far off from the reported result. Can this much of a gap be caused by hyperparameters/optimizer choices (I am using adam instead of sgd), or is it almost certain that there is a bug in my implementation?
It is also worth mentioning that the model successfully overfits a small subset of the training set, but on the whole training set the loss seems to reach a plateau fairly quickly.
Has anyone encountered a problem similar to this?
Can this much of a gap be caused by hyperparameters/optimizer choices
(I am using adam instead of sgd), or is it almost certain that there
is a bug in my implementation?
Even small changes in the hyperparameters and a different optimizer choice can impact the training and the resulting precision of the classifier a lot. So your low precision might not necessary be due to a bug but could also be due to wrong parametrization.
It is also worth mentioning that the model successfully overfits a
small subset of the training set, but on the whole training set the
loss seems to reach a plateau fairly quickly.
Seems like you run into a local optimum which only works for a subset of your data, which could also be a pointer for an suboptimal parameterization.
Like #Matias Valdenegro also mentioned, to reproduce the exact result you might have to use the same parameters as in the original implementation.

Tensorflow Estimator self repair training on overfitting

I'm having some learning experience on tensorflows estimator api. Doing some classification task on a small dataset with tensorflow's tf.contrib.learn.DNNClassifier (I know there is tf.estimator.DNNClassifier but I have to work on tensorflow 1.2) I get the accuracy graph on my test dataset. I wonder why there are these negative peaks.
I thought they could occur because of overfitting and self repairing. The next datapoint after the peak seems to have the same value as the point before.
I tried to look into the code to find any proof that estimator's train function has such a mechanism but did not find any.
So, is there such a mechanism or are there other possible explanations?
I don't think that the Estimator's train functions has any such mechanism.
Some possible theories:
Does your training restart anytime? Its possible that if you have some Estimated Moving Average (EMA) in your model, upon restart the moving average has to be recomputed.
Is your input data randomized? If not, its possible that a patch of input data is all misclassified, and again the EMA is possibly smoothing out.
This is pretty mysterious to me. If you do find out what the real issue is please do share!

DQN cannot converge while the example with the same graph works

Here is the problem. I try to rewrite a dpn example of the CartPole-v0. The example converges with less than 2000 episodes, However, my version could not converge.
Therefore I use TensorBoard as the tool to help me figure out the problem. I find that these two codes have the same graph
What's more, I find that the graph of the loss is very different.
loss of the example
loss of mine
I wonder what's the problem. I will be grateful if anyone could help me with it! Thank you!
P.S. If the codes are needed. I could upload them. Both of them are shorter than 200 lines.

Neural network weights explode in linear unit

I am currently implementing a simple neural network and the backprop algorithm in Python with numpy. I have already tested my backprop method using central differences and the resulting gradient is equal.
However, the network fails to approximate a simple sine curve. The network hast one hidden layer (100 neurons) with tanh activation functions and a output layer with a linear activation function. Each unit hast also a bias input. The training is done by simple gradient descent with a learning rate of 0.2.
The problem arises from the gradient, which gets with every epoch larger, but I don't know why? Further, the problem is unchanged, if I decrease the learning rate.
EDIT: I have uploaded the code to pastebin: http://pastebin.com/R7tviZUJ
There are two things you can try, maybe in combination:
Use a smaller learning rate. If it is too high, you may be overshooting the minimum in the current direction by a lot, and so your weights will keep getting larger.
Use smaller initial weights. This is related to the first item. A smaller learning rate would fix this as well.
I had a similar problem (with a different library, DL4J), even in the case of extremely simple target functions. In my case, the issue turned out to be the cost function. When I changed from negative log likelihood to Poisson or L2, I started to get decent results. (And my results got MUCH better once I added exponential learning rate decay.)
Looks like you dont use regularization. If you train your network long enough it will start to learn the excact data rather than abstract pattern.
There are a couple of method to regularize your network like: stopped training, put a high cost to large gradients or more complex like e.g.g drop out. If you search web/books you probably will find many options for this.
A too big learning rate can fail to converge, and even DIVERGE, that is the point.
The gradient could diverge for this reason: when exceeding the position of the minima, the resulting point could not only be a bit further, but could even be at a greater distance than initially, but the other side. Repeat the process, and it will continue to diverge. in other words, the variation rate around the optimal position could be just to big compared to the learning rate.
Source: my understanding of the following video (watch near 7:30).
https://www.youtube.com/watch?v=Fn8qXpIcdnI&list=PLLH73N9cB21V_O2JqILVX557BST2cqJw4&index=10