I got my model score vs iteration graph is unstable. How can I improve it?
This is what I get
Here is my code
Code 1
Code 2
Code 3
Code 4
Code 5
Your network looks fairly stock/copy and pasted. I'm pretty sure I've seen this code before.
Without knowing much about your input data I'm not sure if you're solving a classification problem or not but try first switching it to softmax and negative log likelihood on the output.
The output activation and loss function are mainly for binary classification.
You can also get rid of the ReNormalizeL2PerLayer. That might hinder the network from learning depending on your data.
It's also hard to help without knowing much about your input data but sometimes unit mean zero variance may not be suitable for your data set. Consider switching to a zero to 1 scaling instead.
Lastly, for quick iteration times consider overfitting on a small amount of data first when testing. That will help you see if there's any signal in your data and if your network can learn.
Related
I started developing some LSTM-models and now have some questions about normalization.
Lets pretend I have some time series data that is roughly ranging between +500 and -500. Would it be more realistic to scale the Data from -1 to 1, or is 0 to 1 a better way, I tested it and 0 to 1 seemed to be faster. Is there a wrong way to do it? Or would it just be slower to learn?
Second question: When do I normalize the data? I split the data into training and testdata, do I have to scale / normalize this data seperately? maybe the trainingdata is only ranging between +300 to -200 and the testdata ranges from +600 to -100. Thats not very good I guess.
But on the other hand... If I scale / normalize the entire dataframe and split it after that, the data is fine for training and test, but how do I handle real new incomming data? The model is trained to scaled data, so I have to scale the new data as well, right? But what if the new Data is 1000? the normalization would turn this into something more then 1, because its a bigger number then everything else before.
To make a long story short, when do I normalize data and what happens to completely new data?
I hope I could make it clear what my problem is :D
Thank you very much!
Would like to know how to handle reality as well tbh...
On a serious note though:
1. How to normalize data
Usually, neural networks benefit from data coming from Gaussian Standard distribution (mean 0 and variance 1).
Techniques like Batch Normalization (simplifying), help neural net to have this trait throughout the whole network, so it's usually beneficial.
There are other approaches that you mentioned, to tell reliably what helps for which problem and specified architecture you just have to check and measure.
2. What about test data?
Mean to subtract and variance to divide each instance by (or any other statistic you gather by any normalization scheme mentioned previously) should be gathered from your training dataset. If you take them from test, you perform data leakage (info about test distribution is incorporated into training) and you may get false impression your algorithm performs better than in reality.
So just compute statistics over training dataset and use them on incoming/validation/test data as well.
I am rather new to deep learning, please bear with me. I have a GAN, with model structure copy-pasted from: https://machinelearningmastery.com/how-to-develop-a-generative-adversarial-network-for-an-mnist-handwritten-digits-from-scratch-in-keras/
It will train for say 100-200 epochs with pretty ok results, then suddenly generator loss drops to zero... here is excerpt from log:
epoch,step,gen_loss,discr_loss
...
189,25,0.208,0.712
189,26,3.925,1.501
189,27,0.269,1.400
189,28,7.814,2.536
189,29,0.000,3.387 // here?!?
189,30,0.000,7.903
189,31,16.118,7.745
189,32,16.118,8.059
189,33,16.118,8.059
189,34,16.118,8.059
... etc, it never recovers
Is this a problem of vanishing gradients? Anything else I’m missing?
In the blogpost comments people argues about the GAN's collapse problem, here you have a comment:
There were problems with the discriminator collapsing to zero on occasions. This seems to be a known feature of GANs. Do any established GAN hacks help with this?
Looking at the discriminator after 100 epochs, it was in a confused state where everything passed into it was circa 50% probability real/fake. I colour coded some generated examples based on disriminator probability (red = fake, green = real, blue = unsure based on an arbitrary banding) and as you mentioned the subjective versus discriminator output does not always tie up. (example posted on linkedin). There was not enough spread in discriminator probability output to make this meaningful.
GANs are very hard to train and it is very usual that the generator or the discriminator becomes so strong that the other can't improve itself, so if you for instance try to generate pictures I would recommend to use progressive GANS what improves the stability a lot and allow to go for high resolution images.
When you are building a neural network in which the input values are known to have error is there a way to incorporate this into the network? I.e one value of the input may have a known small error and so it's value is a good estimate; but another may have a larger standard error and so you are less confident in its true value.
Googling around this question is not easy because it's mostly Error Messages or error in the output that pops up so if someone here knows offhand that would be great thanks!
One possibility would be to use some inverse of the error as a weight during training. Basically when you are calculating the loss of one input example during training you multiply it by its weight to. A higher weight leads to a higher loss and a higher impact on the gradient and the change of the wheights.
By choosing for example 1 / standard error as the weight, a false estimation of an input with high uncertainty is not weighted as much as a certain example.
I want to predict stock price.
Normally, people would feed the input as a sequence of stock prices.
Then they would feed the output as the same sequence but shifted to the left.
When testing, they would feed the output of the prediction into the next input timestep like this:
I have another idea, which is to fix the sequence length, for example 50 timesteps.
The input and output are exactly the same sequence.
When training, I replace last 3 elements of the input by zero to let the model know that I have no input for those timesteps.
When testing, I would feed the model a sequence of 50 elements. The last 3 are zeros. The predictions I care are the last 3 elements of the output.
Would this work or is there a flaw in this idea?
The main flaw of this idea is that it does not add anything to the model's learning, and it reduces its capacity, as you force your model to learn identity mapping for first 47 steps (50-3). Note, that providing 0 as inputs is equivalent of not providing input for an RNN, as zero input, after multiplying by a weight matrix is still zero, so the only source of information is bias and output from previous timestep - both are already there in the original formulation. Now second addon, where we have output for first 47 steps - there is nothing to be gained by learning the identity mapping, yet network will have to "pay the price" for it - it will need to use weights to encode this mapping in order not to be penalised.
So in short - yes, your idea will work, but it is nearly impossible to get better results this way as compared to the original approach (as you do not provide any new information, do not really modify learning dynamics, yet you limit capacity by requesting identity mapping to be learned per-step; especially that it is an extremely easy thing to learn, so gradient descent will discover this relation first, before even trying to "model the future").
So I trained a deep neural network on a multi label dataset I created (about 20000 samples). I switched softmax for sigmoid and try to minimize (using Adam optimizer) :
tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=y_, logits=y_pred)
And I end up with this king of prediction (pretty "constant") :
Prediction for Im1 : [ 0.59275776 0.08751075 0.37567005 0.1636796 0.42361438 0.08701646 0.38991812 0.54468459 0.34593087 0.82790571]
Prediction for Im2 : [ 0.52609032 0.07885984 0.45780018 0.04995904 0.32828355 0.07349177 0.35400775 0.36479294 0.30002621 0.84438241]
Prediction for Im3 : [ 0.58714485 0.03258472 0.3349618 0.03199361 0.54665488 0.02271551 0.43719986 0.54638696 0.20344526 0.88144571]
At first, I thought I just neeeded to find a threshold value for each class.
But I noticed that, for instance, among my 20000 samples, the 1st class appears about 10800 so a 0.54 ratio and it the value around which my prediction is every time. So I think I need to find a way to tackle tuis "imbalanced datset" issue.
I thought about reducing my dataset (Undersampling) to have about the same number of occurence for each class but only 26 samples correspond to one of my classes... That would make me loose a lot of samples...
I read about oversampling or about penalizing even more the classes that are rare but did not really understood how it works.
Can someone share some explainations about these methods please ?
In practice, on Tensorflow, are there functions that help doing that ?
Any other suggestions ?
Thank you :)
PS: Neural Network for Imbalanced Multi-Class Multi-Label Classification This post raises the same problem but had no answer !
Well, having 10000 samples in one class and just 26 in a rare class will be indeed a problem.
However, what you experience, to me, seems more like "outputs don't even see the inputs" and thus the net just learns your output distribution.
To debug this I would create a reduced set (just for this debugging purpose) with say 26 samples per class and then try to heavily overfit. If you get correct predictions my thought is wrong. But if the net cannot even detect those undersampled overfit samples then indeed it's an architecture/implementation problem and not due to the schewed distribution (which you will then need to fix. But it'll be not as bad as your current results).
Your problem is not the class imbalance, rather just the lack of data. 26 samples are considered to be a very small dataset for practically any real machine learning task. A class imbalance could be easily handled by ensuring that each minibatch will have at least one sample from every class (this leads to situations when some samples will be used much more frequently than another, but who cares).
However, in the case of presence only 26 samples this approach (and any other) will quickly lead to overfitting. This problem could be partly solved with some form of data augmentation, but there still too few samples to construct something reasonable.
So, my suggestion will be to collect more data.