Imbalanced Dataset for Multi Label Classification - tensorflow

So I trained a deep neural network on a multi label dataset I created (about 20000 samples). I switched softmax for sigmoid and try to minimize (using Adam optimizer) :
tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=y_, logits=y_pred)
And I end up with this king of prediction (pretty "constant") :
Prediction for Im1 : [ 0.59275776 0.08751075 0.37567005 0.1636796 0.42361438 0.08701646 0.38991812 0.54468459 0.34593087 0.82790571]
Prediction for Im2 : [ 0.52609032 0.07885984 0.45780018 0.04995904 0.32828355 0.07349177 0.35400775 0.36479294 0.30002621 0.84438241]
Prediction for Im3 : [ 0.58714485 0.03258472 0.3349618 0.03199361 0.54665488 0.02271551 0.43719986 0.54638696 0.20344526 0.88144571]
At first, I thought I just neeeded to find a threshold value for each class.
But I noticed that, for instance, among my 20000 samples, the 1st class appears about 10800 so a 0.54 ratio and it the value around which my prediction is every time. So I think I need to find a way to tackle tuis "imbalanced datset" issue.
I thought about reducing my dataset (Undersampling) to have about the same number of occurence for each class but only 26 samples correspond to one of my classes... That would make me loose a lot of samples...
I read about oversampling or about penalizing even more the classes that are rare but did not really understood how it works.
Can someone share some explainations about these methods please ?
In practice, on Tensorflow, are there functions that help doing that ?
Any other suggestions ?
Thank you :)
PS: Neural Network for Imbalanced Multi-Class Multi-Label Classification This post raises the same problem but had no answer !

Well, having 10000 samples in one class and just 26 in a rare class will be indeed a problem.
However, what you experience, to me, seems more like "outputs don't even see the inputs" and thus the net just learns your output distribution.
To debug this I would create a reduced set (just for this debugging purpose) with say 26 samples per class and then try to heavily overfit. If you get correct predictions my thought is wrong. But if the net cannot even detect those undersampled overfit samples then indeed it's an architecture/implementation problem and not due to the schewed distribution (which you will then need to fix. But it'll be not as bad as your current results).

Your problem is not the class imbalance, rather just the lack of data. 26 samples are considered to be a very small dataset for practically any real machine learning task. A class imbalance could be easily handled by ensuring that each minibatch will have at least one sample from every class (this leads to situations when some samples will be used much more frequently than another, but who cares).
However, in the case of presence only 26 samples this approach (and any other) will quickly lead to overfitting. This problem could be partly solved with some form of data augmentation, but there still too few samples to construct something reasonable.
So, my suggestion will be to collect more data.

Related

CNN: Unstable of model score vs iteration

I got my model score vs iteration graph is unstable. How can I improve it?
This is what I get
Here is my code
Code 1
Code 2
Code 3
Code 4
Code 5
Your network looks fairly stock/copy and pasted. I'm pretty sure I've seen this code before.
Without knowing much about your input data I'm not sure if you're solving a classification problem or not but try first switching it to softmax and negative log likelihood on the output.
The output activation and loss function are mainly for binary classification.
You can also get rid of the ReNormalizeL2PerLayer. That might hinder the network from learning depending on your data.
It's also hard to help without knowing much about your input data but sometimes unit mean zero variance may not be suitable for your data set. Consider switching to a zero to 1 scaling instead.
Lastly, for quick iteration times consider overfitting on a small amount of data first when testing. That will help you see if there's any signal in your data and if your network can learn.

Tensorflow / Keras: Normalize train / test / realtime Data or how to handle reality?

I started developing some LSTM-models and now have some questions about normalization.
Lets pretend I have some time series data that is roughly ranging between +500 and -500. Would it be more realistic to scale the Data from -1 to 1, or is 0 to 1 a better way, I tested it and 0 to 1 seemed to be faster. Is there a wrong way to do it? Or would it just be slower to learn?
Second question: When do I normalize the data? I split the data into training and testdata, do I have to scale / normalize this data seperately? maybe the trainingdata is only ranging between +300 to -200 and the testdata ranges from +600 to -100. Thats not very good I guess.
But on the other hand... If I scale / normalize the entire dataframe and split it after that, the data is fine for training and test, but how do I handle real new incomming data? The model is trained to scaled data, so I have to scale the new data as well, right? But what if the new Data is 1000? the normalization would turn this into something more then 1, because its a bigger number then everything else before.
To make a long story short, when do I normalize data and what happens to completely new data?
I hope I could make it clear what my problem is :D
Thank you very much!
Would like to know how to handle reality as well tbh...
On a serious note though:
1. How to normalize data
Usually, neural networks benefit from data coming from Gaussian Standard distribution (mean 0 and variance 1).
Techniques like Batch Normalization (simplifying), help neural net to have this trait throughout the whole network, so it's usually beneficial.
There are other approaches that you mentioned, to tell reliably what helps for which problem and specified architecture you just have to check and measure.
2. What about test data?
Mean to subtract and variance to divide each instance by (or any other statistic you gather by any normalization scheme mentioned previously) should be gathered from your training dataset. If you take them from test, you perform data leakage (info about test distribution is incorporated into training) and you may get false impression your algorithm performs better than in reality.
So just compute statistics over training dataset and use them on incoming/validation/test data as well.

GAN generator loss goes to zero

I am rather new to deep learning, please bear with me. I have a GAN, with model structure copy-pasted from: https://machinelearningmastery.com/how-to-develop-a-generative-adversarial-network-for-an-mnist-handwritten-digits-from-scratch-in-keras/
It will train for say 100-200 epochs with pretty ok results, then suddenly generator loss drops to zero... here is excerpt from log:
epoch,step,gen_loss,discr_loss
...
189,25,0.208,0.712
189,26,3.925,1.501
189,27,0.269,1.400
189,28,7.814,2.536
189,29,0.000,3.387 // here?!?
189,30,0.000,7.903
189,31,16.118,7.745
189,32,16.118,8.059
189,33,16.118,8.059
189,34,16.118,8.059
... etc, it never recovers
Is this a problem of vanishing gradients? Anything else I’m missing?
In the blogpost comments people argues about the GAN's collapse problem, here you have a comment:
There were problems with the discriminator collapsing to zero on occasions. This seems to be a known feature of GANs. Do any established GAN hacks help with this?
Looking at the discriminator after 100 epochs, it was in a confused state where everything passed into it was circa 50% probability real/fake. I colour coded some generated examples based on disriminator probability (red = fake, green = real, blue = unsure based on an arbitrary banding) and as you mentioned the subjective versus discriminator output does not always tie up. (example posted on linkedin). There was not enough spread in discriminator probability output to make this meaningful.
GANs are very hard to train and it is very usual that the generator or the discriminator becomes so strong that the other can't improve itself, so if you for instance try to generate pictures I would recommend to use progressive GANS what improves the stability a lot and allow to go for high resolution images.

Train / Test split % for Object Detection - what's the current recommendation?

Using the Tensorflow Object Detection API, what's the current recommendation / best practice around the train / test split percentage for labeled examples? I've seen a lot of conflicting info, anywhere from 70/30 to 95/5. Any recent real world experience is appreciated.
Traditional advice is ~70-75% training and the rest test data. More recent articles indeed suggest a different split. I read 95/2.5/2.5 (train / test / dev for hyperparameter tuning) a lot these days.
I guess your optimal split depends on the amount of available data and the bias/variance characteristics. Poor performance on training data may be caused by underfitting and need more training data. If your model is fitting well or even overfitting, you should be able to allocate some of the training data away to test data.
If you're stuck in the middle, you may also consider cross validation as a computationally expensive but data friendly option.
It depends on the size of the dataset as Andrew Ng suggests:
(train/ dev or Val /test)
If the size of the dataset is 100 to 10K ~ 60/20/20
If the size of the dataset is 1M to INF ==> 98/1/1 or 99.5/0.25/0.25
Note that these are not fixed and just suggestions.
The goal of the test set mentioned here is to give you an unbiased performance measurement of your work. In some works, it is OK not to have only two sets set (then they will call it train/test, though test set here is actually working as dev set ratio can be 70/30 )

How should I test on a small dataset?

I use Weka to test machine learning algorithms on my dataset. I have 3800 rows and around 25 features. I am testing the combination of different features for prediction models and seem to predict lower than just the oneR algorithm does with the use of Cross-validation. Even C4.5 does not predict better, sometimes it does and sometimes it does not on basis of the features that are still able to classify.
But, on a certain moment I splitted my dataset in a testset and dataset(20/80), and testing it on the testset, the C4.5 algorithm had a far higher accuracy than my OneR algorithm had. I thought, with the small size of the dataset, it probably is just a coincidence that it predicted very well(the target was still splitted up relatively as target attributes). And therefore, its more useful to use Cross-validation on small datasets like these.
However, testing it on another testset, did give the high accuracy towards the testset using C4.5. So, my question actually is, what is the best way to test datasets when the datasets are actually pretty small?
I saw some posts where it is discussed, but I am still not sure what is the right way to do it.
It's almost always a good approach to test your model via Cross-Validation.
A rule of thumb is to use 10 fold cross validation.
In your case, 10 fold cross validation will do the following in Weka:
split your 3800 training instances into 10 sets of 380 instances
for each set (s = 1 .. 10) :
use the instances from s for testing and the other 9 sets for training a model (3420 training instances)
the result will be an average of the results obtained with the 10 models used.
Try to avoid testing your dataset using the training set option, because that could result in creating a model that works very well for you existing data but could have big problems with other new instances (overfitting).