Dealing with imbalanced data by using weight

Dealing with imbalanced data by using weight - numpy

I have very imbalanced data and the goal is classification. At the first i want to check undersampling on the majority class. Class 1 with 600, class2 90, class3 60 and class4 96 sample data!!!
Using weight: In 2 fold cross validation and Randomforest model:
Why by using weight, the result isn't better?
This is my code: cfr = RandomForestClassifier(n_estimators=100,n_jobs=5,class_weight={1:1,2:30,3:30,4:30})
Is there any thing wrong in my code? Could u please guide me?

The actual question is what is your task. Is your task to maximize the accuracy of the model, even though you have a huge disproportion of classes? If so, you should not undersample test set. In fact you never under- or oversample test set, you might however, in some cases - add weights to particular classes to make a correction for true priors (which might be different from the empircal ones) or due to cost sensitive learning.

Related

Can you forecast with multiple trajectories?

I am new to time-series machine learning and have a, perhaps, trivial question.
I would like like to forecast the temperature for a particular region. I could train a model using the hourly data points from the first 6 days of the week and then evaluate its performance on the final day. Therefore the training set would have 144 data points (6*24) and the test set would have 24 data points (24*1). Likewise, I can train a new model for regions B-Z and evaluate each of their individual performances. My question is, can you train a SINGLE model for the predictions across multiple different regions? So the region label should be an input of course since that will effect the temperature evolution.
Can you train a single model that forecasts for multiple trajectories rather than just one? Also, what might be a good metric for evaluating its performance? I was going to use mean absolute error but maybe a correlation is better?

Yes you can train with multiple series of data from different region the question that you ask is an ultimate goal of deep learning by create a 1 model to do every things, predict every region correctly and so on. However, if you want to generalize your model that much you normally need a really huge model, I'm talking about 100M++ parameter and to train that data you also need tons of Data maybe couple TB or PB, so you also need a super powerful computer to train that thing something like GOOGLE data center. Coming to your next question, the metric, you may use just simple RMS error or mean absolute error will work fine.
Here is what you need to focus Training Data, there is no super model that take garbage and turn it in to gold, same thing here garbage in garbage out. You need a pretty good datasets that can represent whole environment of what u are trying to solve. For example, you want to create model to predict that if you hammer a glass will it break, so you have maybe 10 data for each type of glass and all of them break when u hammer it. so, you train the model and it just predict break every single time, then you try to predict with a bulletproof glass and it does not break, so your model is wrong. Therefore, you need a whole data of different type of glass then your model maybe predict it correctly. Then compare this to your 144 data points, I'm pretty sure it won't work for your case.
Therefore, I would say yes you can build that 1 model fits all but there is a huge price to pay.

How to build a neural network that infers a set of values for the most important feature?

My task here is to find a way to get a suggested value of the most important feature or features. By changing into the suggested values of the features, I want the classification result to change as well.
Snapshot of dataset
The following is the procedures that I have tried so far:
Import dataset (shape: 1162 by 22)
Build a simple neural network (2 hidden layers)
Since the dependent variable is simply either 0 or 1 (classification problem), I onehot-encoded the variable. So it's either [0, 1] or [1,0]
After splitting into train & test data, I train my NN model and got accuracy of 77.8%
To know which feature (out of 21) is the most important one in the determination of either 0 or 1, I trained the data using Random Forest classifier (scikit-learn) and also got 77.8% accuracy and then used the 'feature_importances_' offered by the random forest classifier.
As a result, I found out that a feature named 'a_L4' ranks the highest in terms of relative feature importance.
The feature 'a_L4' is allowed to have a value from 0 to 360 since it means an angle. In the original dataset, 'a_L4' comprises of only 12 values that are [5, 50, 95, 120, 140, 160, 185, 230, 235, 275, 320, 345].
I augmented the original dataset by directly adding all the possible 12 values for each cases giving a new dataset of shape (1162x12 by 22).
I imported the augmented dataset and tested it on the previously trained NN model. The result was a FAILURE. There hardly was any change in the classification meaning almost no '1's switched to '0's.
My conclusion was that changing the values of 'a_L4' was not enough to bring a change in the classification. So I additionally did the same procedure again for the 2nd most important feature which in this case was 'b_L7_p1'.
So writing all the possible values that the two most important features can have, now the new dataset becomes the shape of (1162x12x6 by 22). 'b_L7_p1' is allowed to have 6 different values only, thus the multiplication by 6.
Again the result was a FAILURE.
So, my question is what might have I done wrong in the procedure described above? Do I need to keep searching for more important features and augment the data with all the possible values they can have? But since this is a tedious task with multiple procedures to be done manually and leads to a dataset with a huge size, I wish there was a way to construct an inference-based NN model that can directly give out the suggested values of a certain feature or features.
I am relatively new to this field of research, so could anyone please tell me some key words that I should search for? I cannot find any work or papers regarding this issue on Google.
Thanks in advance.

In this case I would approach the problem in the following way:
Normalize the whole dataset. As you can see from the dataset your features have different scales. It is utterly important that you make all features to have the same scale. Have a look at: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
The second this that I would do now is train and evaluate a model (It can be whatever you want) to get a so called baseline model.
Then, I would try PCA to see whether all features are needed. Maybe you are including unnecessary sparsity to the model. See: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
For example if you set the n_components in PCA to be 0.99 then you are reducing the number of features while retaining as 0.99 explained variance.
Then I would train the model to see whether there is any improvement. Please note that only by adding the normalization itself there should be an improvement.
If I want to see by the dataset itself which features are important I would do: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html This would select a specified number of features based on some statistical test lets say: https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.chi2.html
Train a model and evaluate it again to see whether there is some improvement.
Also, you should be aware that the NNs can perform feature engineering by themselves, so computing feature importance is redundant in a way.
Let me know whether you will see any improvements.

How to penalize the loss of one class more than the other in tensorflow for a multi class problem?

Let's say my model has two classes Class 1 and Class 2. Both Class 1 and Class 2 has a equal amount of training and testing data. But I want to penalize the loss of the Class 1 more than Class 2, so that one class has a fewer number of False Positives than the other (I want the model to perform better for one class than the other).
How do I achieve this in Tensorflow?

The thing you are looking for is probably
weighted_cross_entropy.
It is giving a very closely related contextual information, similar to #Sazzad 's answer, but specific to TensorFlow. To quote the documentation:
This is like sigmoid_cross_entropy_with_logits() except that
pos_weight, allows one to trade off recall and precision by up- or
down-weighting the cost of a positive error relative to a negative
error.
It accepts an additional argument pos_weights. Also note that this is only for binary classification, which is the case in the example you described. If there might be other classes besides the two, this would not work.

If I understand your question correctly, this is not a tensorflow concept. you can write your own. for binary classification, the loss is something like this
loss = ylogy + (1-y)log(1-y)
Here class 0 and class 1 have the same weight in the loss. So you can give more give more weight to some portion. for example,
loss = 5 * ylogy + (1-y)log(1-y)
Hope it answers your question.

Imbalanced Dataset for Multi Label Classification

So I trained a deep neural network on a multi label dataset I created (about 20000 samples). I switched softmax for sigmoid and try to minimize (using Adam optimizer) :
tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=y_, logits=y_pred)
And I end up with this king of prediction (pretty "constant") :
Prediction for Im1 : [ 0.59275776 0.08751075 0.37567005 0.1636796 0.42361438 0.08701646 0.38991812 0.54468459 0.34593087 0.82790571]
Prediction for Im2 : [ 0.52609032 0.07885984 0.45780018 0.04995904 0.32828355 0.07349177 0.35400775 0.36479294 0.30002621 0.84438241]
Prediction for Im3 : [ 0.58714485 0.03258472 0.3349618 0.03199361 0.54665488 0.02271551 0.43719986 0.54638696 0.20344526 0.88144571]
At first, I thought I just neeeded to find a threshold value for each class.
But I noticed that, for instance, among my 20000 samples, the 1st class appears about 10800 so a 0.54 ratio and it the value around which my prediction is every time. So I think I need to find a way to tackle tuis "imbalanced datset" issue.
I thought about reducing my dataset (Undersampling) to have about the same number of occurence for each class but only 26 samples correspond to one of my classes... That would make me loose a lot of samples...
I read about oversampling or about penalizing even more the classes that are rare but did not really understood how it works.
Can someone share some explainations about these methods please ?
In practice, on Tensorflow, are there functions that help doing that ?
Any other suggestions ?
Thank you :)
PS: Neural Network for Imbalanced Multi-Class Multi-Label Classification This post raises the same problem but had no answer !

Well, having 10000 samples in one class and just 26 in a rare class will be indeed a problem.
However, what you experience, to me, seems more like "outputs don't even see the inputs" and thus the net just learns your output distribution.
To debug this I would create a reduced set (just for this debugging purpose) with say 26 samples per class and then try to heavily overfit. If you get correct predictions my thought is wrong. But if the net cannot even detect those undersampled overfit samples then indeed it's an architecture/implementation problem and not due to the schewed distribution (which you will then need to fix. But it'll be not as bad as your current results).

Your problem is not the class imbalance, rather just the lack of data. 26 samples are considered to be a very small dataset for practically any real machine learning task. A class imbalance could be easily handled by ensuring that each minibatch will have at least one sample from every class (this leads to situations when some samples will be used much more frequently than another, but who cares).
However, in the case of presence only 26 samples this approach (and any other) will quickly lead to overfitting. This problem could be partly solved with some form of data augmentation, but there still too few samples to construct something reasonable.
So, my suggestion will be to collect more data.

How should I test on a small dataset?

I use Weka to test machine learning algorithms on my dataset. I have 3800 rows and around 25 features. I am testing the combination of different features for prediction models and seem to predict lower than just the oneR algorithm does with the use of Cross-validation. Even C4.5 does not predict better, sometimes it does and sometimes it does not on basis of the features that are still able to classify.
But, on a certain moment I splitted my dataset in a testset and dataset(20/80), and testing it on the testset, the C4.5 algorithm had a far higher accuracy than my OneR algorithm had. I thought, with the small size of the dataset, it probably is just a coincidence that it predicted very well(the target was still splitted up relatively as target attributes). And therefore, its more useful to use Cross-validation on small datasets like these.
However, testing it on another testset, did give the high accuracy towards the testset using C4.5. So, my question actually is, what is the best way to test datasets when the datasets are actually pretty small?
I saw some posts where it is discussed, but I am still not sure what is the right way to do it.

It's almost always a good approach to test your model via Cross-Validation.
A rule of thumb is to use 10 fold cross validation.
In your case, 10 fold cross validation will do the following in Weka:
split your 3800 training instances into 10 sets of 380 instances
for each set (s = 1 .. 10) :
use the instances from s for testing and the other 9 sets for training a model (3420 training instances)
the result will be an average of the results obtained with the 10 models used.
Try to avoid testing your dataset using the training set option, because that could result in creating a model that works very well for you existing data but could have big problems with other new instances (overfitting).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas