Is it correct to do t.test on accuracy scores - data-science

I have accuracy scores from two models on different datasets and languages, for example, I have this table for two models.
is it correct if I take avg accuracy from model1 and model2 and do a significance t.test on this to see what model does best on these datasets?
Kind regards

Good question.
Short answer is, it depends -- but it's pretty hard to do a t-test statistically on just two values. You should look into a two-sample paired t-test to address this. Basically, is the difference between each test in model one minus model two statistically significantly different from 0 on the whole?
Best of luck!

Related

How to implement feature importance on nominal categorical features in tree based classifiers?

I am using SKLearn XGBoost model for my binary classification problem. My data contains nominal categorical features (such as race) for which one hot encoding should be used to feed them to the tree based models.
On the other hand, using feature_importances_ variable of XGBoost yields us the importance of each column on the trained model. So if I do the encoding and then get the features importance of columns, the result will includes names like race_2 and its importance.
What should I do to solve this problem and get a whole score for each nominal feature? Can I take the average of one hot encoded columns importance scores that belong to one feature? (like race_1, race_2 and race_3)
First of all, if your goal is to select the most useful features for later training, I would advise you to use regularization in your model. In the case of xgboost, you can tune the parameter gamma so the model would actually be more dependent on "more useful" features (i.e. tune the minimum loss reduction required for the model to add a partition leaf). Here is a good article on implementing regularization into xgboost models.
On the other hand, if you insist on doing feature importance, I would say grouping the encoded variables and simply adding them is not a good decision. This would result in feature-importance results that do not consider the relationship between these dummy variables.
My suggestion would be to take a look at the permutation tools for this. The basic idea is you take your original dataset, shuffle the values on the column in which you are going to calculate feature importance, train the model and record the score. Repeat this over different columns and the effect of each on the model performance would be a sign of their importance.
It is actually easier done than said, sklearn has this feature built-in to do for you: check out the example provided in here.

Once a CNN is trained, should its ouputs be deterministic?

I just trained a CNN with Tensorflow/Keras and saved it as a model. I tried running about 1000 inputs through it multiple times, and each time got a slightly different prediction accuracy. The accuracy was good, and I am not concerned with the performance; however, I thought that CNN models, once trained, should be deterministic. That is, any input will always be classified the same way. Is this not the case? Is there variability in the way a model can predict once trained? If not, hopefully I can assume that I have programmed some variability into my code unawares. Any help would be appreciated.
Once a CNN is trained, should its ouputs be deterministic?
Well, in theory, yes. In practise, as Peter Duniho points out in his excellent explanatory comment, we can see very small deviations because of the way values are calculated, aggregated, etc.
In practice the probability of such small deviations changing the predicted category (and therefore the accuracy) of a classification model are so small that I'd be almost certain something else is at play in your example. Even over a sample size of 1000.
Have you left on some training regularisation like batch normalisation? Are you certain you are evaluating precisely the same 1000 inputs each time? Got to suspect the issue is in the code rather than rounding errors.
Can you determine which specific classification changes?

Batch structure for training a ranking model with contrastive loss?

How do I choose my batch if I train a deep ranking model with a eg. contrastive loss where I have per query 1 positive document and 2 negative samples?
So, it is about ranking (loss) which applies to eg. the quora question pair data or any other question/answer pairs which I want to rank using a deep learning ranking model or just a Siamese network.
The data would look like this: https://github.com/NTMC-Community/MatchZoo/blob/master/matchzoo/datasets/toy/train.csv
Now, I assume that it is crucial how to build the batch, right? Since for every question all according pos and neg answers need to be contained inside a batch, right?
Different strategies can be used to build the batches and the triplets or pairs. Usually, the batches are built randomnly, and then the hardest negative, or one of the hardest negatives in the batch is picked.
So yes, positive and negatives examples need to be contaned inside a batch. And it is crucial to pick negatives. But usually efforts are made to pick the proper negatives inside the batch, instead of in building the batches in a specific way.
This blogpost explaining how ranking losses work may be usefull https://gombru.github.io/2019/04/03/ranking_loss/

Reducing false positive in CNN (Conv1D) text classification model

I created a char-based CNN model for text classification on keras + tensorflow - mainly using Conv1D, mainly based on:
http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/
The model is performing very good with 80%+ accuracy on test data set. However I'm having problem with false positive. One of the reason could be that the final layer is a Dense layer with softmax activation function.
To give an idea of how the model is performing, I train the model with data set with 31 classes with 1021 samples, the performance is ~85% on 25% test data set
However if you include false negative the performance is pretty bad (I didn't run another test data with false negative since it's pretty obvious just testing by hand) - every input has a corresponding prediction. For example a sentence acasklncasdjsandjas can result in a class ask_promotion.
Are there any best practice on how to deal with false positive in this case?
My idea is to:
Implement a noise class where samples are just a set of totally random text. However this doesn't seem to help since the noise doesn't contain any pattern thus it would be difficult to train the model
Replace softmax with something that doesn't require all output probability to 1 so small values can stay small regardless of other values. I did some research on this but there's not much information on changing the activation function for this specific case
That sounds like the issue of imbalanced data, where two classes have completely different supports (the number of instances in each class). This issue is particularly crucial in the task of hierarchical classification in which some classes with a deep hierarchy tend to have much more instances than the others.
Anyway, let's simply the issue as binary classification, and name the class with much more support Class-A and the other one with less support Class-B. Generally speaking, there are two popular ways to circumvent this issue.
Under-sampling: You fix Class-B as is. Then you sample instances from Class-A for the same amount as Class-B. Combine these instances and train your classifier with them.
Over-sampling: You fix Class-A as is. Then you sample instances from Class-B for the same amount as Class-A. The same goes with Choice 1.
For more information, please refer to this KDNuggets page.
https://www.kdnuggets.com/2017/06/7-techniques-handle-imbalanced-data.html
Hope this helps. :P

Using unlabeled dataset in Keras

Usually, when using Keras, the datasets used to train the neural network are labeled.
For example, if I have a 100,000 rows of patients with 12 field per each row, then the last field will indicate if this patient is diabetic or no (0 or 1).
And then after training is finished I can insert a new record and predict if this person is diabetic or no.
But in the case of unlabeled datasets, where I can not label the data due to some reasons, how can I train the neural network to let him know that those are the normal records and any new record that does not match this network will be malicious or not accepted ?
This is called one-class learning and is usually done by using autoencoders. You train an autoencoder on the training data to reconstruct the data itself. The labels in this case is the input itself. This will give you a reconstruction error. https://en.wikipedia.org/wiki/Autoencoder
Now you can define a threshold where the data is benign or not, depending on the reconstruction error. The hope is that the reconstruction of the good data is better than the reconstruction of the bad data.
Edit to answer the question about the difference in performance between supervised and unsupervised learning.
This cannot be said with any certainty, because I have not tried it and I do not know what the final accuracy is going to be. But for a rough estimate supervised learning will perform better on the trained data, because more information is supplied to the algorithm. However if the actual data is quite different to the training data the network will underperform in practice, while the autoencoder tends to deal better with different data. Additionally, per rule of thumb you should have 5000 examples per class to train a neural network reliably, so labeling could take some time. But you will need some data to test anyways.
It sounds like you need fit two different models:
a model for bad record detection
a model for prediction of a patient's likelihood to be diabetic
For both of these models, you will need to have labels. For the first model your labels would indicate whether the record is good or bad (malicious) and the second would be whether the patient is diabetic or not.
In order to detect bad records, you may find that simple logistic regression or SVM performs adequately.