Predict probability of predicted class - pandas

ml beginner here.
I have a dataset containing the GPA, GRE, TOEFL, SOP&LOR Ranking(out of 5)etc. (all numerical) , and a final column that states whether or not they were admitted to a university(0 or 1), which is what we'll use as y_train.
I'm supposed to not just classify the predicted labels, but also calculate the probability of each person getting admitted.
edit: so from the from the first comment, I built a Logistic Regression model, and with some googling I found 'predict_proba' from sklearn and tried implementing it. There werent any syntactical errors but the code values given by predict_proba were horribly wrong.
Link: https://github.com/tarunn2799/gre-pred/blob/master/GRE%20Admission%20Probability-%20Extraaedge.ipynb
please help me finding where I've gone wrong, and also tips to reduce the loss
thank you!

I read your notebook, but I'm confused why you think the predict_proba are horribly wrong..
Is the predict accuracy not good, or the format of predict_proba not as you expected?
You could use sklearn.metrics.accuracy_score(), sklearn.metrics.confusion_matrix() to check your predict label, or use sklearn.metrics.roc_auc_score() to check the result of predict_proba. Check both train & text parts are better.
I think the format of predict_proba is correct, or maybe you could try the predict_log_proba() to calculate the log probability?
Hope this could help you.

Related

GAN generator loss goes to zero

I am rather new to deep learning, please bear with me. I have a GAN, with model structure copy-pasted from: https://machinelearningmastery.com/how-to-develop-a-generative-adversarial-network-for-an-mnist-handwritten-digits-from-scratch-in-keras/
It will train for say 100-200 epochs with pretty ok results, then suddenly generator loss drops to zero... here is excerpt from log:
epoch,step,gen_loss,discr_loss
...
189,25,0.208,0.712
189,26,3.925,1.501
189,27,0.269,1.400
189,28,7.814,2.536
189,29,0.000,3.387 // here?!?
189,30,0.000,7.903
189,31,16.118,7.745
189,32,16.118,8.059
189,33,16.118,8.059
189,34,16.118,8.059
... etc, it never recovers
Is this a problem of vanishing gradients? Anything else I’m missing?
In the blogpost comments people argues about the GAN's collapse problem, here you have a comment:
There were problems with the discriminator collapsing to zero on occasions. This seems to be a known feature of GANs. Do any established GAN hacks help with this?
Looking at the discriminator after 100 epochs, it was in a confused state where everything passed into it was circa 50% probability real/fake. I colour coded some generated examples based on disriminator probability (red = fake, green = real, blue = unsure based on an arbitrary banding) and as you mentioned the subjective versus discriminator output does not always tie up. (example posted on linkedin). There was not enough spread in discriminator probability output to make this meaningful.
GANs are very hard to train and it is very usual that the generator or the discriminator becomes so strong that the other can't improve itself, so if you for instance try to generate pictures I would recommend to use progressive GANS what improves the stability a lot and allow to go for high resolution images.

How calculate the features' influence on prediction NN?

There are about 20 features in my data and I wonder if there is any way to check how each column influences the final prediction. For example if it has negative impact on prediction, it will be better to get rid of it.
sklearn has the DecisionTreeClassifier that has the property feature_importances_
I also want to know the answer. Maybe this can help us.

Stata output variable to matrix with ebalance

I'm using the ebalance Stata package to calculate post-stratification weights, and I'd like to convert the weights output (_webal, which is generated as a double with format %10.0g) to a matrix.
I'd like to normalize all weights in the "control" group, but I can't seem to convert the variable to a matrix in order to manipulate the weights individually (I'm a novice to Stata, so I was just going to do this using a loop––I'd normally just export and do this in R, but I have to calculate results within a bootstrap). I can, however, view the individual-level weights produced by the output, and I can use them to calculate sample statistics.
Any ideas, anyone? Thanks so much!
This is not an answer, but it doesn't fit within a comment box.
As a self-described novice in Stata, you are asking the wrong question.
Your problem is that you have a variable that you want to do some calculations on, and since you can't just use R and you don't know how to do those (unspecified) calculations directly in Stata, you have decided that the first step is to create a matrix from the variable.
Your question would be better phrased as a simple description of the relevant portions of your data and the calculation you need to do using that data (ebalance is an obscure distraction that probably lost you a few readers) and where you are stuck.
See also https://stackoverflow.com/help/mcve for a discussion of completing a minimal complete example with a description of the results you expect for that example.

Imbalanced Dataset for Multi Label Classification

So I trained a deep neural network on a multi label dataset I created (about 20000 samples). I switched softmax for sigmoid and try to minimize (using Adam optimizer) :
tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=y_, logits=y_pred)
And I end up with this king of prediction (pretty "constant") :
Prediction for Im1 : [ 0.59275776 0.08751075 0.37567005 0.1636796 0.42361438 0.08701646 0.38991812 0.54468459 0.34593087 0.82790571]
Prediction for Im2 : [ 0.52609032 0.07885984 0.45780018 0.04995904 0.32828355 0.07349177 0.35400775 0.36479294 0.30002621 0.84438241]
Prediction for Im3 : [ 0.58714485 0.03258472 0.3349618 0.03199361 0.54665488 0.02271551 0.43719986 0.54638696 0.20344526 0.88144571]
At first, I thought I just neeeded to find a threshold value for each class.
But I noticed that, for instance, among my 20000 samples, the 1st class appears about 10800 so a 0.54 ratio and it the value around which my prediction is every time. So I think I need to find a way to tackle tuis "imbalanced datset" issue.
I thought about reducing my dataset (Undersampling) to have about the same number of occurence for each class but only 26 samples correspond to one of my classes... That would make me loose a lot of samples...
I read about oversampling or about penalizing even more the classes that are rare but did not really understood how it works.
Can someone share some explainations about these methods please ?
In practice, on Tensorflow, are there functions that help doing that ?
Any other suggestions ?
Thank you :)
PS: Neural Network for Imbalanced Multi-Class Multi-Label Classification This post raises the same problem but had no answer !
Well, having 10000 samples in one class and just 26 in a rare class will be indeed a problem.
However, what you experience, to me, seems more like "outputs don't even see the inputs" and thus the net just learns your output distribution.
To debug this I would create a reduced set (just for this debugging purpose) with say 26 samples per class and then try to heavily overfit. If you get correct predictions my thought is wrong. But if the net cannot even detect those undersampled overfit samples then indeed it's an architecture/implementation problem and not due to the schewed distribution (which you will then need to fix. But it'll be not as bad as your current results).
Your problem is not the class imbalance, rather just the lack of data. 26 samples are considered to be a very small dataset for practically any real machine learning task. A class imbalance could be easily handled by ensuring that each minibatch will have at least one sample from every class (this leads to situations when some samples will be used much more frequently than another, but who cares).
However, in the case of presence only 26 samples this approach (and any other) will quickly lead to overfitting. This problem could be partly solved with some form of data augmentation, but there still too few samples to construct something reasonable.
So, my suggestion will be to collect more data.

Multivariate time-series RNN using Tensorflow. Is this possible with an LSTM cell or similar?

I am looking for examples of how to build a multivariate time-series RNN using Tensorflow. Is this possible with an LSTM cell or similar?
e.g. the data might look something like this:
Time,A,B,C,...
0,3.5,4.5,7.7,...
1,2.1,6.4,8.2,...
...
Any help much appreciated. Thanks, John
It depends on exactly what you mean, but yes, it should be possible. If you write more specifically how exactly your input and target data looks like, somebody may be able to help. You can generally have sequential continuous or categorical input data and sequential continuous or categorical output data or a mix of those. I would suggest you look at the tutorials and try out a few things, then ask again here.
Thanks. I have figured it out now. I misunderstood the docs 'inputs: A length T list of inputs, each a vector with shape [batch_size].'
The following link was useful:
https://m.reddit.com/r/MachineLearning/comments/3sok8k/tensorflow_basic_rnn_example_with_variable_length/