Binomial And Multinomial Classification in ML - data-science

I got a project in which my task is to build network intrusion detection system to detect anomolies and attacks in the network.
There are two problems.
1. Binomial Classification: Activity is normal or attack
2. Multinomial classification: Activity is normal or DOS or PROBE or R2L or U2R
But before this I get some confusion in these terms Binomial/Multinomial Classification.
Help me to understand/ if possible please share a sort code... which gives me more help.
I tried to search these term on google/youtube but can't find proper definition with some code
I do only these thing with my code:-
clean/transform/outlier detect/missing value treatment
model_selection/accuracy test
so my next step is to make classification of Binomial/Multinomial Classification
Thanks for help...

First, do not hesitate to post on https://datascience.stackexchange.com/ for these kind of question that is more Data Science than coding issue.
Second, the answer is as simple as :
Binary (and not Binomial) Classification means only 2 targets to find.
=> In your case Normal vs Attack
Multilabel / Multiclass / Multinomial Classification means more than 2 targets to find.
=> Your case : Normal, DOS, PROBE, REL & E2R.
You can find example on https://scikit-learn.org/stable/supervised_learning.html#supervised-learning

Related

GPT2 paper clarification

In the GPT-2 paper, under Section 2, Page 3 it says,
Since the supervised objective is the the same as the unsupervised objective but only evaluated on a subset of the sequence, the global minimum of the unsupervised objective is also the global minimum of the supervised objective.
I didn't follow this line of reasoning. What is the logic behind concluding this?
The underlying principle here is that if f is a function with domain D and S is a subset of D, then if d maximizes f over D and d happens to be in S, then d also maximizes f over S.
In simper words "a global maximum is also a local maximum".
Now how does this apply to GPT-2? Let's look at how GPT-2 is trained.
First step: GPT-2 uses unsupervised training to learn the distribution of the next letter in a sequence by examining examples in a huge corpus of existing text. By this point, it should be able to output valid words and be able to complete things like "Hello ther" to "Hello there".
Second step: GPT-2 uses supervised training at specific tasks such as answering specific questions posed to it such as "Who wrote the book the origin of species?" Answer "Charles Darwin".
Question: Does the second step of supervised training undo general knowledge that GPT-2 learned in the first step?
Answer: No, the question-answer pair "Who wrote the book the origin of species? Charles Darwin." is itself valid English text that comes from the same distribution that the network is trying to learn in the first place. It may well even appear verbatim in the corpus of text from step 1. Therefore, these supervised examples are elements of the same domain (valid English text) and optimizing the loss function to get these supervised examples correct is working towards the same objective as optimizing the loss function to get the unsupervised examples correct.
In simpler words, supervised question-answer pairs or other specific tasks that GPT-2 was trained to do use examples from the same underlying distribution as the unsupervised corpus text, so they are optimizing towards the same goal and will have the same global optimum.
Caveat: you can still accidentally end up in a local-minimum due to (over)training using these supervised examples that you might not have run into otherwise. However, GPT-2 was revolutionary in its field and whether or not this happened with GPT-2, it still made significant progress from the state-of-the-art before it.

GAN generator loss goes to zero

I am rather new to deep learning, please bear with me. I have a GAN, with model structure copy-pasted from: https://machinelearningmastery.com/how-to-develop-a-generative-adversarial-network-for-an-mnist-handwritten-digits-from-scratch-in-keras/
It will train for say 100-200 epochs with pretty ok results, then suddenly generator loss drops to zero... here is excerpt from log:
epoch,step,gen_loss,discr_loss
...
189,25,0.208,0.712
189,26,3.925,1.501
189,27,0.269,1.400
189,28,7.814,2.536
189,29,0.000,3.387 // here?!?
189,30,0.000,7.903
189,31,16.118,7.745
189,32,16.118,8.059
189,33,16.118,8.059
189,34,16.118,8.059
... etc, it never recovers
Is this a problem of vanishing gradients? Anything else I’m missing?
In the blogpost comments people argues about the GAN's collapse problem, here you have a comment:
There were problems with the discriminator collapsing to zero on occasions. This seems to be a known feature of GANs. Do any established GAN hacks help with this?
Looking at the discriminator after 100 epochs, it was in a confused state where everything passed into it was circa 50% probability real/fake. I colour coded some generated examples based on disriminator probability (red = fake, green = real, blue = unsure based on an arbitrary banding) and as you mentioned the subjective versus discriminator output does not always tie up. (example posted on linkedin). There was not enough spread in discriminator probability output to make this meaningful.
GANs are very hard to train and it is very usual that the generator or the discriminator becomes so strong that the other can't improve itself, so if you for instance try to generate pictures I would recommend to use progressive GANS what improves the stability a lot and allow to go for high resolution images.

Machine Learning Algorithm for multiple output features

I am looking for machine learning algorithm where I have multiple variables as output . It is something like like a vector[A,....X] each of which can have 0 or 1 value. I have data to train the model with required input features.
Which algorithm should I use for such case. With my limited knowledge I know that multi label classification can solve the problem where one output variable can take multiple values like color. But this case is multiple output variables taking 0 or 1 . Please let me know.
It is difficult to give an answer on which algorithm is the best without more information.
A perceptron, a neural network with an output layer with multiple binary (threshold function) neurons could be a good candidate.

Imbalanced Dataset for Multi Label Classification

So I trained a deep neural network on a multi label dataset I created (about 20000 samples). I switched softmax for sigmoid and try to minimize (using Adam optimizer) :
tf.reduce_mean(tf.nn.sigmoid_cross_entropy_with_logits(labels=y_, logits=y_pred)
And I end up with this king of prediction (pretty "constant") :
Prediction for Im1 : [ 0.59275776 0.08751075 0.37567005 0.1636796 0.42361438 0.08701646 0.38991812 0.54468459 0.34593087 0.82790571]
Prediction for Im2 : [ 0.52609032 0.07885984 0.45780018 0.04995904 0.32828355 0.07349177 0.35400775 0.36479294 0.30002621 0.84438241]
Prediction for Im3 : [ 0.58714485 0.03258472 0.3349618 0.03199361 0.54665488 0.02271551 0.43719986 0.54638696 0.20344526 0.88144571]
At first, I thought I just neeeded to find a threshold value for each class.
But I noticed that, for instance, among my 20000 samples, the 1st class appears about 10800 so a 0.54 ratio and it the value around which my prediction is every time. So I think I need to find a way to tackle tuis "imbalanced datset" issue.
I thought about reducing my dataset (Undersampling) to have about the same number of occurence for each class but only 26 samples correspond to one of my classes... That would make me loose a lot of samples...
I read about oversampling or about penalizing even more the classes that are rare but did not really understood how it works.
Can someone share some explainations about these methods please ?
In practice, on Tensorflow, are there functions that help doing that ?
Any other suggestions ?
Thank you :)
PS: Neural Network for Imbalanced Multi-Class Multi-Label Classification This post raises the same problem but had no answer !
Well, having 10000 samples in one class and just 26 in a rare class will be indeed a problem.
However, what you experience, to me, seems more like "outputs don't even see the inputs" and thus the net just learns your output distribution.
To debug this I would create a reduced set (just for this debugging purpose) with say 26 samples per class and then try to heavily overfit. If you get correct predictions my thought is wrong. But if the net cannot even detect those undersampled overfit samples then indeed it's an architecture/implementation problem and not due to the schewed distribution (which you will then need to fix. But it'll be not as bad as your current results).
Your problem is not the class imbalance, rather just the lack of data. 26 samples are considered to be a very small dataset for practically any real machine learning task. A class imbalance could be easily handled by ensuring that each minibatch will have at least one sample from every class (this leads to situations when some samples will be used much more frequently than another, but who cares).
However, in the case of presence only 26 samples this approach (and any other) will quickly lead to overfitting. This problem could be partly solved with some form of data augmentation, but there still too few samples to construct something reasonable.
So, my suggestion will be to collect more data.

Has anyone managed to make Asynchronous advantage actor critic work with Mujoco experiments?

I'm using an open source version of a3c implementation in Tensorflow which works reasonably well for atari 2600 experiments. However, when I modify the network for Mujoco, as outlined in the paper, the network refuses to learn anything meaningful. Has anyone managed to make any open source implementations of a3c work with continuous domain problems, for example mujoco?
I have done a continuous action of Pendulum and it works well.
Firstly, you will build your neural network and output mean (mu) and standard deviation (sigma) for selecting an action.
The essential part of the continuous action is to include a normal distribution. I'm using tensorflow, so the code is looks like:
normal_dist = tf.contrib.distributions.Normal(mu, sigma)
log_prob = normal_dist.log_prob(action)
exp_v = log_prob * td_error
entropy = normal_dist.entropy() # encourage exploration
exp_v = tf.reduce_sum(0.01 * entropy + exp_v)
actor_loss = -exp_v
When you wanna sample an action, use the function tensorflow gives:
sampled_action = normal_dist.sample(1)
The full code of Pendulum can be found in my Github. https://github.com/MorvanZhou/tutorials/blob/master/Reinforcement_learning_TUT/10_A3C/A3C_continuous_action.py
I was hung up on this for a long time, hopefully this helps someone in my shoes:
Advantage Actor-critic in discrete spaces is easy: if your actor does better than you expect, increase the probability of doing that move. If it does worse, decrease it.
In continuous spaces though, how do you do this? The entire vector your policy function outputs is your move -- if you are on-policy, and you do better than expected, there's no way of saying "let's output that action even more!" because you're already outputting exactly that vector.
That's where Morvan's answer comes into play. Instead of outputting just an action, you output a mean and a std-dev for each output-feature. To choose an action, you pass your inputs in to create a mean/stddev for each output-feature, and then sample each feature from this normal distribution.
If you do well, you adjust the weights of your policy network to change the mean/stddev to encourage this action. If you do poorly, you do the opposite.