I'm starting to study machine learning. I have a basic knowldege about it. If I consider a generic machine learning algorithm M, I would know which are its precise inputs and outputs. I'm not referring to some kind of implementation in a such programming language. I'm talking about the theory of machine learning.
Take the example of supervised learning. The input of M should be the collection of pairs related to the function f the algorithm must learn. So, it will build some function h which approximate f. The output of M should be h?
The output of ML algorithms is whatever you want it to be.
For example:
Regression: 1 value
Classification: n classes (with the probability of the input is a member of that class)
Text summarization: One word, one character, a batch of them or the whole text summarized.
GPT2 paper clarification

In the GPT-2 paper, under Section 2, Page 3 it says,
Since the supervised objective is the the same as the unsupervised objective but only evaluated on a subset of the sequence, the global minimum of the unsupervised objective is also the global minimum of the supervised objective.
I didn't follow this line of reasoning. What is the logic behind concluding this?
The underlying principle here is that if f is a function with domain D and S is a subset of D, then if d maximizes f over D and d happens to be in S, then d also maximizes f over S.
In simper words "a global maximum is also a local maximum".
Now how does this apply to GPT-2? Let's look at how GPT-2 is trained.
First step: GPT-2 uses unsupervised training to learn the distribution of the next letter in a sequence by examining examples in a huge corpus of existing text. By this point, it should be able to output valid words and be able to complete things like "Hello ther" to "Hello there".
Second step: GPT-2 uses supervised training at specific tasks such as answering specific questions posed to it such as "Who wrote the book the origin of species?" Answer "Charles Darwin".
Question: Does the second step of supervised training undo general knowledge that GPT-2 learned in the first step?
Answer: No, the question-answer pair "Who wrote the book the origin of species? Charles Darwin." is itself valid English text that comes from the same distribution that the network is trying to learn in the first place. It may well even appear verbatim in the corpus of text from step 1. Therefore, these supervised examples are elements of the same domain (valid English text) and optimizing the loss function to get these supervised examples correct is working towards the same objective as optimizing the loss function to get the unsupervised examples correct.
In simpler words, supervised question-answer pairs or other specific tasks that GPT-2 was trained to do use examples from the same underlying distribution as the unsupervised corpus text, so they are optimizing towards the same goal and will have the same global optimum.
Binomial And Multinomial Classification in ML

I got a project in which my task is to build network intrusion detection system to detect anomolies and attacks in the network.
There are two problems.
1. Binomial Classification: Activity is normal or attack
2. Multinomial classification: Activity is normal or DOS or PROBE or R2L or U2R
But before this I get some confusion in these terms Binomial/Multinomial Classification.
Help me to understand/ if possible please share a sort code... which gives me more help.
I tried to search these term on google/youtube but can't find proper definition with some code
I do only these thing with my code:-
clean/transform/outlier detect/missing value treatment
model_selection/accuracy test
so my next step is to make classification of Binomial/Multinomial Classification
First, do not hesitate to post on https://datascience.stackexchange.com/ for these kind of question that is more Data Science than coding issue.
Second, the answer is as simple as :
Binary (and not Binomial) Classification means only 2 targets to find.
=> In your case Normal vs Attack
Multilabel / Multiclass / Multinomial Classification means more than 2 targets to find.
=> Your case : Normal, DOS, PROBE, REL & E2R.
Machine Learning Algorithm for multiple output features

I am looking for machine learning algorithm where I have multiple variables as output . It is something like like a vector[A,....X] each of which can have 0 or 1 value. I have data to train the model with required input features.
Which algorithm should I use for such case. With my limited knowledge I know that multi label classification can solve the problem where one output variable can take multiple values like color. But this case is multiple output variables taking 0 or 1 . Please let me know.
It is difficult to give an answer on which algorithm is the best without more information.
why do we reverse input when feeding in seq2seq model in tensorflow( tf.reverse(inputs,[-1]))

Why do we reverse input when feeding in seq2seq model in tensorflow ( tf.reverse(inputs,[-1]))
To best of my knowledge, reversing the input arose from the paper Sequence to sequence learning with neural networks
The idea is originated for machine translation (I'm not sure how it plays out in other domains, e.g. chatbots). Think of the following scenario (borrowed from the original paper). You want to translate,
A B C -> alpha beta gamma delta
In this setting, we have to go through the full source sequence (ABC) before starting to predict alpha, where the translator might have forgotten about A by then. But when you do this as,
C B A -> alpha beta gamma delta
You have a strong communication link from A to alpha, where A is "probably" related to alpha in the translation.
Note: This entirely depends on your translation task. If the target language is written in the reverse order of the source language (e.g. think of translating from subject-verb-object to object-verb-subject language) to , I think it's better to keep the original order.
While the LSTM is capable of solving problems with long term dependencies, we discovered that the LSTM learns much better when the source sentences are reversed (the target sentences are not reversed). By doing so, the LSTM’s test perplexity dropped from 5.8 to 4.7, and the test BLEU scores of its decoded translations increased from 25.9 to 30.6.
While we do not have a complete explanation to this phenomenon, we believe that it is caused by the introduction of many short term dependencies to the dataset. Normally, when we concatenate a source sentence with a target sentence, each word in the source sentence is far from its corresponding word in the target sentence. As a result, the problem has a large “minimal time lag” [17]. By reversing the words in the source sentence, the average distance between corresponding words in the source and target language is unchanged. However, the first few words in the source language are now very close to the first few words in the target language, so the problem’s minimal time lag is greatly reduced. Thus, backpropagation has an easier time “establishing communication” between the source sentence and the target sentence, which in turn results in substantially improved overall performance.
Initially, we believed that reversing the input sentences would only lead to more confident predic- tions in the early parts of the target sentence and to less confident predictions in the later parts. How- ever, LSTMs trained on reversed source sentences did much better on long sentences than LSTMs rained on the raw source sentences.
Inference on several inputs in order to calculate the loss function

I am modeling a perceptual process in tensorflow. In the setup I am interested in, the modeled agent is playing a resource game: it has to choose 1 out of n resouces, by relying only on the label that a classifier gives to the resource. Each resource is an ordered pair of two reals. The classifier only sees the first real, but payoffs depend on the second. There is a function taking first to second.
Anyway, ideally I'd like to train the classifier in the following way:
In each run, the classifier give labels to n resources.
The agent then gets the payoff of the resource corresponding to the highest label in some predetermined ranking (say, A > B > C > D), and randomly in case of draw.
The loss is taken to be the normalized absolute difference between the payoff thus obtained and the maximum payoff in the set of resources. I.e., (Payoff_max - Payoff) / Payoff_max
For this to work, one needs to run inference n times, once for each resource, before calculating the loss. Is there a way to do this in tensorflow? If I am tackling the problem in the wrong way feel free to say so, too.
I don't have much knowledge in ML aspects of this, but from programming point of view, I can see doing it in two ways. One is by copying your model n times. All the copies can share the same variables. The output of all of these copies would go into some function that determines the the highest label. As long as this function is differentiable, variables are shared, and n is not too large, it should work. You would need to feed all n inputs together. Note that, backprop will run through each copy and update your weights n times. This is generally not a problem, but if it is, I heart about some fancy tricks one can do by using partial_run.
Another way is to use tf.while_loop. It is pretty clever - it stores activations from each run of the loop and can do backprop through them. The only tricky part should be to accumulate the inference results before feeding them to your loss. Take a look at TensorArray for this. This question can be helpful: Using TensorArrays in the context of a while_loop to accumulate values