GPT2 paper clarification - gpt-2

In the GPT-2 paper, under Section 2, Page 3 it says,
Since the supervised objective is the the same as the unsupervised objective but only evaluated on a subset of the sequence, the global minimum of the unsupervised objective is also the global minimum of the supervised objective.
I didn't follow this line of reasoning. What is the logic behind concluding this?

The underlying principle here is that if f is a function with domain D and S is a subset of D, then if d maximizes f over D and d happens to be in S, then d also maximizes f over S.
In simper words "a global maximum is also a local maximum".
Now how does this apply to GPT-2? Let's look at how GPT-2 is trained.
First step: GPT-2 uses unsupervised training to learn the distribution of the next letter in a sequence by examining examples in a huge corpus of existing text. By this point, it should be able to output valid words and be able to complete things like "Hello ther" to "Hello there".
Second step: GPT-2 uses supervised training at specific tasks such as answering specific questions posed to it such as "Who wrote the book the origin of species?" Answer "Charles Darwin".
Question: Does the second step of supervised training undo general knowledge that GPT-2 learned in the first step?
Answer: No, the question-answer pair "Who wrote the book the origin of species? Charles Darwin." is itself valid English text that comes from the same distribution that the network is trying to learn in the first place. It may well even appear verbatim in the corpus of text from step 1. Therefore, these supervised examples are elements of the same domain (valid English text) and optimizing the loss function to get these supervised examples correct is working towards the same objective as optimizing the loss function to get the unsupervised examples correct.
In simpler words, supervised question-answer pairs or other specific tasks that GPT-2 was trained to do use examples from the same underlying distribution as the unsupervised corpus text, so they are optimizing towards the same goal and will have the same global optimum.
Caveat: you can still accidentally end up in a local-minimum due to (over)training using these supervised examples that you might not have run into otherwise. However, GPT-2 was revolutionary in its field and whether or not this happened with GPT-2, it still made significant progress from the state-of-the-art before it.

Related

Understanding Time2Vec embedding for implementing this as a keras layer

The paper time2vector link (the relevant theory is in section 4) shows an approach to include a time embedding for features to improve model performance. I would like to give this a try. I found a implementation as keras layer which I changed a little bit. Basically it creates two matrices for one feature:
(1) linear = w * x + b
(2) periodic = sin(w * x + b)
Currently I choose this feature manually. Concerning the paper there are a few things i don't understand. The first thing is the term k as the number of sinusoids. The authors use up to 64 sinusoids. What does this mean? I have just 1 sinusoid at the moment, right? Secondly I'm about to put every feature I have through the sinus transformation for me dataset that would make 6 (sinusoids) periodic features. The authors use only one linear term. How should I choose the feature for the linear term? Unfortunately the code from the paper is not available anymore. Has anyone worked with time embeddings or even with this particularly approach?
For my limited understanding, the linear transformation of time is a fixed element of the produced embedding and the parameter K allows you to select how many different learned time representations you want to use in your model. So, the resulting embedding has a size of K+1 elements.

How is hashing implemented in SGNN (Self-Governing Neural Networks)?

So I've read the paper named Self-Governing Neural Networks for On-Device Short Text Classification which presents an embedding-free approach to projecting words into a neural representation. To quote them:
The key advantage of SGNNs over existing work is that they surmount the need for pre-trained word embeddings and complex networks with huge parameters. [...] our method is a truly embedding-free approach unlike majority of the widely-used state-of-the-art deep learning techniques in NLP
Basically, from what I understand, they proceed as follow:
You'd first need to compute n-grams (side-question: is that skip-gram like old skip-gram, or new skip-gram like word2vec? I assume it's the first one for what remains) on words' characters to obtain a featurized representation of words in a text, so as an example, with 4-grams you could yield a 1M-dimensional sparse feature vector per word. Hopefully, it's sparse so memory needn't to be fully used for that because it's almost one-hot (or count-vectorized, or tf-idf vectorized ngrams with lots of zeros).
Then you'd need to hash those n-grams sparse vectors using Locality-sensitive hashing (LSH). They seem to use Random Projection from what I've understood. Also, instead of ngram-vectors, they instead use tuples of n-gram feature index and its value for non-zero n-gram feature (which is also by definition a "sparse matrix" computed on-the-fly such as from a Default Dictionary of non-zero features instead of a full vector).
I found an implementation of Random Projection in scikit-learn. From my tests, it doesn't seem to yield a binary output, although the whole thing is using sparse on-the-fly computations within scikit-learn's sparse matrices as expected for a memory-efficient (non-zero dictionnary-like features) implementation I guess.
What doesn't work in all of this, and where my question lies, is in how they could end up with binary features from the sparse projection (the hashing). They seem to be saying that the hashing is done at the same time of computing the features, which is confusing, I would have expected the hashing to come in the order I wrote above as in 1-2-3 steps, but their steps 1 and 2 seems to be somehow merged.
My confusion arises mostly from the paragraphs starting with the phrase "On-the-fly Computation." at page 888 (PDF's page 2) of the paper in the right column. Here is an image depicting the passage that confuses me:
I'd like to convey my school project to a success (trying to mix BERT with SGNNs instead of using word embeddings). So, how would you demystify that? More precisely, how could a similar random hashing projection be achieved with scikit-learn, or TensorFlow, or with PyTorch? Trying to connect the dots here, I've significantly researched but their paper doesn't give implementation details, which is what I'd like to reproduce. I at least know that the SGNN uses 80 fourten-dimensionnal LSHes on character-level n-grams of words (is my understanding right in the first place?).
Thanks!
EDIT: after starting to code, I realized that the output of scikit-learn's SparseRandomProjection() looks like this:
[0.7278244729081154,
-0.7278244729081154,
0.0,
0.0,
0.7278244729081154,
0.0,
...
]
For now, this looks fine, it's closer to binary but it would still be castable to an integer instead of a float by using the good ratio in the first place. I still wonder about the skip-gram thing, I assume n-gram of characters of words for now but it's probably wrong. Will post code soon to GitHub.
EDIT #2: I coded something here, but with n-grams instead of skip-grams: https://github.com/guillaume-chevalier/SGNN-Self-Governing-Neural-Networks-Projection-Layer
More discussion threads on this here: https://github.com/guillaume-chevalier/SGNN-Self-Governing-Neural-Networks-Projection-Layer/issues?q=is%3Aissue
First of all, thanks for your implementation of the projection layer, it helped me get started with my own.
I read your discussion with #thinline72, and I agree with him that the features are calculated in the whole line of text, char by char, not word by word. I am not sure this difference in features is too relevant, though.
Answering your question: I interpret that they do steps 1 and 2 separately, as you suggested and did. Right, in the article excerpt that you include, they talk about hashing both in feature construction and projection, but I think those are 2 different hashes. And I interpret that the first hashing (feature construction) is automatically done by the CountVectorizer method.
Feel free to take a look at my implementation of the paper, where I built the end-to-end network and trained on the SwDA dataset, as split in the SGNN paper. I obtain a max of 71% accuracy, which is somewhat lower than the paper claims. I also used the binary hasher that #thinline72 recommended, and nltk's implementation of skipgrams (I am quite certain the SGNN paper is talking about "old" skipgrams, not "word2vec" skipgrams).

why do we reverse input when feeding in seq2seq model in tensorflow( tf.reverse(inputs,[-1]))

Why do we reverse input when feeding in seq2seq model in tensorflow ( tf.reverse(inputs,[-1]))
training_predictions,test_predictions=seq2seq_model(tf.reverse(inputs,[-1]),
targets,
keep_prob,
batch_size,
seq_length,
len(answerswords2int),
len(questionswords2int),
encoding_embedding_size,
decoding_embedding_size,
rnn_size,
num_layers,
questionswords2int)
To best of my knowledge, reversing the input arose from the paper Sequence to sequence learning with neural networks
The idea is originated for machine translation (I'm not sure how it plays out in other domains, e.g. chatbots). Think of the following scenario (borrowed from the original paper). You want to translate,
A B C -> alpha beta gamma delta
In this setting, we have to go through the full source sequence (ABC) before starting to predict alpha, where the translator might have forgotten about A by then. But when you do this as,
C B A -> alpha beta gamma delta
You have a strong communication link from A to alpha, where A is "probably" related to alpha in the translation.
Note: This entirely depends on your translation task. If the target language is written in the reverse order of the source language (e.g. think of translating from subject-verb-object to object-verb-subject language) to , I think it's better to keep the original order.
While the LSTM is capable of solving problems with long term dependencies, we discovered that the LSTM learns much better when the source sentences are reversed (the target sentences are not reversed). By doing so, the LSTM’s test perplexity dropped from 5.8 to 4.7, and the test BLEU scores of its decoded translations increased from 25.9 to 30.6.
While we do not have a complete explanation to this phenomenon, we believe that it is caused by the introduction of many short term dependencies to the dataset. Normally, when we concatenate a source sentence with a target sentence, each word in the source sentence is far from its corresponding word in the target sentence. As a result, the problem has a large “minimal time lag” [17]. By reversing the words in the source sentence, the average distance between corresponding words in the source and target language is unchanged. However, the first few words in the source language are now very close to the first few words in the target language, so the problem’s minimal time lag is greatly reduced. Thus, backpropagation has an easier time “establishing communication” between the source sentence and the target sentence, which in turn results in substantially improved overall performance.
Initially, we believed that reversing the input sentences would only lead to more confident predic- tions in the early parts of the target sentence and to less confident predictions in the later parts. How- ever, LSTMs trained on reversed source sentences did much better on long sentences than LSTMs rained on the raw source sentences.
Paper: https://arxiv.org/abs/1409.3215

Does LSTM cell update take into account the current input?

I have a question raised by studing LSTM. At the following link I found a very useful explaination of the LSTM mechanism. Parts and equations from this blog post have been reported in several other webpages about LSTM (including Wikipedia). However, by reading the original paper of LSTM there is something that doesn't match. My question is about the update of the cell'state. In the blog it is defined by the equation that defines Ct, this equation takes into account either the last output ht-1 and the current input xt.
In the paper, equation (6) tells me that the state at time t s(t)c depends on the element g(netc(t)), that is the analog of C~ of the blog equation. The equation (6) is the following (the term yin is the input gate).
As you can see from the above figures, C~ depends on both the previous output h and the current input x. However, netc(t) in the paper doesn't take into account the current input xt.
Indeed the definition of netc(t) is the following (equation 4 in the paper).
where yu(t-1) is the output value of unit u at time t-1.
So my question is about if there is an error in the paper or in the blog. Since The blog's version is the one I've often found in courses, tutorials, and all practical material including tensorflow implementation!
Note that the same question raises about the computation of the input gate it.
PS. the cited paper is the first about LSTM, the forget gate has been introduced by another paper, however the mentioned equations are the same in both papers.

What is the output of a machine learning algorithm?

I'm starting to study machine learning. I have a basic knowldege about it. If I consider a generic machine learning algorithm M, I would know which are its precise inputs and outputs. I'm not referring to some kind of implementation in a such programming language. I'm talking about the theory of machine learning.
Take the example of supervised learning. The input of M should be the collection of pairs related to the function f the algorithm must learn. So, it will build some function h which approximate f. The output of M should be h?
And what about unsupervised machine learning?
The output of ML algorithms is whatever you want it to be.
For example:
Regression: 1 value
Classification: n classes (with the probability of the input is a member of that class)
Text summarization: One word, one character, a batch of them or the whole text summarized.
As you see, the output will be what you need it to be.