How is an input translated to the input units of a NN - tensorflow

I am quite new to machine learning and neural nets. I‘ve used the following model for sentiment analysis of short texts. I generally understand how signals are computed, all the way to the output layer. Now what I dont understand is how the inputs are found. When the model classifies a word, how is that word translated to the 512 input units? What features of the word does the model assess and how is that decided?
model = Sequential()
model.add(Dense(512, input_shape=(max_words,), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(256, activation='sigmoid'))
model.add(Dropout(0.5))
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])

When the model classifies a word, how is that word translated to the
512 input units?
As you already noticed, before any kind of written information (single words, sentences or whole texts) can be processed by a neural network, it must be encoded into a vector representation. This is called an embedding or a representation and to find suitable embeddings is subfield of Natural Language Procesessing (NLP) research.
Over the years a number of different representations were published. For single words e.g. Word2Vec in which a neural network has "learned" the embedding based on the semantic similarity of the words. That means words which are similar in context should be close by in the vector space.
The most simple embedding for a sentence would be a bag-of-words embedding. This means we count how many different words we have in our corpus of sentences (e.g. N) and we transform each sentence into a vector of length N where each index of the vector represents a word and the value at the index the number of occurrences of that word in the sentence.
Of course there are many more sophisticated text embeddings.

There are multiple methods by which you can obtain the vector embedding of a word.
Count based methods: PMI, PPMI and SVD
Prediction based methods: CBOW and Skip-Gram
The count-based methods create a co-occurrence matrix of words of shape Vocabulary*Vocabulary where each word is represented by some sort of count of co-occurrence in K neighborhood.
The prediction-based models train on a corpus and create a vector embedding basis on how close the context of two words are.

Related

Sorting a list of arbitrary size using attention / transformers?

Seq2Seq neural network architectures can work with sequences of arbitrary size either via iteration, as in RNN, or parallelism, as in Transformers or other Attention (Query/Key/Value) mechanisms. It is relatively easy to create a model that can be trained to find the maximum of a list. For instance with LSTM this 77 parameters model does the trick well:
model = Sequential()
model.add(Dense(1, input_shape=(None,1), activation='relu'))
model.add(LSTM(2, return_sequences=True, activation='relu'))
model.add(LSTM(2, return_sequences=False, activation='relu'))
model.add(Dense(1, activation='gelu'))
and surely it is possible to do it with smaller RNN even. For Attention, a 93 parameters model also does the job:
number=tf.keras.Input(shape=(None,1))
tinput=tf.keras.layers.Dense(4)(number)
toutput=tf.keras.layers.MultiHeadAttention(num_heads=2,key_dim=2)(tinput,tinput,tinput)
reduction=tf.keras.layers.Lambda(lambda x: tf.reduce_mean(x,axis=1))(toutput)
result=tf.keras.layers.Dense(1)(reduction)
model=tf.keras.Model(inputs=number,outputs=result)
Now, while LTSM obviously do not have mechanisms to see the entire series and then produce a exact quartile function, a median or a sorting of the whole list, the situation is different with Attention. One could in principle see the median of a dataset and, perhaps, even the production of the full ordered series?
How should it be done? Do I need a complete transformer, using the decoder to produce the series? Or could be just assign a "position" to each element, as output of an encoder?
A problem I find when experimenting with transformers here is that they seem to learn on one side to recognise the input sequence, on other side to produce a "translated" output sequence, so the output always differ from the input at some decimal level. It is noticeable when you scale the input sequence, say from
tiempo=np.random.uniform(1,10000,size=(rows,cols))
to
tiempo=np.random.uniform(1,100,size=(rows,cols))
as then it needs to relearn, while a pure decision based network would work with both inputs.

What is the network structure inside a Tensorflow Embedding Layer?

Tensoflow Embedding Layer (https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) is easy to use,
and there are massive articles talking about
"how to use" Embedding (https://machinelearningmastery.com/what-are-word-embeddings/, https://www.sciencedirect.com/topics/computer-science/embedding-method)
.
However, I want to know the Implemention of the very "Embedding Layer" in Tensorflow or Pytorch.
Is it a word2vec?
Is it a Cbow?
Is it a special Dense Layer?
Structure wise, both Dense layer and Embedding layer are hidden layers with neurons in it. The difference is in the way they operate on the given inputs and weight matrix.
A Dense layer performs operations on the weight matrix given to it by multiplying inputs to it ,adding biases to it and applying activation function to it. Whereas Embedding layer uses the weight matrix as a look-up dictionary.
The Embedding layer is best understood as a dictionary that maps integer indices (which stand for specific words) to dense vectors. It takes integers as input, it looks up these integers in an internal dictionary, and it returns the associated vectors. It’s effectively a dictionary lookup.
from keras.layers import Embedding
embedding_layer = Embedding(1000, 64)
Here 1000 means the number of words in the dictionary and 64 means the dimensions of those words. Intuitively, embedding layer just like any other layer will try to find vector (real numbers) of 64 dimensions [ n1, n2, ..., n64] for any word. This vector will represent the semantic meaning of that particular word. It will learn this vector while training using backpropagation just like any other layer.
When you instantiate an Embedding layer, its weights (its internal dictionary of token vectors) are initially random, just as with any other layer. During training, these word vectors are gradually adjusted via backpropagation, structuring the space into something the downstream model can exploit. Once fully trained, the embedding space will show a lot of structure—a kind of structure specialized for the specific problem for which you’re training your model.
-- Deep Learning with Python by F. Chollet
Edit - How "Backpropagation" is used to train the look-up matrix of the Embedding Layer ?
Embedding layer is similar to the linear layer without any activation function. Theoretically, Embedding layer also performs matrix multiplication but doesn't add any non-linearity to it by using any kind of activation function. So backpropagation in the Embedding layer is similar to as of any linear layer. But practically, we don't do any matrix multiplication in the embedding layer because the inputs are generally one hot encoded and the matrix multiplication of weights by a one-hot encoded vector is as easy as a look-up.

TF-Hub Elmo uses which word embedding to concatenate with characters in Highway layer

I understand that Elmo uses CNN over characters for character embeddings. However I do not understand how the character embeddings are concatenated with word embeddings in the Highway network. In the Elmo paper most of the evaluations use Glove for word embeddings and CNN character embedding together which make sense as they have mentioned the word embeddings.
But for pre-trained models like the one in TF-Hub with which word embeddings do we concatenate with character embeddings in Highway layer?
Please help me understand if you can.
Concatenation happens inside the https://tfhub.dev/google/elmo/3 model. When using word_emb output, one can get the embedding for each token in the input. The embedding can be used for classification or other modeling tasks similar to BERT/transformer based models. The model also provides direct access to the some hidden state of the LSTM through lstm_outputs1 and lstm_outputs2.

How to setup a neural network architecture for binary classification

I am reading through the tensorflow tutorials on neural network and i came across the architecture part which is a bit confusing. Can some explain me why he had use following settings in this code
# input shape is the vocabulary count used for the movie reviews
(10,000 words)
vocab_size = 10000
model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 16))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(16, activation=tf.nn.relu))
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))
model.summary()
Vocab_size?
value of 16 for Embedding?
and the choice of units, i get the intuition behind the last dense layer because it is a binary classification(1) but why 16 units in the second layer?
Is the 16 in embedding and 16 units in first dense layer related? Like they should be equal?
If someone can explain this para too
The first layer is an Embedding layer. This layer takes the integer-encoded vocabulary and looks up the embedding vector for each word-index. These vectors are learned as the model trains. The vectors add a dimension to the output array. The resulting dimensions are: (batch, sequence, embedding).
source:
Classify movie reviews: binary classification
vocab_size: All word in your corpus (in this case IMDB) sorted based on their frequency and their top 10000 word extracted. Rest of the vocabulary will be ignored. E.g: This is really Fancyyyyyyy will convert into ==> [8 7 9]. As you may guess the word Fancyyyyyyy ignored because its not in out top 10000 words.
pad_sequences: Convert all sentence to the same size. For example in training corpus the document length are different. So all of them convert to seq_len = 256. After this step, your output is [Batch_size * seq_len].
Embedding: Each word converted to a vector with 16 dimension. As a result output of this step is a Tensor with size of [Batch_size * seq_len * embedding_dim].
GlobalAveragePooling1D: Convert your sequence with size of [Batch_size * seq_len * embedding_dim] into [Batch_size * embedding_dim]
unit: is output of dense layer (MLP layer). It covert [Batch_size * embedding_dim] into [Batch_size * unit].
The first layer is vocab_size because each word is represented as an index into the vocabulary. For example, if the input word is 'word', which is the 500th word in the vocabulary, the input is a vector of length vocab_size with all zeros except a one at index 500. This is commonly referred to as a 'one hot' representation.
The embedding layer essentially takes this huge input vector and condenses it into a smaller vector (in this case, length 16) that encodes some of information about the word. The specific embedding weights are learned from training just like any other neural network layer. I'd recommend reading up on word embeddings. The length of 16 is a bit arbitrary here but can be tuned. One could do away with this embedding layer but then the model will have less expressive power (it would just be logistic regression, which is a linear model).
Then, as you said, the last layer is simply predicting the class of the word based on the embedding.

Weights update in Tensorflow embedding layer with pretrained fasttext weights

I'm not sure if my understanding is correct but...
While training a seq2seq model, one of the purpose I want to initiated a set of pre-trained fasttext weights in the embedding layers is to decrease the unknown words in the test environment (these unknown words are not in training set). Since pre-trained fasttext model has larger vocabulary, during test environment, the unknown word can be represented by fasttext out-of-vocabulary word vectors, which supposed to have similar direction of the semantic similar words in the training set.
However, due to the fact that the initial fasttext weights in the embedding layers will be updated through the training process (updating weights generates better results). I am wondering if the updated embedding weights would distort the relationship of semantic similarity between words and undermine the representation of fasttext out-of-vocabulary word vectors? (and, between those updated embedding weights and word vectors in the initial embedding layers but their corresponding ID didn't appear in the training data)
If the input ID can be distributed represented vectors extracted from pre-trained model and, then, map these pre-trained word vectors (fixed weights while training) via a lookup table to the embedding layers (these weights will be updated while training), would it be a better solution?
Any suggestions will be appreciated!
You are correct about the problem: when using pre-trained vector and fine-tuning them in your final model, the words that are infrequent or hasn't appear in your training set won't get any updates.
Now, usually one can test how much of the issue for your particular case this is. E.g. if you have a validation set, try fine-tuning and not fine-tuning the weights and see what's the difference in model performance on validation set.
If you see a big difference in performance on validation set when you are not fine-tuning, here is a few ways to handle this:
a) Add a linear transformation layer after not-trainable embeddings. Fine-tuning embeddings in many cases does affine transformations to the space, so one can capture this in a separate layer that can be applied at test time.
E.g. A is pre-trained embedding matrix:
embeds = tf.nn.embedding_lookup(A, tokens)
X = tf.get_variable("X", [embed_size, embed_size])
b = tf.get_vairable("b", [embed_size])
embeds = tf.mul(embeds, X) + b
b) Keep pre-trained embeddings in the not-trainable embedding matrix A. Add trainable embedding matrix B, that has a smaller vocab of popular words in your training set and embedding size. Lookup words both in A and B (and if word is out of vocab use ID=0 for example), concat results and use it input to your model. This way you will teach your model to use mostly A and sometimes rely on B for popular words in your training set.
fixed_embeds = tf.nn.embedding_lookup(A, tokens)
B = tf.get_variable("B", [smaller_vocab_size, embed_size])
oov_tokens = tf.where(tf.less(tokens, smaller_vocab_size), tokens, tf.zeros(tf.shape(tokens), dtype=tokens.dtype))
dyn_embeds = tf.nn.embedding_lookup(B, oov_tokens)
embeds = tf.concat([fixed_embeds, dyn_embeds], 1)