document clustering using word2vec - k-means

I am doing document clustering using Word2vec (genism library)
The following steps that I am doing,
Cleaning and tokenizing data, let's say I have 50000 data
Generating vector representations of the documents using the word2vec model. Here, for each word, having a word embedding vector. (i.e. word2vec model size 300)
Then getting word embedding vector for each document by Mean of Word Embeddings (MOWE), that's mean by adding all embedding vector for each word in a document by dividing total word number in that document. (dimension of the embedding vector of entire corpus: 50000X300)
Next be measuring cosine similarity matrix on the entire dataset(50000X300), get cosine similarity matrix( dimension 50000X50000)
In the final step, I am sending this cosine similarity matrix to kmeans algorithm. Here, I am using scipy library for kmeans algorithm.
I have confusion in step 4.
My question is, do I need to calculate cosine similarity matrix for kmeans algorithm?
Or Instead of cosine similarity can I fed embedding vector for the entire dataset(50000X300) to kmeans algorithm?
Which one should I follow and why?

Related

What is the network structure inside a Tensorflow Embedding Layer?

Tensoflow Embedding Layer (https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) is easy to use,
and there are massive articles talking about
"how to use" Embedding (https://machinelearningmastery.com/what-are-word-embeddings/, https://www.sciencedirect.com/topics/computer-science/embedding-method)
.
However, I want to know the Implemention of the very "Embedding Layer" in Tensorflow or Pytorch.
Is it a word2vec?
Is it a Cbow?
Is it a special Dense Layer?
Structure wise, both Dense layer and Embedding layer are hidden layers with neurons in it. The difference is in the way they operate on the given inputs and weight matrix.
A Dense layer performs operations on the weight matrix given to it by multiplying inputs to it ,adding biases to it and applying activation function to it. Whereas Embedding layer uses the weight matrix as a look-up dictionary.
The Embedding layer is best understood as a dictionary that maps integer indices (which stand for specific words) to dense vectors. It takes integers as input, it looks up these integers in an internal dictionary, and it returns the associated vectors. It’s effectively a dictionary lookup.
from keras.layers import Embedding
embedding_layer = Embedding(1000, 64)
Here 1000 means the number of words in the dictionary and 64 means the dimensions of those words. Intuitively, embedding layer just like any other layer will try to find vector (real numbers) of 64 dimensions [ n1, n2, ..., n64] for any word. This vector will represent the semantic meaning of that particular word. It will learn this vector while training using backpropagation just like any other layer.
When you instantiate an Embedding layer, its weights (its internal dictionary of token vectors) are initially random, just as with any other layer. During training, these word vectors are gradually adjusted via backpropagation, structuring the space into something the downstream model can exploit. Once fully trained, the embedding space will show a lot of structure—a kind of structure specialized for the specific problem for which you’re training your model.
-- Deep Learning with Python by F. Chollet
Edit - How "Backpropagation" is used to train the look-up matrix of the Embedding Layer ?
Embedding layer is similar to the linear layer without any activation function. Theoretically, Embedding layer also performs matrix multiplication but doesn't add any non-linearity to it by using any kind of activation function. So backpropagation in the Embedding layer is similar to as of any linear layer. But practically, we don't do any matrix multiplication in the embedding layer because the inputs are generally one hot encoded and the matrix multiplication of weights by a one-hot encoded vector is as easy as a look-up.

Size of input and output layers in Keras implementation of an RNN Language Model

As part of my thesis, I am trying to build a recurrent Neural Network Language Model.
From theory, I know that the input layer should be a one-hot vector layer with a number of neurons equal to the number of words of our Vocabulary, followed by an Embedding layer, which, in Keras, it apparently translates to a single Embedding layer in a Sequential model. I also know that the output layer should also be the size of our vocabulary so that each output value maps 1-1 to each vocabulary word.
However, in both the Keras documentation for the Embedding layer (https://keras.io/layers/embeddings/) and in this article (https://machinelearningmastery.com/how-to-develop-a-word-level-neural-language-model-in-keras/#comment-533252), the vocabulary size is arbitrarily augmented by one for both the input and the output layers! Jason gives an explenation that this is due to the implementation of the Embedding layer in Keras but that doesn't explain why we would also use +1 neuron in the output layer. I am at the point of wanting to order the possible next words based on their probabilities and I have one probability too many that I do not know to which word to map it too.
Does anyone know what is the correct way of acheiving the desired result? Did Jason just forget to subtrack one from the output layer and the Embedding layer just needs a +1 for implementation reasons (I mean it's stated in the official API)?
Any help on the subject would be appreciated (why is Keras API documentation so laconic?).
Edit:
This post Keras embedding layer masking. Why does input_dim need to be |vocabulary| + 2? made me think that Jason does in fact have it wrong and that the size of the Vocabulary should not be incremented by one when our word indices are: 0, 1, ..., n-1.
However, when using Keras's Tokenizer our word indices are: 1, 2, ..., n. In this case, the correct approach is to:
Set mask_zero=True, to treat 0 differently, as there is never a
0 (integer) index input in the Embedding layer and keep the
vocabulary size the same as the number of vocabulary words (n)?
Set mask_zero=True but augment the vocabulary size by one?
Not set mask_zero=True and keep the vocabulary size the same as the
number of vocabulary words?
the reason why we add +1 leads to the possibility that we can encounter a chance to see an unseen word(out of our vocabulary) during testing or in production, it is common to consider a generic term for those UNKNOWN and that is why we add a OOV word in front which resembles all out of vocabulary words.
Check this issue on github which explains it in detail:
https://github.com/keras-team/keras/issues/3110#issuecomment-345153450

How to use fasttext vectors in a tensorflow embedding layer?

I just struggle to find out how I can use Fasttext wordvectors for OOV words in a keras/tensorflow embedding layer. There is nothing out there. Maybe someone has thought of that too and has some hints for me?
The way via word embedding look up works via indices like
tf.nn.embedding_lookup(word_embeddings, x)
And you could have an index for one OOV. But how can I assign a specific vector (from a different and custom source like fasttext) at runtime?
I imagine a function which can customly assign a vector to the UNK index for a OOV word.
Related to that:
Assign custom word vector to UNK token during prediction?
Using subword information in OOV token from fasttext in word embedding layer (keras/tensorflow)
You can do the embedding lookup \ computation outside of tensorflow, and use the embedded text as input to the model (so the input wont be a sequence of word indices, but a sequence of vectors)

Weights update in Tensorflow embedding layer with pretrained fasttext weights

I'm not sure if my understanding is correct but...
While training a seq2seq model, one of the purpose I want to initiated a set of pre-trained fasttext weights in the embedding layers is to decrease the unknown words in the test environment (these unknown words are not in training set). Since pre-trained fasttext model has larger vocabulary, during test environment, the unknown word can be represented by fasttext out-of-vocabulary word vectors, which supposed to have similar direction of the semantic similar words in the training set.
However, due to the fact that the initial fasttext weights in the embedding layers will be updated through the training process (updating weights generates better results). I am wondering if the updated embedding weights would distort the relationship of semantic similarity between words and undermine the representation of fasttext out-of-vocabulary word vectors? (and, between those updated embedding weights and word vectors in the initial embedding layers but their corresponding ID didn't appear in the training data)
If the input ID can be distributed represented vectors extracted from pre-trained model and, then, map these pre-trained word vectors (fixed weights while training) via a lookup table to the embedding layers (these weights will be updated while training), would it be a better solution?
Any suggestions will be appreciated!
You are correct about the problem: when using pre-trained vector and fine-tuning them in your final model, the words that are infrequent or hasn't appear in your training set won't get any updates.
Now, usually one can test how much of the issue for your particular case this is. E.g. if you have a validation set, try fine-tuning and not fine-tuning the weights and see what's the difference in model performance on validation set.
If you see a big difference in performance on validation set when you are not fine-tuning, here is a few ways to handle this:
a) Add a linear transformation layer after not-trainable embeddings. Fine-tuning embeddings in many cases does affine transformations to the space, so one can capture this in a separate layer that can be applied at test time.
E.g. A is pre-trained embedding matrix:
embeds = tf.nn.embedding_lookup(A, tokens)
X = tf.get_variable("X", [embed_size, embed_size])
b = tf.get_vairable("b", [embed_size])
embeds = tf.mul(embeds, X) + b
b) Keep pre-trained embeddings in the not-trainable embedding matrix A. Add trainable embedding matrix B, that has a smaller vocab of popular words in your training set and embedding size. Lookup words both in A and B (and if word is out of vocab use ID=0 for example), concat results and use it input to your model. This way you will teach your model to use mostly A and sometimes rely on B for popular words in your training set.
fixed_embeds = tf.nn.embedding_lookup(A, tokens)
B = tf.get_variable("B", [smaller_vocab_size, embed_size])
oov_tokens = tf.where(tf.less(tokens, smaller_vocab_size), tokens, tf.zeros(tf.shape(tokens), dtype=tokens.dtype))
dyn_embeds = tf.nn.embedding_lookup(B, oov_tokens)
embeds = tf.concat([fixed_embeds, dyn_embeds], 1)

Vector representation in multidimentional time-series prediction in Tensorflow

I have a large data set (~30 million data-points with 5 features) that I have reduced using K-means down to 200,000 clusters. The data is a time-series with ~150,000 time-steps. The data on which I would like to train the model is the presence of particular clusters at each time-step. The purpose of the predictive model is generate a generalized sequence similar to generating syntactically correct sentences from a model trained on word sequences. The easiest way to think about this data is that I'm trying to predict the pixels in the next video frame from pixels in the current video frame in order to generate a new sequence of frames that approximate the original sequence.
The raw and sparse representation at each time-step would be 200,000 binary values representing which clusters are present or not at that time step. Note, no more than 200 clusters may be present in any one time-step and thus this representation is extremely sparse.
What is the best representation to convert this sparse vector to a dense vector that would be more suitable to time-series prediction using Tensorflow?
I initially had in mind a RNN / LSTM trained on the vectors at each time-step, but due to the size of the training vector I'm now wondering if a convolution approach would be more suitable.
Note, I have not actually used tensorflow beyond some simple tutorials, but have have previously used OpenCV ML functions. Please consider me a novice in your responses.
Thank you.