How to train a reverse embedding, like vec2word? - tensorflow

how do you train a neural network to map from a vector representation, to one hot vectors? The example I'm interested in is where the vector representation is the output of a word2vec embedding, and I'd like to map onto the the individual words which were in the language used to train the embedding, so I guess this is vec2word?
In a bit more detail; if I understand correctly, a cluster of points in embedded space represents similar words. Thus if you sample from points in that cluster, and use it as the input to vec2word, the output should be a mapping to similar individual words?
I guess I could do something similar to an encoder-decoder, but does it have to be that complicated/use so many parameters?
There's this TensorFlow tutorial, how to train word2vec, but I can't find any help to do the reverse? I'm happy to do it using any deeplearning library, and it's OK to do it using sampling/probabilistic.
Thanks a lot for your help, Ajay.

One easiest thing that you can do is to use the nearest neighbor word. Given a query feature of an unknown word fq, and a reference feature set of known words R={fr}, then you can find out what is the nearest fr* for fq, and use the corresponding fr* word as fq's word.

Related

How to pass a list of numbers as a single feature to a neural network?

I am trying to cluster sentences by clustering the sentence embedding of them taken from fasttext model. Each sentence embedding has 300 dimensions, and I want to reduce them to 50 (say). I tried t-SNE, PCA, UMAP. I wanted to see how Auto Encoder works for my data.
Now passing those 300 numbers for each sentence as separate features to the NN would make sense or they should be passed as a single entity? If so, is there an way to pass a list as a feature to NN?
I tried passing the 300 numbers as individual features and with the output I tried clustering. Could get very few meaningful clusters rest were either noise or clusters with no similar sentences but being grouped (But with other techniques like UMAP I could get far more meaningful clusters in more number). Any leads would be helpful. Thanks in advance :)

How is hashing implemented in SGNN (Self-Governing Neural Networks)?

So I've read the paper named Self-Governing Neural Networks for On-Device Short Text Classification which presents an embedding-free approach to projecting words into a neural representation. To quote them:
The key advantage of SGNNs over existing work is that they surmount the need for pre-trained word embeddings and complex networks with huge parameters. [...] our method is a truly embedding-free approach unlike majority of the widely-used state-of-the-art deep learning techniques in NLP
Basically, from what I understand, they proceed as follow:
You'd first need to compute n-grams (side-question: is that skip-gram like old skip-gram, or new skip-gram like word2vec? I assume it's the first one for what remains) on words' characters to obtain a featurized representation of words in a text, so as an example, with 4-grams you could yield a 1M-dimensional sparse feature vector per word. Hopefully, it's sparse so memory needn't to be fully used for that because it's almost one-hot (or count-vectorized, or tf-idf vectorized ngrams with lots of zeros).
Then you'd need to hash those n-grams sparse vectors using Locality-sensitive hashing (LSH). They seem to use Random Projection from what I've understood. Also, instead of ngram-vectors, they instead use tuples of n-gram feature index and its value for non-zero n-gram feature (which is also by definition a "sparse matrix" computed on-the-fly such as from a Default Dictionary of non-zero features instead of a full vector).
I found an implementation of Random Projection in scikit-learn. From my tests, it doesn't seem to yield a binary output, although the whole thing is using sparse on-the-fly computations within scikit-learn's sparse matrices as expected for a memory-efficient (non-zero dictionnary-like features) implementation I guess.
What doesn't work in all of this, and where my question lies, is in how they could end up with binary features from the sparse projection (the hashing). They seem to be saying that the hashing is done at the same time of computing the features, which is confusing, I would have expected the hashing to come in the order I wrote above as in 1-2-3 steps, but their steps 1 and 2 seems to be somehow merged.
My confusion arises mostly from the paragraphs starting with the phrase "On-the-fly Computation." at page 888 (PDF's page 2) of the paper in the right column. Here is an image depicting the passage that confuses me:
I'd like to convey my school project to a success (trying to mix BERT with SGNNs instead of using word embeddings). So, how would you demystify that? More precisely, how could a similar random hashing projection be achieved with scikit-learn, or TensorFlow, or with PyTorch? Trying to connect the dots here, I've significantly researched but their paper doesn't give implementation details, which is what I'd like to reproduce. I at least know that the SGNN uses 80 fourten-dimensionnal LSHes on character-level n-grams of words (is my understanding right in the first place?).
Thanks!
EDIT: after starting to code, I realized that the output of scikit-learn's SparseRandomProjection() looks like this:
[0.7278244729081154,
-0.7278244729081154,
0.0,
0.0,
0.7278244729081154,
0.0,
...
]
For now, this looks fine, it's closer to binary but it would still be castable to an integer instead of a float by using the good ratio in the first place. I still wonder about the skip-gram thing, I assume n-gram of characters of words for now but it's probably wrong. Will post code soon to GitHub.
EDIT #2: I coded something here, but with n-grams instead of skip-grams: https://github.com/guillaume-chevalier/SGNN-Self-Governing-Neural-Networks-Projection-Layer
More discussion threads on this here: https://github.com/guillaume-chevalier/SGNN-Self-Governing-Neural-Networks-Projection-Layer/issues?q=is%3Aissue
First of all, thanks for your implementation of the projection layer, it helped me get started with my own.
I read your discussion with #thinline72, and I agree with him that the features are calculated in the whole line of text, char by char, not word by word. I am not sure this difference in features is too relevant, though.
Answering your question: I interpret that they do steps 1 and 2 separately, as you suggested and did. Right, in the article excerpt that you include, they talk about hashing both in feature construction and projection, but I think those are 2 different hashes. And I interpret that the first hashing (feature construction) is automatically done by the CountVectorizer method.
Feel free to take a look at my implementation of the paper, where I built the end-to-end network and trained on the SwDA dataset, as split in the SGNN paper. I obtain a max of 71% accuracy, which is somewhat lower than the paper claims. I also used the binary hasher that #thinline72 recommended, and nltk's implementation of skipgrams (I am quite certain the SGNN paper is talking about "old" skipgrams, not "word2vec" skipgrams).

how to reduce the dimension of the document embedding?

Let us assume that I have a set of document embeddings. (D)
Each of document embedding is consisting of N number of word vectors where each of these pre-trained vector has 300 dimensions.
The corpus would be represented as [D,N,300].
My question is that, what would be the best way to reduce [D,N,300] to [D,1, 300]. How should I represent the document in a single vector instead of N vectors?
Thank you in advance.
I would say that what you are looking for is doc2vec. Using this you can convert the whole document into a one, 300-dimensional vector. You can use it like this:
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(your_documents)]
model = Doc2Vec(documents, vector_size=300, window=2, min_count=1, workers=4)
This will train the model on your data and you will be able to represent each document with only one vector as you specified in the question.
You can run inferrance with:
vector = model.infer_vector(doc_words)
I hope this is helpful :)
It's fairly common and fairly (perhaps surprisingly) effective to simply average the word vectors.
Good question but all the answers will result in the some loss of information. The best way for you is to use a Bi-LSTM/GRU layer and provide your word embeddings as input to that layer. And take the output of last time step.
The output of last timestep will have all the contextual information of document both in forward and backward direction. And hence, this is the best way to get what you want as the model learns the representation.
Note that, the larger the document, the more loss of information.

Visualize Gensim's Phrases' vectors in 2D

I'm using the Phrases class and want to visualize the vectors in a 2D space. In order to do this with Word2Vec I've used T-SNE and it worked perfectly. When I'm trying to do the same with Phrases it doesn't make any sense (words appear next to irrelevant words).
Any suggestions on how to visualize the Phrases output?
As suggested/reported on the gensim mailing list, the key problem was that merely wrapping a corpus in Phrases results in an iterator that offers only one pass over the data. The Word2Vec model needs a corpus over which it can make multiple passes to do its vocabulary-discovery then multiple-passes of training. (If closely watching INFO-level logging, there should be indications that 'training' ended almost instantly in such a situation.)

how to find similar words for a certain word in tensorflow_word2vec like using model.most_similar in gensim?

I've using tensorflow to build word2vec model,reference here:https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/tutorials/word2vec/word2vec_basic.py#L118
my question is that, how can i find top n similar words for a certain word.I know in gensim, I can save and load word2vec model,and then use model.most_similar to find what I want.but how in tensorflow and even more is there any way to save model in tensorflow since i find what i get is only an embedding vector,is that right?
I think as long as you have computed the weight vector for each token, then you can manipulate all the tokens in the vector space. You can simply calculate the cosine similarity between each vector and then sort by score. For your reference, you can look at the source code of most_similar method implemented in gensim word2vec model. Hope this helps.