Inverse transform word count vector to original document - tensorflow

I am training a simple model for text classification (currently with scikit-learn). To transform my document samples into word count vectors using a vocabulary I use
CountVectorizer(vocabulary=myDictionaryWords).fit_transform(myDocumentsAsArrays)
from sklearn.feature_extraction.text.
This works great and I can subsequently train my classifier on this word count vectors as feature vectors. But what I don't know is how to inverse transform these word count vectors to the original documents. CountVectorizer indeed has a function inverse_transform(X) but this only gives you back the unique non-zero tokens.
As far as I know CountVectorizer doesn't have any implementation of a mapping back to the original documents.
Anyone know how I can restore the original sequences of tokens from their count-vectorized representation? Is there maybe a Tensorflow or any other module for this?

CountVectorizer is "lossy", i.e. for a document :
This is the amazing string in amazing program , it will only store counts of words in the document (i.e. string -> 1, amazing ->2 etc), but loses the position information.
So by reversing it, you can create a document having same words repeated same number of times, but their sequence in the document cannot be retraced.

Related

How to train data of different lengths in machine learning?

I am analyzing the text of some literary works and I want to look at the distance between certain words in the text. Specifically, I am looking for parallelism.
Since I can’t know the specific number of tokens in a text I can’t simply put all words in the text in the training data because it would not be uniform across all training data.
For example, the text:
“I have a dream that my four little children will one day live in a nation where they will not be judged by the color of their skin but by the content of their character. I have a dream today."
Is not the same text length as
"My fellow Americans, ask not what your country can do for you, ask what you can do for your country."
So therefore I could not columns out of each word and then assign the distance in a row because the lengths would be different.
How could I go about representing this in training data? I was under the assumption that training data had to be the same type and length.
In order to solve this problem you can use something called pad_sequence,so follow this process, sure you are going to transform the data throught some word embedding techniques like TF-IDF or any other algorithm, and after finishing the process of converting the textual data into vectors and by using the shape method you can figure the maximum length you have and than use that maximum in the pad-sequence method, and here is a how you implement this method:
'''
from keras.preprocessing.sequence import pad_sequences
padded_data= pad_sequences(name-of-your-data, maxlen=your-maximum-shape, padding='post', truncating='post')
'''

how to find closeness between two keras pad_sequences?

I am writing a small proof of concept where I turn a catalog into a json that has a url, and a label that explains the web page. I read this json in python, tokenize it and create a pad_sequences.
I need to then compare some free flow texts to find which index of the pad_sequences has the most words from the free flow text.
I am generating a pad_sequences() from the text too but not sure if I can somehow compare the two sequences for closeness?
Please help.
You can use cosine similarity or euclidean distance to compare two vectors.
https://www.tensorflow.org/api_docs/python/tf/keras/metrics/CosineSimilarity
https://www.tutorialexample.com/calculate-euclidean-distance-in-tensorflow-a-step-guide-tensorflow-tutorial/
For sequences you can make embedding to same lenght vector at first.

one_hot Vs Tokenizer for Word representation

I have seen in many blogs , people using one_hot (from tf.keras.preprocessing.text.one_hot ) to convert the string of words into array of numbers which represent indices. This does not ensure unicity. Whereas Tokenizer class ensures unicity (tf.keras.preprocessing.text.Tokenizer ).
Then why is one_hot prefered over tokenizer?
Update: I got to know that hashing is used in One_hot to convert words into numbers but didn't get its importance as we can use the tokenizer class to do the same thing with more accuracy.
Not sure what you mean by uncity. I expect it has to do with the sequential relationship between the words. That of course is lost with ine hot encoding. However one-hot encoding is used when the number of words is limited. If say you have 10 words in the vocubulary you will create 10 new features which is fine for most neural networks to process. If you have other features in your data set beside the word sequences say numeric ordinal parameters you can still create a single input model. However if you have 10,000 words in the vocabulary you would create 10,000 new features which at best will take a lot to process. So in the case of a large vocabularly it is best to use "dense" encoding" versus the sparse encoding generated by one hot encoding. You can use the results of the tokenizer encoding to serve as input to a keras embedding layer which will encode the words into an n dimensional space where N is a value you specify. If you have additional ordinal features then to process the data your model will need multiple inputs. Perhaps that is why some people prefer to one hot encode the words.

Understanding what happens during a single epoch of DBOW

I am using Distributed Bag of Words (DBOW) and I'm curious what happens during a single Epoch? Does DBOW cycle through all documents (aka Batch) or does it cycle through a subset of documents (aka Mini-batch)? In addition, for a given document DBOW will randomly sample a word from a text window and learn the weights to associate that target word to the surrounding words in the window, does this mean that DBOW may not go through all text in a document?
I've gone through the GENSIM (https://github.com/RaRe-Technologies/gensim) code to identify if there is a parameter for batch, but no luck.
One epoch of PV-DBOW training in gensim Doc2Vec will iterate through all the texts, and then for each text iterate through all their words, attempting to predict each word in turn, and then back-propagating corrections for that predicted word immediately. That is, there's no "mini-batching" at all: each target word is an individual prediction/back-propagation.
(There is a sort-of-batching in how groups-of-texts are sent to worker threads, which can change the ordering somewhat, but each individual training-example presented to the neural-network is corrected individually, so no SGD-mini-batching is occurring.)
The words of each text are considered in order, and only skipped if (a) the word appeared fewer than min_count times; (b) the word is very-frequent and is chosen for random dropping via the value of the sample parameter. So you can generally think of the training as including all significant words of every document.

how to reduce the dimension of the document embedding?

Let us assume that I have a set of document embeddings. (D)
Each of document embedding is consisting of N number of word vectors where each of these pre-trained vector has 300 dimensions.
The corpus would be represented as [D,N,300].
My question is that, what would be the best way to reduce [D,N,300] to [D,1, 300]. How should I represent the document in a single vector instead of N vectors?
Thank you in advance.
I would say that what you are looking for is doc2vec. Using this you can convert the whole document into a one, 300-dimensional vector. You can use it like this:
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(your_documents)]
model = Doc2Vec(documents, vector_size=300, window=2, min_count=1, workers=4)
This will train the model on your data and you will be able to represent each document with only one vector as you specified in the question.
You can run inferrance with:
vector = model.infer_vector(doc_words)
I hope this is helpful :)
It's fairly common and fairly (perhaps surprisingly) effective to simply average the word vectors.
Good question but all the answers will result in the some loss of information. The best way for you is to use a Bi-LSTM/GRU layer and provide your word embeddings as input to that layer. And take the output of last time step.
The output of last timestep will have all the contextual information of document both in forward and backward direction. And hence, this is the best way to get what you want as the model learns the representation.
Note that, the larger the document, the more loss of information.