Question
Is word frequency available from the TextVectorization instance? TextVectorization documentation says it determines the frequency of each string but it looks there is no method to get how many times a word appeared in the corpus it adapted.
it will analyze the dataset, determine the frequency of individual string values, and create a 'vocabulary' from them.
If the TextVectorization instance has already analysed and has the word frequencies, we can use them for statistics e.g. for negative sampling. It can be done but prefer using already available information especially the corpus is large.
from collections import Counter
corpus = "..."
total = len(words:=corpus.split())
probabilities = {word: (count / total) for (word, count) in Counter(words).items()}
Related
I'm using tf.keras.preprocessing.text.Tokenizer to build a vocabulary for my corpus (5 million documents). The tokenizer finds 145K tokens. The issue is that the embedding layer has far too many parameters.
What's a simple way to force the tokenizer to only consider the top N most common words? Then anything outside of that would get a .
Solution
As indicated by #MarcoCerliani in the comment, you can simply change the num_words parameter in the tokenizer. That won't change the tokenizer's internal data - it'll instead return OOV tokens for words outside the range specified by num_words.
tokenizer = tf.keras.preprocessing.text.Tokenizer(oov_token='<OOV>')
tokenizer.fit_on_texts(huge_corpus)
len(tokenizer.word_counts)
>>> (some huge vocabulary size)
# Change vocabulary size
tokenizer.num_words = 100
tokenizer.texts_to_sequences(texts)
>>> (any words in `texts` not within 100 most common words
>>> will get `1` (out of vocab).
I have a term-document matrix (X) of shape (6, 25931). The first 5 documents are my source documents and the last document is my target document. The column represents counts for different words in the vocabulary set. I want to get the cosine similarity of the last document with each of the other documents.
But since SVD produces an S of size (min(6, 25931),), If I used the S to reduce my X, I get a 6 * 6 matrix. But In this case, I feel that I will be losing too much information since I am reducing a vector of size (25931,) to (6,).
And when you think about it, usually, the number of documents will always be less than number of vocabulary words. In this case, using SVD to reduce dimensionality will always produce vectors that are of size (no documents,).
According to everything that I have read, when SVD is used like this on a term-document matrix, it's called LSA.
Am I implementing LSA correctly?
If this is correct, then is there any other way to reduce the dimensionality and get denser vectors where the size of the compressed vector is greater than (6,)?
P.S.: I also tried using fit_transform from sklearn.decomposition.TruncatedSVD which expects the vector to be of the form (n_samples, n_components) which is why the shape of my term-document matrix is (6, 25931) and not (25931, 6). I kept getting a (6, 6) matrix which initially confused me. But now it makes sense after I remembered the math behind SVD.
If the objective of the exercise is to find the cosine similarity, then the following approach can help. The author is only attempting to solve for the objective and not to comment on the definition of Latent Semantic Analysis or the definition of Singular Value Decomposition mentioned by the questioner.
Let us first invoke all the required libraries. Please install them if they do not exist in the machine.
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
Let us generate some sample data for this exercise.
df = {'sentence': ['one two three','two three four','four five','six seven eight nine ten']}
df = pd.DataFrame(df, columns = ['sentence'])
The first step is to get the exhaustive list of all the possible features. So collate all of the content at one place.
all_content = [' '.join(df['sentence'])]
Let us build a vectorizer and fit it now. Please note that the arguments in the vectorizer are not explained by the author as the focus is on solving the problem.
vectorizer = TfidfVectorizer(encoding = 'latin-1',norm = 'l2', min_df = 0.03, ngram_range = (1,2), max_features = 5000)
vectorizer.fit(all_content)
We can inspect the vocabulary to see if it makes sense. If needed, one could add stop words in the vectorizer above and supress them to see if they are indeed supressed.
print(vectorizer.vocabulary_)
Let us vectorize the sentences for us to deploy cosine similarity.
s1Tokens = vectorizer.transform(df.iloc[1,])
s2Tokens = vectorizer.transform(df.iloc[2,])
Finally, the cosine of the similarity can be computed as follows.
cosine_similarity(s1Tokens , s2Tokens)
This term came up a few times in the Tensorflow Dev Summit, and it shows up in the Tensorflow Extended documentation, but without any sort of definition. After a fair amount of googling, I don't see reference to it in any Statistics-related setting. Searching the Tensorflow repositories produces a few hits, but they're similarly unhelpful. The term does seem to be used in Chemistry, Psychology, and Linguistics, but those definitions appear to be unrelated.
Per the 2017 TFX paper http://stevenwhang.com/tfx_paper.pdf, TFX can calculate a number of stats on a dataset, including:
"The expected valency of the feature in each example, i.e., minimum
and maximum number of values."
We can also look at the code that calculates valency in TFX. From what I can tell, it's designed to run on a feature that is an array, and counts the minimum and maximum number of values within that array for that feature:
# Extract the valency information of the feature.
valency = ''
if feature.HasField('value_count'):
if (feature.value_count.min == feature.value_count.max and
feature.value_count.min == 1):
valency = 'single'
else:
min_value_count = ('[%d' % feature.value_count.min
if feature.value_count.HasField('min') else '[0')
max_value_count = ('%d]' % feature.value_count.max
if feature.value_count.HasField('max') else 'inf)')
valency = min_value_count + ',' + max_value_count
from: https://github.com/tensorflow/data-validation/blob/master/tensorflow_data_validation/utils/display_util.py#L68
As discussed in this blog,
Valency indicates the number of values required per training example.
In the case of categorical features, single indicates that each
training example must have exactly one category for the feature.
More broadly speaking, it applies to features with multiple values (IMHO not that common for features in Machine Learning), e.g., lists and arrays. In that case, valency refers to the minimum or maximum number of values in these data types. For lists, one can compute the valency by applying np.min()/np.max() on the list lengths from all available feature examples.
After experimenting with both numerical and categorical features, it turns out that there only appears values (e.g., "single") in the "Valency" column when the value in the corresponding "Presence" column is "optional" (tfdv 1.6.0).
I am training a simple model for text classification (currently with scikit-learn). To transform my document samples into word count vectors using a vocabulary I use
CountVectorizer(vocabulary=myDictionaryWords).fit_transform(myDocumentsAsArrays)
from sklearn.feature_extraction.text.
This works great and I can subsequently train my classifier on this word count vectors as feature vectors. But what I don't know is how to inverse transform these word count vectors to the original documents. CountVectorizer indeed has a function inverse_transform(X) but this only gives you back the unique non-zero tokens.
As far as I know CountVectorizer doesn't have any implementation of a mapping back to the original documents.
Anyone know how I can restore the original sequences of tokens from their count-vectorized representation? Is there maybe a Tensorflow or any other module for this?
CountVectorizer is "lossy", i.e. for a document :
This is the amazing string in amazing program , it will only store counts of words in the document (i.e. string -> 1, amazing ->2 etc), but loses the position information.
So by reversing it, you can create a document having same words repeated same number of times, but their sequence in the document cannot be retraced.
Reading the tensorflow word2vec model output how can I output the words related to a specific word ?
Reading the src : https://github.com/tensorflow/tensorflow/blob/r0.11/tensorflow/examples/tutorials/word2vec/word2vec_basic.py can view how the image is plotted.
But is there a data structure (e.g dictionary) created as part of training the model that allows to access nearest n words closest to given word ?
For example if word2vec generated image :
image src: https://www.tensorflow.org/versions/r0.11/tutorials/word2vec/index.html
In this image the words 'to , he , it' are contained in same cluster, is there a function which takes as input 'to' and outputs 'he , it' (in this case n=2) ?
This approach apply to word2vec in general. If you can save the word2vec in text/binary file like google/GloVe word vector. Then what you need is just the gensim.
To install:
Via github
Python code:
from gensim.models import Word2Vec
gmodel=Word2Vec.load_word2vec_format(fname)
ms=gmodel.most_similar('good',10)
for x in ms:
print x[0],x[1]
However this will search all the words to give the results, there are approximate nearest neighbor (ANN) which will give you the result faster but with a trade off in accuracy.
In the latest gensim, annoy is used to perform the ANN, see this notebooks for more information.
Flann is another library for Approximate Nearest Neighbors.
I will assume that you don't want to use gensim, and would prefer to stick with tensorflow. In that case, I'll offer two options
Option 1 - Tensorboard:
If you are just trying to do this from an exploratory standpoint, I would suggest using Tensorboard's embedding visualizer to search for the closest embeddings. It provides a cool interface and you can use both cosine and euclidian distances with a set number of neighbors.
Link to Tensorflow documentation
Option 2 - Direct Calculation
Within the word2vec_basic.py file, there is an example of how they are calculating closest words, and you could go ahead and use that if you mess with the function a little bit. The following is found in the graph itself:
# Compute the cosine similarity between minibatch examples and all embeddings.
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(
normalized_embeddings, valid_dataset)
similarity = tf.matmul(
valid_embeddings, normalized_embeddings, transpose_b=True)
Then, during training (every 10000 steps) they run this next bit of code (while the session is active). When they call similarity.eval() it is getting the literal numpy array evaluation of the similarity tensor in the graph.
# Note that this is expensive (~20% slowdown if computed every 500 steps)
if step % 10000 == 0:
sim = similarity.eval()
for i in xrange(valid_size):
valid_word = reverse_dictionary[valid_examples[i]]
top_k = 8 # number of nearest neighbors
nearest = (-sim[i, :]).argsort()[1:top_k+1]
log_str = "Nearest to %s:" % valid_word
for k in xrange(top_k):
close_word = reverse_dictionary[nearest[k]]
log_str = "%s %s," % (log_str, close_word)
print(log_str)
If you want to adapt this for yourself, you will have to do some finessing with changing reverse_dictionary[valid_examples[i]] to be the word/words idxs that you want to get the k-closest words for.
Get gensim and use similar_by_word method on gensim.models.Word2Vec model.
similar_by_word takes 3 parameters,
The input word
n - for top n similar words (optional, default=10)
restrict_vocab (optional, default=None)
Example
import gensim, nltk
class FileToSent(object):
"""A class to load a text file efficiently """
def __init__(self, filename):
self.filename = filename
# To remove stop words (optional)
self.stop = set(nltk.corpus.stopwords.words('english'))
def __iter__(self):
for line in open(self.filename, 'r'):
ll = [i for i in unicode(line, 'utf-8').lower().split() if i not in self.stop]
yield ll
Then depending on your input sentences (sentence_file.txt),
sentences = FileToSent('sentence_file.txt')
model = gensim.models.Word2Vec(sentences=sentences, min_count=2, hs=1)
print model.similar_by_word('hack', 2) # Get two most similar words to 'hack'
# [(u'debug', 0.967338502407074), (u'patch', 0.952264130115509)] (Output specific to my dataset)