Word2vec classification and clustering tensorflow - tensorflow

I am trying to cluster some sentences using similarity (maybe cosine) and then maybe use a classifier to put text in predefined classes.
My idea is to use tensorflow to generate the word embedding then average them for each sentence. Next use a clustering/classification algorithm.
Does tensorflow provide ready to use word2vec generation algorithm?
Would a bag of words model generate a good output?

No, tensorflow does not provide a ready-to-use word2vec but it does have a tutorial on word2vec.
Yes, a bag of words can generate surprisingly good output (but not state-of-the-art), and has the benefit of being amazingly faster. I have a small amount of data (tens of thousands of sentences) and have achieved F1 scores of >0.90 for classification.

Related

How to implement feature importance on nominal categorical features in tree based classifiers?

I am using SKLearn XGBoost model for my binary classification problem. My data contains nominal categorical features (such as race) for which one hot encoding should be used to feed them to the tree based models.
On the other hand, using feature_importances_ variable of XGBoost yields us the importance of each column on the trained model. So if I do the encoding and then get the features importance of columns, the result will includes names like race_2 and its importance.
What should I do to solve this problem and get a whole score for each nominal feature? Can I take the average of one hot encoded columns importance scores that belong to one feature? (like race_1, race_2 and race_3)
First of all, if your goal is to select the most useful features for later training, I would advise you to use regularization in your model. In the case of xgboost, you can tune the parameter gamma so the model would actually be more dependent on "more useful" features (i.e. tune the minimum loss reduction required for the model to add a partition leaf). Here is a good article on implementing regularization into xgboost models.
On the other hand, if you insist on doing feature importance, I would say grouping the encoded variables and simply adding them is not a good decision. This would result in feature-importance results that do not consider the relationship between these dummy variables.
My suggestion would be to take a look at the permutation tools for this. The basic idea is you take your original dataset, shuffle the values on the column in which you are going to calculate feature importance, train the model and record the score. Repeat this over different columns and the effect of each on the model performance would be a sign of their importance.
It is actually easier done than said, sklearn has this feature built-in to do for you: check out the example provided in here.

How can I evaluate FaceNet embeddings for face verification on LFW?

I am trying to create a script that is able to evaluate a model on lfw dataset. As a process, I am reading pair of images (using the LFW annotation list), track and crop the face, align it and pass it through a pre-trained facenet model (.pb using tensorflow) and extract the features. The feature vector size = (1,128) and the input image is (160,160).
To evaluate for the verification task, I am using a Siamese architecture. That is, I am passing a pair of images (same or different person) from two identical models ([2 x facenet] , this is equivalent like passing a batch of images with size 2 from a single network) and calculating the euclidean distance of the embeddings. Finally, I am training a linear SVM classifier to extract 0 when the embedding distance is small and 1 otherwise using pair labels. This way I am trying to learn a threshold to be used while testing.
Using this architecture I am getting a score of 60% maximum. On the other hand, using the same architecture on other models (e.g vgg-face), where the features are 4096 [fc7:0] (not embeddings) I am getting 90%. I definitely cannot replicate the scores that I see online (99.x%), but using the embeddings the score is very low. Is there something wrong with the pipeline in general ?? How can I evaluate the embeddings for verification?
Nevermind, the approach is correct, facenet model that is available online is poorly trained and that is the reason for the poor score. Since this model is trained on another dataset and not the original one that is described in the paper (obviously), verification score will be less than expected. However, if you set a constant threshold to the desired value you can probably increase true positives but by sacrificing f1 score.
You can use a similarity search engine. Either using approximated kNN search libraries such as Faiss or Nmslib, cloud-ready similarity search open-source tools such as Milvus, or production-ready managed service such as Pinecone.io.

Tensorflow embeddings

I know what embeddings are and how they are trained. Precisely, while referring to the tensorflow's documentation, I came across two different articles. I wish to know what exactly is the difference between them.
link 1: Tensorflow | Vector Representations of words
In the first tutorial, they have explicitly trained embeddings on a specific dataset. There is a distinct session run to train those embeddings. I can then later on save the learnt embeddings as a numpy object and use the
tf.nn.embedding_lookup() function while training an LSTM network.
link 2: Tensorflow | Embeddings
In this second article however, I couldn't understand what is happening.
word_embeddings = tf.get_variable(“word_embeddings”,
[vocabulary_size, embedding_size])
embedded_word_ids = tf.gather(word_embeddings, word_ids)
This is given under the training embeddings sections. My doubt is: does the gather function train the embeddings automatically? I am not sure since this op ran very fast on my pc.
Generally: What is the right way to convert words into vectors (link1 or link2) in tensorflow for training a seq2seq model? Also, how to train the embeddings for a seq2seq dataset, since the data is in the form of separate sequences for my task unlike (a continuous sequence of words refer: link 1 dataset)
Alright! anyway, I have found the answer to this question and I am posting it so that others might benefit from it.
The first link is more of a tutorial that steps you through the process of exactly how the embeddings are learnt.
In practical cases, such as training seq2seq models or Any other encoder-decoder models, we use the second approach where the embedding matrix gets tuned appropriately while the model gets trained.

Use Tensorflow LSTM PTB example for scoring sentences

I try to use an example LSTM, trained according to Tensorflow LSTM example. This example allows to get perplexity on whole test set. But I need to use the trained model to score (get loglikes) of each sentence separately (to score hypotheses of STT decoder output). I modified reader a bit and used code:
mtests=list()
with tf.name_scope("Test"):
for test_data_item in test_data:
test_input.append(PTBInput(config=eval_config, data=test_data_item, name="TestInput"))
with tf.variable_scope("Model", reuse=True, initializer=initializer):
for test_input_item in test_input:
mtests.append(PTBModel(is_training=False, config=eval_config,
input_=test_input_item))
sv = tf.train.Supervisor(logdir=FLAGS.model_dir)
with sv.managed_session() as session:
checkpoint=tf.train.latest_checkpoint(FLAGS.model_dir)
sv.saver.restore(session, checkpoint)
sys.stderr.write("model restored\n")
for mtest in mtests:
score, test_perplexity = run_epoch_test(session, mtest)
print(score)
So, using that code, I get score of each sentence independently. If I pass 5 sentences, it works ok. But if I pass 1k sentences to this code, it works extremely slow and uses a lot of memory, because I create 1k models mtest. So, could you tell me another way to reach my goal? Thank you.
It seems like the model can take a batch of inputs, which is set to 20 in all cases by default. You should be able to feed a larger batch of sentences to one test model to get the output for all of them without having to create multiple models instances. This probably involves some experimenting with the reader, which you are already familiar with.

Vector representation in multidimentional time-series prediction in Tensorflow

I have a large data set (~30 million data-points with 5 features) that I have reduced using K-means down to 200,000 clusters. The data is a time-series with ~150,000 time-steps. The data on which I would like to train the model is the presence of particular clusters at each time-step. The purpose of the predictive model is generate a generalized sequence similar to generating syntactically correct sentences from a model trained on word sequences. The easiest way to think about this data is that I'm trying to predict the pixels in the next video frame from pixels in the current video frame in order to generate a new sequence of frames that approximate the original sequence.
The raw and sparse representation at each time-step would be 200,000 binary values representing which clusters are present or not at that time step. Note, no more than 200 clusters may be present in any one time-step and thus this representation is extremely sparse.
What is the best representation to convert this sparse vector to a dense vector that would be more suitable to time-series prediction using Tensorflow?
I initially had in mind a RNN / LSTM trained on the vectors at each time-step, but due to the size of the training vector I'm now wondering if a convolution approach would be more suitable.
Note, I have not actually used tensorflow beyond some simple tutorials, but have have previously used OpenCV ML functions. Please consider me a novice in your responses.
Thank you.