How to project my word2vec model in Tensorflow - tensorflow

I am new to using word embedding and want to know how i can project my model in Tensorflow. I was looking at the tensorflow website and it only accepts tsv file (vector/metadata), but don't know how to generate the required tsv files. I have tried looking it up and can't find any solutions regrading this. Will I try saving my model in a tsv file format, will i need to do some transformations? Any help will be appreciated.
I have saved my model as the following files, and just load it up when I need to use it:
word2vec.model
word2vec.model.wv.vectors.npy

Assuming you're trying to load some pre-trained Gensim word embeddings into a model, you can do this directly with the following code..
import numpy
import tensorflow as tf
from gensim.models import KeyedVectors
# Load the word-vector model
wvec_fn = 'wvecs.kv'
wvecs = KeyedVectors.load(wvec_fn, mmap='r')
vec_size = wvecs.vector_size
vocab_size = len(wvecs.vocab)
# Create the embedding matrix where words are indexed alphabetically
embedding_mat = numpy.zeros(shape=(vocab_size, vec_size), dtype='int32')
for idx, word in enumerate(sorted(wvecs.vocab)):
embedding_mat[idx] = wvecs.get_vector(word)
# Setup the embedding matrix for tensorflow
with tf.variable_scope("input_layer"):
embedding_tf = tf.get_variable(
"embedding", [vocab_size, vec_size],
initializer=tf.constant_initializer(embedding_mat),
trainable=False)
# Integrate this into your model
batch_size = 32 # just for example
seq_length = 20
input_data = tf.placeholder(tf.int32, [batch_size, seq_length])
inputs = tf.nn.embedding_lookup(embedding_tf, input_data)
If you've save a model instead of just the KeyedVectors, you may need to modify the code to load the model and then access the KeyedVectors with model.wv.

Related

Tensorflow Recommender - ScaNN passing embedding as query

I want to pass a query embedding to ScaNN instead of a model, what data type should I use for this?
My query would look like this [1, 0.3, 0.4]
My candidate embedding would be something like:
[[0.2, 1, .4],
[0.3,0.1,0.56]]
All the examples I see are passing an query model, not the embedding itself.
I tried passing a numpy array but it didn't work
Embeddings are just lists of vectors which your model produces. In this case using the tf.keras.layers.Embedding layer.
self._embeddings = {}
# Compute embeddings for string features
for feature_name in str_features:
vocabulary = vocabularies[feature_name]
self._embeddings[feature_name] = tf.keras.Sequential([
tf.keras.layers.StringLookup(
vocabulary=vocabulary, mask_token=None),
tf.keras.layers.Embedding(len(vocabulary) + 1,
self.embedding_dimension)
])
You can also use another model such as a Sentence Transformer to create embeddings.
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(sentences)
print(embeddings)
You do not need to pass the model to ScaNN, you can pass it the embeddings directly as well as mentioned in the documentation here
Here is a sample code snippet on how to pass embeddings directly to scann
import pandas as pd
from sklearn import preprocessing, metrics
df = pd.read_csv("./data/mydata.csv")
# normalization
df_np = preprocessing.normalize(df.iloc[:,1:], norm=norm)
num_neighbors = 100
# creating searcher
k = int(np.sqrt(df_np.shape[0]))
searcher = scann.scann_ops_pybind.builder(df_np, num_neighbors, "dot_product").tree(
num_leaves=k,
num_leaves_to_search=int(k/20),
training_sample_size=2500).score_brute_force(2).reorder(7).build()
Here is a blog post on using ScaNN
ScaNN optimization and configuration

Tensorflow: Classifying images in batches

I have followed this TensorFlow tutorial to classify images using transfer learning approach. Using almost 16,000 manually classified images (with about 40/60 split of 1/0) added on top of the pre-trained MobileNet V2 model, my model achieved 96% accuracy on the hold out test set. I then saved the resulting model.
Next, I would like to use this trained model to classify new images. To do so, I have adapted one of the portions of the tutorial's code (in the end where it says #Retrieve a batch of images from the test set) in the way described below. The code works, however, it only processes one batch of 32 images and that's it (there are hundreds of images in the source folder). What am I missing here? Please advise.
# Import libraries
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import preprocessing
from tensorflow.keras.preprocessing import image_dataset_from_directory
import matplotlib.pyplot as plt
import numpy as np
import os
# Load saved model
model = tf.keras.models.load_model('/model')
# Re-compile model
base_learning_rate = 0.0001
model.compile(optimizer=tf.keras.optimizers.Adam(lr=base_learning_rate),
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])
# Define paths
PATH = 'Data/'
new_dir = os.path.join(PATH, 'New_images') # New_images must contain at least one class (sub-folder)
IMG_SIZE = (640, 640)
BATCH_SIZE = 32
new_dataset = image_dataset_from_directory(new_dir, shuffle=True, batch_size=BATCH_SIZE, image_size=IMG_SIZE)
# Retrieve a batch of images from the test set
image_batch, label_batch = new_dataset.as_numpy_iterator().next()
predictions = model.predict_on_batch(image_batch).flatten()
# Apply a sigmoid since our model returns logits
predictions = tf.nn.sigmoid(predictions)
predictions = tf.where(predictions < 0.5, 0, 1)
print('Predictions:\n', predictions.numpy())
len(new_dataset) # equals 25, i.e., there are 25 batches
Replace this code:
# Retrieve a batch of images from the test set
image_batch, label_batch = new_dataset.as_numpy_iterator().next()
predictions = model.predict_on_batch(image_batch).flatten()
with this one:
predictions = model.predict(new_dataset,batch_size=BATCH_SIZE).flatten()
tf.data.Dataset objects can be directly passed to the method predict(). Reference

Tensorflow 2.0 save preprocessing tonkezier for nlp into tensorflow server

I have trained a tensforflow 2.0 keras model to make some natural language processing.
What I am doing basically is get the title of different news and predicting in what category they belong. In order to do that I have to tokenize the sentences and then add 0 to fill the array to have the same lenght that I defined:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
max_words = 1500
tokenizer = Tokenizer(num_words=max_words )
tokenizer.fit_on_texts(x.values)
X = tokenizer.texts_to_sequences(x.values)
X = pad_sequences(X, maxlen = 32)
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, GRU,InputLayer
numero_clases = 5
modelo_sentimiento = Sequential()
modelo_sentimiento.add(InputLayer(input_tensor=tokenizer.texts_to_sequences, input_shape=(None, 32)))
modelo_sentimiento.add(Embedding(max_palabras, 128, input_length=X.shape[1]))
modelo_sentimiento.add(LSTM(256, dropout=0.2, recurrent_dropout=0.2, return_sequences=True))
modelo_sentimiento.add(LSTM(256, dropout=0.2, recurrent_dropout=0.2))
modelo_sentimiento.add(Dense(numero_clases, activation='softmax'))
modelo_sentimiento.compile(loss = 'categorical_crossentropy', optimizer='adam',
metrics=['acc',f1_m,precision_m, recall_m])
print(modelo_sentimiento.summary())
Now once trained I want to deploy it for example in tensorflow serving, but I don't know how to save this preprocessing(tokenizer) into the server, like make a scikit-learn pipeline, it is possible to do it here? or I have to save the tokenizer and make the preprocessing by my self and then call the model trained to predict?
Unfortunately, you won't be able to do something as elegant as a sklearn Pipeline with Keras models (at least I'm not aware of) easily. Of course you'd be able to create your own Transformer which will achieve the preprocessing you need. But given my experience trying to incorporate custom objects in sklearn pipelines, I don't think it's worth the effort.
What you can do is save the tokenizer along with metadata using,
with open('tokenizer_data.pkl', 'wb') as handle:
pickle.dump(
{'tokenizer': tokenizer, 'num_words':num_words, 'maxlen':pad_len}, handle)
And then load it when you want to use it,
with open("tokenizer_data.pkl", 'rb') as f:
data = pickle.load(f)
tokenizer = data['tokenizer']
num_words = data['num_words']
maxlen = data['maxlen']

Embedding feature vectors in Tensorflow

In text processing there is embedding to show up (if I understood it correctly) the database words as vector (after dimension reduction).
now, I am wondering, is there any method like this to show extracted features via CNN?
for example: consider we have a CNN and train and test sets. we want to train the CNN with train set and meanwhile see the extracted features (from dense layer) corresponding class labels via CNN in the embedding section of tensorboard.
the purpose of this work is seeing the features of input data in every batch and understand how close or far are they from together. and finally, in the trained model, we can find out accuracy of our classifier (like softmax or etc.).
thank you in advance for your help.
I have taken help of Tensorflow documentation.
For in depth information on how to run TensorBoard and make sure you are logging all the necessary information, see TensorBoard: Visualizing Learning.
To visualize your embeddings, there are 3 things you need to do:
1) Setup a 2D tensor that holds your embedding(s).
embedding_var = tf.get_variable(....)
2) Periodically save your model variables in a checkpoint in LOG_DIR.
saver = tf.train.Saver()
saver.save(session, os.path.join(LOG_DIR, "model.ckpt"), step)
3) (Optional) Associate metadata with your embedding.
If you have any metadata (labels, images) associated with your embedding, you can tell TensorBoard about it either by directly storing a projector_config.pbtxt in the LOG_DIR, or use our python API.
For instance, the following projector_config.ptxt associates the word_embedding tensor with metadata stored in $LOG_DIR/metadata.tsv:
embeddings {
tensor_name: 'word_embedding'
metadata_path: '$LOG_DIR/metadata.tsv'
}
The same config can be produced programmatically using the following code snippet:
from tensorflow.contrib.tensorboard.plugins import projector
# Create randomly initialized embedding weights which will be trained.
vocabulary_size = 10000
embedding_size = 200
embedding_var = tf.get_variable('word_embedding', [vocabulary_size,
embedding_size])
# Format: tensorflow/tensorboard/plugins/projector/projector_config.proto
config = projector.ProjectorConfig()
# You can add multiple embeddings. Here we add only one.
embedding = config.embeddings.add()
embedding.tensor_name = embedding_var.name
# Link this tensor to its metadata file (e.g. labels).
embedding.metadata_path = os.path.join(LOG_DIR, 'metadata.tsv')
#Use the same LOG_DIR where you stored your checkpoint.
summary_writer = tf.summary.FileWriter(LOG_DIR)
# The next line writes a projector_config.pbtxt in the LOG_DIR. TensorBoard will
# read this file during startup.
projector.visualize_embeddings(summary_writer, config)

How to extract features from tensorflow slim model VGG when doing forward run?

I have trained a model for classification using TensorFlow slim model vgg, using CASIA(a face recognition dataset) as training dataset.
I want to test the model by using LFW dataset, it is a face matching task. so I need to extract the net features like fc7/fc8, not the softmax layer, and compare the distance between the features, to determine whether they are the same person.
How can I extract the features of a slim model?
Here is part of the training code.
import tensorflow as tf
from tensorflow.contrib.slim.python.slim.nets import vgg
slim = tf.contrib.slim
FLAGS = tf.app.flags.FLAGS
def tower_loss(scope):
images, labels = read_and_decode()
with slim.arg_scope(vgg.vgg_arg_scope()):
logits, end_points = vgg.vgg_16(images, num_classes=FLAGS.num_classes)
_ = cal_loss(logits, labels)
losses = tf.get_collection('losses', scope)
total_loss = tf.add_n(losses, name='total_loss')
return total_loss
You can try using tf.get_default_graph().get_tensor_by_name("VGG16/fc16:0") or whatever tensor name of the specific feature you want to extract.
To verify the name of the tensors you are extracting, you can try
for operation in graph.get_operations():
print operation.values()
Remember to put :0 at the end of the names as they indicate the item you're retrieving is a tensor.
Get end_points of slim model and extract the feature.