Trainable USE-lite-based classifier with SentencePiece input - tensorflow

I have heard that it is possible to use the pretrained Universal Sentence Encoder (USE) (neural language model) from TF-hub as part of a trainable model, e.g. a sentence classifier. Some versions of USE rely on SentencePiece sub-word tokenizer, which I also need. There are minimal instructions online for how to do this.
Here is how to use USE-lite with SentencePiece:
- https://tfhub.dev/google/universal-sentence-encoder-lite/2
Here is how to train a classifier based on a pretrained USE model:
- http://hunterheidenreich.com/blog/google-universal-sentence-encoder-in-keras/
- https://www.youtube.com/watch?v=gnz1CUzb5qo
And here is how to measure sentence similarity using both USE-lite and SentencePiece:
- https://github.com/tensorflow/hub/blob/master/examples/colab/semantic_similarity_with_tf_hub_universal_encoder_lite.ipynb
I have successfully reproduced the above pieces separately. I have then tried to combine the above ideas into a single POC that will build a classifier model based on USE-lite and SentencePiece, but I cannot see how to do it. I am currently stuck on the part where I modify the trainable classifier's first layer(s). I have tried to make it accept either (1) SentencePiece token IDs (in which I tokenize the text outide of the Tensorflow graph) or (2) raw text (using SentencePiece as an Op inside the Tensorflow graph). After that point, it should feed tokenized text forward into the USE-lite model, either in a lambda or in some other way. Finally, the output of USE-lite should be fed into a dense layer (or two?) ending in softmax for computing class probabilities.
I am relatively new to Tensorflow. I imagine that the above sources would be sufficient for a more experienced Tensorflow developer to merge and make work for my use-case. Let me know if you can provide any pointers. Thanks.

Related

Deploying tensorflow RNN models (other than keras LSTM) to a microcontroller without unrolling the network?

Goal
I want to compare different types of RNN tflite-micro models, built using tensorflow, on a microcontroller based on their accuracy, model size and inference time. I have also created my own custom RNN cell that I want to compare with the LSTM cell, GRU cell, and SimpleRNN cell. I create the tensorflow model using tf.keras.layers.RNN(Cell(...)).
Problem
I have successfully deployed a keras LSTM-RNN using tf.keras.layers.LSTM(...) but when I create the same model using tf.keras.layers.RNN(tf.keras.layers.LSTMCell(...)) and deploy it to the microcontroller, then it does not work. I trained both networks on a batch size of 64, but then I copy the weights and biases to a model where the batch_size is fixed to 1 as tflite-micro does not support dynamic batch sizes.
When the keras LSTM layer is converted to a tflite model it creates a fused operator called UnidirectionalSequenceLSTM but the network created with an RNN layer using the LSTMCell does not have that UnidirectionalSequenceLSTM operator, instead it has a reshape and while operator. The first network has only 1 subgraph but the second has 3 subgraphs.
When I run that second model on the microcontroller, two things go wrong:
the interpreter returns the same result for different inputs
the interpreter fails on some inputs reporting an error with the while loop saying that int32 is not supported (which is in the while operator, and can't be quantized to int8)
LSTM tflite-model vizualized with Netron
RNN(LSTMCell) tflite-model vizualized with Netron
Bad solution (10x model size)
I figured out that by unrolling the second network I can successfully deploy it and get correct results on the microcontroller. However, that increases the model size 10x which is really bad as we are trying to deploy the model on a resource constrained device.
Better solution?
I have explained the problem using the example of the LSTM layer (works) and LSTM cell in an RNN layer (does not work), but I want to be able to deploy a model using the GRU cell, SimpleRNN cell, and of course the custom cell that I have created. And all those have the same problem as the network created with the LSTM cell.
What can I do?
Do I have to create a special fused operator? Maybe even one for each cell I want to compare? How would I do that?
Can I use the interface into the conversion infrastructure for user-defined RNN implementations mentioned here: https://www.tensorflow.org/lite/models/convert/rnn. How I understand the documentation, is that this would only work for user-defined LSTM implementations, not user-defined RNN implemenations like the title suggests.

Tensorflow Hub Image Modules: Clarity on Preprocessing and Output values

Many thanks for support!
I currently use TF Slim - and TF Hub seems like a very useful addition for transfer learning. However the following things are not clear from the documentation:
1. Is preprocessing done implicitly? Is this based on "trainable=True/False" parameter in constructor of module?
module = hub.Module("https://tfhub.dev/google/imagenet/inception_v3/feature_vector/1", trainable=True)
When I use Tf-slim I use the preprocess method:
inception_preprocessing.preprocess_image(image, img_height, img_width, is_training)
2.How to get access to AuxLogits for an inception model? Seems to be missing:
import tensorflow_hub as hub
import tensorflow as tf
img = tf.random_uniform([10,299,299,3])
module = hub.Module("https://tfhub.dev/google/imagenet/inception_v3/feature_vector/1", trainable=True)
outputs = module(dict(images=img), signature="image_feature_vector", as_dict=True)
The output is
dict_keys(['InceptionV3/Mixed_6b', 'InceptionV3/MaxPool_5a_3x3', 'InceptionV3/Mixed_6c', 'InceptionV3/Mixed_6d', 'InceptionV3/Mixed_6e', 'InceptionV3/Mixed_7a', 'InceptionV3/Mixed_7b', 'InceptionV3/Conv2d_2a_3x3', 'InceptionV3/Mixed_7c', 'InceptionV3/Conv2d_4a_3x3', 'InceptionV3/Conv2d_1a_3x3', 'InceptionV3/global_pool', 'InceptionV3/MaxPool_3a_3x3', 'InceptionV3/Conv2d_2b_3x3', 'InceptionV3/Conv2d_3b_1x1', 'default', 'InceptionV3/Mixed_5b', 'InceptionV3/Mixed_5c', 'InceptionV3/Mixed_5d', 'InceptionV3/Mixed_6a'])
These are excellent questions; let me try to give good answers also for readers less familiar with TF-Slim.
1. Preprocessing is not done by the module, because it is a lot about your data, and not so much about the CNN architecture within the module. The module only handles transforming input values from the canonical [0,1] range into whatever the pre-trained CNN within the module expects.
Lengthy rationale: Preprocessing of images for CNN training usually consists of decoding the input JPEG (or whatever), selecting a (reasonably large) random crop from it, random photometric and geometric transformations (distort colors, flip left/right, etc.), and resizing to the common image size for a batch of training inputs. The TensorFlow Hub modules that implement https://tensorflow.org/hub/common_signatures/images leave all of that to your code around the module.
The primary reason is that the suitable random transformations depend a lot on your training task, but not on the architecture or trained state weights of the module. For example, color distortions will help if you classify cars vs dogs, but probably not for ripe vs unripe bananas, and so on.
Also, a batch of images that have been decoded but not yet cropped/resized are hard to represent as a single tensor (unless you make it a 1-D tensor of encoded strings, but that brings other problems, such as breaking backprop into module inputs for advanced uses).
Bottom line: The Python code using the module needs to do image preprocessing (except scaling values), for example, as in https://github.com/tensorflow/hub/blob/master/examples/image_retraining/retrain.py
The slim preprocessing methods conflate the dataset-specific random transformations (tuned for Imagenet!) with the re-scaling to the architecture's value range (which the Hub module does for you). That means they are not directly applicable here.
2. Indeed, auxiliary heads are missing from the initial set of modules published under tfhub.dev/google/..., but I expect them to work fine for re-training anyways.
More details: Not all architectures have auxiliary heads, and even the original Inception paper says their effect was "relatively minor" [Szegedy&al. 2015; §5]. Using an image feature vector module for a custom classification task would burden the module consumer code with checking for aux features and, if found, putting aux logits and a loss term on top.
This complication did not seem to pull its weight, but more experiments might refute that assessment. (Please share in a GitHub issue if you know of any.)
For now, the only way to put an aux head onto https://tfhub.dev/google/imagenet/inception_v3/feature_vector/1 is to copy&paste some lines from https://github.com/tensorflow/models/blob/master/research/slim/nets/inception_v3.py (search "Auxiliary head logits") and apply that to the "Inception_V3/Mixed_6e" output that you saw.
3. You didn't ask, but: For training, the module's documentation recommends to pass hub.Module(..., tags={"train"}), or else batch norm operates in inference mode (and dropout, if the module had any).
Hope this explains how and why things are.
Arno (from the TensorFlow Hub developers)

Is it possible to create a trainable variable in keras like in tensorflow?

Good morning everyone;
I'm trying to implement this model where the neural network's inputs are based on a trainable vocabulary matrix (each row in the matrix represents a word entry in the vocabulary). I'm using keras (tensorflow backend), I was wondering if it's possible to define a trainable variable (without adding a custom layer), such that this variable will be trained as well as the neural network? like a tensorflow variable.
Could you please give a short example of how I can do it?
Thanks in advance.
The neural network's inputs are based on a trainable vocabulary matrix (each row in the matrix represents a word entry in the vocabulary)
This is the definition of a Word Embedding
There is already an embedding layer in Keras, you don't have to reimplement it.
You can find an easy example of how to use it here.

Tensorflow embeddings

I know what embeddings are and how they are trained. Precisely, while referring to the tensorflow's documentation, I came across two different articles. I wish to know what exactly is the difference between them.
link 1: Tensorflow | Vector Representations of words
In the first tutorial, they have explicitly trained embeddings on a specific dataset. There is a distinct session run to train those embeddings. I can then later on save the learnt embeddings as a numpy object and use the
tf.nn.embedding_lookup() function while training an LSTM network.
link 2: Tensorflow | Embeddings
In this second article however, I couldn't understand what is happening.
word_embeddings = tf.get_variable(“word_embeddings”,
[vocabulary_size, embedding_size])
embedded_word_ids = tf.gather(word_embeddings, word_ids)
This is given under the training embeddings sections. My doubt is: does the gather function train the embeddings automatically? I am not sure since this op ran very fast on my pc.
Generally: What is the right way to convert words into vectors (link1 or link2) in tensorflow for training a seq2seq model? Also, how to train the embeddings for a seq2seq dataset, since the data is in the form of separate sequences for my task unlike (a continuous sequence of words refer: link 1 dataset)
Alright! anyway, I have found the answer to this question and I am posting it so that others might benefit from it.
The first link is more of a tutorial that steps you through the process of exactly how the embeddings are learnt.
In practical cases, such as training seq2seq models or Any other encoder-decoder models, we use the second approach where the embedding matrix gets tuned appropriately while the model gets trained.

Building a conversational model using TensorFlow

I'd like to build a conversational modal that can predict a sentence using the previous sentences using TensorFlow LSTMs . The example provided in TensorFlow tutorial can be used to predict the next word in a sentence .
https://www.tensorflow.org/versions/v0.6.0/tutorials/recurrent/index.html
lstm = rnn_cell.BasicLSTMCell(lstm_size)
# Initial state of the LSTM memory.
state = tf.zeros([batch_size, lstm.state_size])
loss = 0.0
for current_batch_of_words in words_in_dataset:
# The value of state is updated after processing each batch of words.
output, state = lstm(current_batch_of_words, state)
# The LSTM output can be used to make next word predictions
logits = tf.matmul(output, softmax_w) + softmax_b
probabilities = tf.nn.softmax(logits)
loss += loss_function(probabilities, target_words)
Can I use the same technique to predict the next sentence ? Is there any working example on how to do this ?
You want to use the Sequence-to-sequence model. Instead of having it learn to translate sentences from a source language to a target language you have it learn responses to previous utterances in the conversation.
You can adapt the example seq2seq model in tensorflow by using the analogy that the source language 'English' is your set of previous sentences and target language 'French' are your response sentences.
In theory you could use the basic LSTM you were looking at by concatenating your training examples with a special symbol like this:
hello there ! __RESPONSE hi , how can i help ?
Then during testing you run it forward with a sequence up to and including the __RESPONSE symbol and the LSTM can carry it the rest of the way.
However, the seq2seq model above should be much more accurate and powerful because it had a separate encoder / decoder and includes an attention mechanism.
A sentence is composed words, so you can indeed predict the next sentence by predicting words sequentially. There are models, such as the one described in this paper, that build embeddings for entire paragraphs, which can be useful for your purpose. Of course there is Neural Conversational Model work that probably directly fits your need. TensorFlow doesn't ship with working examples of these models, but the recurrent models that come with TensorFlow should give you a good starting point for implementing them.