TextVectorization layer in a deep neaural network - tensorflow

I'm training a neural network with Tensorflow for a NLP problem and I'm using the TextVectorization layer as fisrt layer. This layer has one parameter that give me problems : standardize. For this there are some default options but there is also the possibility to create a custum callable. I've got a functions written by me that I want to use in this layer. My functions receive a string in input and give a string in output. This is not compatible with the input and the output for the callable in this layer.
What is the right output and input that this callable has to have?
If there is a way, how can convert my function?
This is my function:
import nltk
from nltk.sentiment.util import *
from nltk.stem.wordnet import WordNetLemmatizer
import re
nltk.download('wordnet')
def my_preprocessing(text)
text=re.split("\\s+",text)
wnl = WordNetLemmatizer()
stemmed_words = [wnl.lemmatize(word) for word in text]
testo = [w for w in stemmed_words if not w.isnumeric()]
return ' '.join(testo)

Related

Irreproducible results Tensorflow

I have a very basic code that tries to create a single-layered Dense neural net and predicts the output for a deterministic input. The code is as follows:
import tensorflow as tf
from tensorflow.keras import layers
model = tf.keras.models.Sequential()
model.add(layers.Dense(units = 10))
import numpy as np
inp = np.ones((1,10))
model.predict(inp)
But the output that I am getting isn't being deterministic. I think it is related to initializing the weights and biases. So, how do I fix this without writing the initializing function from scratch?
Set global seed before initializing model tf.random.set_seed(42)
You can also set seed for specific parts of model, e.g. kernel_initializer in Dense layer, but with this approach, you may miss initializers that will still be nondeterministic. In your case setting it globally will be the best solution.

TF.js import error with model created using TF Lite Model Maker

I've created a model using the tutorial at https://www.tensorflow.org/lite/tutorials/model_maker_image_classification and exported it in the TF.js format:
import os
import matplotlib.pyplot as plt
import tensorflow as tf
from tflite_model_maker import image_classifier, model_spec
from tflite_model_maker.config import ExportFormat, QuantizationConfig
from tflite_model_maker.image_classifier import DataLoader
image_path = tf.keras.utils.get_file(
'flower_photos.tgz',
'https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz',
extract=True)
image_path = os.path.join(os.path.dirname(image_path), 'flower_photos')
data = DataLoader.from_folder(image_path)
train_data, test_data = data.split(0.9)
model = image_classifier.create(train_data)
loss, accuracy = model.evaluate(test_data)
# Export model to TF.js format
model.export(export_dir='.', export_format=ExportFormat.TFJS)
When loading this model in TF.js using tf.loadLayersModel I get the following error:
Uncaught (in promise) Error: Unknown layer: HubKerasLayerV1V2.
This may be due to one of the following reasons:
1. The layer is defined in Python, in which case it needs to be
ported to TensorFlow.js or your JavaScript code.
2. The custom layer is defined in JavaScript, but is not registered
properly with tf.serialization.registerClass()
I guess the error is due to reason (1), but how can I port the HubKerasLayerV1V2 layer to TF.js?
I believe this is an issue with the model converter having issues with a partial Graph inside of a Layers model.
You can probably fix this by serializing the model to the normal SaveModel format and export the HDF5. Once you have the .h5 output, use the TensorFlow.js converter (tensorflowjs_converter) to create a purely Graph model. Then try loading with tf.loadGraphModel instead.

Error occurred when initializing NLClassifier: Type mismatch for input tensor serving_default_input_type_ids:0. Requested STRING, got INT32

I'm trying to learn how to use some ML stuff for Android. I got the Text Classification demo working and seems to work fine. So then I tried creating my own model.
The code I used to create my own model was this:
import numpy as np
import os
from tflite_model_maker import model_spec
from tflite_model_maker import text_classifier
from tflite_model_maker.config import ExportFormat
from tflite_model_maker.text_classifier import AverageWordVecSpec
from tflite_model_maker.text_classifier import DataLoader
import tensorflow as tf
assert tf.__version__.startswith('2')
tf.get_logger().setLevel('ERROR')
spec = model_spec.get('mobilebert_classifier')
train_data = DataLoader.from_csv(
filename='/path to file/train.csv',
text_column='sentence',
label_column='label',
model_spec=spec,
is_training=True)
model = text_classifier.create(train_data, model_spec=spec, epochs=10)
model.export(export_dir='average_word_vec')
The code appeared to run fine and it created a model.tflite file for me. I then replaced the demo tflite file with mine. But when I run the demo I get the following error:
java.lang.AssertionError: Error occurred when initializing NLClassifier: Type mismatch for input tensor serving_default_input_type_ids:0. Requested STRING, got INT32.
at org.tensorflow.lite.task.text.nlclassifier.NLClassifier.initJniWithByteBuffer(Native Method)
at org.tensorflow.lite.task.text.nlclassifier.NLClassifier.access$100(NLClassifier.java:67)
at org.tensorflow.lite.task.text.nlclassifier.NLClassifier$2.createHandle(NLClassifier.java:223)
at org.tensorflow.lite.task.core.TaskJniUtils.createHandleFromLibrary(TaskJniUtils.java:91)
at org.tensorflow.lite.task.text.nlclassifier.NLClassifier.createFromBufferAndOptions(NLClassifier.java:219)
at org.tensorflow.lite.task.text.nlclassifier.NLClassifier.createFromFileAndOptions(NLClassifier.java:175)
at org.tensorflow.lite.task.text.nlclassifier.NLClassifier.createFromFile(NLClassifier.java:150)
at org.tensorflow.lite.examples.textclassification.client.TextClassificationClient.load(TextClassificationClient.java:44)
at org.tensorflow.lite.examples.textclassification.MainActivity.lambda$onStart$1$MainActivity(MainActivity.java:67)
at org.tensorflow.lite.examples.textclassification.-$$Lambda$MainActivity$eJaQnJq74KcmPEczFE5swJIGydg.run(Unknown Source:2)
What am I missing?
In your codes you trained a MobileBERT model, but saved to the path of average_word_vec?
spec = model_spec.get('mobilebert_classifier')
model.export(export_dir='average_word_vec')
One posssiblity is: you use the model of average_word_vec, but add MobileBERT metadata, thus the preprocessing doesn't match.
Could you follow the Model Maker tutorial and try again?
https://colab.sandbox.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/lite/g3doc/tutorials/model_maker_text_classification.ipynb
Make sure change the export path.

Load pretrained model on TF-Hub to calculate Word Mover's Distance (WMD) on Gensim or spaCy

I'd like to calculate Word Mover's Distance with Universal Sentence Encoder on TensorFlow Hub embedding.
I have tried the example on spaCy for WMD-relax, which loads 'en' model from spaCy, but I couldn't find another way to feed other embeddings.
In gensim, it seems that it only accepts load_word2vec_format file (file.bin) or load file (file.vec).
As I know, someone has written a Bert to token embeddings based on pytorch, but it's not generalized to other models on tf-hub.
Is there any other approach to transfer pretrained models on tf-hub to spaCy format or word2vec format?
You need two different things.
First tell SpaCy to use an external vector for your documents, spans or tokens. This can be done by setting the user_hooks:
- user_hooks["vector"] is for the document vector
- user_span_hooks["vector"] is for the span vector
- user_token_hooks["vector"] is for the token vector
Given the fact that you have a function that retrieves from TF Hub the vectors for a Doc/Span/Token (all of them have the property text):
import spacy
import tensorflow_hub as hub
model = hub.load(TFHUB_URL)
def embed(element):
# get the text
text = element.text
# then get your vector back. The signature is for batches/arrays
results = model([text])
# get the first element because we queried with just one text
result = np.array(results)[0]
return result
You can write the following pipe component, that tells spacy how to retrieve the custom embedding for documents, spans and tokens:
def overwrite_vectors(doc):
doc.user_hooks["vector"] = embed
doc.user_span_hooks["vector"] = embed
doc.user_token_hooks["vector"] = embed
# add this to your nlp pipeline to get it on every document
nlp = spacy.blank('en') # or any other Language
nlp.add_pipe(overwrite_vectors)
For your question related to the custom distance, there is a user hook also for this one:
def word_mover_similarity(a, b):
vector_a = a.vector
vector_b = b.vector
# your distance score needs to be converted to a similarity score
similarity = TODO_IMPLEMENT(vector_a, vector_b)
return similarity
def overwrite_similarity(doc):
doc.user_hooks["similarity"] = word_mover_similarity
doc.user_span_hooks["similarity"] = word_mover_similarity
doc.user_token_hooks["similarity"] = word_mover_similarity
# as before, add this to the pipeline
nlp.add_pipe(overwrite_similarity)
I have an implementation of the TF Hub Universal Sentence Encoder that uses the user_hooks in this way: https://github.com/MartinoMensio/spacy-universal-sentence-encoder-tfhub
Here is the implementation of WMD in spacy. You can create a WMD object and load your own embeddings:
import numpy
from wmd import WMD
embeddings_numpy_array = # your array with word vectors
calc = WMD(embeddings_numpy_array, ...)
Or, as shown in this example., you can create your own class:
import spacy
spacy_nlp = spacy.load('en_core_web_lg')
class SpacyEmbeddings(object):
def __getitem__(self, item):
return spacy_nlp.vocab[item].vector # here you can return your own vector instead
calc = WMD(SpacyEmbeddings(), documents)
...
...
calc.nearest_neighbors("some text")
...

Tensorflow: Keras, Estimators and custom input function

TF1.4 made Keras an integral part.
When trying to create Estimators from Keras models with propratery input function (I.e., not using the tf.estimator.inputs.numpy_input_fn) things are not working as Tensorflow can not fuse the model with the Input function.
I am using tf.keras.estimator.model_to_estimator
keras_estimator = tf.keras.estimator.model_to_estimator(
keras_model = keras_model,
config = run_config)
train_spec = tf.estimator.TrainSpec(input_fn=train_input_fn,
max_steps=self.train_steps)
eval_spec = tf.estimator.EvalSpec(input_fn=eval_input_fn,
steps=None)
tf.estimator.train_and_evaluate(keras_estimator, train_spec, eval_spec)
and I get the following error message:
Cannot find %s with name "%s" in Keras Model. It needs to match '
'one of the following:
I found some reference for this topic here (strangely enough its hidden in the TF docs in the master branch - compare to this)
If you have the same issue - see my answer below. Might save you several hours.
So here is the deal. You must make sure that your custom Input Function returns a dictionary of {inputs} and a dictionary of {outputs}.
The dictionary keys must match your Keras input/output layers name.
From TF docs:
First, recover the input name(s) of Keras model, so we can use them as the
feature column name(s) of the Estimator input function
This is correct.
Here is how I did this:
# Get inputs and outout Keras model name to fuse them into the infrastructure.
keras_input_names_list = keras_model.input_names
keras_target_names_list = keras_model.output_names
Now, that you have the names, you need to go to your own input function and change it so it will deliver two dictionaries with the corresponding input and output names.
In my example, before the change, the input function returned [image_batch],[label_batch]. This is basically a bug because it is stated that the inputfn returns a dictionary and not a list.
To solve this, we need to wrap it up into a dict:
image_batch_dict = dict(zip(keras_input_names_list , [image_batch]))
label_batch_dict = dict(zip(keras_target_names_list , [label_batch]))
Only now, TF will be able to connect the input function to the Keras input layers.