How to create an NLP processing pipeline with Keras - tensorflow

I regularly use scikit-learn pipelines to streamline model processing, and I'm wondering the easiest way to do something similar with Keras in Tensorflow 2.0.
What I'd like to do is deploy a Keras model as an API endpoint, and then submit a piece of text in a numpy array to it and have it tokenized, padded and predicted. But I don't know the shortest path to do this.
Here's some sample code:
from tensorflow import keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Embedding, Dense, Flatten
import numpy as np
sample_words = [
'The sky is blue',
'The sky delivers us many gifts',
'Wise men appreciate gifts for what they are, not what they are not',
'Wherever you go, there you are',
'Don\'t pass judgment onto others, or you will quickly be judged yourself'
]
y = np.array([1, 0, 1, 1, 0])
tokenizer = Tokenizer(num_words=10)
tokenizer.fit_on_texts(sample_words)
train_sequences = tokenizer.texts_to_sequences(sample_words)
train_sequences = pad_sequences(train_sequences, maxlen=7)
mod = Sequential([
Embedding(10, 2, input_length=7),
Flatten(),
Dense(3, activation='relu'),
Dense(1, activation='sigmoid')
])
mod.compile(optimizer='adam', loss='binary_crossentropy')
mod.fit(train_sequences, y)
The idea is that if I have a web form and someone submits a form with the words 'The sky is pretty today', I can wrap it in a numpy array, send it to the endpoint (which will be setup on Google Cloud), and have it padded, tokenized, and predicted.
In scikit learn it would be as simple as: pipe = make_pipeline(tokenizer, mod), and then go from there.
I have a feeling there are some solutions that include td.Datasets, but I was hoping keras had something in it that was more user friendly.

Keras is easy in a way that there is no need to explicitly build any pipelines.
The Keras model is using Tensorflow backend to create a computation graph which could be loosely said as similar to scikit-learn's pipeline.
Thus your mod is in itself equivalent to a pipeline having the operations: Embedding -> Flatten -> Dense -> Dense. The mod.compile() method is generating the tensorflow computation graph.
Then everything comes together in model.fit() method where you plug in your inputs to your model (i.e. pipeline) and then the method trains on your data.
In order to have the tokenization be a part of your model, the TextVectorization layer can be used.
This layer has basic options for managing text in a Keras model. It transforms a batch of strings (one sample = one string) into either a list of token indices (one sample = 1D tensor of integer token indices) or a dense representation (one sample = 1D tensor of float values representing data about the sample's tokens)
Code snapshot:
vectorize_layer = TextVectorization(
max_tokens=max_features,
output_mode='int',
output_sequence_length=max_len
)
model.add(vectorize_layer)
input_data = [["foo qux bar"], ["qux baz"]]
model.predict(input_data)
>>>
array([[2, 1, 4, 0],
[1, 3, 0, 0]])

Related

Using Sparse Tensors as Input for Autoencoders

I have an One-hot-encoded sparse matrix which can't be transformed into a normal matrix due to its size.
I would like to reduce the dimensions using an autoencoder. Currently I am trying to use Tensorflow and its Keras library for that.
The Tensorflow docs state that sparse tensors exist and that they can be used in Keras (see https://www.tensorflow.org/guide/sparse_tensor).
The Problem is that all autoencoders I've found in the internet do not seem to work with sparse tensors.
I have prepared a small code example which stops after the first training epoch with the error message: "Failed to convert elements of SparseTensor to Tensor. Consider casting elements to a supported type.".
My Questions would be:
Do you have an idea to improve the Code or ideally do you have an example which I can look up?
If not: Do you have other ideas on how to do what I would like to do (e.g. another library, other method, etc.)?
Code Example:
#necessary imports
import tensorflow as tf
from keras.models import Model, Sequential
from keras.layers import Input, Dense, ActivityRegularization
from tensorflow.keras import backend as K
from tensorflow.keras import regularizers
#example one-hot-encoded matrix with 10 records with each one out of 4 distinct categories
sparse_tensor = tf.sparse.SparseTensor(indices=[[0,3], [1,3], [2,0], [3,1], [4,0], [5,2], [6,2], [7,1], [8,3], [9,1]],
values=[1 for i in range(10)],
dense_shape=[10, 4])
encoder = Sequential([
Input(shape=(4,), sparse=True),
Dense(1, activation = 'relu'),
ActivityRegularization(l1=1e-3)
])
decoder = Sequential([
Dense(4, activation = 'sigmoid', input_shape = (1, )),
])
autoencoder = Sequential([encoder, decoder])
autoencoder.compile(optimizer='adam', loss='binary_crossentropy')
autoencoder.fit(x=sparse_tensor, y=sparse_tensor, epochs=5, batch_size=5, shuffle=True)

How to dilate y_true inside a custom metric in keras/tensorflow?

I am trying to code a custom metric for U-net model implemented using keras/tensorflow. In the metric, I need to use the opencv function, 'cv2.dilate' on the ground truth. When I tried to use it, it gave the error as y_true is a tensor and cv2.dilate expects a numpy array.
Any idea on how to implement this?
I tried to convert tensor to numpy array but it is not working.
I searched for the tensorflow implementation of cv2.dilate but couldnt find one.
One possibility, if you are using a simple rectangular kernel in your dilation, is to use tf.nn.max_pool2d as a replacement.
import numpy as np
import tensorflow as tf
import cv2
image = np.random.random((28,28))
kernel_size = 3
# OpenCV dilation works with grayscale image, with H,W dimensions
dilated_cv = cv2.dilate(image, np.ones((kernel_size, kernel_size), np.uint8))
# TensorFlow maxpooling works with batch and channels: B,H,W,C dimenssions
image_w_batch_and_channels = image[None,...,None]
dilated_tf = tf.nn.max_pool2d(image_w_batch_and_channels, kernel_size, 1, "SAME")
# checking that the results are equal
np.allclose(dilated_cv, dilated_tf[0,...,0])
However, given that you mention that you are applying dilation on the ground truth, this dilation does not need to be differentiable. In that case, you can wrap your dilation in a tf.numpy_function
from functools import partial
# be sure to put the correct output type, tf.float64 is working in that specific case because numpy defaults to float64, but it might be different in your case
dilated_tf_npfunc = tf.numpy_function(
partial(cv2.dilate, kernel=np.ones((kernel_size, kernel_size), np.uint8)), [image]
)

Tensorflow Recommender - ScaNN passing embedding as query

I want to pass a query embedding to ScaNN instead of a model, what data type should I use for this?
My query would look like this [1, 0.3, 0.4]
My candidate embedding would be something like:
[[0.2, 1, .4],
[0.3,0.1,0.56]]
All the examples I see are passing an query model, not the embedding itself.
I tried passing a numpy array but it didn't work
Embeddings are just lists of vectors which your model produces. In this case using the tf.keras.layers.Embedding layer.
self._embeddings = {}
# Compute embeddings for string features
for feature_name in str_features:
vocabulary = vocabularies[feature_name]
self._embeddings[feature_name] = tf.keras.Sequential([
tf.keras.layers.StringLookup(
vocabulary=vocabulary, mask_token=None),
tf.keras.layers.Embedding(len(vocabulary) + 1,
self.embedding_dimension)
])
You can also use another model such as a Sentence Transformer to create embeddings.
from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(sentences)
print(embeddings)
You do not need to pass the model to ScaNN, you can pass it the embeddings directly as well as mentioned in the documentation here
Here is a sample code snippet on how to pass embeddings directly to scann
import pandas as pd
from sklearn import preprocessing, metrics
df = pd.read_csv("./data/mydata.csv")
# normalization
df_np = preprocessing.normalize(df.iloc[:,1:], norm=norm)
num_neighbors = 100
# creating searcher
k = int(np.sqrt(df_np.shape[0]))
searcher = scann.scann_ops_pybind.builder(df_np, num_neighbors, "dot_product").tree(
num_leaves=k,
num_leaves_to_search=int(k/20),
training_sample_size=2500).score_brute_force(2).reorder(7).build()
Here is a blog post on using ScaNN
ScaNN optimization and configuration

Learning a Categorical Variable with TensorFlow Probability

I would like to use TFP to write a neural network where the output are the probabilities of a categorical variable with 3 classes, and train it using the negative log-likelihood.
As I'm moving my first steps with TF and TFP, I started with a toy model where the input layer has only 1 unit receiving a null input, and the output layer has 3 units with softmax activation function. The idea is that the biases should learn (up to an additive constant) the log of the probabilities.
Here below is my code, true_p are the true parameters I use to generate the data and I would like to learn, while learned_p is what I get from the NN.
import numpy as np
import tensorflow as tf
from tensorflow import keras
from functions import nll
from tensorflow.keras.optimizers import SGD
import tensorflow.keras.layers as layers
import tensorflow_probability as tfp
tfd = tfp.distributions
# params
true_p = np.array([0.1, 0.7, 0.2])
n_train = 1000
# training data
x_train = np.array(np.zeros(n_train)).reshape((n_train,))
y_train = np.array(np.random.choice(len(true_p), size=n_train, p=true_p)).reshape((n_train,))
# model
input_layer = layers.Input(shape=(1,))
p_layer = layers.Dense(len(true_p), activation=tf.nn.softmax)(input_layer)
p_y = tfp.layers.DistributionLambda(tfd.Categorical)(p_layer)
model_p = keras.models.Model(inputs=input_layer, outputs=p_y)
model_p.compile(SGD(), loss=nll)
# training
hist_p = model_p.fit(x=x_train, y=y_train, batch_size=100, epochs=3000, verbose=0)
# check result
learned_p = np.round(model_p.layers[1].call(tf.constant([0], shape=(1, 1))).numpy(), 3)
learned_p
With this setup, I get the result:
>>> learned_p
array([[0.005, 0.989, 0.006]], dtype=float32)
I over-estimate the second category, and can't really distinguish between the first and the third one. What's worst, if I plot the probabilities at the end of each epoch, it looks like they are converging monotonically to the vector [0,1,0], which doesn't make sense (it seems to me the gradient should push in the opposite direction once I start to over-estimate).
I really can't figure out what's going on here, but have the feeling I'm doing something plain wrong. Any idea? Thank you for your help!
For the record, I also tried using other optimizers like Adam or Adagrad playing with the hyper-params, but with no luck.
I'm using Python 3.7.9, TensorFlow 2.3.1 and TensorFlow probability 0.11.1
I believe the default argument to Categorical is not the vector of probabilities, but the vector of logits (values you'd take softmax of to get probabilities). This is to help maintain precision in internal Categorical computations like log_prob. I think you can simply eliminate the softmax activation function and it should work. Please update if it doesn't!
EDIT: alternatively you can replace the tfd.Categorical with
lambda p: tfd.Categorical(probs=p)
but you'll lose the aforementioned precision gains. Just wanted to clarify that passing probs is an option, just not the default.

Memory error while creating large one hot encoding for lstm

I am trying to build a character level lstm model using keras and for that I need to create one hot encoding for characters to feed in the model. And I have around 1000 characters in each line with around 160,000 lines.
I tried to create a numpy array of zeros and make the corresponding entries 1, but I am geting memory error due to large size of the matrix is there any other way to do this.
Sure:
Create batches. Only process, say, 10,000 entries (characters) at a time, computing and feeding them into your neural network just before they're needed (say, by using a generator instead of a list). Keras has a fit_generator training function to do this.
Group chunks of data together. Say, instead of a line being a matrix of the one-hot encodings of its characters, instead use the sum/max of all those columns to produce a single vector for the line. Now, each line is only a single vector, with dimensionality equal to the number of unique characters in your data set. E.g., instead of [[0, 0, 1], [0, 1, 0], [0, 0, 1]], use, [0, 1, 1] to represent the entire line.
Perhaps an easier and more intuitive solution is to add a custom one-hot encoding layer in your Keras model architecture.
def build_model(self, batch_size, print_summary=False):
X = Input(shape=(self.sequence_length,), batch_size=batch_size)
embedding = OneHotEncoding(num_classes=self.vocab_size+1,
sequence_length=self.sequence_length)(X)
encoder = Bidirectional(CuDNNLSTM(units=self.recurrent_units,
return_sequences=True))(embedding)
...
where we can define the OneHotEncoding layer as follows:
from tensorflow.keras.layers import Lambda
from tensorflow.keras import backend as K
from tensorflow.keras.layers import Layer # for creating custom layers
class OneHotEncoding(Layer):
def __init__(self, num_classes=None, sequence_length=None):
if num_classes is None or sequence_length is None:
raise ValueError("Can't leave params #num_classes or #sequence_length empty")
super(OneHotEncoding, self).__init__()
self.num_classes = num_classes
self.sequence_length = sequence_length
def encode(self, inputs):
return K.one_hot(indices=inputs,
num_classes=self.num_classes)
def call(self, inputs):
return Lambda(function=self.encode,
input_shape=(self.sequence_length,))(inputs)
Here we are utilizing the fact that the Keras model is fed the training samples in appropriate batch sizes (with the standard fit function), which in turn doesn't yield a MemoryError.