how to package string labels into savedModel - tensorflow

I have string labels such as "cat", "dog". Can I feed string labels directly to deep learning models in Tensorflow and get string labels as predictions? I am looking for the equivalent of sklearn's labelEncoder sklearn.preprocessing import LabelEncoder
If this is not possible, is there a way to pack the labels into savedModel protobuf file and retrieve them based on indices during serving time? I am using Estimator's export_savedModel API. Is assets_extra the right way? The one at https://github.com/tensorflow/serving/issues/55 does not use savedModel format.

The typical way to handle label data in deep learning is to embed the labels in a vector space. Language models do it routinely with word embedding. TensorFlow provides embedding lookup operations that you can use for you purposes.

Related

Can I feed categorical data in Keras embedding layer without encoding the data?

I am trying to feed multicolumn categorical data into Keras embedding layer. Can I feed categorical data in Keras embedding layer without encoding ?
If not then which encoding method is preferable to retrieve contextual information from the categorical data ?
No you cannot feed categorical data into Keras embedding layer without encoding the data.
There are couple of ways to encode the data:
Integer Encoding: Where each unique label is mapped to an integer.
One Hot Encoding: Where each label is mapped to a binary vector.
Learned Embedding: Where a distributed representation of the categories is learned.
The most preferred method to retrieve contextual information from the categorical data is Learned Embedding method.
You could use any pertained embeddings from below:
Glove Embeddings (https://nlp.stanford.edu/projects/glove/)
Word2Vec.
ConceptNet (https://github.com/commonsense/conceptnet-numberbatch)
ELMo embeddings (https://github.com/yuanxiaosc/ELMo)
ELMo embeddings code usage example:
import tensorflow_hub as hub
import tensorflow as tf
elmo = hub.Module("https://tfhub.dev/google/elmo/2", trainable=True))

Image data augmentation in TF 2.0 -- rotation

I am training a tensorflow model with with multiple images as input and a segmentation mask at the output. I wanted to perform random rotation augmentation in my dataset pipeline.
I have a list of parallel image file names (input output files aligned) for which I convert into tf dataset object using tf.data.Dataset.from_generator and then and use Dataset.map function to load images with tf.image.decode_png and tf.io.read_file commands.
How can I perform random rotations on the input-output images. I tried using random_transform function of ImageDataGenerator class but it expects numpy data as input and does not work on Tensors (Since tensorflow does not support eager execution in data pipeline I cannot convert it into numpy as well). I suppose I can use tf.numpy_function but I expect there should be some simple solution for this simple problem.

Image feature extraction with TF2 Keras API and TF Dataset

How should I use a tf Dataset in order run model.predict(data) and have access to the other features of the tf Dataset?
For example: my tf dataset has this format:
(tensor<(100,224,224,3)>, tensor<(100,)>) -> preprocesses images as tf.float32, uuids of the images as tf.string
If I extract the feature vector like this:
for image_data, uuids in ds.batch(100):
features = model.predict(data[0]) -> I get an array of features.
At this moment features is an array of (100, 2048) and uuids is a tensor of (100,) tf.string
How can I combine them in order to write the feature vectors to disk?
From my understanding, I need to have both of them in the same format, either both tensors so I can continue using tf code and save the feature vector as a tfrecord, either to get the uuid as a string from the uuid tensor so I can use python code and save the array in the file using numpy.tofile.
So my questions are:
- How can I make the features to be a tensor?
- Or can I get the string value from the tensor uuid?
- Does anything sounds wrong in what I try to do? Is there a more optimal way to create the input pipeline? Or did I misunderstood the usage of Keras API and tf dataset?
If I use a python pipeline I can successfully save the array in a file. But I would like to use tf dataset because I think it's gonna be faster and more optimized due to it's parallel map function, batching and autotuning the parallel calls.

Using keras.layers.Add() in a keras.sequential model

Using TF 2.0 and tfp probability layers, I have constructed a keras.sequential model. I would like to export it for serving with TensorFlow Serving, and I would like to include the preprocessing and post processing steps in the servable.
My preprocessing steps are fairly simple-- fill NAs with explicit values, encoding a few strings as floats, normalize inputs, and denormalize outputs. For training, I have been doing the pre/post processing with pandas and numpy.
I know that I can export my Keras model's weights, wrap the keras.sequential model's architecture in a bigger TensorFlow graph, use low-level ops like tf.math.subtract(inputs, vector_of_feature_means) to do pre/post processing operations, define tf.placeholders for my inputs and outputs, and make a servable, but I feel like there has to be a cleaner way of doing this.
Is it possible to use keras.layers.Add() and keras.layers.Multiply() in a keras.sequence model for explicit preprocessing steps, or is there some more standard way of doing these things?
The standard and efficient way of doing these things, as per my understanding is, to use Tensorflow Transform. It doesn't essentially mean that we should use entire TFX Pipeline if we have to use TF Transform. TF Transform can be used as a Standalone as well.
Tensorflow Transform creates a Beam Transormation Graph, which injects these Transformations as Constants in Tensorflow Graph. As these transformations are represented as Constants in the Graph, they will be consistent across Training and Serving. Advantages of that consistency across Training and Serving are
Eliminates Training-Serving Skew
Eliminates the need for having code in the Serving System, which improves the latency.
Sample Code for TF Transform is mentioned below:
Code for Importing all the Dependencies:
try:
import tensorflow_transform as tft
import apache_beam as beam
except ImportError:
print('Installing TensorFlow Transform. This will take a minute, ignore the warnings')
!pip install -q tensorflow_transform
print('Installing Apache Beam. This will take a minute, ignore the warnings')
!pip install -q apache_beam
import tensorflow_transform as tft
import apache_beam as beam
import tensorflow as tf
import tensorflow_transform.beam as tft_beam
from tensorflow_transform.tf_metadata import dataset_metadata
from tensorflow_transform.tf_metadata import dataset_schema
Below mentioned is the Pre-Processing function where we mention all the Transformations. As of now, TF Transform doesn't provide a direct API for Missing Value Imputation. So, only for that, we have to write our own code for that using low level APIs.
def preprocessing_fn(inputs):
"""Preprocess input columns into transformed columns."""
# Since we are modifying some features and leaving others unchanged, we
# start by setting `outputs` to a copy of `inputs.
outputs = inputs.copy()
# Scale numeric columns to have range [0, 1].
for key in NUMERIC_FEATURE_KEYS:
outputs[key] = tft.scale_to_0_1(outputs[key])
for key in OPTIONAL_NUMERIC_FEATURE_KEYS:
# This is a SparseTensor because it is optional. Here we fill in a default
# value when it is missing.
dense = tf.sparse_to_dense(outputs[key].indices,
[outputs[key].dense_shape[0], 1],
outputs[key].values, default_value=0.)
# Reshaping from a batch of vectors of size 1 to a batch to scalars.
dense = tf.squeeze(dense, axis=1)
outputs[key] = tft.scale_to_0_1(dense)
# For all categorical columns except the label column, we generate a
# vocabulary but do not modify the feature. This vocabulary is instead
# used in the trainer, by means of a feature column, to convert the feature
# from a string to an integer id.
for key in CATEGORICAL_FEATURE_KEYS:
tft.vocabulary(inputs[key], vocab_filename=key)
# For the label column we provide the mapping from string to index.
table = tf.contrib.lookup.index_table_from_tensor(['>50K', '<=50K'])
outputs[LABEL_KEY] = table.lookup(outputs[LABEL_KEY])
return outputs
You can refer below mentioned link for the detailed information and for the Tutorial of TF Transform.
https://www.tensorflow.org/tfx/transform/get_started
https://www.tensorflow.org/tfx/tutorials/transform/census

TensorFlow input pipeline for deployment on CloudML

I'm relatively new to TensorFlow and I'm having trouble modifying some of the examples to use batch/stream processing with input functions. More specifically, what is the 'best' way to modify this script to make it suitable for training and serving deployment on Google Cloud ML?
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/learn/text_classification.py
Something akin to this example:
https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/census/estimator/trainer
I can package it up and train it in the cloud, but I can't figure out how to apply even the simple vocab_processor transformations to an input tensor. I know how to do it with pandas, but there I can't apply the transformation to batches (using the chunk_size parameter). I would be very happy if I could reuse my pandas preprocessing pipelines in TensorFlow.
I think you have 3 options
1) You cannot reuse pandas preprocessing pipelines in TF. However, you could start TF with the output of your pandas preprocessing. So you could build a vocab and convert the text words to integers, and save a new preprocessed dataset to disk. Then read the integer data (which is encoding your text) in TF to do training.
2) You could build a vocab outside of TF in pandas. Then inside TF, after reading the words, you can make a table to map the text to integers. But if you are going to build a vocab outside of TF, you might as well do the transformation at the same time outside of TF, which is option 1.
3) Use tensorflow_transform. You can call tft.string_to_int() on the text column to automatically build the vocab and convert to integers. The output of tensorflow_transform is preprocessed data in tf.example format. Then training can start from the tf.example files. This is again option 1 but with tf.example files. If you want to run prediction on raw text data, this option allows you to make an exported graph that has the same text preprocessing built in, so you don't have to manage the preprocessing step at prediction time. However, this option is the most complicated as it introduces two additional ideas: tf.example files and beam pipelines.
For examples of tensorflow_transform see https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/criteo_tft
and
https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/reddit_tft