Tensorflow estimator exporter when training data is from tfrecords and inference is from raw data - tensorflow

Background
0) I'm working on an NLP model that I would like to export
1) I have training data in the form of tfRecords
2) I would like to export my model and host it on a flask app, so the data that comes in is raw text
3) I handle all my pre-processing (tokenization and such) as part of my tensorflow graph
Question
1) Given the fact that I do the data loading (tf.Dataset creation, and pre-processing) as part of tensorflow graph, would the raw text that comes in break the process? (specifically in the tf.Dataset creation step)
2) Would it make more sense to just load in raw text instead of tf.Dataset data?

NVM, I completely forgot that you want to have an input_function which you feed datasets through, and a serving_input_function which accepts raw data

Related

how to load large datasets of numpy arrays in order to train a CNN model in tensorflow2.1.0

I'm training a convolutional neural network (CNN) model for a binary classification task in tensorflow2.1.0.
The feature of each instance is a 4-dimensional numpy array with shape of (50, 50, 50, 2), in which the type of each element is float32.
The label of each instance is 1 or 0
My largest training dataset can contain up to ~100 millions of instances.
To efficiently train the model, is it best to serialize my training data and store it in a set of files with TFrecord format, and then load them with tf.data.TFRecordDataset() and parse them with tf.data.map()?
If so, could you show me an example of how to serialize the pairs of feature-label and store them into TFrecord files, then how to load and parse them?
I did not find appropriate example in the website of Tensorflow.
Or is there any better way to store and load the huge datasets? Thanks very much.
There are many ways to efficiently build data pipeline without TFRecord click thislink it was very useful
To extract images from directory efficiently then click this link.
Hope this helped you.

Tensorflow Keras GCP ML engine model serving

I'm working on an image classifier with tensorflow estimator + keras retraining the last layer of a pretrained application inception_v3 on GCP ML engine.
The keras model is exported with tf.keras.estimator.model_to_estimator and the input function receive the path of the image stored on GCP cloud storage open the image with tf.image.decode_jpeg and return a dataset with the following format dict(zip(['inception_v3_input'], [image])), label
I'm trying to define the tf.estimator.export.ServingInputReceiver but I'm having some trouble defining it.
The model is serving correctly the prediction with the predict method using the input function without the labels.
My idea was to reuse the input_function to decode the image passing only the path of the image on cloud storage to the prediction also for the google endpoint, but I can't understand how to do it.
Thank's for your help
If I'm understanding correctly, your question is how to get the file from Cloud Storage, considering that you want to decode the image this way:
image_decoded = tf.image.decode_jpeg(image_string)
So, in this case, you can use:
image_string = file_io.FileIO(filename, mode='r')
By importing file_io first:
from tensorflow.python.lib.io import file_io
According to the comments on this question about reading input data from GCS, using the file_read function should provide the same results since " there was a bunch of work done to abstract file io and file systems, so there all the io functionality works consistently". So you can try also with read_file function.

Data augmentation in Tensorflow using Estimator API and TFRecords dataset

I'm using Tensorflow's 1.3 Estimator API to perform some image classification. Since I have a considerable amount of data, I gave the TFRecords a go. Saved the file and can read the examples to a Dataset using a parser function inside the input_fn of the estimator model. So far so good.
The issue is when I want to do some image augmentation (rotating and shearing in this case).
1) I tried using the tf.contrib.keras.preprocessing.image.random_shearand the likes. Turns out Keras doesn't like the format of TF's shape ('Dimension') and I can't cast it to a list because its arguments are the axis indexes not the actual value.
2) Then I tried using the tf.contrib.image.rotate and tf.contrib.image.transform with random values in my chosen range. This time I get an error of NotFoundError: Op type not registered 'ImageProjectiveTransform' in binary running on MYPC. Make sure the Op and Kernel are registered in the binary running in this process. which is an open issue (https://github.com/tensorflow/tensorflow/issues/9672). At the moment I can't move from Windows, so I would very interested in possible alternatives.
3) Searched for a way to read TFRecords and transform it to numpy array and do the augmentation with other tools, but can't find a way from within the input_fn from where I can't access the session.
Thanks!
Have you tried using function from the answer to the question below?tensorflow: how to rotate an image for data augmentation?

TensorFlow input pipeline for deployment on CloudML

I'm relatively new to TensorFlow and I'm having trouble modifying some of the examples to use batch/stream processing with input functions. More specifically, what is the 'best' way to modify this script to make it suitable for training and serving deployment on Google Cloud ML?
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/learn/text_classification.py
Something akin to this example:
https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/census/estimator/trainer
I can package it up and train it in the cloud, but I can't figure out how to apply even the simple vocab_processor transformations to an input tensor. I know how to do it with pandas, but there I can't apply the transformation to batches (using the chunk_size parameter). I would be very happy if I could reuse my pandas preprocessing pipelines in TensorFlow.
I think you have 3 options
1) You cannot reuse pandas preprocessing pipelines in TF. However, you could start TF with the output of your pandas preprocessing. So you could build a vocab and convert the text words to integers, and save a new preprocessed dataset to disk. Then read the integer data (which is encoding your text) in TF to do training.
2) You could build a vocab outside of TF in pandas. Then inside TF, after reading the words, you can make a table to map the text to integers. But if you are going to build a vocab outside of TF, you might as well do the transformation at the same time outside of TF, which is option 1.
3) Use tensorflow_transform. You can call tft.string_to_int() on the text column to automatically build the vocab and convert to integers. The output of tensorflow_transform is preprocessed data in tf.example format. Then training can start from the tf.example files. This is again option 1 but with tf.example files. If you want to run prediction on raw text data, this option allows you to make an exported graph that has the same text preprocessing built in, so you don't have to manage the preprocessing step at prediction time. However, this option is the most complicated as it introduces two additional ideas: tf.example files and beam pipelines.
For examples of tensorflow_transform see https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/criteo_tft
and
https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/reddit_tft

Tensorflow Transfer Learning with Input Pipeline

I want to use transfer learning with Google's Inception network for an image recognition problem. I am using retrain.py from the TensorFlow example source for inspiration.
In retrain.py, the Inception graph is loaded and a feed dict is used to feed the new images into the model's input layer. However, I have my data serialized in TFRecord files and have been using an input pipeline to feed in my inputs, as demonstrated here.
So I have a tensor images which returns my input data in batches when run. But how can I feed these images into Inception? I can't use a feed dict since my inputs are tensors, not NumPy arrays. My two ideas are
1) simply call sess.run() on each batch to convert it to a NumPy array, and then use a feed dict to pass it to Inception.
2) replace the input node in the Inception graph with my own batch input tensor
I think (1) would work, but it seems a little inelegant. (2) seems more natural to me, but I can't do exactly that because TensorFlow graphs can only be appended to and not otherwise modified.
Is there a better approach?
You can implement option (2), replacing the input node, but you will need to modify retrain.py to do so. The tf.import_graph_def() function supports a limited form of modification to the imported graph, by remapping tensors in the imported graph to existing tensors in the target graph.
This line in retrain.py calls tf.import_graph_def() to import the Inception model, where jpeg_data_tensor becomes the tensor that you feed with input data:
bottleneck_tensor, jpeg_data_tensor, resized_input_tensor = (
tf.import_graph_def(graph_def, name='', return_elements=[
BOTTLENECK_TENSOR_NAME, JPEG_DATA_TENSOR_NAME,
RESIZED_INPUT_TENSOR_NAME]))
Instead of retrieving jpeg_data_tensor from the imported graph, you can remap it to an input pipeline that you construct yourself:
# Output of a training pipeline, returning a `tf.string` tensor containing
# a JPEG-encoded image.
jpeg_data_tensor = ...
bottleneck_tensor, resized_input_tensor = (
tf.import_graph_def(
graph_def,
input_map={JPEG_DATA_TENSOR_NAME: jpeg_data_tensor},
return_elements=[BOTTLENECK_TENSOR_NAME, RESIZED_INPUT_TENSOR_NAME]))
Wherever you previously fed jpeg_data_tensor, you no longer need to need it, because the inputs will be read from the input pipeline you constructed. (Note that you might need to handle resized_input_tensor as well... I'm not intimately familiar with retrain.py, so some restructuring might be necessary.)