TensorFlow input pipeline for deployment on CloudML - pandas

I'm relatively new to TensorFlow and I'm having trouble modifying some of the examples to use batch/stream processing with input functions. More specifically, what is the 'best' way to modify this script to make it suitable for training and serving deployment on Google Cloud ML?
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/learn/text_classification.py
Something akin to this example:
https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/census/estimator/trainer
I can package it up and train it in the cloud, but I can't figure out how to apply even the simple vocab_processor transformations to an input tensor. I know how to do it with pandas, but there I can't apply the transformation to batches (using the chunk_size parameter). I would be very happy if I could reuse my pandas preprocessing pipelines in TensorFlow.

I think you have 3 options
1) You cannot reuse pandas preprocessing pipelines in TF. However, you could start TF with the output of your pandas preprocessing. So you could build a vocab and convert the text words to integers, and save a new preprocessed dataset to disk. Then read the integer data (which is encoding your text) in TF to do training.
2) You could build a vocab outside of TF in pandas. Then inside TF, after reading the words, you can make a table to map the text to integers. But if you are going to build a vocab outside of TF, you might as well do the transformation at the same time outside of TF, which is option 1.
3) Use tensorflow_transform. You can call tft.string_to_int() on the text column to automatically build the vocab and convert to integers. The output of tensorflow_transform is preprocessed data in tf.example format. Then training can start from the tf.example files. This is again option 1 but with tf.example files. If you want to run prediction on raw text data, this option allows you to make an exported graph that has the same text preprocessing built in, so you don't have to manage the preprocessing step at prediction time. However, this option is the most complicated as it introduces two additional ideas: tf.example files and beam pipelines.
For examples of tensorflow_transform see https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/criteo_tft
and
https://github.com/GoogleCloudPlatform/cloudml-samples/tree/master/reddit_tft

Related

Image feature extraction with TF2 Keras API and TF Dataset

How should I use a tf Dataset in order run model.predict(data) and have access to the other features of the tf Dataset?
For example: my tf dataset has this format:
(tensor<(100,224,224,3)>, tensor<(100,)>) -> preprocesses images as tf.float32, uuids of the images as tf.string
If I extract the feature vector like this:
for image_data, uuids in ds.batch(100):
features = model.predict(data[0]) -> I get an array of features.
At this moment features is an array of (100, 2048) and uuids is a tensor of (100,) tf.string
How can I combine them in order to write the feature vectors to disk?
From my understanding, I need to have both of them in the same format, either both tensors so I can continue using tf code and save the feature vector as a tfrecord, either to get the uuid as a string from the uuid tensor so I can use python code and save the array in the file using numpy.tofile.
So my questions are:
- How can I make the features to be a tensor?
- Or can I get the string value from the tensor uuid?
- Does anything sounds wrong in what I try to do? Is there a more optimal way to create the input pipeline? Or did I misunderstood the usage of Keras API and tf dataset?
If I use a python pipeline I can successfully save the array in a file. But I would like to use tf dataset because I think it's gonna be faster and more optimized due to it's parallel map function, batching and autotuning the parallel calls.

Hyperparameter tune for Tensorflow with hyper-engine

I find hyper-engine python tool on
https://github.com/maxim5/hyper-engine.
The example in only using mnist.
https://github.com/maxim5/hyper-engine/tree/master/hyperengine/examples.
How can I feed my own data like this example below:
https://github.com/aymericdamien/TensorFlow-Examples/blob/master/notebooks/5_DataManagement/build_an_image_dataset.ipynb
HyperEngine supports custom data provider, the closest example is this one: it's generating word pairs from the text, not images, but the API is more or less clear. Basically, you only need to implement next_batch method:
def next_batch(self, batch_size):
pass
So if you want to train your network on a set of images on a disk, you simply need to write an iterator over files and yield numpy arrays upon calling the next batch.
But there is a but. Currently, HyperEngine is accepting only numpy arrays from next_batch. The example you refer to is working with TF queue API and read_images function is producing tensors, so you can't simply copy the code. Hopefully, will be a better support for various tensorflow APIs, including estimators, dataset API, queues, etc.

Data augmentation in Tensorflow using Estimator API and TFRecords dataset

I'm using Tensorflow's 1.3 Estimator API to perform some image classification. Since I have a considerable amount of data, I gave the TFRecords a go. Saved the file and can read the examples to a Dataset using a parser function inside the input_fn of the estimator model. So far so good.
The issue is when I want to do some image augmentation (rotating and shearing in this case).
1) I tried using the tf.contrib.keras.preprocessing.image.random_shearand the likes. Turns out Keras doesn't like the format of TF's shape ('Dimension') and I can't cast it to a list because its arguments are the axis indexes not the actual value.
2) Then I tried using the tf.contrib.image.rotate and tf.contrib.image.transform with random values in my chosen range. This time I get an error of NotFoundError: Op type not registered 'ImageProjectiveTransform' in binary running on MYPC. Make sure the Op and Kernel are registered in the binary running in this process. which is an open issue (https://github.com/tensorflow/tensorflow/issues/9672). At the moment I can't move from Windows, so I would very interested in possible alternatives.
3) Searched for a way to read TFRecords and transform it to numpy array and do the augmentation with other tools, but can't find a way from within the input_fn from where I can't access the session.
Thanks!
Have you tried using function from the answer to the question below?tensorflow: how to rotate an image for data augmentation?

Tensorflow Transfer Learning with Input Pipeline

I want to use transfer learning with Google's Inception network for an image recognition problem. I am using retrain.py from the TensorFlow example source for inspiration.
In retrain.py, the Inception graph is loaded and a feed dict is used to feed the new images into the model's input layer. However, I have my data serialized in TFRecord files and have been using an input pipeline to feed in my inputs, as demonstrated here.
So I have a tensor images which returns my input data in batches when run. But how can I feed these images into Inception? I can't use a feed dict since my inputs are tensors, not NumPy arrays. My two ideas are
1) simply call sess.run() on each batch to convert it to a NumPy array, and then use a feed dict to pass it to Inception.
2) replace the input node in the Inception graph with my own batch input tensor
I think (1) would work, but it seems a little inelegant. (2) seems more natural to me, but I can't do exactly that because TensorFlow graphs can only be appended to and not otherwise modified.
Is there a better approach?
You can implement option (2), replacing the input node, but you will need to modify retrain.py to do so. The tf.import_graph_def() function supports a limited form of modification to the imported graph, by remapping tensors in the imported graph to existing tensors in the target graph.
This line in retrain.py calls tf.import_graph_def() to import the Inception model, where jpeg_data_tensor becomes the tensor that you feed with input data:
bottleneck_tensor, jpeg_data_tensor, resized_input_tensor = (
tf.import_graph_def(graph_def, name='', return_elements=[
BOTTLENECK_TENSOR_NAME, JPEG_DATA_TENSOR_NAME,
RESIZED_INPUT_TENSOR_NAME]))
Instead of retrieving jpeg_data_tensor from the imported graph, you can remap it to an input pipeline that you construct yourself:
# Output of a training pipeline, returning a `tf.string` tensor containing
# a JPEG-encoded image.
jpeg_data_tensor = ...
bottleneck_tensor, resized_input_tensor = (
tf.import_graph_def(
graph_def,
input_map={JPEG_DATA_TENSOR_NAME: jpeg_data_tensor},
return_elements=[BOTTLENECK_TENSOR_NAME, RESIZED_INPUT_TENSOR_NAME]))
Wherever you previously fed jpeg_data_tensor, you no longer need to need it, because the inputs will be read from the input pipeline you constructed. (Note that you might need to handle resized_input_tensor as well... I'm not intimately familiar with retrain.py, so some restructuring might be necessary.)

How can I feed a numpy array to a prefetch and buffer pipeline of TensorFlow

I tried to follow the Cifar10 example. However, I want to replace the file reading with the Numpy array. There are a few benefits for doing that:
Simpler code (I want to remove the binary file parsing)
Simpler graph and visualization --> easier to explain to other audience
Small perf improvement (due to I/O and parsing)?
What would be a simple way to do it?
You need to get the tensor reshape_image by either:
giving it a name
finding its default name, with Tensorboard for instance
reshaped_image = tf.cast(read_input.uint8image, tf.float32, name="float_image")
Then you can feed your numpy array using a feed_dict like:
reshaped_image = tf.get_default_graph().get_tensor_by_name("float_image")
sess.run(loss, feed_dict={reshaped_image: your_numpy})
The same goes for labels.