Best way to process terabytes of data on gcloud ml-engine with keras - tensorflow

I want to train a model on about 2TB of image data on gcloud storage. I saved the image data as separate tfrecords and tried to use the tensorflow data api following this example
https://medium.com/#moritzkrger/speeding-up-keras-with-tfrecord-datasets-5464f9836c36
But it seems like keras' model.fit(...) doesn't support validation for tfrecord datasets based on
https://github.com/keras-team/keras/pull/8388
Is there a better approach for processing large amounts of data with keras from ml-engine that I'm missing?
Thanks a lot!

If you are willing to use tf.keras instead of actual Keras, you can instantiate a TFRecordDataset with the tf.data API and pass that directly to model.fit(). Bonus: you get to stream directly from Google Cloud storage, no need to download the data first:
# Construct a TFRecordDataset
ds_train tf.data.TFRecordDataset('gs://') # path to TFRecords on GCS
ds_train = ds_train.shuffle(1000).batch(32)
model.fit(ds_train)
To include validation data, create a TFRecordDataset with your validation TFRecords and pass that one to the validation_data argument of model.fit(). Note: this is possible as of TensorFlow 1.9.
Final note: you'll need to specify the steps_per_epoch argument. A hack that I use to know the total number of examples in all TFRecordfiles, is to simply iterate over the files and count:
import tensorflow as tf
def n_records(record_list):
"""Get the total number of records in a collection of TFRecords.
Since a TFRecord file is intended to act as a stream of data,
this needs to be done naively by iterating over the file and counting.
See https://stackoverflow.com/questions/40472139
Args:
record_list (list): list of GCS paths to TFRecords files
"""
counter = 0
for f in record_list:
counter +=\
sum(1 for _ in tf.python_io.tf_record_iterator(f))
return counter
Which you can use to compute steps_per_epoch:
n_train = n_records([gs://path-to-tfrecords/record1,
gs://path-to-tfrecords/record2])
steps_per_epoch = n_train // batch_size

Related

Image feature extraction with TF2 Keras API and TF Dataset

How should I use a tf Dataset in order run model.predict(data) and have access to the other features of the tf Dataset?
For example: my tf dataset has this format:
(tensor<(100,224,224,3)>, tensor<(100,)>) -> preprocesses images as tf.float32, uuids of the images as tf.string
If I extract the feature vector like this:
for image_data, uuids in ds.batch(100):
features = model.predict(data[0]) -> I get an array of features.
At this moment features is an array of (100, 2048) and uuids is a tensor of (100,) tf.string
How can I combine them in order to write the feature vectors to disk?
From my understanding, I need to have both of them in the same format, either both tensors so I can continue using tf code and save the feature vector as a tfrecord, either to get the uuid as a string from the uuid tensor so I can use python code and save the array in the file using numpy.tofile.
So my questions are:
- How can I make the features to be a tensor?
- Or can I get the string value from the tensor uuid?
- Does anything sounds wrong in what I try to do? Is there a more optimal way to create the input pipeline? Or did I misunderstood the usage of Keras API and tf dataset?
If I use a python pipeline I can successfully save the array in a file. But I would like to use tf dataset because I think it's gonna be faster and more optimized due to it's parallel map function, batching and autotuning the parallel calls.

How to extract Tensorflow trained weights from graph.pbtxt to raw data

I have trained a custom neural network with the function:
tf.estimator.train_and_evaluate
After correct training, it contains the following files:
checkpoint
events.out.tfevents.1538489166.ti
model.ckpt-0.data-00000-of-00002
model.ckpt-0.index
model.ckpt-10.data-00000-of-00002
model.ckpt-10.index eval
graph.pbtxt
model.ckpt-0.data-00001-of-00002
model.ckpt-0.meta
model.ckpt-10.data-00001-of-00002
model.ckpt-10.meta
Now I need to export the weights and biases of every layer, into a raw data structure, e.g. an array, numpy.
I have read multiple pages on TensorFlow, and on other topics, but neither can find this question. The first thing I would assume to put the fils together into graph.pd with the freeze.py as suggested here:
Tensorflow: How to convert .meta, .data and .index model files into one graph.pb file
But then still the main question is unsolved.
If you wish to evaluate tensors alone, you can check out this question. But if you wish to e.g. deploy your network, you can take a look at TensorFlow serving, which is probably the most performant one right now. Or if you want to export this network to other frameworks and use them there, you can actually use ONNX for this purpose.
If saving weights and biases in a numpy array is your strict requirement, you can follow this example:
# In a TF shell, define all requirements and call the model function
y = model(x, is_training=False, reuse=tf.AUTO_REUSE) # For example
Once you call this function, you can see all the variables in the graph by running
tf.global_variables()
You need to restore all these variables from the latest checkpoint (say ckpt_dir) and then execute each of these variables to get the latest values.
checkpoint = tf.train.latest_checkpoint('./model_dir/')
fine_tune = tf.contrib.slim.assign_from_checkpoint_fn(checkpoint,
tf.global_variables(),
ignore_missing_vars=True)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
gv = sess.run(tf.global_variables())
Now gv will be a list of all the values of your variables (weights and biases); You can access any individual component via indexing - gv[5] etc. Or you can convert the entire thing into an array and save using numpy.
np.save('my_weights', np.array(gv))
This will save all your weights and biases in your current working directory as a numpy array - 'my_weights.npy'.
Hope this helps.

Tensorflow Keras GCP ML engine model serving

I'm working on an image classifier with tensorflow estimator + keras retraining the last layer of a pretrained application inception_v3 on GCP ML engine.
The keras model is exported with tf.keras.estimator.model_to_estimator and the input function receive the path of the image stored on GCP cloud storage open the image with tf.image.decode_jpeg and return a dataset with the following format dict(zip(['inception_v3_input'], [image])), label
I'm trying to define the tf.estimator.export.ServingInputReceiver but I'm having some trouble defining it.
The model is serving correctly the prediction with the predict method using the input function without the labels.
My idea was to reuse the input_function to decode the image passing only the path of the image on cloud storage to the prediction also for the google endpoint, but I can't understand how to do it.
Thank's for your help
If I'm understanding correctly, your question is how to get the file from Cloud Storage, considering that you want to decode the image this way:
image_decoded = tf.image.decode_jpeg(image_string)
So, in this case, you can use:
image_string = file_io.FileIO(filename, mode='r')
By importing file_io first:
from tensorflow.python.lib.io import file_io
According to the comments on this question about reading input data from GCS, using the file_read function should provide the same results since " there was a bunch of work done to abstract file io and file systems, so there all the io functionality works consistently". So you can try also with read_file function.

How to augment data in tensorflow tfrecords?

I am storing my data using tfrecords and I read them as tensors using Dataset API and then I use the Estimator API to perform training. Now, I want to do online data-augmentation on each item in the dataset, but after trying for a while I cannot find a way out to do it. I want randomly flipping, randomly rotation and other manipulators.
I am following the instructions given in this tutorial with a custom estimator which is the my CNN and I am not sure where the data augmentation step occurs.
Using TFRecords doesn't prevent you from doing data augmentation.
Following the tutorial you linked in your comment, here is what roughly happens:
You create the dataset from the TFRecords files, and parse the file to get an image and a label
dataset = tf.data.TFRecordDataset(filenames=filenames)
dataset = dataset.map(parse)
You can now apply a new preprocessing function to do some data augmentation during training
# Only do it when we are training
if train:
dataset = dataset.map(train_preprocess)
The train_preprocess function can be something like this:
def train_preprocess(image, label):
flip_image = tf.image.random_flip_left_right(image)
# Other transformations...
return flip_image, label

Tensorflow Transfer Learning with Input Pipeline

I want to use transfer learning with Google's Inception network for an image recognition problem. I am using retrain.py from the TensorFlow example source for inspiration.
In retrain.py, the Inception graph is loaded and a feed dict is used to feed the new images into the model's input layer. However, I have my data serialized in TFRecord files and have been using an input pipeline to feed in my inputs, as demonstrated here.
So I have a tensor images which returns my input data in batches when run. But how can I feed these images into Inception? I can't use a feed dict since my inputs are tensors, not NumPy arrays. My two ideas are
1) simply call sess.run() on each batch to convert it to a NumPy array, and then use a feed dict to pass it to Inception.
2) replace the input node in the Inception graph with my own batch input tensor
I think (1) would work, but it seems a little inelegant. (2) seems more natural to me, but I can't do exactly that because TensorFlow graphs can only be appended to and not otherwise modified.
Is there a better approach?
You can implement option (2), replacing the input node, but you will need to modify retrain.py to do so. The tf.import_graph_def() function supports a limited form of modification to the imported graph, by remapping tensors in the imported graph to existing tensors in the target graph.
This line in retrain.py calls tf.import_graph_def() to import the Inception model, where jpeg_data_tensor becomes the tensor that you feed with input data:
bottleneck_tensor, jpeg_data_tensor, resized_input_tensor = (
tf.import_graph_def(graph_def, name='', return_elements=[
BOTTLENECK_TENSOR_NAME, JPEG_DATA_TENSOR_NAME,
RESIZED_INPUT_TENSOR_NAME]))
Instead of retrieving jpeg_data_tensor from the imported graph, you can remap it to an input pipeline that you construct yourself:
# Output of a training pipeline, returning a `tf.string` tensor containing
# a JPEG-encoded image.
jpeg_data_tensor = ...
bottleneck_tensor, resized_input_tensor = (
tf.import_graph_def(
graph_def,
input_map={JPEG_DATA_TENSOR_NAME: jpeg_data_tensor},
return_elements=[BOTTLENECK_TENSOR_NAME, RESIZED_INPUT_TENSOR_NAME]))
Wherever you previously fed jpeg_data_tensor, you no longer need to need it, because the inputs will be read from the input pipeline you constructed. (Note that you might need to handle resized_input_tensor as well... I'm not intimately familiar with retrain.py, so some restructuring might be necessary.)