How to augment data in tensorflow tfrecords? - tensorflow

I am storing my data using tfrecords and I read them as tensors using Dataset API and then I use the Estimator API to perform training. Now, I want to do online data-augmentation on each item in the dataset, but after trying for a while I cannot find a way out to do it. I want randomly flipping, randomly rotation and other manipulators.
I am following the instructions given in this tutorial with a custom estimator which is the my CNN and I am not sure where the data augmentation step occurs.

Using TFRecords doesn't prevent you from doing data augmentation.
Following the tutorial you linked in your comment, here is what roughly happens:
You create the dataset from the TFRecords files, and parse the file to get an image and a label
dataset = tf.data.TFRecordDataset(filenames=filenames)
dataset = dataset.map(parse)
You can now apply a new preprocessing function to do some data augmentation during training
# Only do it when we are training
if train:
dataset = dataset.map(train_preprocess)
The train_preprocess function can be something like this:
def train_preprocess(image, label):
flip_image = tf.image.random_flip_left_right(image)
# Other transformations...
return flip_image, label

Related

How is data from tf.data generated and passed to the model

In the book Hands-On ML with Scikit-Learn, Tensorflow and Keras, the author explains using the Data API to manipulate, transform and pass data to the model efficiently, he writes the following function:
def csv_reader(filepaths, batch_size=32):
dataset = tf.data.Dataset.list_files(filepaths)
dataset = dataset.interleave(lambda filepath:
tf.data.TextLineDataset(filepath).skip(1), cycle_length=5)
dataset = dataset.shuffle(10000).repeat(1)
return dataset.batch(batch_size).prefetch(1)
Then : train_set = csv_reader_dataset(train_filepaths)
and: model.fit(train_set, epochs=10)
What I don't understand is the part where he creates the actual train_set from the function, isn't that way he only has one batch of data? He says that we create a training set once and don't need to repeat it as it will be taken care of by Keras but I don't see how.
A tf.data.Dataset is like a blueprint for how to get your data. To use a dataset to read data, you create an iterator over the dataset. The same dataset can be used to create multiple iterators, which will each iterate over the whole dataset. So Keras just needs one dataset, then can use the dataset to iterate over your data multiple times.

How does tf.keras.Model tell between features and label(s) in tf.data.Dataset and in TFRecords?

I'm trying to create tfrecords files from CSV data, then I want to use tf.data.TFRecordDataset() to create Dataset from them, and then feed the Dataset to tf.keras.Model. (In fact I'm using spark-tensorflow-connector to create tfrecords files directly from Spark Dataframes.)
In the fit() method of tf.keras.Model, the argument x is the Input data. It could be:
A tf.data dataset. Should return a tuple of either (inputs, targets)
or (inputs, targets, sample_weights).
Q1: Is this where tf.keras.Model knows where to separate the features and the labels? i.e., the features are the inputs, the labels are the targets.
However in some examples, I could not see any "tuple" in the building of either the tfrecords files or the tf.data.Dataset. For example, in the following example,
def convert_to_tfrecord(input_files, output_file):
"""Converts a file to TFRecords."""
print('Generating %s' % output_file)
with tf.io.TFRecordWriter(output_file) as record_writer:
for input_file in input_files:
data_dict = read_pickle_from_file(input_file)
data = data_dict[b'data']
labels = data_dict[b'labels']
num_entries_in_batch = len(labels)
for i in range(num_entries_in_batch):
example = tf.train.Example(features=tf.train.Features(
feature={
'image': _bytes_feature(data[i].tobytes()),
'label': _int64_feature(labels[i])
}))
record_writer.write(example.SerializeToString())
...
# Read dataset from tfrecords
dataset = tf.data.TFRecordDataset(tfrecords_files)
Q2: So how does this tf.keras.models.Sequential() model know where to find the features and where to find the labels? Why the model wouldn't take 'label' as a data feature?
You need to consider the full code example, i.e. the other files where the training is done etc. The main thing is the parse_and_decode function in this file which parses the TFRecords file (without such a parse function, the data cannot be interpreted) and returns a tuple image, label for each piece of data. This function is then mapped over the dataset in the create_datasets function.
As such, the dataset that is given to model.fit is actually a dataset of tuples, and to the best of my knowledge, this is exactly what the model will assume if you provide a tf.data.Dataset as input to the fit function -- a dataset of tuples inputs, labels. So the first will be taken as input to the model, the second as target for the loss function.
In this example,
feature={
'image': _bytes_feature(data[i].tobytes()),
'label': _int64_feature(labels[i])
}))
Here, image and label is a tuple where image has byte type and labels have int64 type. YOu can read more here.

Why can't I use my dataset anymore after using InceptionV3?

I'm currently working on video-captioning (frame-sequence to natural language).
I recently started using tf.data.Dataset class instead of feed_dict argument in tensorflow.
My goal is to feed this frames to a pretrained CNN (inceptionv3), extract the feature vector and then feed it to my RNN seq2seq network.
I've got a problem of tensorflow types after mapping my Dataset with the inception model: the dataset is then totally unusable, neither via dataset.batch() or dataset.take(). I can't even make a one shot iterator !
Here is how I proceed to build my Dataset:
Step 1: I first extract the same number of frames for every videos. I store all of it into a numpy array. Its shape is (nb_videos, nb_frames, width, height, channels)
Note that in this dataset, every video has the same size and has 3 color channels.
Step 2: Then I create a tf.data.Dataset object using this big numpy array
Note that printing this dataset via python gives:
With n_videos=2; width=240; height=320; channels=3
I already don't understand what "DataAdapter" stands for
At this point; I can create a one shot iterator but using dataset.batch(1) returns:
I don't understand why "?" and not "1" shape..
Step 3: I use the map function on dataset to resize all the frames of all the videos to 299*299*3 (required to use InceptionV3)
At this point, I can use the data in my dataset and make a one shot iterator.
Step 4: I use the map function again to extract every features using InceptionV3 pretrained model.
The problem occurs at this point:
Printing the dataset gives:
Ok looks good
However, it's now impossible to make a one shot iterator for this dataset
Step1 :
X_train_slice, Y_train = build_dataset(number_of_samples)
Step 2:
X_train = tf.data.Dataset.from_tensor_slices(X_train_slice)
Step 3:
def format_video(video):
frames = tf.image.resize_images(video, (299,299))
frames = tf.keras.applications.inception_v3.preprocess_input(frames)
return frames
X_train = X_train.map(lambda video: format_video(video))
Step 4:
Inception model:
image_model = tf.keras.applications.InceptionV3(include_top=False,
weights='imagenet')
new_input = image_model.input
hidden_layer = image_model.layers[-1].output
image_features_extract_model = tf.keras.Model(new_input, hidden_layer)
For the tf.reduce_mean; see how-to-get-pool3-features-of-inception-v3-model-using-keras (SO)
def extract_video_features(video):
batch_features = image_features_extract_model(video)
batch_features = tf.reduce_mean(batch_features, axis=(1, 2))
return batch_features
X_train = X_train.map(lambda video: extract_video_features(video))
Creating the iterator:
iterator = X_train.make_one_shot_iterator()
Here is the output:
ValueError: Failed to create a one-shot iterator for a dataset.
`Dataset.make_one_shot_iterator()` does not support datasets that capture
stateful objects, such as a `Variable` or `LookupTable`. In these cases, use
`Dataset.make_initializable_iterator()`. (Original error: Cannot capture a
stateful node (name:conv2d/kernel, type:VarHandleOp) by value.)
I don't really get it: it asks me to use a initializable_iterator but this kind of iterator is dedicated for placeholder. Here, I've got raw data !
You're using the pipelines wrong.
The idea of tf.data is to provide input pipelines to a model, not to contain the model itself. What you're trying to do it fit the model as a step of the pipeline (your step 4), but, as the error shows, this won't work.
What you should do instead is build the model as you are doing and then call model.predict on the input data, to obtain the features you want (as computed values). If you want to add further computation, add it in the model, since the predict call will run the model and return the values of the output layers.
Side note: image_features_extract_model = tf.keras.Model(new_input, hidden_layer) is completely irrelevant, given the choice you made for input and output tensors: the input is image_model's input and the output is image_model's output, so image_features_extract_model is identical to image_model.
The final code should be:
X_train_slice, Y_train = build_dataset(number_of_samples)
X_train = tf.data.Dataset.from_tensor_slices(X_train_slice)
def format_video(video):
frames = tf.image.resize_images(video, (299,299))
frames = tf.keras.applications.inception_v3.preprocess_input(frames)
return frames
X_train = X_train.map(lambda video: format_video(video))
image_model = tf.keras.applications.InceptionV3(include_top=False,
weights='imagenet')
bottlenecks = image_model.predict(X_train)
# Do something with your bottlenecks

The established way to use TF Dataset API in Keras is to feed `model.fit` with `make_one_shot_iterator()`, But this iterator only good for one Epoch

Edit:
To clarify why this question is different from the suggested duplicates, this SO question follows up on those suggested duplicates, on what exactly is Keras doing with the techniques described in those SO questions. The suggested duplicates specify using a dataset API make_one_shot_iterator() in model.fit, my follow up is that make_one_shot_iterator() can only go through the dataset once, however in the solutions given, several epochs are specified.
This is a follow up to these SO questions
How to Properly Combine TensorFlow's Dataset API and Keras?
Tensorflow keras with tf dataset input
Using tf.data.Dataset as training input to Keras model NOT working
Where "Starting from Tensorflow 1.9, one can pass tf.data.Dataset object directly into keras.Model.fit() and it would act similar to fit_generator". Each example has a TF dataset one shot iterator fed into Kera's model.fit.
An example is given below
# Load mnist training data
(x_train, y_train), _ = tf.keras.datasets.mnist.load_data()
training_set = tfdata_generator(x_train, y_train,is_training=True)
model = # your keras model here
model.fit(
training_set.make_one_shot_iterator(),
steps_per_epoch=len(x_train) // 128,
epochs=5,
verbose = 1)
However, according the the Tensorflow Dataset API guide (here https://www.tensorflow.org/guide/datasets ) :
A one-shot iterator is the simplest form of iterator, which only
supports iterating once through a dataset
So it's only good for 1 epoch. However, the codes in the SO questions specify several epochs, with the code example above specifying 5 epochs.
Is there any explanation for this contradiction? Does Keras somehow know that when the one shot iterator has gone through the dataset, it can re-initialize and shuffle the data?
You can simply pass dataset object to model.fit, Keras will handle iteration.
Considering one of pre-made datasets:
train, test = tf.keras.datasets.cifar10.load_data()
dataset = tf.data.Dataset.from_tensor_slices((train[0], train[1]))
This will create dataset object from training data of cifar10 dataset. In this case parse function isn't needed.
If you create dataset from path containing images of list of numpy arrays you'll need one.
dataset = tf.data.Dataset.from_tensor_slices((image_path, labels_path))
In case you'll need a function to load actual data from filename. Numpy array can be handled the same way just without tf.read_file
def parse_func(filename):
f = tf.read_file(filename)
image = tf.image.decode_image(f)
label = #get label from filename
return image, label
Then you can shuffle, batch, and map any parse function to this dataset. You can control how many examples will be preloaded with shuffle buffer. Repeat controls epoch count and better be left None, so it will repeat indefinitely. You can use either plain batch function or combine with
dataset = dataset.shuffle().repeat()
dataset.apply(tf.data.experimental.map_and_batch(map_func=parse_func, batch_size,num_parallel_batches))
Then dataset object can be passed to model.fit
model.fit(dataset, epochs, steps_per_epoch). Note that steps_per_epoch is a necessary parameter in this case, it will define when to start new epoch. So you'll have to know epoch size in advance.

Best way to process terabytes of data on gcloud ml-engine with keras

I want to train a model on about 2TB of image data on gcloud storage. I saved the image data as separate tfrecords and tried to use the tensorflow data api following this example
https://medium.com/#moritzkrger/speeding-up-keras-with-tfrecord-datasets-5464f9836c36
But it seems like keras' model.fit(...) doesn't support validation for tfrecord datasets based on
https://github.com/keras-team/keras/pull/8388
Is there a better approach for processing large amounts of data with keras from ml-engine that I'm missing?
Thanks a lot!
If you are willing to use tf.keras instead of actual Keras, you can instantiate a TFRecordDataset with the tf.data API and pass that directly to model.fit(). Bonus: you get to stream directly from Google Cloud storage, no need to download the data first:
# Construct a TFRecordDataset
ds_train tf.data.TFRecordDataset('gs://') # path to TFRecords on GCS
ds_train = ds_train.shuffle(1000).batch(32)
model.fit(ds_train)
To include validation data, create a TFRecordDataset with your validation TFRecords and pass that one to the validation_data argument of model.fit(). Note: this is possible as of TensorFlow 1.9.
Final note: you'll need to specify the steps_per_epoch argument. A hack that I use to know the total number of examples in all TFRecordfiles, is to simply iterate over the files and count:
import tensorflow as tf
def n_records(record_list):
"""Get the total number of records in a collection of TFRecords.
Since a TFRecord file is intended to act as a stream of data,
this needs to be done naively by iterating over the file and counting.
See https://stackoverflow.com/questions/40472139
Args:
record_list (list): list of GCS paths to TFRecords files
"""
counter = 0
for f in record_list:
counter +=\
sum(1 for _ in tf.python_io.tf_record_iterator(f))
return counter
Which you can use to compute steps_per_epoch:
n_train = n_records([gs://path-to-tfrecords/record1,
gs://path-to-tfrecords/record2])
steps_per_epoch = n_train // batch_size