How is data from tf.data generated and passed to the model - tensorflow

In the book Hands-On ML with Scikit-Learn, Tensorflow and Keras, the author explains using the Data API to manipulate, transform and pass data to the model efficiently, he writes the following function:
def csv_reader(filepaths, batch_size=32):
dataset = tf.data.Dataset.list_files(filepaths)
dataset = dataset.interleave(lambda filepath:
tf.data.TextLineDataset(filepath).skip(1), cycle_length=5)
dataset = dataset.shuffle(10000).repeat(1)
return dataset.batch(batch_size).prefetch(1)
Then : train_set = csv_reader_dataset(train_filepaths)
and: model.fit(train_set, epochs=10)
What I don't understand is the part where he creates the actual train_set from the function, isn't that way he only has one batch of data? He says that we create a training set once and don't need to repeat it as it will be taken care of by Keras but I don't see how.

A tf.data.Dataset is like a blueprint for how to get your data. To use a dataset to read data, you create an iterator over the dataset. The same dataset can be used to create multiple iterators, which will each iterate over the whole dataset. So Keras just needs one dataset, then can use the dataset to iterate over your data multiple times.

Related

How is keras predict working with datasets

I am new in using tf datasets with keras. Since you just handover one object, I don't understand what actually happens. If I handover a dataset to model predict, how does it know how and what elements to use from this object? Since a dataset of complex structure which inherits many kind of structures and levels I think, what happens if I take a dataset which as more "columns" than the dataset which was trained on. Are somehow the structure, names or levels saved during training from the dataset to remember when making predictions?
if tf.keras.Model.fit() receives tf.dataset() as input - it assumes that the dataset returns a tuple of either (inputs, targets) or (inputs, targets, sample_weights). Now, inputs part itself may be a complex structure of sub-inputs (like tuple of image and label for conditional VAE for instance).
If the dataset does not fit your model inputs - fit() will just fail.
See comment to fit() function in the TF source code

Why can't I use my dataset anymore after using InceptionV3?

I'm currently working on video-captioning (frame-sequence to natural language).
I recently started using tf.data.Dataset class instead of feed_dict argument in tensorflow.
My goal is to feed this frames to a pretrained CNN (inceptionv3), extract the feature vector and then feed it to my RNN seq2seq network.
I've got a problem of tensorflow types after mapping my Dataset with the inception model: the dataset is then totally unusable, neither via dataset.batch() or dataset.take(). I can't even make a one shot iterator !
Here is how I proceed to build my Dataset:
Step 1: I first extract the same number of frames for every videos. I store all of it into a numpy array. Its shape is (nb_videos, nb_frames, width, height, channels)
Note that in this dataset, every video has the same size and has 3 color channels.
Step 2: Then I create a tf.data.Dataset object using this big numpy array
Note that printing this dataset via python gives:
With n_videos=2; width=240; height=320; channels=3
I already don't understand what "DataAdapter" stands for
At this point; I can create a one shot iterator but using dataset.batch(1) returns:
I don't understand why "?" and not "1" shape..
Step 3: I use the map function on dataset to resize all the frames of all the videos to 299*299*3 (required to use InceptionV3)
At this point, I can use the data in my dataset and make a one shot iterator.
Step 4: I use the map function again to extract every features using InceptionV3 pretrained model.
The problem occurs at this point:
Printing the dataset gives:
Ok looks good
However, it's now impossible to make a one shot iterator for this dataset
Step1 :
X_train_slice, Y_train = build_dataset(number_of_samples)
Step 2:
X_train = tf.data.Dataset.from_tensor_slices(X_train_slice)
Step 3:
def format_video(video):
frames = tf.image.resize_images(video, (299,299))
frames = tf.keras.applications.inception_v3.preprocess_input(frames)
return frames
X_train = X_train.map(lambda video: format_video(video))
Step 4:
Inception model:
image_model = tf.keras.applications.InceptionV3(include_top=False,
weights='imagenet')
new_input = image_model.input
hidden_layer = image_model.layers[-1].output
image_features_extract_model = tf.keras.Model(new_input, hidden_layer)
For the tf.reduce_mean; see how-to-get-pool3-features-of-inception-v3-model-using-keras (SO)
def extract_video_features(video):
batch_features = image_features_extract_model(video)
batch_features = tf.reduce_mean(batch_features, axis=(1, 2))
return batch_features
X_train = X_train.map(lambda video: extract_video_features(video))
Creating the iterator:
iterator = X_train.make_one_shot_iterator()
Here is the output:
ValueError: Failed to create a one-shot iterator for a dataset.
`Dataset.make_one_shot_iterator()` does not support datasets that capture
stateful objects, such as a `Variable` or `LookupTable`. In these cases, use
`Dataset.make_initializable_iterator()`. (Original error: Cannot capture a
stateful node (name:conv2d/kernel, type:VarHandleOp) by value.)
I don't really get it: it asks me to use a initializable_iterator but this kind of iterator is dedicated for placeholder. Here, I've got raw data !
You're using the pipelines wrong.
The idea of tf.data is to provide input pipelines to a model, not to contain the model itself. What you're trying to do it fit the model as a step of the pipeline (your step 4), but, as the error shows, this won't work.
What you should do instead is build the model as you are doing and then call model.predict on the input data, to obtain the features you want (as computed values). If you want to add further computation, add it in the model, since the predict call will run the model and return the values of the output layers.
Side note: image_features_extract_model = tf.keras.Model(new_input, hidden_layer) is completely irrelevant, given the choice you made for input and output tensors: the input is image_model's input and the output is image_model's output, so image_features_extract_model is identical to image_model.
The final code should be:
X_train_slice, Y_train = build_dataset(number_of_samples)
X_train = tf.data.Dataset.from_tensor_slices(X_train_slice)
def format_video(video):
frames = tf.image.resize_images(video, (299,299))
frames = tf.keras.applications.inception_v3.preprocess_input(frames)
return frames
X_train = X_train.map(lambda video: format_video(video))
image_model = tf.keras.applications.InceptionV3(include_top=False,
weights='imagenet')
bottlenecks = image_model.predict(X_train)
# Do something with your bottlenecks

The established way to use TF Dataset API in Keras is to feed `model.fit` with `make_one_shot_iterator()`, But this iterator only good for one Epoch

Edit:
To clarify why this question is different from the suggested duplicates, this SO question follows up on those suggested duplicates, on what exactly is Keras doing with the techniques described in those SO questions. The suggested duplicates specify using a dataset API make_one_shot_iterator() in model.fit, my follow up is that make_one_shot_iterator() can only go through the dataset once, however in the solutions given, several epochs are specified.
This is a follow up to these SO questions
How to Properly Combine TensorFlow's Dataset API and Keras?
Tensorflow keras with tf dataset input
Using tf.data.Dataset as training input to Keras model NOT working
Where "Starting from Tensorflow 1.9, one can pass tf.data.Dataset object directly into keras.Model.fit() and it would act similar to fit_generator". Each example has a TF dataset one shot iterator fed into Kera's model.fit.
An example is given below
# Load mnist training data
(x_train, y_train), _ = tf.keras.datasets.mnist.load_data()
training_set = tfdata_generator(x_train, y_train,is_training=True)
model = # your keras model here
model.fit(
training_set.make_one_shot_iterator(),
steps_per_epoch=len(x_train) // 128,
epochs=5,
verbose = 1)
However, according the the Tensorflow Dataset API guide (here https://www.tensorflow.org/guide/datasets ) :
A one-shot iterator is the simplest form of iterator, which only
supports iterating once through a dataset
So it's only good for 1 epoch. However, the codes in the SO questions specify several epochs, with the code example above specifying 5 epochs.
Is there any explanation for this contradiction? Does Keras somehow know that when the one shot iterator has gone through the dataset, it can re-initialize and shuffle the data?
You can simply pass dataset object to model.fit, Keras will handle iteration.
Considering one of pre-made datasets:
train, test = tf.keras.datasets.cifar10.load_data()
dataset = tf.data.Dataset.from_tensor_slices((train[0], train[1]))
This will create dataset object from training data of cifar10 dataset. In this case parse function isn't needed.
If you create dataset from path containing images of list of numpy arrays you'll need one.
dataset = tf.data.Dataset.from_tensor_slices((image_path, labels_path))
In case you'll need a function to load actual data from filename. Numpy array can be handled the same way just without tf.read_file
def parse_func(filename):
f = tf.read_file(filename)
image = tf.image.decode_image(f)
label = #get label from filename
return image, label
Then you can shuffle, batch, and map any parse function to this dataset. You can control how many examples will be preloaded with shuffle buffer. Repeat controls epoch count and better be left None, so it will repeat indefinitely. You can use either plain batch function or combine with
dataset = dataset.shuffle().repeat()
dataset.apply(tf.data.experimental.map_and_batch(map_func=parse_func, batch_size,num_parallel_batches))
Then dataset object can be passed to model.fit
model.fit(dataset, epochs, steps_per_epoch). Note that steps_per_epoch is a necessary parameter in this case, it will define when to start new epoch. So you'll have to know epoch size in advance.

Using feed_dict is more than 5x faster than using dataset API?

I created a dataset in TFRecord format for testing. Every entry contains 200 columns, named C1 - C199, each being a strings list, and a label column to denote the labels. The code to create the data can be found here: https://github.com/codescv/tf-dist/blob/8bb3c44f55939fc66b3727a730c57887113e899c/src/gen_data.py#L25
Then I used a linear model to train the data. The first approach looks like this:
dataset = tf.data.TFRecordDataset(data_file)
dataset = dataset.prefetch(buffer_size=batch_size*10)
dataset = dataset.map(parse_tfrecord, num_parallel_calls=5)
dataset = dataset.repeat(num_epochs)
dataset = dataset.batch(batch_size)
features, labels = dataset.make_one_shot_iterator().get_next()
logits = tf.feature_column.linear_model(features=features, feature_columns=columns, cols_to_vars=cols_to_vars)
train_op = ...
with tf.Session() as sess:
sess.run(train_op)
The full code can be found here: https://github.com/codescv/tf-dist/blob/master/src/lr_single.py
When I run the code above, I get 0.85 steps/sec (batch size being 1024).
In the second approach, I manually get batches from Dataset into python, then feed them to a placeholder, like this:
example = tf.placeholder(dtype=tf.string, shape=[None])
features = tf.parse_example(example, features=tf.feature_column.make_parse_example_spec(columns+[tf.feature_column.numeric_column('label', dtype=tf.float32, default_value=0)]))
labels = features.pop('label')
train_op = ...
dataset = tf.data.TFRecordDataset(data_file).repeat().batch(batch_size)
next_batch = dataset.make_one_shot_iterator().get_next()
with tf.Session() as sess:
data_batch = sess.run(next_batch)
sess.run(train_op, feed_dict={example: data_batch})
The full code can be found here: https://github.com/codescv/tf-dist/blob/master/src/lr_single_feed.py
When I run the code above, I get 5 steps/sec. That is 5x faster than the first approach. This is what I do not understand, because theoretically the second should be slower due to the extra serialization/deserialization of data batches.
Thanks!
There is currently (as of TensorFlow 1.9) a performance issue when using tf.data to map and batch tensors that have a large number of features with a small amount of data in each. The issue has two causes:
The dataset.map(parse_tfrecord, ...) transformation will execute O(batch_size * num_columns) small operations to create a batch. By contrast, feeding a tf.placeholder() to tf.parse_example() will execute O(1) operations to create the same batch.
Batching many tf.SparseTensor objects using dataset.batch() is much slower than directly creating the same tf.SparseTensor as the output of tf.parse_example().
Improvements to both these issues are underway, and should be available in a future version of TensorFlow. In the meantime, you can improve the performance of the tf.data-based pipeline by switching the order of the dataset.map() and dataset.batch() and rewriting the dataset.map() to work on a vector of strings, like the feeding based version:
dataset = tf.data.TFRecordDataset(data_file)
dataset = dataset.prefetch(buffer_size=batch_size*10)
dataset = dataset.repeat(num_epochs)
# Batch first to create a vector of strings as input to the map().
dataset = dataset.batch(batch_size)
def parse_tfrecord_batch(record_batch):
features = tf.parse_example(
record_batch,
features=tf.feature_column.make_parse_example_spec(
columns + [
tf.feature_column.numeric_column(
'label', dtype=tf.float32, default_value=0)]))
labels = features.pop('label')
return features, labels
# NOTE: Parallelism might not be as useful, because the individual map function now does
# more work per invocation, but you might want to experiment with this.
dataset = dataset.map(parse_tfrecord_batch)
# Add a prefetch at the end to pipeline execution.
dataset = dataset.prefetch(1)
features, labels = dataset.make_one_shot_iterator().get_next()
# ...
EDIT (2018/6/18): To answer your questions from the comments:
Why is dataset.map(parse_tfrecord, ...) O(batch_size * num_columns), not O(batch_size)? If parsing requires enumeration of the columns, why doesn't parse_example take O(num_columns)?
When you wrap TensorFlow code in a Dataset.map() (or other functional transformation) a constant number of extra operations per output are added to "return" values from the function and (in the case of tf.SparseTensor values) "convert" them to a standard format. When you directly pass the outputs of tf.parse_example() to the input of your model, these operations aren't added. While they are very small operations, executing so many of them can become a bottleneck. (Technically the parsing does take O(batch_size * num_columns) time, but the constants involved in parsing are much smaller than executing an operation.)
Why do you add a prefetch at the end of the pipeline?
When you're interested in performance, this is almost always the best thing to do, and it should improve the overall performance of your pipeline. For more information about best practices, see the performance guide for tf.data.

How to augment data in tensorflow tfrecords?

I am storing my data using tfrecords and I read them as tensors using Dataset API and then I use the Estimator API to perform training. Now, I want to do online data-augmentation on each item in the dataset, but after trying for a while I cannot find a way out to do it. I want randomly flipping, randomly rotation and other manipulators.
I am following the instructions given in this tutorial with a custom estimator which is the my CNN and I am not sure where the data augmentation step occurs.
Using TFRecords doesn't prevent you from doing data augmentation.
Following the tutorial you linked in your comment, here is what roughly happens:
You create the dataset from the TFRecords files, and parse the file to get an image and a label
dataset = tf.data.TFRecordDataset(filenames=filenames)
dataset = dataset.map(parse)
You can now apply a new preprocessing function to do some data augmentation during training
# Only do it when we are training
if train:
dataset = dataset.map(train_preprocess)
The train_preprocess function can be something like this:
def train_preprocess(image, label):
flip_image = tf.image.random_flip_left_right(image)
# Other transformations...
return flip_image, label