TF Dataset from Keras Sequence Class - tensorflow

I thought I would share something that took me a while to figure out: easily wrapping an existing Keras Sequence Class with a TF Dataset object. After following tutorials and migrating from TF 1.X and Keras to TF 2.X I finally figured out how to do it with minimal code. Hopefully I'm not the only one who struggled with this and others will find this helpful :)
A few assumptions:
Sequence class loads data and labels
Labels have the same shape (apart from channels) as the source data (i.e. this is something I use for training U-Nets)
Data format is channels last
import tensorflow as tf
def DatasetFromSequenceClass(sequenceClass, stepsPerEpoch, nEpochs, batchSize, dims=[512,512,3], n_classes=2, data_type=tf.float32, label_type=tf.float32):
# eager execution wrapper
def DatasetFromSequenceClassEagerContext(func):
def DatasetFromSequenceClassEagerContextWrapper(batchIndexTensor):
# Use a tf.py_function to prevent auto-graph from compiling the method
tensors = tf.py_function(
func,
inp=[batchIndexTensor],
Tout=[data_type, label_type]
)
# set the shape of the tensors - assuming channels last
tensors[0].set_shape([batchSize, dims[0], dims[1], dims[2]]) # [samples, height, width, nChannels]
tensors[1].set_shape([batchSize, dims[0], dims[1], n_classes]) # [samples, height, width, nClasses for one hot]
return tensors
return DatasetFromSequenceClassEagerContextWrapper
# TF dataset wrapper that indexes our sequence class
#DatasetFromSequenceClassEagerContext
def LoadBatchFromSequenceClass(batchIndexTensor):
# get our index as numpy value - we can use .numpy() because we have wrapped our function
batchIndex = batchIndexTensor.numpy()
# zero-based index for what batch of data to load; i.e. goes to 0 at stepsPerEpoch and starts cound over
zeroBatch = batchIndex % stepsPerEpoch
# load data
data, labels = sequenceClass[zeroBatch]
# convert to tensors and return
return tf.convert_to_tensor(data), tf.convert_to_tensor(labels)
# create our data set for how many total steps of training we have
dataset = tf.data.Dataset.range(stepsPerEpoch*nEpochs)
# return dataset using map to load our batches of data, use TF to specify number of parallel calls
return dataset.map(LoadBatchFromSequenceClass, num_parallel_calls=tf.data.experimental.AUTOTUNE)
With that function, you can then update your training to look something like this:
# load our data as tensorflow datasets
training = DatasetFromSequenceClass(trainingSequence, training_steps, nEpochs, batchSize, dims=shp, n_classes=nClasses)
validation = DatasetFromSequenceClass(validationSequence, validation_steps, nEpochs, batchSize, dims=shp, n_classes=nClasses)
# train
model_object.fit(training,
steps_per_epoch=training_steps,
validation_data=validation,
validation_steps=validation_steps,
epochs=nEpochs,
callbacks=callbacks,
verbose=1)
From here there are lots of other options for the Dataset API (like prefetch), but this should be a good starting point.

Related

Training with Dataset API and numpy array yields completely different results

I have a CNN regression model and feature comes in (2000, 3000, 1) shape, where 2000 is total number of samples with each being a (3000, 1) 1D array. Batch size is 8, 20% of the full dataset is used for validation.
However, zip feature and label into tf.data.Dataset gives completely different scores from feeding numpy arrays directly in.
The tf.data.Dataset code looks like:
# Load features and labels
features = np.array(features) # shape is (2000, 3000, 1)
labels = np.array(labels) # shape is (2000,)
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
dataset = dataset.shuffle(buffer_size=2000)
dataset = dataset.batch(8)
train_dataset = dataset.take(200)
val_dataset = dataset.skip(200)
# Training model
model.fit(train_dataset, validation_data=val_dataset,
batch_size=8, epochs=1000)
The numpy code looks like:
# Load features and labels
features = np.array(features) # exactly the same as previous
labels = np.array(labels) # exactly the same as previous
# Training model
model.fit(x=features, y=labels, shuffle=True, validation_split=0.2,
batch_size=8, epochs=1000)
Except for this, other code is exactly the same, for example
# Set global random seed
tf.random.set_seed(0)
np.random.seed(0)
# No preprocessing of feature at all
# Load model (exactly the same)
model = load_model()
# Compile model
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
loss=tf.keras.losses.MeanSquaredError(),
metrics=[tf.keras.metrics.mean_absolute_error, ],
)
The former method via tf.data.Dataset API yields mean absolute error (MAE) around 10-3 on both training and validation set, which looks quite suspicious as the model doesn't have any drop-out or regularization to prevent overfitting. On the other hand, feeding numpy arrays right in gives training MAE around 0.1 and validation MAE around 1.
The low MAE of tf.data.Dataset method looks super suspicious however I just couldn't figure out anything wrong with the code. Also I could confirm the number of training batches is 200 and validation batches is 50, meaning I didn't use the training set for validation.
I tried to vary the global random seed or use some different shuffle seeds, which didn't change the results much. Training was done on NVIDIA V100 GPUs, and I tried tensorflow version 2.9, 2.10, 2.11 which didn't make much difference.
The problem lies in the default behaviour of "shuffle" method of tf.data.Dataset, more specificially the reshuffle_each_iteration argument which is by default True. Meaning if I implement the following code:
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
dataset = dataset.shuffle(buffer_size=2000)
dataset = dataset.batch(8)
train_dataset = dataset.take(200)
val_dataset = dataset.skip(200)
model.fit(train_dataset, validation_data=val_dataset, batch_size=8, epochs=1000)
The dataset would actually be shuffle after each epoch though it might not look so apparently so. As a result, the validation data would leak into training set (in fact there would be no distinguish between these two sets as the order is shuffled every epoch).
So make sure to set reshuffle_each_iteration to False if you would like to shuffle the dataset and then do train-val split.
UPDATE: TensorFlow confirms this issue and warning would be added in future docs.
PS: It's a hard lesson for me, as I have been using the model for analysing the results for several months (as a graduating MPhil student).

tensor slicing in tensorflow

I want to do the same numpy operation as follow to make a custom layer
img=cv2.imread('img.jpg') # img.shape =>(600,600,3)
mask=np.random.randint(0,2,size=img.shape[:2],dtype='bool')
img2=np.expand_dims(img,axis=0) #img.shape => (1,600,600,3)
img2[:,mask,:].shape # => (1, 204030, 3)
this is my first attemp but I failed. I can't do the same operation for for tensorflow tensors
class Sampling_layer(keras.layers.Layer):
def __init__(self,sampling_matrix):
super(Sampling_layer,self).__init__()
self.sampling_matrix=sampling_matrix
def call(self,input_img):
return input_img[:,self.sampling_matrix,:]
More Explanations:
I want to define a keras layer so that given a batch of images it use a sampling matrix and give me a batch of sampled vectors for the images.The sampling matrix is a random boolean matrix the same size as the image. The slicing operation I used is straight forward for numpy arrays and works perfectly. but I can't get it done with tensors in tensorflow. I tried to use loops to perform the operation I want manually but I failed.
You can do the following.
import numpy as np
import tensorflow as tf
# Batch of images
img=np.random.normal(size=[2,600,600,3]) # img.shape =>(600,600,3)
# You'll need to match the first 3 dimensions of mask with the img
# for that we'll repeat the first axis twice
mask=np.random.randint(0,2,size=img.shape[1:3],dtype='bool')
mask = np.repeat(np.expand_dims(mask, axis=0), 2, axis=0)
# Defining input layers
inp1 = tf.keras.layers.Input(shape=(600,600,3))
mask_inp = tf.keras.layers.Input(shape=(600,600))
# The layer you're looking for
out = tf.keras.layers.Lambda(lambda x: tf.boolean_mask(x[0], x[1]) )([inp1, mask])
model = tf.keras.models.Model([inp1, mask_inp], out)
# Predict on sample data
toy_out = model.predict([img, mask])
Note that both your images and mask needs to have the same batch size. I couldn't find a solution to make this work without repeating the mask on batch axis to match the batch size of images. This is the only possible solution that came to my mind, (assuming that your mask changes for every batch of data).

Training seq2seq model on Google Colab TPU with big dataset - Keras

I'm trying to train a sequence to sequence model for machine translation using Keras on Google Colab TPU.
I have a dataset which I can load in memory but I have to preprocess to it to feed it to the model. In particular I need to convert the target words to one hot vectors and with many examples I can't load the entire conversion in memory, so I need to make batches of data.
I'm using this function as a batch generator:
def generate_batch_bert(X_ids, X_masks, y, batch_size = 1024):
''' Generate a batch of data '''
while True:
for j in range(0, len(X_ids), batch_size):
# batch of encoder and decoder data
encoder_input_data_ids = X_ids[j:j+batch_size]
encoder_input_data_masks = X_masks[j:j+batch_size]
y_decoder = y[j:j+batch_size]
# decoder target and input for teacher forcing
decoder_input_data = y_decoder[:,:-1]
decoder_target_seq = y_decoder[:,1:]
# batch of decoder target data
decoder_target_data = to_categorical(decoder_target_seq, vocab_size_fr)
# keep only with the right amount of instances for training on TPU
if encoder_input_data_ids.shape[0] == batch_size:
yield([encoder_input_data_ids, encoder_input_data_masks, decoder_input_data], decoder_target_data)
The problem is that whenever I try to run the fit function as follows:
model.fit(x=generate_batch_bert(X_train_ids, X_train_masks, y_train, batch_size = batch_size),
steps_per_epoch = train_samples//batch_size,
epochs=epochs,
callbacks = callbacks,
validation_data = generate_batch_bert(X_val_ids, X_val_masks, y_val, batch_size = batch_size),
validation_steps = val_samples//batch_size)
I get the following error:
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/tensor_util.py:445 make_tensor_proto
raise ValueError("None values not supported.")
ValueError: None values not supported.
Not sure what's wrong and how I can solve this problem.
EDIT
I tried loading less amount of data in memory so that the conversion to one hot encoding of the target words doesn't crash the kernel and it actually works. So there is obviously something wrong on how I generate batches.
It's hard to tell what's wrong since you don't provide your model
definition nor any sample data. However, I'm fairly certain that you're
running into the same
TensorFlow bug
that I recently got bitten by.
The workaround is to use the tensorflow.data API which works much
better with TPUs. Like this:
from tensorflow.data import Dataset
import tensorflow as tf
def map_fn(X_id, X_mask, y):
decoder_target_data = tf.one_hot(y[1:], vocab_size_fr)
return (X_id, X_mask, y[:-1]), decoder_target_data
...
X_ids = Dataset.from_tensor_slices(X_ids)
X_masks = Dataset.from_tensor_slices(X_masks)
y = Dataset.from_tensor_slices(y)
ds = Dataset.zip((X_ids, X_masks, y)).map(map_fn).batch(1024)
model.fit(x = ds, ...)

How to perform sklearn style train-test split on feature and label tensors using built in tensorflow methods?

Reposting my original question since even after significant improvements to clarity, it was not revived by the community.
I am looking for a way to split feature and corresponding label data into train and test using TensorFlow inbuilt methods. My data is already in two tensors (i.e. tf.Tensor objects), named features and labels.
I know how to do this easily for numpy arrays using sklearn.model_selection as shown in this post. Additionally, I was pointed to this method which requires the data to be in a single tensor. Also, I need the train and test sets to be disjoint, unlike in this method (meaning they can't have common data points after the split).
I am looking for a way to do the same using built-in methods in Tensorflow.
There may be too many conditions in my requirement, but basically what is needed is an equivalent method to sklearn.model_selection.train_test_split() in Tensorflow such as the below:
import tensorflow as tf
X_train, X_test, y_train, y_test = tf.train_test_split(features,
labels,
test_size=0.1,
random_state=123)
You can achieve this by using TF in the following way
from typing import Tuple
import tensorflow as tf
def split_train_test(features: tf.Tensor,
labels: tf.Tensor,
test_size: float,
random_state: int = 1729) -> Tuple[tf.Tensor, tf.Tensor, tf.Tensor, tf.Tensor]:
# Generate random masks
random = tf.random.uniform(shape=(tf.shape(features)[0],), seed=random_state)
train_mask = random >= test_size
test_mask = random < test_size
# Gather values
train_features, train_labels = tf.boolean_mask(features, mask=train_mask), tf.boolean_mask(labels, mask=train_mask)
test_features, test_labels = tf.boolean_mask(features, mask=test_mask), tf.boolean_mask(labels, mask=test_mask)
return train_features, test_features, train_labels, test_labels
What we are doing here is first creating a random uniform tensor with the size of the length of the data.
Then we follow by creating boolean masks according to the ratio given by test_size and finally we extract the relevant part for train/test using tf.boolean_mask

The established way to use TF Dataset API in Keras is to feed `model.fit` with `make_one_shot_iterator()`, But this iterator only good for one Epoch

Edit:
To clarify why this question is different from the suggested duplicates, this SO question follows up on those suggested duplicates, on what exactly is Keras doing with the techniques described in those SO questions. The suggested duplicates specify using a dataset API make_one_shot_iterator() in model.fit, my follow up is that make_one_shot_iterator() can only go through the dataset once, however in the solutions given, several epochs are specified.
This is a follow up to these SO questions
How to Properly Combine TensorFlow's Dataset API and Keras?
Tensorflow keras with tf dataset input
Using tf.data.Dataset as training input to Keras model NOT working
Where "Starting from Tensorflow 1.9, one can pass tf.data.Dataset object directly into keras.Model.fit() and it would act similar to fit_generator". Each example has a TF dataset one shot iterator fed into Kera's model.fit.
An example is given below
# Load mnist training data
(x_train, y_train), _ = tf.keras.datasets.mnist.load_data()
training_set = tfdata_generator(x_train, y_train,is_training=True)
model = # your keras model here
model.fit(
training_set.make_one_shot_iterator(),
steps_per_epoch=len(x_train) // 128,
epochs=5,
verbose = 1)
However, according the the Tensorflow Dataset API guide (here https://www.tensorflow.org/guide/datasets ) :
A one-shot iterator is the simplest form of iterator, which only
supports iterating once through a dataset
So it's only good for 1 epoch. However, the codes in the SO questions specify several epochs, with the code example above specifying 5 epochs.
Is there any explanation for this contradiction? Does Keras somehow know that when the one shot iterator has gone through the dataset, it can re-initialize and shuffle the data?
You can simply pass dataset object to model.fit, Keras will handle iteration.
Considering one of pre-made datasets:
train, test = tf.keras.datasets.cifar10.load_data()
dataset = tf.data.Dataset.from_tensor_slices((train[0], train[1]))
This will create dataset object from training data of cifar10 dataset. In this case parse function isn't needed.
If you create dataset from path containing images of list of numpy arrays you'll need one.
dataset = tf.data.Dataset.from_tensor_slices((image_path, labels_path))
In case you'll need a function to load actual data from filename. Numpy array can be handled the same way just without tf.read_file
def parse_func(filename):
f = tf.read_file(filename)
image = tf.image.decode_image(f)
label = #get label from filename
return image, label
Then you can shuffle, batch, and map any parse function to this dataset. You can control how many examples will be preloaded with shuffle buffer. Repeat controls epoch count and better be left None, so it will repeat indefinitely. You can use either plain batch function or combine with
dataset = dataset.shuffle().repeat()
dataset.apply(tf.data.experimental.map_and_batch(map_func=parse_func, batch_size,num_parallel_batches))
Then dataset object can be passed to model.fit
model.fit(dataset, epochs, steps_per_epoch). Note that steps_per_epoch is a necessary parameter in this case, it will define when to start new epoch. So you'll have to know epoch size in advance.