How to perform sklearn style train-test split on feature and label tensors using built in tensorflow methods? - tensorflow

Reposting my original question since even after significant improvements to clarity, it was not revived by the community.
I am looking for a way to split feature and corresponding label data into train and test using TensorFlow inbuilt methods. My data is already in two tensors (i.e. tf.Tensor objects), named features and labels.
I know how to do this easily for numpy arrays using sklearn.model_selection as shown in this post. Additionally, I was pointed to this method which requires the data to be in a single tensor. Also, I need the train and test sets to be disjoint, unlike in this method (meaning they can't have common data points after the split).
I am looking for a way to do the same using built-in methods in Tensorflow.
There may be too many conditions in my requirement, but basically what is needed is an equivalent method to sklearn.model_selection.train_test_split() in Tensorflow such as the below:
import tensorflow as tf
X_train, X_test, y_train, y_test = tf.train_test_split(features,
labels,
test_size=0.1,
random_state=123)

You can achieve this by using TF in the following way
from typing import Tuple
import tensorflow as tf
def split_train_test(features: tf.Tensor,
labels: tf.Tensor,
test_size: float,
random_state: int = 1729) -> Tuple[tf.Tensor, tf.Tensor, tf.Tensor, tf.Tensor]:
# Generate random masks
random = tf.random.uniform(shape=(tf.shape(features)[0],), seed=random_state)
train_mask = random >= test_size
test_mask = random < test_size
# Gather values
train_features, train_labels = tf.boolean_mask(features, mask=train_mask), tf.boolean_mask(labels, mask=train_mask)
test_features, test_labels = tf.boolean_mask(features, mask=test_mask), tf.boolean_mask(labels, mask=test_mask)
return train_features, test_features, train_labels, test_labels
What we are doing here is first creating a random uniform tensor with the size of the length of the data.
Then we follow by creating boolean masks according to the ratio given by test_size and finally we extract the relevant part for train/test using tf.boolean_mask

Related

Problem with shapes of experimental Tensorflow dataset

I am trying to store numpy arrays in a Tensorflow dataset. The model fits correctly when using the numpy arrays as train and test data but not when I store the numpy arrays in a single Tensorflow dataset. The problem is with the dimensions of the dataset. Something is wrong even though shapes seem OK at first sight.
After trying multiple things to reshape my Tensorflow dataset, I am still unable to get it working. My code is the following:
train_x.shape
Out[54]: (7200, 40)
train_y.shape
Out[55]: (7200,)
dataset = tf.data.Dataset.from_tensor_slices((x,y))
print(dataset)
Out[56]: <TensorSliceDataset shapes: ((40,), ()), types: (tf.int32, tf.int32)>
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy')
history = model.fit(dataset, epochs=EPOCHS, batch_size=256)
sparse_softmax_cross_entropy_with_logits
logits.get_shape()))
ValueError: Shape mismatch: The shape of labels (received (1,)) should equal the shape of logits except for the last dimension (received (40, 1351)).
I have seen this answer but I am sure it doesn't apply here. I must use sparse_categorical_crossentropy. I am inspiring myself from this example where I want to store the train and test data in a Tensorflow dataset. I also want to store the arrays in a dataset as I will have to use it later.
You can't use batch_size with model.fit() when using a tf.data.Dataset. Instead use tf.data.Dataset.batch(). You'll have to change your code as follows for it to work.
import numpy as np
import tensorflow as tf
# Some toy data
train_x = np.random.normal(size=(7200, 40))
train_y = np.random.choice([0,1,2], size=(7200))
dataset = tf.data.Dataset.from_tensor_slices((train_x,train_y))
dataset = dataset.batch(256)
#### - Define your model here - ####
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy')
history = model.fit(dataset, epochs=EPOCHS)

Custom optimizer with multiple loss function evaluations

I want to implement a custom optimization algorithm for TF models.
I have read the following sources
tf documentation on custom optimizers
tf SGD implementation
keras documentation on custom models
towardsdatascience guide on custom optimizers
However lot of questions remain.
It seems like it is not possible to evaluate the loss function multiple times (for different weight settings) before applying a gradient step, when using the custom optimizer API. For example in a line-search type of algorithm this is necessary.
I tried to do all steps manually.
Assume I have setup my model and my optimization problem like this
from tensorflow.keras import layers
from tensorflow.keras import losses
from tensorflow.keras import models
model = models.Sequential()
model.add(layers.Dense(15, input_dim=10))
model.add(layers.Dense(20))
model.add(layers.Dense(1))
x_train, y_train = get_train_data()
loss = losses.MeanSquaredError()
def val_and_grads(weights):
model.set_weights(weights)
with tf.GradientTape() as tape:
val = loss(y_train, model(x_train))
grads = tape.gradient(val, model.trainable_variables)
return val, grads
initial_weights = model.get_weights()
optimal_weigths = my_fancy_optimization_algorithm(val_and_grads, initial_weights)
However my function val_and_grads needs a list of weights and returns a list of gradients from my_fancy_optimization_algorithms point of view that seems unnatural.
I could warp val_and_grads to "stack" the returned gradients and "split" the passed weights like this
def wrapped_val_and_grad(weights):
grads = val_and_grads(split_weights(weights))
return stack_grads(grads)
however that seems very inefficient.
Anyway, I do not like this approach since it seems that I would loose out out on a lot of the surrounding tensorflow infrastructure (printing of current loss function values and metrics during learning, tensorboard stuff, ...).
I could also pack the above in a custom model with a tailored train_step like this
def CustomModel(keras.Model):
def train_step(self, data):
x_train, y_train = data
def val_and_grads(weights):
self.set_weights(weights)
with tf.GradientTape() as tape:
val = loss(y_train, self(x_train))
grads = tape.gradient(val, self.trainable_variables)
return val, grads
trainable_vars = self.trainable_variables
old_weights = self.get_weights()
update = my_fancy_update_finding_algorithm(val_and_grads, self.get_weights()) # this can do multiple evaluations of the model
self.set_weights(old_weights) # restore the weights
self.optimizer.apply_gradients(zip(update, trainable_vars))
Here I would need a accompanying custom optimizer that does nothing else than updating the current weights by adding the update (new_weigths = current_weights + update).
I am still unsure if this is the best way to go.
If someone can comment on the snippets and ideas above, guide me to any other resource that I should consider or provide new approaches and other feedback I would be very glad.
Thanks all.
Franz
EDIT:
Sadly I did not get any response here so far. Maybe my question is not concrete enough. As a first smaller question:
Given the model and val_and_grads in the first listing. How would I efficiently calculate the norm of the WHOLE gradient? What I do so far is
import numpy as np
_, grads = val_and_grad(model.get_weights())
norm_grads = np.linalg.norm(np.concatenate([grad.numpy().flatten() for grad in grad]))
This surely cannot be the "right" way.

Learning a Categorical Variable with TensorFlow Probability

I would like to use TFP to write a neural network where the output are the probabilities of a categorical variable with 3 classes, and train it using the negative log-likelihood.
As I'm moving my first steps with TF and TFP, I started with a toy model where the input layer has only 1 unit receiving a null input, and the output layer has 3 units with softmax activation function. The idea is that the biases should learn (up to an additive constant) the log of the probabilities.
Here below is my code, true_p are the true parameters I use to generate the data and I would like to learn, while learned_p is what I get from the NN.
import numpy as np
import tensorflow as tf
from tensorflow import keras
from functions import nll
from tensorflow.keras.optimizers import SGD
import tensorflow.keras.layers as layers
import tensorflow_probability as tfp
tfd = tfp.distributions
# params
true_p = np.array([0.1, 0.7, 0.2])
n_train = 1000
# training data
x_train = np.array(np.zeros(n_train)).reshape((n_train,))
y_train = np.array(np.random.choice(len(true_p), size=n_train, p=true_p)).reshape((n_train,))
# model
input_layer = layers.Input(shape=(1,))
p_layer = layers.Dense(len(true_p), activation=tf.nn.softmax)(input_layer)
p_y = tfp.layers.DistributionLambda(tfd.Categorical)(p_layer)
model_p = keras.models.Model(inputs=input_layer, outputs=p_y)
model_p.compile(SGD(), loss=nll)
# training
hist_p = model_p.fit(x=x_train, y=y_train, batch_size=100, epochs=3000, verbose=0)
# check result
learned_p = np.round(model_p.layers[1].call(tf.constant([0], shape=(1, 1))).numpy(), 3)
learned_p
With this setup, I get the result:
>>> learned_p
array([[0.005, 0.989, 0.006]], dtype=float32)
I over-estimate the second category, and can't really distinguish between the first and the third one. What's worst, if I plot the probabilities at the end of each epoch, it looks like they are converging monotonically to the vector [0,1,0], which doesn't make sense (it seems to me the gradient should push in the opposite direction once I start to over-estimate).
I really can't figure out what's going on here, but have the feeling I'm doing something plain wrong. Any idea? Thank you for your help!
For the record, I also tried using other optimizers like Adam or Adagrad playing with the hyper-params, but with no luck.
I'm using Python 3.7.9, TensorFlow 2.3.1 and TensorFlow probability 0.11.1
I believe the default argument to Categorical is not the vector of probabilities, but the vector of logits (values you'd take softmax of to get probabilities). This is to help maintain precision in internal Categorical computations like log_prob. I think you can simply eliminate the softmax activation function and it should work. Please update if it doesn't!
EDIT: alternatively you can replace the tfd.Categorical with
lambda p: tfd.Categorical(probs=p)
but you'll lose the aforementioned precision gains. Just wanted to clarify that passing probs is an option, just not the default.

TF Dataset from Keras Sequence Class

I thought I would share something that took me a while to figure out: easily wrapping an existing Keras Sequence Class with a TF Dataset object. After following tutorials and migrating from TF 1.X and Keras to TF 2.X I finally figured out how to do it with minimal code. Hopefully I'm not the only one who struggled with this and others will find this helpful :)
A few assumptions:
Sequence class loads data and labels
Labels have the same shape (apart from channels) as the source data (i.e. this is something I use for training U-Nets)
Data format is channels last
import tensorflow as tf
def DatasetFromSequenceClass(sequenceClass, stepsPerEpoch, nEpochs, batchSize, dims=[512,512,3], n_classes=2, data_type=tf.float32, label_type=tf.float32):
# eager execution wrapper
def DatasetFromSequenceClassEagerContext(func):
def DatasetFromSequenceClassEagerContextWrapper(batchIndexTensor):
# Use a tf.py_function to prevent auto-graph from compiling the method
tensors = tf.py_function(
func,
inp=[batchIndexTensor],
Tout=[data_type, label_type]
)
# set the shape of the tensors - assuming channels last
tensors[0].set_shape([batchSize, dims[0], dims[1], dims[2]]) # [samples, height, width, nChannels]
tensors[1].set_shape([batchSize, dims[0], dims[1], n_classes]) # [samples, height, width, nClasses for one hot]
return tensors
return DatasetFromSequenceClassEagerContextWrapper
# TF dataset wrapper that indexes our sequence class
#DatasetFromSequenceClassEagerContext
def LoadBatchFromSequenceClass(batchIndexTensor):
# get our index as numpy value - we can use .numpy() because we have wrapped our function
batchIndex = batchIndexTensor.numpy()
# zero-based index for what batch of data to load; i.e. goes to 0 at stepsPerEpoch and starts cound over
zeroBatch = batchIndex % stepsPerEpoch
# load data
data, labels = sequenceClass[zeroBatch]
# convert to tensors and return
return tf.convert_to_tensor(data), tf.convert_to_tensor(labels)
# create our data set for how many total steps of training we have
dataset = tf.data.Dataset.range(stepsPerEpoch*nEpochs)
# return dataset using map to load our batches of data, use TF to specify number of parallel calls
return dataset.map(LoadBatchFromSequenceClass, num_parallel_calls=tf.data.experimental.AUTOTUNE)
With that function, you can then update your training to look something like this:
# load our data as tensorflow datasets
training = DatasetFromSequenceClass(trainingSequence, training_steps, nEpochs, batchSize, dims=shp, n_classes=nClasses)
validation = DatasetFromSequenceClass(validationSequence, validation_steps, nEpochs, batchSize, dims=shp, n_classes=nClasses)
# train
model_object.fit(training,
steps_per_epoch=training_steps,
validation_data=validation,
validation_steps=validation_steps,
epochs=nEpochs,
callbacks=callbacks,
verbose=1)
From here there are lots of other options for the Dataset API (like prefetch), but this should be a good starting point.

The established way to use TF Dataset API in Keras is to feed `model.fit` with `make_one_shot_iterator()`, But this iterator only good for one Epoch

Edit:
To clarify why this question is different from the suggested duplicates, this SO question follows up on those suggested duplicates, on what exactly is Keras doing with the techniques described in those SO questions. The suggested duplicates specify using a dataset API make_one_shot_iterator() in model.fit, my follow up is that make_one_shot_iterator() can only go through the dataset once, however in the solutions given, several epochs are specified.
This is a follow up to these SO questions
How to Properly Combine TensorFlow's Dataset API and Keras?
Tensorflow keras with tf dataset input
Using tf.data.Dataset as training input to Keras model NOT working
Where "Starting from Tensorflow 1.9, one can pass tf.data.Dataset object directly into keras.Model.fit() and it would act similar to fit_generator". Each example has a TF dataset one shot iterator fed into Kera's model.fit.
An example is given below
# Load mnist training data
(x_train, y_train), _ = tf.keras.datasets.mnist.load_data()
training_set = tfdata_generator(x_train, y_train,is_training=True)
model = # your keras model here
model.fit(
training_set.make_one_shot_iterator(),
steps_per_epoch=len(x_train) // 128,
epochs=5,
verbose = 1)
However, according the the Tensorflow Dataset API guide (here https://www.tensorflow.org/guide/datasets ) :
A one-shot iterator is the simplest form of iterator, which only
supports iterating once through a dataset
So it's only good for 1 epoch. However, the codes in the SO questions specify several epochs, with the code example above specifying 5 epochs.
Is there any explanation for this contradiction? Does Keras somehow know that when the one shot iterator has gone through the dataset, it can re-initialize and shuffle the data?
You can simply pass dataset object to model.fit, Keras will handle iteration.
Considering one of pre-made datasets:
train, test = tf.keras.datasets.cifar10.load_data()
dataset = tf.data.Dataset.from_tensor_slices((train[0], train[1]))
This will create dataset object from training data of cifar10 dataset. In this case parse function isn't needed.
If you create dataset from path containing images of list of numpy arrays you'll need one.
dataset = tf.data.Dataset.from_tensor_slices((image_path, labels_path))
In case you'll need a function to load actual data from filename. Numpy array can be handled the same way just without tf.read_file
def parse_func(filename):
f = tf.read_file(filename)
image = tf.image.decode_image(f)
label = #get label from filename
return image, label
Then you can shuffle, batch, and map any parse function to this dataset. You can control how many examples will be preloaded with shuffle buffer. Repeat controls epoch count and better be left None, so it will repeat indefinitely. You can use either plain batch function or combine with
dataset = dataset.shuffle().repeat()
dataset.apply(tf.data.experimental.map_and_batch(map_func=parse_func, batch_size,num_parallel_batches))
Then dataset object can be passed to model.fit
model.fit(dataset, epochs, steps_per_epoch). Note that steps_per_epoch is a necessary parameter in this case, it will define when to start new epoch. So you'll have to know epoch size in advance.