from two tensors how to create generator over raws - tensorflow

Suppose I have two tensors X_train (inputs) and Y_train ( for targets) where in every raw there is a sample.
From those two tensors, I want to create a data generator in order to use in the model.fit(). Because i want to use workers.
the fit method should be as follow
model.fit(generator(X_train, y_train),
workers =os.cpu_count(),
max_queue_size=10,)

Related

How do I add an embedding layer in Keras starting from a pd dataframe?

I am trying to build a neural network using both categorical and numerical inputs using Keras to predict student grades ranging from 0-20.
My dataset is already split into train and test sets (two separate dataframes). I split the training set into numerical and categorical attributes. There are 17 categorical attributes and 16 numerical ones. Each categorical column only contains 3-4 categories so I have used OneHotEncoding to transform them. However, it creates unnecessary columns and I would like to experiment with embedding since it's more efficient.
I don't understand what I need to do in order to feed the categorical inputs into the neural model.
This is what my basic neural network looks like.
input = keras.layers.Input(shape= 58,) #additional columns created through OHE
hidden1 = keras.layers.Dense(300, activation="relu")(input)
hidden2 = keras.layers.Dense(300, activation="relu")(hidden1)
concat = keras.layers.Concatenate()([input,hidden2])
output = keras.layers.Dense(21, activation = "softmax")(concat) model = keras.Model(inputs=[input], outputs=[output])
How can I expand it to include an embedding layer? Can I embed all the categorical columns together or would I need to add a layer for each?
I am using sparse categorical crossentropy as my loss function, but I guess I could use a different one now that the categorical inputs have been vectorized?
model.compile(loss="sparse_categorical_crossentropy", optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3))
I am very new to ML and NNs so apologies if my question is unclear.

How to perform sklearn style train-test split on feature and label tensors using built in tensorflow methods?

Reposting my original question since even after significant improvements to clarity, it was not revived by the community.
I am looking for a way to split feature and corresponding label data into train and test using TensorFlow inbuilt methods. My data is already in two tensors (i.e. tf.Tensor objects), named features and labels.
I know how to do this easily for numpy arrays using sklearn.model_selection as shown in this post. Additionally, I was pointed to this method which requires the data to be in a single tensor. Also, I need the train and test sets to be disjoint, unlike in this method (meaning they can't have common data points after the split).
I am looking for a way to do the same using built-in methods in Tensorflow.
There may be too many conditions in my requirement, but basically what is needed is an equivalent method to sklearn.model_selection.train_test_split() in Tensorflow such as the below:
import tensorflow as tf
X_train, X_test, y_train, y_test = tf.train_test_split(features,
labels,
test_size=0.1,
random_state=123)
You can achieve this by using TF in the following way
from typing import Tuple
import tensorflow as tf
def split_train_test(features: tf.Tensor,
labels: tf.Tensor,
test_size: float,
random_state: int = 1729) -> Tuple[tf.Tensor, tf.Tensor, tf.Tensor, tf.Tensor]:
# Generate random masks
random = tf.random.uniform(shape=(tf.shape(features)[0],), seed=random_state)
train_mask = random >= test_size
test_mask = random < test_size
# Gather values
train_features, train_labels = tf.boolean_mask(features, mask=train_mask), tf.boolean_mask(labels, mask=train_mask)
test_features, test_labels = tf.boolean_mask(features, mask=test_mask), tf.boolean_mask(labels, mask=test_mask)
return train_features, test_features, train_labels, test_labels
What we are doing here is first creating a random uniform tensor with the size of the length of the data.
Then we follow by creating boolean masks according to the ratio given by test_size and finally we extract the relevant part for train/test using tf.boolean_mask

The established way to use TF Dataset API in Keras is to feed `model.fit` with `make_one_shot_iterator()`, But this iterator only good for one Epoch

Edit:
To clarify why this question is different from the suggested duplicates, this SO question follows up on those suggested duplicates, on what exactly is Keras doing with the techniques described in those SO questions. The suggested duplicates specify using a dataset API make_one_shot_iterator() in model.fit, my follow up is that make_one_shot_iterator() can only go through the dataset once, however in the solutions given, several epochs are specified.
This is a follow up to these SO questions
How to Properly Combine TensorFlow's Dataset API and Keras?
Tensorflow keras with tf dataset input
Using tf.data.Dataset as training input to Keras model NOT working
Where "Starting from Tensorflow 1.9, one can pass tf.data.Dataset object directly into keras.Model.fit() and it would act similar to fit_generator". Each example has a TF dataset one shot iterator fed into Kera's model.fit.
An example is given below
# Load mnist training data
(x_train, y_train), _ = tf.keras.datasets.mnist.load_data()
training_set = tfdata_generator(x_train, y_train,is_training=True)
model = # your keras model here
model.fit(
training_set.make_one_shot_iterator(),
steps_per_epoch=len(x_train) // 128,
epochs=5,
verbose = 1)
However, according the the Tensorflow Dataset API guide (here https://www.tensorflow.org/guide/datasets ) :
A one-shot iterator is the simplest form of iterator, which only
supports iterating once through a dataset
So it's only good for 1 epoch. However, the codes in the SO questions specify several epochs, with the code example above specifying 5 epochs.
Is there any explanation for this contradiction? Does Keras somehow know that when the one shot iterator has gone through the dataset, it can re-initialize and shuffle the data?
You can simply pass dataset object to model.fit, Keras will handle iteration.
Considering one of pre-made datasets:
train, test = tf.keras.datasets.cifar10.load_data()
dataset = tf.data.Dataset.from_tensor_slices((train[0], train[1]))
This will create dataset object from training data of cifar10 dataset. In this case parse function isn't needed.
If you create dataset from path containing images of list of numpy arrays you'll need one.
dataset = tf.data.Dataset.from_tensor_slices((image_path, labels_path))
In case you'll need a function to load actual data from filename. Numpy array can be handled the same way just without tf.read_file
def parse_func(filename):
f = tf.read_file(filename)
image = tf.image.decode_image(f)
label = #get label from filename
return image, label
Then you can shuffle, batch, and map any parse function to this dataset. You can control how many examples will be preloaded with shuffle buffer. Repeat controls epoch count and better be left None, so it will repeat indefinitely. You can use either plain batch function or combine with
dataset = dataset.shuffle().repeat()
dataset.apply(tf.data.experimental.map_and_batch(map_func=parse_func, batch_size,num_parallel_batches))
Then dataset object can be passed to model.fit
model.fit(dataset, epochs, steps_per_epoch). Note that steps_per_epoch is a necessary parameter in this case, it will define when to start new epoch. So you'll have to know epoch size in advance.

keras compile with dataset and flexible loss/metrics

I'm porting a bunch of code from tf.estimator.Estimator API to tf.keras using tf.data.Datasets and I'm hoping to stay as close to the provided compile/fit as possible. I'm being frustrated by compile's loss and metrics args.
Essentially, I'd like to use a loss function which uses multiple outputs and labels in a non-additive way, i.e. I want to provide
def custom_loss(all_labels, model_outputs):
"""
Args:
all_labels: all labels in the dataset, as a single tensor, tuple or dict
model_outputs: all outputs of model as a single tensor, tuple or dict
Returns:
single loss tensor to be averaged.
""""
...
I can't provide this to compile because as far as I'm aware it only supports weighted sums of per-output/label losses, and makes assumptions about the shape of each label based on the the corresponding model output. I can't create it separately and use model.add_loss because I never have explicit access to a labels tensor if I want to let model.fit handle dataset iteration. I've considered flattening/concatenating all outputs and labels together, but then I can't monitor multiple metrics.
I can write my own training loop using model.train_on_batch, but that forces me to replicate behaviour already implemented in fit such as dataset iteration, callbacks, validation, distribution strategies etc.
As an example, I'd like to replicate the following estimator.
def model_fn(features, labels, mode):
outputs = get_outputs(features) # dict
loss = custom_loss(labels, outputs)
train_op = tf.train.AdamOptimizer(1e-3).minimize(loss)
eval_metrics_op = {
'a_mean': tf.metrics.mean(outputs['a'])
}
return tf.estimator.EstimatorSpec(
loss=loss, train_op=train_op, mode=mode, eval_metric_ops=eval_metric_ops)
estimator = tf.estimator.Estimator(model_fn=model_fn)
estimator.train(dataset_fn)

binarize input for pytorch

may I ask how to make data loaded in pytorch become binarized once it is loaded?
Like Tensorflow can done this through:
train_data = mnist.input_data.read_data_sets(data_directory, one_hot=True)
How can pytorch achieve the one_hot=True effect.
The data_loader I have now is:
torch.set_default_tensor_type('torch.FloatTensor')
train_loader = torch.utils.data.DataLoader(
datasets.MNIST('data/', train=True, download=True,
transform=transforms.Compose([
# transforms.RandomHorizontalFlip(),
transforms.ToTensor()])),
batch_size=batch_size, shuffle=False)
I want to make data in train_loader be binarized.
Now what I am doing is: After loading the data,
for data,_ in train_loader:
torch.round(data)
data = Variable(data)
Use the torch.round() function. Is this correct?
The one-hot encoding idea is used for classification. It sounds like you are trying to create an autoencoder perhaps.
If you are creating an autoencoder then there is no need to round as BCELoss can handle values between 0 and 1. Note when training it is better not to apply the sigmoid and instead to use BCELossWithLogits as it provides numerical stability.
Here is an example of an autoencoder with MNIST
If instead you are attempting to do classifcation then there is no need for a one hot vector, you simply output the number of neurons equal to the number of classes i.e for MNIST output 10 neurons and then pass it to CrossEntropyLoss along with a LongTensor with corresponding expected class values
Here is an example of classification on MNIST