Was trying out Tensorflow's built in pandas_input_fn() with a pandas dataframe that I named training_examples
It's a very simple dataframe, describing one set of features and labels; this is then passed as argument x in the pandas_input_fn() function as shown below, which, if I understand the docs correctly, should return an input function with the data already parsed into features and labels?
input_function = tf.estimator.inputs.pandas_input_fn(
x= training_examples,
y= None,
batch_size=128,
num_epochs=1,
shuffle=True,
queue_capacity=1000,
num_threads=1,
target_column='y'
)
However, when I then try and pass this function to the .train() method, I get an error as shown below:
ValueError: You must provide a labels Tensor. Given: None. Suggested
troubleshooting steps: Check that your data contain your label feature. Check
that your input_fn properly parses and returns labels.
Not sure what I'm doing wrong?
train_input_function zips up it's own tuple of features and labels. You're in the right track in your comments.
x = training_examples[[feature_column_list]]
y = training_examples[label_column_name]
Working with the full dataset (before splitting into train and test) I find it works effectively to produce train and test input functions like so. This makes use of sklearn's train_test_split function with 'stratify' to make sure the right ratio of cases have each category in the label.
sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(x, y, stratify=y)
At this point you can specify your input functions.
train_input_fn = tf.estimator.inputs.pandas_input_fn(x=train_x, y=train_y, shuffle=True, num_epochs=whatever, batch_size=whatever)
test_input_fn = tf.estimator.inputs.pandas_input_fn(x=test_x, y=test_y, shuffle=False, batch_size=1)
try target_column=None and use the actual Y column in Y= training_examples['label/target']
Related
To put it simply, I'd like to be able to use a keras dataset created from a local image directory to train an autoencoder. To clarify, this is a model that approximates the Identity function for images : ideally, the output is exactly equal to the input.
The dataset is too large to fit in memory, so converting the dataset to a numpy array with np.concatenate will not help me here.
Or in other words, I'd like an Identity image dataset, where the label for each image in the dataset is exactly equal to the image itself.
Here's my (non-working) sample code:
train_ds, validate_ds = tf.keras.utils.image_dataset_from_directory(
data_dir,
labels=None,
validation_split=0.1,
subset="both",
shuffle=True,
seed=123,
image_size=(img_height, img_width),
batch_size=batch_size,
crop_to_aspect_ratio=True)
history = autoencoder.fit(
x=train_ds,
y=train_ds,
validation_data=(validate_ds, validate_ds),
epochs=epochs,
batch_size=16
)
The image_dataset_from_directory function gives me a dataset of images with no labels. So far so good.
The second command fails with the error message:
ValueError: `y` argument is not supported when using dataset as input.
On the other hand, if I exclude the y variable I get this error:
ValueError: Target data is missing. Your model was compiled with loss=binary_crossentropy, and therefore expects target data to be provided in `fit()`.
Which is not at all surprising, because there are NO labels, as I requested none. But yet it won't let me use the dataset as the labels which is what I need to do.
Any help would be appreciated.
While there are ways to modify the dataset, I think the best option is to write a custom model class. This is modified from the official tutorial:
class Autoencoder(tf.keras.Model):
def train_step(self, data):
# Unpack the data. Its structure depends on your model and
# on what you pass to `fit()`.
x = data # CHANGE 1: changed from x, y = data
with tf.GradientTape() as tape:
y_pred = self(x, training=True) # Forward pass
# Compute the loss value
# (the loss function is configured in `compile()`)
loss = self.compiled_loss(x, y_pred, regularization_losses=self.losses) # CHANGE 2: replaced y by x as label
# Compute gradients
trainable_vars = self.trainable_variables
gradients = tape.gradient(loss, trainable_vars)
# Update weights
self.optimizer.apply_gradients(zip(gradients, trainable_vars))
# Update metrics (includes the metric that tracks the loss)
self.compiled_metrics.update_state(x, y_pred) # CHANGE 3: like change 2
# Return a dict mapping metric names to current value
return {m.name: m.result() for m in self.metrics}
def test_step(self, data):
# CHANGED in the same way
x = data
# Compute predictions
y_pred = self(x, training=False)
# Updates the metrics tracking the loss
self.compiled_loss(x, y_pred, regularization_losses=self.losses)
# Update the metrics.
self.compiled_metrics.update_state(x, y_pred)
# Return a dict mapping metric names to current value.
# Note that it will include the loss (tracked in self.metrics).
return {m.name: m.result() for m in self.metrics}
This is for the functional API (tf.keras.Model). In case you are using a Sequential model, you should inherit from that instead. You can use this as a direct replacement for the normal model constructor.
Another option could be to use train_zipped = tf.data.Dataset.zip((train_ds, train_ds)) to create an input, target dataset that you can put directly into the usual model and loss function. Personally, I don't like the duplication. Also, I'm not sure if this will behave correctly for the shuffled data (will both copies of train_ds be shuffled in the same way?).
You could circumvent this by setting shuffle=False in image_dataset_from_directory, and then use train_zipped = train_zipped.shuffle(buffer_size) instead. However, in my experience this is very slow.
In my project, I have a number of cases where I have a Dataset instance and I need to get predictions from some model on every item in the dataset.
The model.predict() API is optimized perfectly for this, as shown in the documentation. However, there seems to be one major catch. I also happen to need the labels to compare with the predicted values, i.e. the dataset contains x,y pairs, and I'd like to end up with (y_predicted, y) pairs after the prediction is complete. This does not seem to be possible with the predict() API though, and I can't think of a clean way to 'split' the dataset so that the x's are fed into the model and the y's are retained to be joined back up with the predicted y's.
EDIT: I know it's quite simple to do by iterating over the dataset manually and calling the model directly, e.g.
for x, y in dataset:
y_pred = model(x)
result.append((y, y_pred))
However, this seems like it will be a fair bit slower than using the inbuilt predict() as Tensorflow won't be able to multi-thread/optimize the input pipeline.
Does anyone have a good way to accomplish this?
Given the concerns you mentioned, it may be best to overwrite predict to suit your needs. You don't actually need to overwrite that function though, instead only predict_step which is called by that function. Just use this class instead of Model:
class MyModel(tf.keras.Model):
def predict_step(self, data):
x, y = data
return self(x, training=False), y
If your model is currently Sequential, inherit from that instead. Basically the only change I made from the default implementation is to add , y to the model call result.
Note that this also makes some assumptions, such that your dataset consists of (input, label) batch pairs. You may need to adapt it slightly to your needs. Here is a minimal example:
import tensorflow as tf
import numpy as np
(imgs, lbls), (te_imgs, te_lbls) = tf.keras.datasets.mnist.load_data()
imgs = imgs.astype(np.float32).reshape((-1, 784)) / 255.
te_imgs = te_imgs.astype(np.float32).reshape((-1, 784)) / 255.
lbls = lbls.astype(np.int32)
te_lbls = te_lbls.astype(np.int32)
tr_data = tf.data.Dataset.from_tensor_slices((imgs, lbls)).shuffle(60000).batch(128)
te_data = tf.data.Dataset.from_tensor_slices((te_imgs, te_lbls)).batch(128)
class MyModel(tf.keras.Model):
def predict_step(self, data):
x, y = data
return self(x, training=False), y
inp = tf.keras.Input((784,))
logits = tf.keras.layers.Dense(10)(inp)
model = MyModel(inp, logits)
opt = tf.keras.optimizers.Adam()
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(loss=loss, optimizer=opt)
something = model.predict(te_data)
print(something[0].shape, something[1].shape)
This shows ((10000, 10), (10000,)) -- predict now returns a tuple of outputs, labels (this can be confirmed by inspecting the returned labels and comparing to the images in the test set).
I have a big dataset which I want to use in order to train my convolutional autoencoder.
Like every autoencoder, my convolutional autoencoder needs to be trained with: x=y as the input (same x_train and x_test in the X and Y parameters)
for example:
autoencoder.fit(x_train, x_train,
epochs=50,
batch_size=256,
shuffle=True,
validation_data=(x_test, x_test))
How can I use image_dataset_from_directory to fit my autoencoder ?
How can I set image_dataset_from_directory with same x and y parmeters (as I mentioned above) ?
You can get the result you want by using the ImageDataGenerator.flow_from_directory. Documentation is here. If you set the class_mode = "input" the generator y output is identical to the x input. The documentation specifically states this is useful for auto encoders.
as the link posted by "dogvarog" provided the full example, I just want to add my two cents to simplify the answer and also add the train/validate split method when Dataset encountered:
df_auto_train = df_train.map(lambda x, y: (x, x))
X_auto_train = df_auto_train.take(int(0.9*len(df_auto_train)))
X_auto_validate = df_auto_train.skip(int(0.9*len(df_auto_train)))
Just use the "map" function to change the labels to images.
if you want to use train + validation, it is impossible to use validation_split in Dataset, instead, using "take" and "split".
Reposting my original question since even after significant improvements to clarity, it was not revived by the community.
I am looking for a way to split feature and corresponding label data into train and test using TensorFlow inbuilt methods. My data is already in two tensors (i.e. tf.Tensor objects), named features and labels.
I know how to do this easily for numpy arrays using sklearn.model_selection as shown in this post. Additionally, I was pointed to this method which requires the data to be in a single tensor. Also, I need the train and test sets to be disjoint, unlike in this method (meaning they can't have common data points after the split).
I am looking for a way to do the same using built-in methods in Tensorflow.
There may be too many conditions in my requirement, but basically what is needed is an equivalent method to sklearn.model_selection.train_test_split() in Tensorflow such as the below:
import tensorflow as tf
X_train, X_test, y_train, y_test = tf.train_test_split(features,
labels,
test_size=0.1,
random_state=123)
You can achieve this by using TF in the following way
from typing import Tuple
import tensorflow as tf
def split_train_test(features: tf.Tensor,
labels: tf.Tensor,
test_size: float,
random_state: int = 1729) -> Tuple[tf.Tensor, tf.Tensor, tf.Tensor, tf.Tensor]:
# Generate random masks
random = tf.random.uniform(shape=(tf.shape(features)[0],), seed=random_state)
train_mask = random >= test_size
test_mask = random < test_size
# Gather values
train_features, train_labels = tf.boolean_mask(features, mask=train_mask), tf.boolean_mask(labels, mask=train_mask)
test_features, test_labels = tf.boolean_mask(features, mask=test_mask), tf.boolean_mask(labels, mask=test_mask)
return train_features, test_features, train_labels, test_labels
What we are doing here is first creating a random uniform tensor with the size of the length of the data.
Then we follow by creating boolean masks according to the ratio given by test_size and finally we extract the relevant part for train/test using tf.boolean_mask
I want to get confusion matrix
yet for that I need the set of predicated items and labels.
how can I get this data from tflearn for example for this example (Pannous speech_data) https://github.com/llSourcell/tensorflow_speech_recognition_demo/blob/master/demo.py
thanks!
model.fit(trainX, trainY, n_epoch=10, validation_set=(testX, testY), show_metric=True,batch_size=batch_size)
_y=model.predict(X)
predictions.append(_y)
labels.append(trainY)
bp()
confusionMat=tf.confusion_matrix(labels,predictions,num_classes=classes,dtype=tf.int32,name=None,weights=None)
print(np.matrix(confusionMat))
_y=model.predict(X) # predictions
y = train_Y # i think this is actual labels data
tf.confusion_matrix(
labels, # put y here
predictions, #put _y here
num_classes=None,
dtype=tf.int32,
name=None,
weights=None
)