Problem with shapes of experimental Tensorflow dataset - numpy

I am trying to store numpy arrays in a Tensorflow dataset. The model fits correctly when using the numpy arrays as train and test data but not when I store the numpy arrays in a single Tensorflow dataset. The problem is with the dimensions of the dataset. Something is wrong even though shapes seem OK at first sight.
After trying multiple things to reshape my Tensorflow dataset, I am still unable to get it working. My code is the following:
train_x.shape
Out[54]: (7200, 40)
train_y.shape
Out[55]: (7200,)
dataset = tf.data.Dataset.from_tensor_slices((x,y))
print(dataset)
Out[56]: <TensorSliceDataset shapes: ((40,), ()), types: (tf.int32, tf.int32)>
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy')
history = model.fit(dataset, epochs=EPOCHS, batch_size=256)
sparse_softmax_cross_entropy_with_logits
logits.get_shape()))
ValueError: Shape mismatch: The shape of labels (received (1,)) should equal the shape of logits except for the last dimension (received (40, 1351)).
I have seen this answer but I am sure it doesn't apply here. I must use sparse_categorical_crossentropy. I am inspiring myself from this example where I want to store the train and test data in a Tensorflow dataset. I also want to store the arrays in a dataset as I will have to use it later.

You can't use batch_size with model.fit() when using a tf.data.Dataset. Instead use tf.data.Dataset.batch(). You'll have to change your code as follows for it to work.
import numpy as np
import tensorflow as tf
# Some toy data
train_x = np.random.normal(size=(7200, 40))
train_y = np.random.choice([0,1,2], size=(7200))
dataset = tf.data.Dataset.from_tensor_slices((train_x,train_y))
dataset = dataset.batch(256)
#### - Define your model here - ####
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy')
history = model.fit(dataset, epochs=EPOCHS)

Related

Learning a Categorical Variable with TensorFlow Probability

I would like to use TFP to write a neural network where the output are the probabilities of a categorical variable with 3 classes, and train it using the negative log-likelihood.
As I'm moving my first steps with TF and TFP, I started with a toy model where the input layer has only 1 unit receiving a null input, and the output layer has 3 units with softmax activation function. The idea is that the biases should learn (up to an additive constant) the log of the probabilities.
Here below is my code, true_p are the true parameters I use to generate the data and I would like to learn, while learned_p is what I get from the NN.
import numpy as np
import tensorflow as tf
from tensorflow import keras
from functions import nll
from tensorflow.keras.optimizers import SGD
import tensorflow.keras.layers as layers
import tensorflow_probability as tfp
tfd = tfp.distributions
# params
true_p = np.array([0.1, 0.7, 0.2])
n_train = 1000
# training data
x_train = np.array(np.zeros(n_train)).reshape((n_train,))
y_train = np.array(np.random.choice(len(true_p), size=n_train, p=true_p)).reshape((n_train,))
# model
input_layer = layers.Input(shape=(1,))
p_layer = layers.Dense(len(true_p), activation=tf.nn.softmax)(input_layer)
p_y = tfp.layers.DistributionLambda(tfd.Categorical)(p_layer)
model_p = keras.models.Model(inputs=input_layer, outputs=p_y)
model_p.compile(SGD(), loss=nll)
# training
hist_p = model_p.fit(x=x_train, y=y_train, batch_size=100, epochs=3000, verbose=0)
# check result
learned_p = np.round(model_p.layers[1].call(tf.constant([0], shape=(1, 1))).numpy(), 3)
learned_p
With this setup, I get the result:
>>> learned_p
array([[0.005, 0.989, 0.006]], dtype=float32)
I over-estimate the second category, and can't really distinguish between the first and the third one. What's worst, if I plot the probabilities at the end of each epoch, it looks like they are converging monotonically to the vector [0,1,0], which doesn't make sense (it seems to me the gradient should push in the opposite direction once I start to over-estimate).
I really can't figure out what's going on here, but have the feeling I'm doing something plain wrong. Any idea? Thank you for your help!
For the record, I also tried using other optimizers like Adam or Adagrad playing with the hyper-params, but with no luck.
I'm using Python 3.7.9, TensorFlow 2.3.1 and TensorFlow probability 0.11.1
I believe the default argument to Categorical is not the vector of probabilities, but the vector of logits (values you'd take softmax of to get probabilities). This is to help maintain precision in internal Categorical computations like log_prob. I think you can simply eliminate the softmax activation function and it should work. Please update if it doesn't!
EDIT: alternatively you can replace the tfd.Categorical with
lambda p: tfd.Categorical(probs=p)
but you'll lose the aforementioned precision gains. Just wanted to clarify that passing probs is an option, just not the default.

fine-tuning huggingface DistilBERT for multi-class classification on custom dataset yields weird output shape on prediction

I'm trying to fine-tune huggingface's implementation of distilbert for multi-class classification (100 classes) on a custom dataset following the tutorial at https://huggingface.co/transformers/custom_datasets.html.
I'm doing so using Tensorflow, and fine-tuning in native tensorflow, that is, I use the following part of the tutorial for dataset creation:
import tensorflow as tf
train_dataset = tf.data.Dataset.from_tensor_slices((
dict(train_encodings),
train_labels
))
val_dataset = tf.data.Dataset.from_tensor_slices((
dict(val_encodings),
val_labels
))
test_dataset = tf.data.Dataset.from_tensor_slices((
dict(test_encodings),
test_labels
))
And this one for fine-tuning:
from transformers import TFDistilBertForSequenceClassification
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
model.compile(optimizer=optimizer, loss=model.compute_loss) # can also use any keras loss fn
model.fit(train_dataset.shuffle(1000).batch(16), epochs=3, batch_size=16)
Everything seems to go fine with fine-tuning, but when I try to predict on the test dataset using model.predict(test_dataset) as argument (with 2000 examples), the model seems to yield one prediction per token rather than one prediction per sequence...
That is, instead of getting an output of shape (1, 2000, 100), I get an output of shape (1, 1024000, 100), where 1024000 is the number of test examples (2000) * the sequence length (512).
Any hint on what's going on here?
(Sorry if this is naive, I'm very new to tensorflow).
I had exactly the same problem. I do not know why it's happening, as it should by the right code by looking at the tutorial.
But for me it worked to create numpy arrays out of the train_encodings and pass them directly to the fit method instead of creating the Dataset.
x1 = np.array(list(dict(train_encodings).values()))[0]
x2 = np.array(list(dict(train_encodings).values()))[1]
model.fit([x1,x2], train_labels, epochs=20)

How to perform sklearn style train-test split on feature and label tensors using built in tensorflow methods?

Reposting my original question since even after significant improvements to clarity, it was not revived by the community.
I am looking for a way to split feature and corresponding label data into train and test using TensorFlow inbuilt methods. My data is already in two tensors (i.e. tf.Tensor objects), named features and labels.
I know how to do this easily for numpy arrays using sklearn.model_selection as shown in this post. Additionally, I was pointed to this method which requires the data to be in a single tensor. Also, I need the train and test sets to be disjoint, unlike in this method (meaning they can't have common data points after the split).
I am looking for a way to do the same using built-in methods in Tensorflow.
There may be too many conditions in my requirement, but basically what is needed is an equivalent method to sklearn.model_selection.train_test_split() in Tensorflow such as the below:
import tensorflow as tf
X_train, X_test, y_train, y_test = tf.train_test_split(features,
labels,
test_size=0.1,
random_state=123)
You can achieve this by using TF in the following way
from typing import Tuple
import tensorflow as tf
def split_train_test(features: tf.Tensor,
labels: tf.Tensor,
test_size: float,
random_state: int = 1729) -> Tuple[tf.Tensor, tf.Tensor, tf.Tensor, tf.Tensor]:
# Generate random masks
random = tf.random.uniform(shape=(tf.shape(features)[0],), seed=random_state)
train_mask = random >= test_size
test_mask = random < test_size
# Gather values
train_features, train_labels = tf.boolean_mask(features, mask=train_mask), tf.boolean_mask(labels, mask=train_mask)
test_features, test_labels = tf.boolean_mask(features, mask=test_mask), tf.boolean_mask(labels, mask=test_mask)
return train_features, test_features, train_labels, test_labels
What we are doing here is first creating a random uniform tensor with the size of the length of the data.
Then we follow by creating boolean masks according to the ratio given by test_size and finally we extract the relevant part for train/test using tf.boolean_mask

Tf.Dataset with Keras returning a ValueError

Getting a ValueError related to shape when passing Tensorflow Dataset into a Keras's model.fit function.
My dataset's X_train has shape (100 samples x 62 features) and Y_train is (100 samples x 1 label
Reproducible code below:
import numpy as np
from tensorflow.keras import layers, Sequential, optimizers
from tensorflow.data import Dataset
num_samples = 100
num_features = 62
num_labels = 1
batch_size = 32
steps_per_epoch = int(num_samples/batch_size)
X_train = np.random.rand(num_samples,num_features)
Y_train = np.random.rand(num_samples, num_labels)
final_dataset = Dataset.from_tensor_slices((X_train, Y_train))
model = Sequential()
model.add(layers.Dense(256, activation='relu',input_shape=(num_features,)))
model.add(layers.Dense(128, activation='relu'))
model.add(layers.Dense(num_labels, activation='softmax'))
model.compile(optimizer=optimizers.Adam(0.001), loss='categorical_crossentropy',metrics=['accuracy'])
history = model.fit(final_dataset,epochs=10,batch_size=batch_size,steps_per_epoch = steps_per_epoch)
The error is:
ValueError: Error when checking input: expected dense_input to have shape (62,) but got array with shape (1,)
Why is the dense_input getting an array with shape (1,)? I am clearly passing it an X_train of shape (n_samples, n_features).
Interestingly the error goes away if I were to apply a batch(some number) function to the dataset, but seems like I am missing something.
It's an intended behavior.
When you use Tensorflow Dataset , you shouldn't specify the batch_size in the fit method of 'Model'. Instead as you mentioned you have to generate the batches using the function with tensorflow dataset.
As mentioned here in the documentation
batch_size: Integer or None. Number of samples per gradient update. If unspecified, batch_size will default to 32. Do not specify the batch_size if your data is in the form of symbolic tensors, dataset, dataset iterators, generators, or keras.utils.Sequence instances (since they generate batches).
The classical behavior is therefore to do as you did: generate the batches with the dataset.
Also use repeat if you want to perform multiple epochs. On the .fit side you'll have to specify the steps_per_epoch to indicate how many batch is one epoch and epochs for your number of epochs.

How do I resolve an InvalidArgumentError in Classifier model?

New to TensorFlow, so apologies for newbie question.
Following this tutorial but instead of using image data I am using numerical data.
Load the dataset:
train_dataset_url = "xxx.csv"
train_dataset_fp = tf.keras.utils.get_file(
fname=os.path.basename(train_dataset_url),
origin=train_dataset_url)
Make training dataset:
batch_size = 32
train_dataset = tf.contrib.data.make_csv_dataset(
train_dataset_fp,
batch_size,
column_names=column_names,
label_name=label_name,
num_epochs=1)
Train classified model using:
model = tf.keras.Sequential([
tf.keras.layers.Dense(10, activation=tf.nn.relu, input_shape=(1,)),
tf.keras.layers.Dense(10, activation=tf.nn.relu),
tf.keras.layers.Dense(4)
])
But when I "test" the model with the same inputs:
predictions = model(features)
I receive the error:
InvalidArgumentError: cannot compute MatMul as input #0(zero-based) was expected to be a float tensor but is a int32 tensor [Op:MatMul]
It's possible I have missed something fundamental. I feel like I need to specify a type somewhere.
The data which you feed in the model is a numpy array according to my assumption . The error states that the model requires a tensor with dtype=float32 or float64. You are providing a int32 numpy array. So, wherever you create a numpy array, just mention the dtype as float32.