Tf.Dataset with Keras returning a ValueError - tensorflow

Getting a ValueError related to shape when passing Tensorflow Dataset into a Keras's model.fit function.
My dataset's X_train has shape (100 samples x 62 features) and Y_train is (100 samples x 1 label
Reproducible code below:
import numpy as np
from tensorflow.keras import layers, Sequential, optimizers
from tensorflow.data import Dataset
num_samples = 100
num_features = 62
num_labels = 1
batch_size = 32
steps_per_epoch = int(num_samples/batch_size)
X_train = np.random.rand(num_samples,num_features)
Y_train = np.random.rand(num_samples, num_labels)
final_dataset = Dataset.from_tensor_slices((X_train, Y_train))
model = Sequential()
model.add(layers.Dense(256, activation='relu',input_shape=(num_features,)))
model.add(layers.Dense(128, activation='relu'))
model.add(layers.Dense(num_labels, activation='softmax'))
model.compile(optimizer=optimizers.Adam(0.001), loss='categorical_crossentropy',metrics=['accuracy'])
history = model.fit(final_dataset,epochs=10,batch_size=batch_size,steps_per_epoch = steps_per_epoch)
The error is:
ValueError: Error when checking input: expected dense_input to have shape (62,) but got array with shape (1,)
Why is the dense_input getting an array with shape (1,)? I am clearly passing it an X_train of shape (n_samples, n_features).
Interestingly the error goes away if I were to apply a batch(some number) function to the dataset, but seems like I am missing something.

It's an intended behavior.
When you use Tensorflow Dataset , you shouldn't specify the batch_size in the fit method of 'Model'. Instead as you mentioned you have to generate the batches using the function with tensorflow dataset.
As mentioned here in the documentation
batch_size: Integer or None. Number of samples per gradient update. If unspecified, batch_size will default to 32. Do not specify the batch_size if your data is in the form of symbolic tensors, dataset, dataset iterators, generators, or keras.utils.Sequence instances (since they generate batches).
The classical behavior is therefore to do as you did: generate the batches with the dataset.
Also use repeat if you want to perform multiple epochs. On the .fit side you'll have to specify the steps_per_epoch to indicate how many batch is one epoch and epochs for your number of epochs.

Related

Problem with shapes of experimental Tensorflow dataset

I am trying to store numpy arrays in a Tensorflow dataset. The model fits correctly when using the numpy arrays as train and test data but not when I store the numpy arrays in a single Tensorflow dataset. The problem is with the dimensions of the dataset. Something is wrong even though shapes seem OK at first sight.
After trying multiple things to reshape my Tensorflow dataset, I am still unable to get it working. My code is the following:
train_x.shape
Out[54]: (7200, 40)
train_y.shape
Out[55]: (7200,)
dataset = tf.data.Dataset.from_tensor_slices((x,y))
print(dataset)
Out[56]: <TensorSliceDataset shapes: ((40,), ()), types: (tf.int32, tf.int32)>
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy')
history = model.fit(dataset, epochs=EPOCHS, batch_size=256)
sparse_softmax_cross_entropy_with_logits
logits.get_shape()))
ValueError: Shape mismatch: The shape of labels (received (1,)) should equal the shape of logits except for the last dimension (received (40, 1351)).
I have seen this answer but I am sure it doesn't apply here. I must use sparse_categorical_crossentropy. I am inspiring myself from this example where I want to store the train and test data in a Tensorflow dataset. I also want to store the arrays in a dataset as I will have to use it later.
You can't use batch_size with model.fit() when using a tf.data.Dataset. Instead use tf.data.Dataset.batch(). You'll have to change your code as follows for it to work.
import numpy as np
import tensorflow as tf
# Some toy data
train_x = np.random.normal(size=(7200, 40))
train_y = np.random.choice([0,1,2], size=(7200))
dataset = tf.data.Dataset.from_tensor_slices((train_x,train_y))
dataset = dataset.batch(256)
#### - Define your model here - ####
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy')
history = model.fit(dataset, epochs=EPOCHS)

Resnet based (Tensorflow Keras) Siamese Model providing `nan` validation loss in training when using TripletHardLoss (Semi too)

I have a model which I built on top of ResNet. I am using 25k Similar type of Images. My images have text as well as some diagram. When I used the Euclidean Distance + Binary loss, I got an accuracy of 95% with Inception but same with Triplet Hard/ Semi Hard Loss gave me nan loss and almost 0 accuracy. Please tell me if there is something wrong with the code structure.
import tensorflow_addons as tfa
from tensorflow.keras.applications.resnet50 import preprocess_input as res50_pre, ResNet50
shape = (224,224,3)
lr = 0.001
loss = tfa.losses.TripletSemiHardLoss()
epochs = 50
batch_size = 128 #254 gives 'log' referenced before assignment error
datagen = ImageDataGenerator(preprocessing_function=res50_pre,validation_split=0.2)
train_data = datagen.flow_from_dataframe(df,x_col='path',y_col='label',class_mode='sparse',target_size=(224,224),
batch_size=batch_size,subset='training',seed=SEED)
val_data = datagen.flow_from_dataframe(df,x_col='path',y_col='label',class_mode='sparse',target_size=(224,224),
batch_size=batch_size,subset='validation',seed=SEED)
base_model = ResNet50(weights='imagenet',input_shape=shape,include_top=False,pooling='avg')
base_model.trainable = True
inputs = keras.Input(shape=shape)
x = base_model(inputs,training=True)
outputs = keras.layers.Lambda(lambda x: tf.math.l2_normalize(x, axis=1))(x) # L2 normalize embeddings
model = keras.Model(inputs, outputs)
for layer in model.layers: # set all the parameters trainable
layer.trainable = True
model.compile(optimizer=tf.keras.optimizers.Adam(lr),loss=loss,metrics=['accuracy'])
history = model.fit(train_data,epochs=epochs,steps_per_epoch=len(train_data)//batch_size,validation_data=val_data,verbose=2)
My group has values like 1,2,3 [Not in order and some missing] which represent the same type of data. I used Sparse after converting the value to str(1), str(3) etc.
My DataFrame looks like this:
increase batch size to reduce probability of a mini batch not including any triplets.
Edit: I published a package for generating TF/Keras balanced batches to solve this problem https://github.com/ma7555/kerasgen

Stateful LSTM Tensorflow Invalid Input_h Shape Error

I am experimenting with stateful LSTM on a time-series regression problem by using TensorFlow. I apologize that I cannot share the dataset.
Below is my code.
train_feature = train_feature.reshape((train_feature.shape[0], 1, train_feature.shape[1]))
val_feature = val_feature.reshape((val_feature.shape[0], 1, val_feature.shape[1]))
batch_size = 64
model = tf.keras.Sequential()
model.add(tf.keras.layers.LSTM(50, batch_input_shape=(batch_size, train_feature.shape[1], train_feature.shape[2]), stateful=True))
model.add(tf.keras.layers.Dense(1))
model.compile(optimizer='adam',
loss='mse',
metrics=[tf.keras.metrics.RootMeanSquaredError()])
model.fit(train_feature, train_label,
epochs=10,
batch_size=batch_size)
When I run the above code, after the end of the first epoch, I will get an error as follows.
InvalidArgumentError: [_Derived_] Invalid input_h shape: [1,64,50] [1,49,50]
[[{{node CudnnRNN}}]]
[[sequential_1/lstm_1/StatefulPartitionedCall]] [Op:__inference_train_function_1152847]
Function call stack:
train_function -> train_function -> train_function
However, the model will be successfully trained if I change the batch_size to 1, and change the code for model training to the following.
total_epochs = 10
for i in range(total_epochs):
model.fit(train_feature, train_label,
epochs=1,
validation_data=(val_feature, val_label),
batch_size=batch_size,
shuffle=False)
model.reset_states()
Nevertheless, with a very large data (1 million rows), the model training will take a very long time since the batch_size is 1.
So, I wonder, how to train a stateful LSTM with a batch size larger than 1 (e.g. 64), without getting the invalid input_h shape error?
Thanks for your answers.
The fix is to ensure batch size never changes between batches. They must all be the same size.
Method 1
One way is to use a batch size that perfectly divides your dataset into equal-sized batches. For example, if total size of data is 1500 examples, then use a batch size of 50 or 100 or some other proper divisor of 1500.
batch_size = len(data)/proper_divisor
Method 2
The other way is to ignore any batch that is less than the specified size, and this can be done using the TensorFlow Dataset API and setting the drop_remainder to True.
batch_size = 64
train_data = tf.data.Dataset.from_tensor_slices((train_feature, train_label))
train_data = train_data.repeat().batch(batch_size, drop_remainder=True)
steps_per_epoch = len(train_feature) // batch_size
model.fit(train_data,
epochs=10, steps_per_epoch = steps_per_epoch)
When using the Dataset API like above, you will need to also specify how many rounds of training count as an epoch (essentially how many batches to count as 1 epoch). A tf.data.Dataset instance (the result from tf.data.Dataset.from_tensor_slices) doesn't know the size of the data that it's streaming to the model, so what constitutes as one epoch has to be manually specified with steps_per_epoch.
Your new code will look like this:
train_feature = train_feature.reshape((train_feature.shape[0], 1, train_feature.shape[1]))
val_feature = val_feature.reshape((val_feature.shape[0], 1, val_feature.shape[1]))
batch_size = 64
train_data = tf.data.Dataset.from_tensor_slices((train_feature, train_label))
train_data = train_data.repeat().batch(batch_size, drop_remainder=True)
model = tf.keras.Sequential()
model.add(tf.keras.layers.LSTM(50, batch_input_shape=(batch_size, train_feature.shape[1], train_feature.shape[2]), stateful=True))
model.add(tf.keras.layers.Dense(1))
model.compile(optimizer='adam',
loss='mse',
metrics=[tf.keras.metrics.RootMeanSquaredError()])
steps_per_epoch = len(train_feature) // batch_size
model.fit(train_data,
epochs=10, steps_per_epoch = steps_per_epoch)
You can also include the validation set as well, like this (not showing other code):
batch_size = 64
val_data = tf.data.Dataset.from_tensor_slices((val_feature, val_label))
val_data = val_data.repeat().batch(batch_size, drop_remainder=True)
validation_steps = len(val_feature) // batch_size
model.fit(train_data, epochs=10,
steps_per_epoch=steps_per_epoch,
validation_steps=validation_steps)
Caveat: This means a few datapoints will never be seen by the model. To get around that, you can shuffle the dataset each round of training, so that the datapoints left behind each epoch changes, giving everyone a chance to be seen by the model.
buffer_size = 1000 # the bigger the slower but more effective shuffling.
train_data = tf.data.Dataset.from_tensor_slices((train_feature, train_label))
train_data = train_data.shuffle(buffer_size=buffer_size, reshuffle_each_iteration=True)
train_data = train_data.repeat().batch(batch_size, drop_remainder=True)
Why the error occurs
Stateful RNNs and their variants (LSTM, GRU, etc.) require fixed batch size. The reason is simply because statefulness is one way to realize Truncated Backprop Through Time, by passing the final hidden state for a batch as the initial hidden state of the next batch. The final hidden state for the first batch has to have exactly the same shape as the initial hidden state of the next batch, which requires that batch size stay the same across batches.
When you set the batch size to 64, model.fit will use the remaining data at the end of an epoch as a batch, and this may not have up to 64 datapoints. So, you get such an error because the batch size is different from what the stateful LSTM expects. You don't have the problem with batch size of 1 because any remaining data at the end of an epoch will always contain exactly 1 datapoint, so no errors. More generally, 1 is always a divisor of any integer. So, if you picked any other divisor of your data size, you should not get the error.
In the error message you posted, it appears the last batch has size of 49 instead of 64. On a side note: The reason the shapes look different from the input is because, under the hood, keras works with the tensors in time_major (i.e. the first axis is for steps of sequence). When you pass a tensor of shape (10, 15, 2) that represents (batch_size, steps_per_sequence, num_features), keras reshapes it to (15, 10, 2) under the hood.

Keras CNN overfitting for more than four classes

I'm trying to train a classifier on Google QuickDraw drawings using Keras:
import numpy as np
from tensorflow.keras.layers import Conv2D, Dense, Flatten, MaxPooling2D
from tensorflow.keras.models import Sequential
model = Sequential()
model.add(Conv2D(filters=32, kernel_size=5, data_format="channels_last", activation="relu", input_shape=(28, 28, 1)))
model.add(MaxPooling2D(data_format="channels_last"))
model.add(Conv2D(filters=16, kernel_size=3, data_format="channels_last", activation="relu"))
model.add(MaxPooling2D(data_format="channels_last"))
model.add(Flatten(data_format="channels_last"))
model.add(Dense(units=128, activation="relu"))
model.add(Dense(units=64, activation="relu"))
model.add(Dense(units=4, activation="softmax"))
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
x = np.load("./x.npy")
y = np.load("./y.npy")
model.fit(x=x, y=y, batch_size=100, epochs=40, validation_split=0.2)
The input data is a 4d array with 12000 normalized images (28 x 28 x 1) per class. The output data is an array of one hot encoded vectors.
If I train this model on four classes, it produces convincing results:
(red is training data, blue is validation data)
I know the model is slightly overfitted. However, I want to keep the architecture as simple as possible, so I accepted that.
My problem is that as soon as I add just one arbitrary class, the model starts to overfit extremely:
I tried many different things to prevent it from overfitting such as Batch Normalization, Dropout, Kernel Regularizers, much more training data and different batch sizes, none of which caused any significant improvement.
What could be the reason why my CNN overfits so much?
EDIT: This is the code I used to create x.npy and y.npy:
import numpy as np
from tensorflow.keras.utils import to_categorical
files = ['cat.npy', 'dog.npy', 'apple.npy', 'banana.npy', 'flower.npy']
SAMPLES = 12000
x = np.concatenate([np.load(f'./data/{f}')[:SAMPLES] for f in files]) / 255.0
y = np.concatenate([np.full(SAMPLES, i) for i in range(len(files))])
# (samples, rows, cols, channels)
x = x.reshape(x.shape[0], 28, 28, 1).astype('float32')
y = to_categorical(y)
np.save('./x.npy', x)
np.save('./y.npy', y)
The .npy files come from here.
The problem lies with how the data split is done. Notice that there are 5 classes and you do 0.2 validation split. By default there's no shuffling and in your code you feed the data in a sequential order. What that means:
Training data consists entirely of 4 classes: 'cat.npy', 'dog.npy', 'apple.npy', 'banana.npy'. That's the 0.8 training split.
Test data is 'flower.npy'. That's your 0.2 validation split. The model was never trained on this so it gets terrible accuracy.
Such results are only possible thanks to the fact that the validation_split=0.2, so you get close to perfect class separation.
Solution
x = np.load("./x.npy")
y = np.load("./y.npy")
# Shuffle the data!
p = np.random.permutation(len(x))
x = x[p]
y = y[p]
model.fit(x=x, y=y, batch_size=100, epochs=40, validation_split=0.2)
if my hypothesis is correct, setting the validation_split to e.g. 0.5 should also get you much better results (though it's not a solution).

How do I resolve an InvalidArgumentError in Classifier model?

New to TensorFlow, so apologies for newbie question.
Following this tutorial but instead of using image data I am using numerical data.
Load the dataset:
train_dataset_url = "xxx.csv"
train_dataset_fp = tf.keras.utils.get_file(
fname=os.path.basename(train_dataset_url),
origin=train_dataset_url)
Make training dataset:
batch_size = 32
train_dataset = tf.contrib.data.make_csv_dataset(
train_dataset_fp,
batch_size,
column_names=column_names,
label_name=label_name,
num_epochs=1)
Train classified model using:
model = tf.keras.Sequential([
tf.keras.layers.Dense(10, activation=tf.nn.relu, input_shape=(1,)),
tf.keras.layers.Dense(10, activation=tf.nn.relu),
tf.keras.layers.Dense(4)
])
But when I "test" the model with the same inputs:
predictions = model(features)
I receive the error:
InvalidArgumentError: cannot compute MatMul as input #0(zero-based) was expected to be a float tensor but is a int32 tensor [Op:MatMul]
It's possible I have missed something fundamental. I feel like I need to specify a type somewhere.
The data which you feed in the model is a numpy array according to my assumption . The error states that the model requires a tensor with dtype=float32 or float64. You are providing a int32 numpy array. So, wherever you create a numpy array, just mention the dtype as float32.