training loss is Nan - but trainning data is all in range without null - tensorflow

When I execute my, y_train_lstm, epochs=3, shuffle=False, verbose=2)
I always get loss as nan:
Epoch 1/3
73/73 - 5s - loss: nan - accuracy: 0.5417 - 5s/epoch - 73ms/step
Epoch 2/3
73/73 - 5s - loss: nan - accuracy: 0.5417 - 5s/epoch - 74ms/step
Epoch 3/3
73/73 - 5s - loss: nan - accuracy: 0.5417 - 5s/epoch - 73ms/step
My x_training is shaped (2475, 48), y_train is shaped (2475,)
I derive my input train set in (2315, 160, 48), so 2315 sets of training data, 160 as my loopback timewindow, 48 features
corresspondingly, the y_train is 0 or 1 in shape of (2315, 1)
All in range of (-1,1):
My model is like this:
Model: "sequential_2"
Layer (type) Output Shape Param #
lstm_6 (LSTM) (None, 160, 128) 90624
dropout_4 (Dropout) (None, 160, 128) 0
lstm_7 (LSTM) (None, 160, 64) 49408
dropout_5 (Dropout) (None, 160, 64) 0
lstm_8 (LSTM) (None, 32) 12416
dense_2 (Dense) (None, 1) 33
Total params: 152,481
Trainable params: 152,481
Non-trainable params: 0
I tried different LSTM units: 48, 60, 128, 160, none of them work
I check my training data, all of them are in the range of (-1,1)
There is no 'null' in my dataset, x_train.isnull().values.any() outputs False
Now I have no clue where can I try more~
My model code is:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM
from tensorflow.keras.layers import Dropout
def create_model(win = 100, features = 9):
model = Sequential()
model.add(LSTM(units=128, activation='relu', input_shape=(win, features),
model.add(LSTM(units=64, activation='relu', return_sequences=True))
# no need return sequences from 'the last layer'
# adding the output layer
model.add(Dense(units=1, activation='sigmoid'))
# may also try mean_squared_error
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
return model
here I plot some train_y samples:

Two things: try normalizing your time series data and using relu as the activation function for lstm layers is not 'conventional'. Check this post for further insights. An example:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM
from tensorflow.keras.layers import Dropout
import tensorflow as tf
layer = tf.keras.layers.Normalization(axis=-1)
x = tf.random.normal((500, 100, 9))
y = tf.random.uniform((500, ), dtype=tf.int32, maxval=2)
def create_model(win = 100, features = 9):
model = Sequential()
model.add(LSTM(units=128, activation='tanh', input_shape=(win, features),
model.add(LSTM(units=64, activation='tanh', return_sequences=True))
model.add(Dense(units=1, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
return model
model = create_model(), y, epochs=20)

2022.Mar.17 update
After further debugging, I eventually found out the problem is actually because of my newly added feature contains np.inf, after I remove those rows, my problem resolved, I can see the loss value now
6/6 [==============================] - 2s 50ms/step - loss: 0.6936 - accuracy: 0.5176
Note, np.inf has symbol, so make sure both np.inf and -np.inf are removed:
all_dataset = all_dataset[all_dataset.feature_issue != np.inf]
all_dataset = all_dataset[all_dataset.feature_issue != -np.inf]
After some debugging, I addressed 2 of my new added features actually cause the problem. So, the problem comes from data, but unlike the others, my data contains no nan, no out of range(originally I thought all the data needs to be normalized)
But I can't tell the reason yet, as they looks good
I will continue research more on it tomorrow, any advice is appreciated!


Keras model classifies images as the same class

I am using Keras' pre-trained VGGNet16 as my base model, but I need to add layers on the end to make it work for my data. I've got the data pre-processed and formatted, so I'll jump to the part of the code actually involving the CNN.
import tensorflow as tf
from tensorflow.keras.layers import Dense, Dropout, Activation, Flatten, Conv2D, MaxPooling2D, AveragePooling2D
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
import tensorflow.keras as K
input_t = K.Input(shape=(224,224,3))
base_model = K.applications.VGG16(include_top=False,weights="imagenet",input_tensor=input_t)
for layer in base_model.layers:
layer.trainable = False
model = K.models.Sequential()
model.compile(loss = 'binary_crossentropy',
optimizer = Adam(learning_rate=1e-5),
metrics = ['accuracy'])
epochs = 3,
validation_split = 0.1,
validation_data =(x_test,y_test))
This code yields the following output.
Epoch 1/3
14/14 [==============================] - 48s 3s/step - loss: 0.5887 - accuracy: 0.2287 - val_loss: 0.4951 - val_accuracy: 0.3000
Epoch 2/3
14/14 [==============================] - 48s 3s/step - loss: 0.5170 - accuracy: 0.2220 - val_loss: 0.4972 - val_accuracy: 0.2800
Epoch 3/3
14/14 [==============================] - 48s 3s/step - loss: 0.4982 - accuracy: 0.2265 - val_loss: 0.4975 - val_accuracy: 0.2200
Model: "sequential"
Layer (type) Output Shape Param #
vgg16 (Functional) (None, 7, 7, 512) 14714688
flatten (Flatten) (None, 25088) 0
dense (Dense) (None, 5) 125445
Total params: 14,840,133
Trainable params: 125,445
Non-trainable params: 14,714,688
Test loss: 0.4997282326221466
Test accuracy: 0.2720000147819519
All of my test images are being classified into the same class. Is there any reason for this?
Edit: Modified question upon realizing the model is running properly
Perhaps this isn't a great solution, but I fixed this by making the last six layers trainable with the others trainable
for layer in base_model.layers[:-6]:
layer.trainable = False

Adapt Text Classifier Neural Net To Accept Multiple Categories

I'm trying to adjust the text classifier neural net in this Keras/Tensorflow tutorial to output multiple (more than 2 categories). I think I can change the output layer to use a 'softmax' activation but I'm not sure how to adjust the input layer.
Tutorial Link:
The tutorial is using movie review data and the only two categories are positive or negative so the model only uses an output layer with activation set to 'sigmoid'.
I have 16 categories represented using one-hot encoding.
Tutorial Example:
vocab_size = 10000
model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 16))
model.add(keras.layers.Dense(16, activation=tf.nn.relu))
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))
My Attempt:
model.add(keras.layers.Embedding(10000, 16))
model.add(keras.layers.Dense(16, activation=tf.nn.relu))
model.add(keras.layers.Dense(16, activation='softmax'))
history =[:5000],
validation_data=(x_train[5000:], y_train[5000:]),
ValueError: A target array with shape (5000, 1) was passed for an output of shape (None, 16) while using as loss binary_crossentropy. This loss expects targets to have the same shape as the output.
Data Shapes:
x_train[:5000] (5000, 2000)
y_train[:5000] (5000,16)
x_train[5000:] (1934, 2000)
y_train[5000:] (1934,16)
Model Summary:
Model: "sequential_16"
Layer (type) Output Shape Param #
embedding_15 (Embedding) (None, None, 16) 160000
global_average_pooling1d_15 (None, 16) 0
dense_30 (Dense) (None, 16) 272
dense_31 (Dense) (None, 16) 272
Total params: 160,544
Trainable params: 160,544
Non-trainable params: 0
Binary_crossentropy is for binary classification, but what you're looking for is categorical_crossentropy. Binary_crossentropy expects your y matrix to be (n_samples x 1), with values of 0 or 1. Categorical_crossentropy expects your y matrix to be (n_samples x n_categories), with the correct category labeled as 1 and the other categories labeled as 0. It sounds like the way you did one-hot-encoding would be correct, so you probably just need to change your loss function.
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])

CNN implementation using Keras and Tensorflow

I have created a CNN model using Keras and I am training it on a MNIST dataset. I got a reasonable accuracy around 98%, which is what I expected:
model = Sequential()
model.add(Conv2D(64, 5, activation="relu", input_shape=(28, 28, 1)))
model.add(Conv2D(64, 5, activation="relu"))
model.add(Dense(256, activation='relu'))
model.add(Dense(10, activation='softmax'))
loss='categorical_crossentropy', metrics=['accuracy']), data.y_train,
batch_size=256, validation_data=(data.x_test, data.y_test))
Now I want to build the same model, but using vanilla Tensorflow, here is how I did that:
X = tf.placeholder(shape=[None, 784], dtype=tf.float32, name="X")
Y = tf.placeholder(shape=[None, 10], dtype=tf.float32, name="Y")
net = tf.reshape(X, [-1, 28, 28, 1])
net = tf.layers.conv2d(
net, filters=64, kernel_size=5, padding="valid", activation=tf.nn.relu)
net = tf.layers.max_pooling2d(net, pool_size=2, strides=2)
net = tf.layers.conv2d(
net, filters=64, kernel_size=5, padding="valid", activation=tf.nn.relu)
net = tf.layers.max_pooling2d(net, pool_size=2, strides=2)
net = tf.contrib.layers.flatten(net)
net = tf.layers.dense(net, name="dense1", units=256, activation=tf.nn.relu)
model = tf.layers.dense(net, name="output", units=10)
And here is how I train/test it:
loss = tf.nn.softmax_cross_entropy_with_logits_v2(labels=Y, logits=model)
opt = tf.train.AdamOptimizer().minimize(loss)
accuracy = tf.cast(tf.equal(tf.argmax(model, 1), tf.argmax(Y, 1)), tf.float32)
with tf.Session() as sess:
for batch in range(data.get_number_of_train_batches(batch_size)):
x, y = data.get_next_train_batch(batch_size)[loss, opt], feed_dict={X: x, Y: y})
for batch in range(data.get_number_of_test_batches(batch_size)):
x, y = data.get_next_test_batch(batch_size), feed_dict={X: x, Y: y})
But the resulting accuracy of the model dropped to ~80%. What are the principal differences between my implementation of that model using Keras and Tensorflow ? Why the accuracy varies so much ?
I don't see any mistakes in your code. Note that your current model is heavily parameterized for such a simple problem because of the Dense layers, which introduce over 260k trainable parameters:
Layer (type) Output Shape Param #
conv2d_3 (Conv2D) (None, 24, 24, 64) 1664
max_pooling2d_3 (MaxPooling2 (None, 12, 12, 64) 0
conv2d_4 (Conv2D) (None, 8, 8, 64) 102464
max_pooling2d_4 (MaxPooling2 (None, 4, 4, 64) 0
flatten_2 (Flatten) (None, 1024) 0
dense_2 (Dense) (None, 256) 262400
dense_3 (Dense) (None, 10) 2570
Total params: 369,098
Trainable params: 369,098
Non-trainable params: 0
Below, I will run your code with:
minor adaptations to make the code work with the MNIST dataset in keras.datasets
a simplified model: basically I remove the 256-node Dense layer, drastically reducing the number of trainable parameters, and introduce some dropout for regularization.
With these changes, both models achieve 90%+ validation set accuracy after the first epoch. So it seems the problem you encountered has to do with an ill-posed optimization problem which leads to highly variable outcomes, and not with a bug in your code.
# Import the datasets
import numpy as np
from keras.datasets import mnist
from keras.utils import to_categorical
(x_train, y_train), (x_test, y_test) = mnist.load_data()
# Add batch dimension
x_train = np.expand_dims(x_train, axis=-1)
x_test = np.expand_dims(x_test, axis=-1)
# One-hot encode the labels
y_train = to_categorical(y_train, num_classes=None)
y_test = to_categorical(y_test, num_classes=None)
batch_size = 64
# Fit model using Keras
import keras
import numpy as np
from keras.layers import Conv2D, MaxPool2D, Flatten, Dense, Dropout
from keras.models import Sequential
model = Sequential()
model.add(Conv2D(32, 5, activation="relu", input_shape=(28, 28, 1)))
model.add(Conv2D(32, 5, activation="relu"))
model.add(Dense(10, activation='softmax'))
loss='categorical_crossentropy', metrics=['accuracy']), y_train,
batch_size=32, validation_data=(x_test, y_test), epochs=1)
Train on 60000 samples, validate on 10000 samples
Epoch 1/1
60000/60000 [==============================] - 35s 583us/step - loss: 1.5217 - acc: 0.8736 - val_loss: 0.0850 - val_acc: 0.9742
Note that the number of trainable parameters is now just a fraction of the amount in your model:
Layer (type) Output Shape Param #
conv2d_3 (Conv2D) (None, 24, 24, 32) 832
max_pooling2d_3 (MaxPooling2 (None, 12, 12, 32) 0
conv2d_4 (Conv2D) (None, 8, 8, 32) 25632
max_pooling2d_4 (MaxPooling2 (None, 4, 4, 32) 0
flatten_2 (Flatten) (None, 512) 0
dropout_1 (Dropout) (None, 512) 0
dense_2 (Dense) (None, 10) 5130
Total params: 31,594
Trainable params: 31,594
Non-trainable params: 0
Now, doing the same with TensorFlow:
# Fit model using TensorFlow
import tensorflow as tf
X = tf.placeholder(shape=[None, 28, 28, 1], dtype=tf.float32, name="X")
Y = tf.placeholder(shape=[None, 10], dtype=tf.float32, name="Y")
net = tf.layers.conv2d(
X, filters=32, kernel_size=5, padding="valid", activation=tf.nn.relu)
net = tf.layers.max_pooling2d(net, pool_size=2, strides=2)
net = tf.layers.conv2d(
net, filters=32, kernel_size=5, padding="valid", activation=tf.nn.relu)
net = tf.layers.max_pooling2d(net, pool_size=2, strides=2)
net = tf.contrib.layers.flatten(net)
net = tf.layers.dropout(net, rate=0.25)
model = tf.layers.dense(net, name="output", units=10)
loss = tf.nn.softmax_cross_entropy_with_logits_v2(labels=Y, logits=model)
opt = tf.train.AdamOptimizer().minimize(loss)
accuracy = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(model, 1), tf.argmax(Y, 1)), tf.float32))
with tf.Session() as sess:
L = []
l_ = 0
for i in range(x_train.shape[0] // batch_size):
x, y = x_train[i*batch_size:(i+1)*batch_size],\
l, _ =[loss, opt], feed_dict={X: x, Y: y})
l_ += np.mean(l)
L.append(l_ / (x_train.shape[0] // batch_size))
print('Training loss: {:.3f}'.format(L[-1]))
acc = []
for j in range(x_test.shape[0] // batch_size):
x, y = x_test[j*batch_size:(j+1)*batch_size],\
acc.append(, feed_dict={X: x, Y: y}))
print('Test set accuracy: {:.3f}'.format(np.mean(acc)))
Training loss: 0.519
Test set accuracy: 0.968
Possible improvement of your models.
I used CNN networks on different problems and always got good effectiveness improvements with regularization techniques, the best ones with dropout.
I suggest to use Dropout on the Dense layers and in case with lower probability on the convolutional ones.
Also data augmentation on the input data is very important, but applicability depends on the problem domain.
P.s: in one case I had to change the optimization from Adam to SGD with Momentum. So, playing with the optimization makes sense. Also Gradient clipping can be considered when your networks starves and doesn't improve effectiveness, may be a numeric issue.

Tensorflow with Keras: ValueError - expected dense_84 to have 2 dimensions, but got array with shape (100, 9, 1)

I am trying to use Tensorflow through Keras to build a network that uses time-series data to predict the next value, but I'm getting this error:
ValueError: Error when checking target: expected dense_84 to have 2 dimensions, but got array with shape (100, 9, 1)
What is causing this? I've tried reshaping the data as other posts have suggested, but to no avail so far. Here is the code:
import keras
import numpy as np
import os
from keras import losses
from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten, Dropout
from keras.layers.convolutional import Conv1D, Conv2D
# add the desktop to our path so we can access the data
# import data
data = np.genfromtxt("C:\\Users\\user\\Desktop\\aapl_blocks_10.csv",
# separate into inputs and outputs
X = data[:, :9]
X = np.expand_dims(X, axis=2) # reshape (409, 9) to (409, 9, 1) for network
Y = data[:, 9]
# separate into test and train data
X_train = X[:100]
X_test = X[100:]
Y_train = Y[:100]
Y_test = Y[100:]
# set parameters
batch_size = 20;
# define model
model = Sequential()
input_shape=(9, 1),
# train model, Y_train, epochs=10, batch_size=batch_size)
# evaluate model
model.evaluate(X_test, Y_test, batch_size=batch_size)
And here is the model summary:
Layer (type) Output Shape Param #
conv1d_43 (Conv1D) (None, 9, 20) 120
flatten_31 (Flatten) (None, 180) 0
dropout_14 (Dropout) (None, 180) 0
dense_83 (Dense) (None, 10) 1810
activation_29 (Activation) (None, 10) 0
dense_84 (Dense) (None, 1) 11
Total params: 1,941
Trainable params: 1,941
Non-trainable params: 0
If there's a proper way to be formatting the data, or maybe a proper way to stack these layers, I would love to know.
I suspect you need to squeeze the channel dimension from the output, i.e. the labes are shape (batch_size, 9) and you're comparing that against the output of a dense layer with 1 channel which has size (batch_size, 9, 1). Solution: squeeze/flatten before calculating the loss.
A note on squeeze vs Flatten: in this case, the result of squeezing (removing an axis of dimension 1) and flattening (making something of shape (batch_size, n, m, ...) into shape (batch_size, nm...) will be the same. Squeeze might be slightly more appropriate in this case, since if you accidentally squeeze an axis without dimension 1 you'll get an error (a good thing), as opposed to having your program run with unexpected behaviour. I don't use keras much though and couldn't find a 'Squeeze' layer - just a squeeze function - and I'm not entirely sure how to integrate it.

Training loss and accuracy remain constant after adding attention mechanism

I trained bidirectional lstm on imdb dataset for sentiment analysis using keras and tensorflow as backend. This is the example in keras. After training accuracy quickly jumps to 90% and above for training and 84% for validation. So, far so good. But when I create a custom layer of attention decoder and train the network, the training and validation stays constant from epoch 1 to 10.
Below is my code for training on imdb dataset having implemented a custom layer of attention decoder.
max_features = 20000
maxlen = 80
batch_size = 32
timesteps = 80
print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
x_train = x_train[:5000]
y_train = y_train[:5000]
x_test = x_test[:5000]
y_test = y_test[:5000]
print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print (x_train[0]) #Shape(5000, 80)
print (y_train[0]) #Shape (5000,)
print('Build model...')
def modelnmt():
input_ = Input(shape=(80,), dtype='float32')
print (input_.get_shape())
input_embed = Embedding(max_features, 128 ,input_length=80)(input_)
print (input_embed.get_shape())
rnn_encoded = Bidirectional(LSTM(encoder_units, return_sequences=True),
print (rnn_encoded.get_shape())
y_hat = AttentionDecoder(decoder_units,
y_adec = Reshape((80,))(y_adec)
y_hat = Dense(1, activation='sigmoid')(y_adec)
model = Model(inputs=input_, outputs=y_hat)
return model
model = modelnmt()
# try using different optimizers and different optimizer configs
print('Train...'), y_train,
validation_data=(x_test, y_test))
And here is the output:
Layer (type) Output Shape Param #
input_1 (InputLayer) (None, 80) 0
embedding_1 (Embedding) (None, 80, 128) 2560000
bidirectional_1 (Bidirection (None, 80, 128) 98816
attention_decoder_1 (Attenti (None, 80, 1) 58050
reshape_1 (Reshape) (None, 80) 0
dense_1 (Dense) (None, 1) 81
Total params: 2,716,947
Trainable params: 2,716,947
Non-trainable params: 0
Epoc 1/10
5000/5000 [==============================] - 289s - loss: 0.6955 - acc: 0.5056 - val_loss: 0.6935 - val_acc: 0.4956
Epoch 2/10
5000/5000 [==============================] - 348s - loss: 0.6944 - acc: 0.4956 - val_loss: 0.6936 - val_acc: 0.4956
I will explain model in short.First the words are embedded and then passed to bidirectional LSTM later to attention decoder. So, what attention decoder does it that it outputs for number of timesteps one number i.e. (None, 80, 1). This number is then reshaped and passed to Dense Layer to calculate overall sentiment of the sentence (probability). The output from attention decoder can later be used for visualizing contribution of each word in the sentence.
What can be possible reasons for such output?