I am trying to train a classification problem with two labels to predict. For some reason, my validation_loss and my loss are always stuck at 0 when training. What could be the reason? Is there something wrong when calling loss functions? are they the appropriated ones for multi-label classification?
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=12, shuffle=True)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=12, shuffle=True)
model = keras.Sequential([
#keras.layers.Flatten(batch_input_shape=(None,24)),
keras.layers.Dense(first_neurona, activation='relu'),
keras.layers.Dense(second_neurona, activation='relu'),
keras.layers.Dense(third_neurona, activation='relu'),
keras.layers.Dense(fourth_neurona, activation='relu'),
keras.layers.BatchNormalization(), #WE NORMALIZE THE INPUT DATA
keras.layers.Dropout(0.25),
keras.layers.Dense(2, activation='softmax'),
#keras.layers.BatchNormalization() #for multi-class problems we use softmax? 2 clases: Forehand or backhand
])
model.compile(optimizer=keras.optimizers.Adam(learning_rate=lr),
loss='categorical_crossentropy',
metrics=['accuracy'])
history=model.fit(X_train, y_train, epochs=n_epochs, batch_size=batchSize, validation_data=(X_val, y_val))
test_loss, test_acc = model.evaluate(X_test, y_test)
EDIT:
See the shape of my training data:
X_train shape : (280, 14) X_val shape : (94, 14) y_train shape : (280, 2) y_val shape : (94, 2)
the parameters when calling the function:
first neuron units:4
second neuron units: 8
learning rate= 0.0001
epochs= 1000
batch_size=32
also the metrics plots:
softmax activation fx should be the last step in multiclass classification - it converts logits to probabilities with no (non)-linearity transformations
pay attention to the output - this & this advices helped me once -- in last layer you should have the same number of classes (output_units, here number of neurons) as the number of target classes in the training dataset -- because getting probabilities of belonging to each one class -- you then select argmax as most probable... and for this reason one-hot encoding apriori model.fit() is used for labels - in order to get at finish the slice of all probabilities per sample, from which you will choose further argmax for this sample (as most probable)... OR use sparse_categorical_crossentropy
often on this trouble is suggested to decrease learning_rate - against exploding gradients
be carefull with activation='relu' & vanishing gradient - - use better Leaky ReLu or PReLU or use gradient clipping
this investigation also can matter
p.s. I can not test your data
from the plot of validation accuracy it is clear it is changing so the loss must be changing as well. I suspect you are plotting the wrong values for loss which is why I wanted to see the actual model.fit printed training data. Make sure you have these assignments for loss'
val_loss=history.history('val_loss')
train_loss=history.history('loss')
also recommend you increase the learning_rate to .001
Related
The code below that is adapted from tensorflow:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train = x_train.reshape(len(x_train), -1)
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(10)
])
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer='adam',
loss=loss_fn,
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=1, verbose=0)
model.summary()
gives output Shape (32, 10), whereas this code
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(10)
])
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer='adam',
loss=loss_fn,
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=1, verbose=0)
model.summary()
gives Output Shape (None, 10).
I'm conscious that 32 means batch size, 10 means the output classes. I'd just like to know where does the None come from when input shape is clear and fixed.
The first dimension is the number of samples (batch_size). Since it should be flexible and work with any number of samples or batch sizes, it is represented as None. So, don't worry about it. Your model does not care about the first dimension.
For example in your case input shape is (28,28) and output is (10). The model considers (None,28,28) and (None,10) shapes as input and output. It means that you can feed to the model any number of samples, but each input sample should be (28,28), and the model gives you the same unknown number of samples but each of which with 10 labels. This is the reason that you don't need to set the batch_size in the input_shape parameter in your first layer.
Another example for the first dimension, is when you train your model, vs. when you predict using that model. For training you may pass an input array say (10,28,28), which means 10 samples with 28,28 size. But when you want to get a prediction from your model using model.predict() you may pass one single sample like (1,28,28) to get a prediction. So, The first dimension varies during the model life cycle. So it is set to None.
The first model shows (32,10) because you called it after model.fit() and you didn't specified input_shape in your first layer, so it inferences the shapes from training procedure. model.fit() sets batch_size to 32 as default. So, it shows the batch size.
But if you set input_shape, since you should not include the batch size, model will be created by None as the first dimension.
I'm learning deep learning with keras and trying to compare the results (accuracy) with machine learning algorithms (sklearn) (i.e random forest, k_neighbors)
It seems that with keras I'm getting the worst results.
I'm working on simple classification problem: iris dataset
My keras code looks:
samples = datasets.load_iris()
X = samples.data
y = samples.target
df = pd.DataFrame(data=X)
df.columns = samples.feature_names
df['Target'] = y
# prepare data
X = df[df.columns[:-1]]
y = df[df.columns[-1]]
# hot encoding
encoder = LabelEncoder()
y1 = encoder.fit_transform(y)
y = pd.get_dummies(y1).values
# split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
# build model
model = Sequential()
model.add(Dense(1000, activation='tanh', input_shape = ((df.shape[1]-1),)))
model.add(Dense(500, activation='tanh'))
model.add(Dense(250, activation='tanh'))
model.add(Dense(125, activation='tanh'))
model.add(Dense(64, activation='tanh'))
model.add(Dense(32, activation='tanh'))
model.add(Dense(9, activation='tanh'))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train)
score, acc = model.evaluate(X_test, y_test, verbose=0)
#results:
#score = 0.77
#acc = 0.711
I have tired to add layers and/or change number of units per layer and/or change the activation function (to relu) by it seems that the result are not higher than 0.85.
With sklearn random forest or k_neighbors I'm getting result (on same dataset) above 0.95.
What am I missing ?
With sklearn I did little effort and got good results, and with keras, I had a lot of upgrades but not as good as sklearn results. why is that ?
How can I get same results with keras ?
In short, you need:
ReLU activations
Simpler model
Data mormalization
More epochs
In detail:
The first issue here is that nowadays we never use activation='tanh' for the intermediate network layers. In such problems, we practically always use activation='relu'.
The second issue is that you have build quite a large Keras model, and it might very well be the case that with only 100 iris samples in your training set you have too few data to effectively train such a large model. Try reducing drastically both the number of layers and the number of nodes per layer. Start simpler.
Large neural networks really thrive when we have lots of data, but in cases of small datasets, like here, their expressiveness and flexibility may become a liability instead, compared with simpler algorithms, like RF or k-nn.
The third issue is that, in contrast to tree-based models, like Random Forests, neural networks generally require normalizing the data, which you don't do. Truth is that knn also requires normalized data, but in this special case, since all iris features are in the same scale, it does not affect the performance negatively.
Last but not least, you seem to run your Keras model for only one epoch (the default value if you don't specify anything in model.fit); this is somewhat equivalent to building a random forest with a single tree (which, BTW, is still much better than a single decision tree).
All in all, with the following changes in your code:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
model = Sequential()
model.add(Dense(150, activation='relu', input_shape = ((df.shape[1]-1),)))
model.add(Dense(150, activation='relu'))
model.add(Dense(y.shape[1], activation='softmax'))
model.fit(X_train, y_train, epochs=100)
and everything else as is, we get:
score, acc = model.evaluate(X_test, y_test, verbose=0)
acc
# 0.9333333373069763
We can do better: use slightly more training data and stratify them, i.e.
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size = 0.20, # a few more samples for training
stratify=y)
And with the same model & training epochs you can get a perfect accuracy of 1.0 in the test set:
score, acc = model.evaluate(X_test, y_test, verbose=0)
acc
# 1.0
(Details might differ due to some randomness imposed by default in such experiments).
Adding some dropout might help you improve accuracy. See Tensorflow's documentation for more information.
Essentially how you add a Dropout layer is just very similar to how you added those Dense() layers.
model.add(Dropout(0.2)
Note: The parameter '0.2 implies that 20% of the connections in the layer is randomly omitted to reduce the interdependencies between them, which reduces overfitting.
I'm trying to experiment with a simple TensorFlow model built with keras, but I can't figure out why I'm getting such poor predictions. Here's the model:
x_train = np.asarray([[.5], [1.0], [.4], [5], [25]])
y_train = np.asarray([.25, .5, .2, 2.5, 12.5])
opt = keras.optimizers.Adam(lr=0.01)
model = Sequential()
model.add(Dense(1, activation="relu", input_shape=(x_train.shape[1:])))
model.add(Dense(9, activation="relu"))
model.add(Dense(1, activation="relu"))
model.compile(loss='mean_squared_error', optimizer=opt, metrics=['mean_squared_error'])
model.fit(x_train, y_train, shuffle=True, epochs=10)
print(model.predict(np.asarray([[5]])))
As you can see, it should learn to divide the input by two. However the loss is 32.5705, and over a few epochs, it refuses to change whatsoever (even if I do something crazy like 100 epochs, it's always that loss). Is there anything you can see that I'm doing horribly wrong here? The prediction for any value it seems is 0..
It also seems to be randomly switching between performing as expected, and the weird behavior described above. I re-ran it and got a loss of 0.0019 after 200 epochs, but if I re-run it with all the same parameters a second later the loss stays at 30 like before. What's going on here?
Some reasons that I can think of,
training set is too small
learning rate is high
last layer should just be a linear layer
for some runs the ReLU units are dying (see dead ReLU problem) and your network weights don't change after that so you see the same loss value.
In this case maybe a tanh activation will provide better conditioning for optimization
I made a few changes to your code based on what I commented, and I get decent results.
import keras
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Activation
x_train = np.random.random((50000, 1))#np.asarray([[.5], [1.0], [.4], [5], [25]])
y_train = x_train /2. #TODO: add small amount of noise to y #np.asarray([.25, .5, .2, 2.5, 12.5])
opt = keras.optimizers.Adam(lr=0.0005, clipvalue=0.5)
model = Sequential()
model.add(Dense(1, activation="tanh", input_shape=x_train.shape[1:]))
model.add(Dense(9, activation="tanh"))
model.add(Dense(1, activation=None))
model.compile(loss='mean_squared_error', optimizer=opt, metrics=['mean_squared_error'])
model.fit(x_train, y_train, shuffle=True, epochs=10)
print(model.predict(np.asarray([.4322])))
Output:
[[0.21410337]]
Currently I've been learning like so:
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adadelta(),
metrics=['accuracy'])
model.fit(x_train, y_train,
batch_size=batch_size,
epochs=epochs,
verbose=1,
shuffle='batch',
validation_data=(x_test, y_test),
callbacks=callbacks)
With these metrics being output every epoch so I know how well it's performing:
from sklearn.metrics import confusion_matrix
predictions = model.predict(x_test)
y_test = np.argmax(y_test, axis=-1)
predictions = np.argmax(predictions, axis=-1)
c = confusion_matrix(y_test, predictions)
print('Confusion matrix:\n', c)
print('sensitivity', c[0, 0] / (c[0, 1] + c[0, 0]))
print('specificity', c[1, 1] / (c[1, 1] + c[1, 0]))
Depending on my architecture, I get better results at epoch 93 or 155; then it gets worse. So clearly my metrics are wrong.
How do I learn from the sensitivity and specificity results each epoch?
To learn from the sensitivity and specificity results, you could write a custom loss function where the loss is computed depending on your confusion matrix results. Alternatively, you could try the class_weight parameter in keras model.fit() and assign different weights to classes depending on which ones your model finds harder to learn.
I am using Keras to perform landmark detection - specifically locating parts of the body on a picture of a human. I have gathered around 2,000 training samples and am using rmsprop w/ mse loss function. After training my CNN, I am left with loss: 3.1597e-04 - acc: 1.0000 - val_loss: 0.0032 - val_acc: 1.0000
I figured this would mean my model would perform well on the test data, however, instead the predicted points are way off from the labeled points. Any ideas or help would be greatly appreciated!
IMG_SIZE = 96
NUM_KEYPOINTS = 15
NUM_EPOCHS = 50
NUM_CHANNELS = 1
TESTING = True
def load(test=False):
# load data from CSV file
df = pd.read_csv(fname)
# convert Image to numpy arrays
df['Image'] = df['Image'].apply(lambda im: np.fromstring(im, sep=' '))
df = df.dropna() # drop rows with missing values
X = np.vstack(df['Image'].values) / 255. # scale pixel values to [0, 1]
X = X.reshape(X.shape[0], IMG_SIZE, IMG_SIZE, NUM_CHANNELS)
X = X.astype(np.float32)
y = df[df.columns[:-1]].values
y = (y - (IMG_SIZE / 2)) / (IMG_SIZE / 2) # scale target coordinates to [-1, 1]
X, y = shuffle(X, y, random_state=42) # shuffle train data
y = y.astype(np.float32)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
return X_train, X_test, y_train, y_test
def build_model():
# construct the neural network
model = Sequential()
model.add(Conv2D(16, (3, 3), activation='relu', input_shape=(IMG_SIZE, IMG_SIZE, NUM_CHANNELS)))
model.add(MaxPooling2D(2, 2))
model.add(Conv2D(32, (3, 3), activation='relu'))
model.add(MaxPooling2D(2, 2))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(2, 2))
model.add(Flatten())
model.add(Dropout(0.5))
model.add(Dense(500, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(NUM_KEYPOINTS * 2))
return model
if __name__ == '__main__':
X_train, X_test, y_train, y_test = load(test=TESTING)
model = build_model()
sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(optimizer=sgd, loss='mse', metrics=['accuracy'])
hist = model.fit(X_train, y_train, epochs=NUM_EPOCHS, verbose=1, validation_split=0.2)
# save the model
model.save_weights("/output/model_weights.h5")
histFile = open("/output/training_history", "wb")
pickle.dump(hist.history, histFile)
According to this question How does keras define "accuracy" and "loss"? your "accuracy" is defined as categorical accuracy which makes absolutely no sense for your problem.
After training you are left with a 10x difference between your training loss and validation loss which would suggest overfitting (hard to say for sure without a graph and some examples).
To start fixing it:
Use a metric that makes sense in your context and you understand what it does and how it's computed.
Take random examples where the metric is very good and where is very bad and manually validate that that is really the case (otherwise you need a different metric).
In your case I would imagine a metric based on the distance between the desired location and the predicted ones. This is not a default thing and you would have to implement it yourself.
Always be suspicious if the model says it's perfect.
It is impossible to tell from your question, but I will venture a guess here by some implications of your data split.
Typically, when one splits one's data into more than two sets, one is using all but one of them to train on some parameter or another. For example, the first split is used to choose the model weights, the second split to choose the model architecture, etc. Presumably you are training something with your 'validation' set, otherwise you wouldn't have it. Thus, the problem is almost certainly overfitting. The way that you detect overfitting, usually, is the difference in the accuracy of your model on data used to train your model (which is usually everything except for one single split) which you are calling your 'training' and 'validation' splits, and the accuracy of a split which your model has not touched, which you are calling your 'test' split.
So, per your question-comment "I assume if the validation accuracy is that high then there is no overfitting, right?". No. If the difference between the accuracy of your model on any data that you've used to train anything at all is higher than the accuracy of your model on data that your model has never touched in any way shape form or fashion, then you've overfit. Which seems to be the case with you.
OTOH, it may be the case that you've simply not shuffled your data. It's impossible to tell without having a look-see at the training/testing pipeline.