tensorflow automatic accuracy calculation for multilabel classifier - tensorflow

I am fitting a multilabel classifier to (train_x, train_y) while monitoring the loss and accuracy on a validation set (val_x, val_y):
classification_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.0002),
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics=['accuracy'])
classification_model.fit(train_x, train_y, validation_data=(val_x, val_y), \
epochs=10,
batch_size=10
)
This gives the following output:
Epoch 1/10
50/50 [==============================] - ETA: 0s - loss: 0.1186 - accuracy: 0.7094
Epoch 1: val_loss improved from 0.15329 to 0.11998, saving model to best_classification_model.tf
50/50 [==============================] - 12s 186ms/step - loss: 0.1186 - accuracy: 0.7094 - val_loss: 0.1200 - val_accuracy: 0.6280
Epoch 2/10
50/50 [==============================] - ETA: 0s - loss: 0.0848 - accuracy: 0.7776
Epoch 2: val_loss improved from 0.11998 to 0.10281, saving model to best_classification_model.tf
50/50 [==============================] - 8s 167ms/step - loss: 0.0848 - accuracy: 0.7776 - val_loss: 0.1028 - val_accuracy: 0.7200
Epoch 3/10
50/50 [==============================] - ETA: 0s - loss: 0.0652 - accuracy: 0.8176
Epoch 3: val_loss improved from 0.10281 to 0.09259, saving model to best_classification_model.tf
50/50 [==============================] - 10s 202ms/step - loss: 0.0652 - accuracy: 0.8176 - val_loss: 0.0926 - val_accuracy: 0.7560
Epoch 4/10
50/50 [==============================] - ETA: 0s - loss: 0.0522 - accuracy: 0.8236
Epoch 4: val_loss improved from 0.09259 to 0.08710, saving model to best_classification_model.tf
50/50 [==============================] - 10s 206ms/step - loss: 0.0522 - accuracy: 0.8236 - val_loss: 0.0871 - val_accuracy: 0.7480
Epoch 5/10
50/50 [==============================] - ETA: 0s - loss: 0.0418 - accuracy: 0.8337
Epoch 5: val_loss improved from 0.08710 to 0.08441, saving model to best_classification_model.tf
50/50 [==============================] - 10s 209ms/step - loss: 0.0418 - accuracy: 0.8337 - val_loss: 0.0844 - val_accuracy: 0.7640
I am wondering how this accuracy is actually calculated.
Does it count the total number of correct labels, or the total number of rows for which all labels?
And what is a 'correct label'? Is (internally) the maximum taken per output row?
To clarify what I mean with each option:
The total number of correct labels: for each image, 20 labels are outputted of which some are 0 and some are 1. Report the total number of correct labels (= number of correct 0s + number of correct 1s) and divide it by the total number of labels (= 20*num_images). I don’t think this is what happens, as that would probably lead to a higher accuracy. Just predicting 0's everywhere would already give an accuracy of over 90%! And that is not what happens, even after training for a longer time.
The total number of rows for which all labels are correct: count the number of images for which all labels are correct (0s and 1s) and divide by the number of images
The model output and the validation labels look as follows
>>> classification_model.predict(val_x) # shape: (250, 20)
array([[ -9.385, -5.443, -8.274, ..., 1.936, -11.607, -1.867],
[-10.523, 3.074, -7.765, ..., -2.925, -10.35 , -2.602],
[ -7.872, -7.525, -4.877, ..., -6.434, -9.063, -8.485],
...,
[ -6.04 , -4.826, 3.537, ..., -5.68 , -7.276, -6.05 ],
[ -5.734, -6.399, -5.288, ..., -5.495, -6.673, 0.06 ],
[ -9.458, -7.892, 1.391, ..., -6.422, -9.75 , -7.702]],
dtype=float32)
>>> val_y # also shape: (250,20)
array([[0., 0., 0., ..., 0., 0., 1.],
[0., 1., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 1., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 1.],
[0., 0., 1., ..., 0., 0., 0.]])

When you use 'accuracy', you are trusting Keras to automatically select a metric for you among BinaryAccuracy, CategoricalAccuracy, and SparseCategoricalAccruacy. You'll get burned enough times by corner cases that aren't picked up that you'll find it's easier to just be explicit. So I'd go with
metrics = [tf.keras.metrics.BinaryAccuracy(threshold=???)]
BinaryAccuracy is counted as if each label is part of one big bucket. So if you have two images, and 10 possible labels each, in multi-class classification setting, then you'll have 20 possible predictions. Binary Accuracy is just TP + TN / 20. If it makes sense to you, it is a "reduce_sum(axes=all)" rather than "(reduce_sum(reduce_mean(axis-1) == 1))".
But Keras often doesn't document these corner cases. You can read the code yourself (which gets harder if you use "accuracy" rather than instantiating an actual object). Or run experiments.
Further, since you are outputting logits not predictions (e.g. there is no logistic / sigmoid layer at the end of your model), your model will output -inf to indicate 0% confidence, zero to indicate 50% confidence ,and +inf to indicate 100% confidence. It's your decision where to place the threshold. The typical answer is 50% confidence ,but if you want to tune recall/recision/specificity, you can move that up or down. To get 50% confidence, if your model outputs logits, you should set the threshold at 0.0 because a logit of 0.0 corresponds to 50%.
>>> tf.sigmoid(0.0)
<tf.Tensor: shape=(), dtype=float32, numpy=0.5>
You do it like this.
m = tf.keras.metrics.BinaryAccuracy(threshold=0.0)
Here's an example of where not controlling the threshold properly for logits will burn you.
In [12]: import tensorflow as tf
In [13]: y_true = tf.convert_to_tensor([[0,0,0],[0,1,1]])
In [14]: y_pred = tf.convert_to_tensor([[-.1, -.1, -.1], [.1, .1, .1]])
In [15]: m = tf.keras.metrics.BinaryAccuracy(threshold=0.0)
In [16]: m(y_true, y_pred)
Out[16]: <tf.Tensor: shape=(), dtype=float32, numpy=0.8333334>
In [17]: m = tf.keras.metrics.BinaryAccuracy() # default threshhold is 0.5
In [18]: m(y_true, y_pred)
Out[18]: <tf.Tensor: shape=(), dtype=float32, numpy=0.6666667>
Sorry something this simple is a pain. Welcome to Tensorflow.

Related

What's the difference between using Dataset and ndarray in fit method in Tensorflow 2?

As a newbie for TF, I feel a little confused about the usage of BatchDataset in training a model.
Let's use the MNIST as an example. In this classification task, we can load the data and feed the ndarray of x_trian, y_train directly into the model.
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(x_train,y_train, epochs=5)
The training results are:
Epoch 1/5
2021-02-17 15:43:02.621749: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cublas64_10.dll
1/1875 [..............................] - ETA: 0s - loss: 2.2977 - accuracy: 0.0938WARNING:tensorflow:Callbacks method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0000s vs `on_train_batch_end` time: 0.0010s). Check your callbacks.
1875/1875 [==============================] - 2s 1ms/step - loss: 0.3047 - accuracy: 0.9117
Epoch 2/5
1875/1875 [==============================] - 2s 1ms/step - loss: 0.1473 - accuracy: 0.9569
Epoch 3/5
1875/1875 [==============================] - 2s 1ms/step - loss: 0.1097 - accuracy: 0.9673
Epoch 4/5
1875/1875 [==============================] - 2s 1ms/step - loss: 0.0905 - accuracy: 0.9724
Epoch 5/5
1875/1875 [==============================] - 2s 1ms/step - loss: 0.0759 - accuracy: 0.9764
And we can also use tf.data.Dataset.from_tensor_slices to generate a BatchDataset and feed it in to fit function.
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
train_ds = tf.data.Dataset.from_tensor_slices(
(x_train, y_train)).shuffle(10000).batch(32)
test_ds = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(32)
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(train_ds, epochs=5)
The results in training process is as follows.
Epoch 1/5
2021-02-17 15:30:34.698718: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cublas64_10.dll
1875/1875 [==============================] - 3s 1ms/step - loss: 0.2969 - accuracy: 0.9140
Epoch 2/5
1875/1875 [==============================] - 3s 1ms/step - loss: 0.1462 - accuracy: 0.9566
Epoch 3/5
1875/1875 [==============================] - 3s 1ms/step - loss: 0.1087 - accuracy: 0.9669
Epoch 4/5
1875/1875 [==============================] - 3s 1ms/step - loss: 0.0881 - accuracy: 0.9730
Epoch 5/5
1875/1875 [==============================] - 3s 1ms/step - loss: 0.0765 - accuracy: 0.9759
The model can be trained successfully with 2 methods, but is there any difference between them? Does using Dataset for training have some additional advantages? If there is no difference between the 2 methods in this case, what the typical usage of generating a Dataset for training and when should this method be used?
Thank you.
When we use Model.fit(x=None, y=None, ... - we can pass the training pair argument as pure numpy array or keras.utils.Sequence or tf.data.
When we use as follows, we're passing each training pairs (x and y) separately as a direct numpy array to the fit function.
# data
(x_train, y_train), (_, _) = tf.keras.datasets.mnist.load_data()
# fit
model.fit(x = x_train, y = y_train, ...
# check
print(x_train.shape, y_train.shape)
print(type(x_train), type(y_train))
# (60000, 28, 28) (60000,)
# <class 'numpy.ndarray'> <class 'numpy.ndarray'>
On the other hand in tf.data and Sequence we pass the training pairs as a shape of the tuple and still the data type are ndarray. According to the doc,
A tf.data dataset. Should return a tuple of either (inputs, targets)
A generator or keras.utils.Sequence returning (inputs, targets)
i.e
# data
train_ds = tf.data.Dataset.from_tensor_slices((x_train, y_train)).shuffle(10000).batch(2)
# check
next(iter(train_ds))
(<tf.Tensor: shape=(2, 28, 28), dtype=uint8, numpy= array([[[...], [[...]]], dtype=uint8)>,
<tf.Tensor: shape=(2,), dtype=uint8, numpy=array([7, 8], dtype=uint8)>)
And that's why, if x is a tf.data, generator, or keras.utils.Sequence instance, y should not be specified (since targets will be obtained from x).
# fit
model.fit(train_ds, ...
Among these three, tf.data data pipelines is the most efficient approach followed by generator. When the data set is small enough, the first approach (x and y) is primarily chosen. But when the dataset gets big enough, then you would think about tf.data or generator for efficient input pipelines. So the choice of these totally depends.
From Keras's post:
NumPy arrays, just like Scikit-Learn and many other Python-based libraries. This is a good option if your data fits in memory.
TensorFlow Dataset objects. This is a high-performance option that is more suitable for datasets that do not fit in memory and that are streamed from disk or from a distributed filesystem.
Python generators that yield batches of data (such as custom subclasses of the keras.utils.Sequence class).

In tensorflow, why is there only one validation loss, when there are many mini-batch of validation data?

In tensorflow, if we provide validation_data in .fit(), we get validation loss. But there is only one validation loss even if the validation dataset has many mini-batches. So I was wondering how tensorflow calculates the loss for validation.
For example:
import tensorflow as tf
import numpy as np
import pandas as pd
model = tf.keras.Sequential([
tf.keras.layers.Dense(8, input_shape=(4,)),
tf.keras.layers.Dense(1)
])
model.compile(loss='MAE')
df = pd.DataFrame(np.random.rand(1000, 5), columns=['A', 'B', 'C', 'D', 'E'])
data = tf.data.Dataset.from_tensor_slices(df)
data = data.map(lambda x: (x[:4], x[-1:]))
train = data.take(900).batch(10)
val = data.skip(900).batch(10)
model.fit(train, validation_data=val, epochs=50)
this will give:
Epoch 1/50
90/90 [==============================] - 0s 2ms/step - loss: 0.4025 - val_loss: 0.3321
Epoch 2/50
90/90 [==============================] - 0s 635us/step - loss: 0.3114 - val_loss: 0.3065
Epoch 3/50
90/90 [==============================] - 0s 765us/step - loss: 0.2906 - val_loss: 0.2919
Epoch 4/50
90/90 [==============================] - 0s 689us/step - loss: 0.2784 - val_loss: 0.2807
Epoch 5/50
90/90 [==============================] - 0s 629us/step - loss: 0.2709 - val_loss: 0.2738
...
There is only one validation loss when there are 10 validation mini-batches in the validation dataset. Does tensorflow takes just one mini-batch to calculate the loss? Or does it calculate the y_pred for each batch individually, then calculate the loss for the entire validation data? Or does it calculate 10 loss for the 10 mini-batches, then take a summary statistic?
from here
For training loss, keras does a running average over the batches. For validation loss, a conventional average over all the batches in validation data is performed. The training accuracy is the average of the accuracy values for each batch of training data during training.
But i would take the answer with grain of salt. The code is not clear but it seems that validation loss is just an average over the batches. You could use some synthetic data to verify that if your life is depending on that ;-)

Can relu be used at the last layer of a neural network?

I hope to find an answer to clarify my doubt. I created a convolutional-autoencoder this way:
input_dim = Input((1, 200, 4))
x = Conv2D(64, (1,3), activation='relu', padding='same')(input_dim)
x = MaxPooling2D((1,2), padding='same')(x)
x = Conv2D(32, (1,3), activation='relu', padding='same')(x)
x = MaxPooling2D((1,2), padding='same')(x)
x = Conv2D(32, (1,3), activation='relu', padding='same')(x)
encoded = MaxPooling2D((1,2), padding='same')(x)
#decoder
x = Conv2D(32, (1,3), activation='relu', padding='same')(encoded)
x = UpSampling2D((1,2))(x)
x = Conv2D(32, (1,3), activation='relu', padding='same')(x)
x = UpSampling2D((1,2))(x)
x = Conv2D(64, (1,3), activation='relu')(x)
x = UpSampling2D((1,2))(x)
decoded = Conv2D(4, (1,3), activation='sigmoid', padding='same')(x)
autoencoder = Model(input_layer, decoded)
autoencoder.compile(optimizer='adam', loss='mae',
metrics=['mean_squared_error'])
But when I try fitting the model with the last activation of the decoder being sigmoid as above, the model loss decreases slightly (and remain unchanged at later epochs) so also the mean_square_error. (using default Adam settings):
autoencoder.fit(train, train, epochs=100, batch_size=256, shuffle=True,
validation_data=(test, test), callbacks=callbacks_list)
Epoch 1/100
97/98 [============================>.] - ETA: 0s - loss: 12.3690 - mean_squared_error: 2090.8232
Epoch 00001: loss improved from inf to 12.36328, saving model to weights.best.hdf5
98/98 [==============================] - 6s 65ms/step - loss: 12.3633 - mean_squared_error: 2089.3044 - val_loss: 12.1375 - val_mean_squared_error: 2029.4445
Epoch 2/100
97/98 [============================>.] - ETA: 0s - loss: 12.3444 - mean_squared_error: 2089.8032
Epoch 00002: loss improved from 12.36328 to 12.34172, saving model to weights.best.hdf5
98/98 [==============================] - 6s 64ms/step - loss: 12.3417 - mean_squared_error: 2089.1536 - val_loss: 12.1354 - val_mean_squared_error: 2029.4530
Epoch 3/100
97/98 [============================>.] - ETA: 0s - loss: 12.3461 - mean_squared_error: 2090.5886
Epoch 00003: loss improved from 12.34172 to 12.34068, saving model to weights.best.hdf5
98/98 [==============================] - 6s 63ms/step - loss: 12.3407 - mean_squared_error: 2089.1526 - val_loss: 12.1351 - val_mean_squared_error: 2029.4374
Epoch 4/100
97/98 [============================>.] - ETA: 0s - loss: 12.3320 - mean_squared_error: 2087.0349
Epoch 00004: loss improved from 12.34068 to 12.34050, saving model to weights.best.hdf5
98/98 [==============================] - 6s 63ms/step - loss: 12.3405 - mean_squared_error: 2089.1489 - val_loss: 12.1350 - val_mean_squared_error: 2029.4448
But then both loss and mean_squared_error decrease quickly when I changed the decoder's last activation to relu.
Epoch 1/100
97/98 [============================>.] - ETA: 0s - loss: 9.8283 - mean_squared_error: 1267.3282
Epoch 00001: loss improved from inf to 9.82359, saving model to weights.best.hdf5
98/98 [==============================] - 6s 64ms/step - loss: 9.8236 - mean_squared_error: 1266.0548 - val_loss: 8.4972 - val_mean_squared_error: 971.0208
Epoch 2/100
97/98 [============================>.] - ETA: 0s - loss: 8.1906 - mean_squared_error: 910.6423
Epoch 00002: loss improved from 9.82359 to 8.19058, saving model to weights.best.hdf5
98/98 [==============================] - 6s 62ms/step - loss: 8.1906 - mean_squared_error: 910.5417 - val_loss: 7.6558 - val_mean_squared_error: 811.6011
Epoch 3/100
97/98 [============================>.] - ETA: 0s - loss: 7.3522 - mean_squared_error: 736.2031
Epoch 00003: loss improved from 8.19058 to 7.35255, saving model to weights.best.hdf5
98/98 [==============================] - 6s 61ms/step - loss: 7.3525 - mean_squared_error: 736.2403 - val_loss: 6.8044 - val_mean_squared_error: 650.5342
Epoch 4/100
97/98 [============================>.] - ETA: 0s - loss: 6.6166 - mean_squared_error: 621.1281
Epoch 00004: loss improved from 7.35255 to 6.61435, saving model to weights.best.hdf5
98/98 [==============================] - 6s 61ms/step - loss: 6.6143 - mean_squared_error: 620.6105 - val_loss: 6.2180 - val_mean_squared_error: 572.2390
I want to verify if it is valid to use an-all relu function in the network architecture. Being novice to deep learning.
What you have asked invokes another question which is very fundamental. Ask yourself: "What you actually want the model to do?"- Predicting a real value? Or Value within a certain range? - You will get your answer.
But before that what I feel I should give you a brief on what activation functions are all about and why we use them.
Activation functions' main goal is to introduce non-linearity in your model. As the combination of linear functions is also a linear function, hence, without activation functions a Neural Network is nothing but a giant linear function. Hence, being a liner function itself it won't be able to learn any non-linear behavior at all. This is the primary purpose of using an activation function.
Another purpose is to limit the range of output from a neuron. Following image shows Sigmoid and ReLU activation functions (the image is collected from here).
These two graphs show exactly what kind of limitations they can impose on values passed through them. If you look at Sigmoid function it is allowing output to be in between 0 to 1. So we can think it like a probability mapping based on some input value to the function. So where we can use it? Say for binary classification if we assign 0 and 1 for two different classes and use a Sigmoid function in the output layer it can give us the probability of belonging to a certain class for an example input.
Now coming to ReLU. What it does? It only allows Non-negative values. As you can see all the negative values in horizontal axis is being mapped to 0 in vertical axis. But for positive values the 45 degree straight line shows that it does nothing to them and leave them as they are. Basically it helps us to get rid of negative values and makes them 0 and allows non-negative values only. Mathematically: relu(value) = max(0, value).
Now picture a situation: Say you want to predict real values which can be positive, zero or even negative! Will you use ReLU activation function in the output layer just because it looks cool? Nope! Obviously not. If you do so it will never be able to predict any negative values as all the negative values are being trimmed down to 0.
Now coming to your case, I believe this model should predict values which shouldn't be limited from 0 to 1. It should be a real valued prediction.
Hence when you are using sigmoid function, it is basically forcing the model to output between 0 to 1 and which is not a valid prediction in most of the cases and thus the model is producing large loss and MSE values. As the model is forcefully predicting something which is not anywhere near to the actual correct output.
Again when you are using ReLU it is performing better. Because ReLU doesn't change any non-negative value. Hence, the model is free to predict any non-negative values and now there is no bound to predict values which are close to actual outputs.
But what I think the model wants to predict intensity values which are likely from 0 to 255. Hence, there are already no negative values coming from your model. So in that sense technically there is no need of using ReLU activation function in the last layer as it will not even get any negative values to filter out (if I am not mistaken). But you can use it as the official TensorFlow documentation is using it. But it is only for safety purpose such that no negative values can come out and again the ReLU won't do anything to non-negative values.
You can use relu function as activation in the final layer.
You can see in the autoencoder example at the official TensorFlow site here.
Use the sigmoid/softmax activation function in the final output layer when you are trying to solve the Classification problems where your labels are class values.

make accuracy appear in my result and interpret the results of the loss and the val_loss

i'm new in tensorflow i'm trying to learning it by examples in github, now i found an example but the results of the loss and val_loss are more than '1' ( you can see bellow the results is between 800 and 700 ) while normally in other examples the loss and val_loss are between 0 and 1)
in addition i would like how to make appear the accuracy.
this is the code.
https://github.com/simoninithomas/DNN-Speech-Recognizer/blob/master/train_utils.py
thank you !
def train_model(input_to_softmax,
pickle_path,
save_model_path,
train_json='train_corpus.json',
valid_json='valid_corpus.json',
minibatch_size=20,
spectrogram=True,
mfcc_dim=13,
optimizer=SGD(lr=0.02, decay=1e-6, momentum=0.9, nesterov=True, clipnorm=5),
epochs=20,
verbose=1,
sort_by_duration=False,
max_duration=10.0):
# create a class instance for obtaining batches of data
audio_gen = AudioGenerator(minibatch_size=minibatch_size,
spectrogram=spectrogram, mfcc_dim=mfcc_dim, max_duration=max_duration,
sort_by_duration=sort_by_duration)
# add the training data to the generator
audio_gen.load_train_data(train_json)
audio_gen.load_validation_data(valid_json)
# calculate steps_per_epoch
num_train_examples=len(audio_gen.train_audio_paths)
steps_per_epoch = num_train_examples//minibatch_size
# calculate validation_steps
num_valid_samples = len(audio_gen.valid_audio_paths)
validation_steps = num_valid_samples//minibatch_size
# add CTC loss to the NN specified in input_to_softmax
model = add_ctc_loss(input_to_softmax)
# CTC loss is implemented elsewhere, so use a dummy lambda function for the loss
model.compile(loss={'ctc': lambda y_true, y_pred: y_pred}, optimizer=optimizer)
# make results/ directory, if necessary
if not os.path.exists('results'):
os.makedirs('results')
# add checkpointer
checkpointer = ModelCheckpoint(filepath='results/'+save_model_path, verbose=0)
# train the model
hist = model.fit_generator(generator=audio_gen.next_train(), steps_per_epoch=steps_per_epoch,
epochs=epochs, validation_data=audio_gen.next_valid(), validation_steps=validation_steps,
callbacks=[checkpointer], verbose=verbose)
# save model loss
with open('results/'+pickle_path, 'wb') as f:
pickle.dump(hist.history, f)
Epoch 1/20
106/106 [==============================] - 302s - loss: 839.6881 - val_loss: 744.7609
Epoch 2/20
106/106 [==============================] - 276s - loss: 767.3973 - val_loss: 727.8361
Epoch 3/20
106/106 [==============================] - 272s - loss: 752.6904 - val_loss: 720.8375
Epoch 4/20
106/106 [==============================] - 261s - loss: 751.8432 - val_loss: 728.3446
Epoch 5/20
106/106 [==============================] - 261s - loss: 752.1302 - val_loss: 733.3166
Epoch 6/20
106/106 [==============================] - 264s - loss: 752.3786 - val_loss: 722.4345
Epoch 7/20
106/106 [==============================] - 265s - loss: 752.7827 - val_loss: 723.2651
Epoch 8/20
106/106 [==============================] - 263s - loss: 752.5077 - val_loss: 736.0229
Epoch 9/20
106/106 [==============================] - 263s - loss: 752.5616 - val_loss: 731.2018
The loss that you are using is described in this pdf.
When you say accuracy it could mean a lot of things:
Single unit accuracy (average it over the labels you have. NOTE: you have multiple labels for the same data point since its a temporal classification) [Would be between 0 and 1]
Error rate: Could be defined as edit distance between the predicted labels and true labels [Would be between 0 and MAX_LABELS averaged over data points.
Precision of labels averaged over all timesteps and data points.
There is no reason for it to be between 0 and 1.
On the other hand your loss is a connectionist temporal loss. This loss predicts either a label or a blank label at every timestep. Then we use Cross Entropy on top of the labels. The cross entropy of two probability distributions is only a positive quantity, and is not between 0 and 1.
Therefore this is not an issue. If you would like to see the accuracies you will have take some test data and make a prediction. You can calculate accuracy (as defined above) using Tensorflow against the expected labels using whatever metric you want, and use that as your accuracy. You can technically use any metric defined in Tensorflow: https://www.tensorflow.org/api_docs/python/tf/metrics/, after your prediction step.

Trouble With Simple Custom Keras Metric Function Can't Return Argmax?

I'd like to create a custom metric and I've learned that within a custom function everything is a tensor and I need to use the special backend functions. To wrap my head around this I tried a three class classification problem example where I simply return the argmax as the custom function
def custom(y_true, y_pred):
return K.argmax(y_pred)
# Neural Network
model = models.Sequential()
model.add(keras.layers.Embedding(len(np.unique(X.values)), 4))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(16, activation='relu'))
model.add(keras.layers.Dense(3, activation='softmax'))
model.compile(optimizer='rmsprop',
loss='sparse_categorical_crossentropy',
metrics=['accuracy', custom])
model.fit(X_train.values, y_train.values, epochs=4)
To my surprise I'm getting floating point values in the output!
Epoch 1/4
1023/1023 [==============================] - 0s 276us/step - loss: 0.3560 - acc: 0.3294 - custom: 1.1867
Epoch 2/4
1023/1023 [==============================] - 0s 52us/step - loss: 0.3368 - acc: 0.3343 - custom: 1.9687
Epoch 3/4
1023/1023 [==============================] - 0s 47us/step - loss: 0.3225 - acc: 0.3324 - custom: 1.9374
Epoch 4/4
1023/1023 [==============================] - 0s 47us/step - loss: 0.3173 - acc: 0.3275 - custom: 1.2825
This clearly isn't doing what I expected and I don't know why
Question: Why is my custom metric that just returns the argmax not returning an integer vector that represents the argmax's and instead is returning a floating point number?
PS: I modified the custom function to print
def custom(y_true, y_pred):
x = K.argmax(y_pred)
x = K.print_tensor(x, message="x is: ")
return(x)
and I get output like this
Epoch 4/4
x is: [2 2 0...]
32/1023 [..............................] - ETA: 0s - loss: 0.3113 - acc: 0.2500 - custom: 1.0000x is: [2 0 0...]
x is: [0 0 0...]
x is: [0 0 0...]
x is: [2 0 2...]
Which again isn't making any sense to me. Does anyone know what's happening under the hood?
argmax function is working correctly. For each batch argmax is returning an integer. But the output of the metric function for the whole batch will be mean of the batch.
Keras documentation says
Returns
Single tensor value representing the mean of the output array across
all datapoints.
So if your custom metric function is returning some arrays of class values for certain batch, the model will calculate the average of these values.