Tensorflow: sparse categorical crossentropy and precision metric incompatibility - tensorflow

I'm training a classification model, and I've decided to switch from categorical crossentropy loss function to sparse categorical crossentropy to potentially use less memory and have faster trainings. My training computes precision and recall metrics.
However, when I switch to sparse crossentropy, precision metric starts to fail. The thing is that SparseCategoricalCrossentropy expects true labels to be scalars, while predicted labels to be vectors of size "number of classes", and precision metrics raises an exception of "shape mistmatch" type.
A minimal example to show this (the same model works without the precision score, and fails during the second training with added precision score computation):
import numpy as np
import tensorflow as tf
x = np.arange(0, 20)
y = np.zeros_like(x)
for i in range(len(x)):
if x[i] % 2 == 0:
y[i] = 0 # Even number
else:
y[i] = 1 # Odd number
n_classes = len(np.unique(y))
model = tf.keras.Sequential(
[
tf.keras.layers.Dense(10, input_shape=(1,)),
tf.keras.layers.Dense(n_classes, activation="softmax"),
]
)
print("Train without precision metric")
model.compile(
optimizer="adam",
loss="sparse_categorical_crossentropy",
)
model.fit(x, y, epochs=2)
print("Train with precision metric")
model.compile(
optimizer="adam",
loss="sparse_categorical_crossentropy",
metrics=[tf.keras.metrics.Precision()],
)
model.fit(x, y, epochs=2)
The output is
Metal device set to: Apple M1 Pro
2022-09-20 18:47:20.254419: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2022-09-20 18:47:20.254522: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
2022-09-20 18:47:20.324585: W tensorflow/core/platform/profile_utils/cpu_utils.cc:128] Failed to get CPU frequency: 0 Hz
Train without precision metric
Epoch 1/2
2022-09-20 18:47:20.441786: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
1/1 [==============================] - ETA: 0s - loss: 5.9380
1/1 [==============================] - 0s 205ms/step - loss: 5.9380
Epoch 2/2
1/1 [==============================] - ETA: 0s - loss: 5.8844
1/1 [==============================] - 0s 4ms/step - loss: 5.8844
Train with precision metric
Epoch 1/2
systemMemory: 16.00 GB
maxCacheSize: 5.33 GB
Traceback (most recent call last):
File "/Users/dima/dev/learn/datascience/test-sparse-precision.py", line 35, in <module>
model.fit(x, y, epochs=2)
File "/Users/dima/sw/mambaforge/envs/data-science/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/var/folders/_0/2yc8qfs11xq2vykxzkkngq4m0000gn/T/__autograph_generated_filedw4nh8_p.py", line 15, in tf__train_function
retval_ = ag__.converted_call(ag__.ld(step_function), (ag__.ld(self), ag__.ld(iterator)), None, fscope)
ValueError: in user code:
File "/Users/dima/sw/mambaforge/envs/data-science/lib/python3.10/site-packages/keras/engine/training.py", line 1051, in train_function *
return step_function(self, iterator)
File "/Users/dima/sw/mambaforge/envs/data-science/lib/python3.10/site-packages/keras/engine/training.py", line 1040, in step_function **
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "/Users/dima/sw/mambaforge/envs/data-science/lib/python3.10/site-packages/keras/engine/training.py", line 1030, in run_step **
outputs = model.train_step(data)
File "/Users/dima/sw/mambaforge/envs/data-science/lib/python3.10/site-packages/keras/engine/training.py", line 894, in train_step
return self.compute_metrics(x, y, y_pred, sample_weight)
File "/Users/dima/sw/mambaforge/envs/data-science/lib/python3.10/site-packages/keras/engine/training.py", line 987, in compute_metrics
self.compiled_metrics.update_state(y, y_pred, sample_weight)
File "/Users/dima/sw/mambaforge/envs/data-science/lib/python3.10/site-packages/keras/engine/compile_utils.py", line 501, in update_state
metric_obj.update_state(y_t, y_p, sample_weight=mask)
File "/Users/dima/sw/mambaforge/envs/data-science/lib/python3.10/site-packages/keras/utils/metrics_utils.py", line 70, in decorated
update_op = update_state_fn(*args, **kwargs)
File "/Users/dima/sw/mambaforge/envs/data-science/lib/python3.10/site-packages/keras/metrics/base_metric.py", line 140, in update_state_fn
return ag_update_state(*args, **kwargs)
File "/Users/dima/sw/mambaforge/envs/data-science/lib/python3.10/site-packages/keras/metrics/metrics.py", line 818, in update_state **
return metrics_utils.update_confusion_matrix_variables(
File "/Users/dima/sw/mambaforge/envs/data-science/lib/python3.10/site-packages/keras/utils/metrics_utils.py", line 619, in update_confusion_matrix_variables
y_pred.shape.assert_is_compatible_with(y_true.shape)
ValueError: Shapes (None, 2) and (None, 1) are incompatible
It occurs on two different environments: Tensorflow 2.9.2 from Apple for M1, and on Tensorflow 2.8.0 on Ubuntu.
Does anyone know how to deal with this besides writing my own metric class?

As mentioned by you and here, We can use SparseCategoricalCrossentropy loss if we have labels as integers and CategoricalCrossentropy loss if we have labels in one-hot representation.
But to fix the above mentioned arror, you can use binarycrossentropy loss as there are binary labels(0,1) and change the final layer arguments as below:
model = tf.keras.Sequential(
[
tf.keras.layers.Dense(10, input_shape=(1,)),
tf.keras.layers.Dense(1, activation="sigmoid"),
]
)
print("Train without precision metric")
model.compile(
optimizer="adam",
loss="BinaryCrossentropy",
)
model.fit(x, y, epochs=2)
Output:
Train without precision metric
Epoch 1/2
1/1 [==============================] - 0s 475ms/step - loss: 0.8964
Epoch 2/2
1/1 [==============================] - 0s 12ms/step - loss: 0.8776
<keras.callbacks.History at 0x7f438e6ce190>
and to check the precision score:
print("Train with precision metric")
model.compile(
optimizer="adam",
loss="BinaryCrossentropy",
metrics=[tf.keras.metrics.Precision()],
)
model.fit(x, y, epochs=2)
Output:
Train with precision metric
Epoch 1/2
1/1 [==============================] - 1s 636ms/step - loss: 0.8595 - precision: 0.5263
Epoch 2/2
1/1 [==============================] - 0s 11ms/step - loss: 0.8420 - precision: 0.5263
<keras.callbacks.History at 0x7f438e627e50>

Related

Invalid Argument Error when using Tensorboard callback

I have used a Tensorboard callback in fitting a model consisting of one embedding layer and one SimpleRNN layer. The model performs binary sentiment classification for 9600 input text sequences. They have been tokenised and padded in advance.
# 1. Remove previous logs
!rm -rf ./logs/
# 2. Change to Py_file_dir
os.chdir(...)
# input_dim = 43489 (size of tokenizer word dictionary); output_dim = 100 (GloVe 100d embeddings); input_length = 1403 (length of longest text sequence).
# xtr_pad is padded, tokenised text sequences. nrow = 9600, ncol = input_length = 1403.
model= Sequential()
model.add(Embedding(input_dim, output_dim, input_length= input_length,
weights= [Embedding_matrix], trainable= False))
model.add(SimpleRNN(200))
model.add(Dense(1, activation= 'sigmoid'))
model.compile(loss='binary_crossentropy', optimizer= 'adam', metrics=['accuracy'])
tb = TensorBoard(histogram_freq=1, log_dir= 'tbcallback_prac')
tr_results= model.fit(xtr_pad, ytr, epochs= 2, batch_size= 64, verbose= 1,
validation_split= 0.2, callbacks= [tb])
# In command prompt enter: tensorboard --logdir tbcallback_prac
I have run this on Jupyterlab and on the first time the model trains without issue. I was able to view the Tensorboard statistics on local host.
However when I run this same code a second time, i.e. removing logs and fitting model it completed the first epoch of training, but returns this error before the 2nd epoch begins.
Train on 7680 samples, validate on 1920 samples
Epoch 1/2
7680/7680 [==============================] - ETA: 0s - loss: 0.2919 - accuracy: 0.9004
---------------------------------------------------------------------------
InvalidArgumentError Traceback (most recent call last)
<ipython-input-12-a1cde9b5b1f4> in <module>()
7 tb = TensorBoard(histogram_freq=1, log_dir= 'tbcallback_prac')
8 tr_results= model.fit(xtr_pad, ytr, epochs= 2, batch_size= 64, verbose= 1,
----> 9 validation_split= 0.2, callbacks= [tb])
...
InvalidArgumentError: You must feed a value for placeholder tensor 'embedding_input' with dtype float and shape [?,1403]
[[{{node embedding_input}}]]
Note 1403 is the length of all padded, tokenised sequences in training input 'xtr'.
Thanks in advance for any help!
I have no issue but I think that is a dimensions problem when working on logtis and sigmoid
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 3072, 64) 64000
simple_rnn (SimpleRNN) (None, 200) 53000
dense (Dense) (None, 1) 201
=================================================================
Total params: 117,201
Trainable params: 117,201
Non-trainable params: 0
_________________________________________________________________
val_dir: F:\models\checkpoint\ale_highscores_3\validation
Epoch 1/1500
2/2 [==============================] - ETA: 0s - loss: -0.5579 - accuracy: 0.1000[<KerasTensor: shape=(None, 3072) dtype=float32 (created by layer 'embedding_input')>]
<keras.engine.functional.Functional object at 0x00000233003A8550>
Press AnyKey!
2/2 [==============================] - 14s 7s/step - loss: -0.5579 - accuracy: 0.1000 - val_loss: -0.6446 - val_accuracy: 0.1000
Epoch 2/1500
2/2 [==============================] - ETA: 0s - loss: -0.6588 - accuracy: 0.1000[<KerasTensor: shape=(None, 3072) dtype=float32 (created by layer 'embedding_input')>]
<keras.engine.functional.Functional object at 0x00000233003A8C40>
Press AnyKey!
2/2 [==============================] - 13s 7s/step - loss: -0.6588 - accuracy: 0.1000 - val_loss: -0.7242 - val_accuracy: 0.1000
Epoch 3/1500
1/2 [==============>...............] - ETA: 6s - loss: -0.1867 - accuracy: 0.1429

Keras Callback - Save the per-epoch outputs

I am trying to create a callback for extracting the per-epoch outputs in the 1st hidden layer of my model. With what I have written, self.model.layers[0].output outputs a Tensor object but I could not see the actual entries.
Ideally I would like to save these output tensors, and visualise using an epoch vs mean-output plot. This has been implemented in Glorot & Bengio (2010) but the source code is not available.
How shall I edit my code in order to make the model fitting process save the outputs in each epoch? Thanks in advance.
class PerEpochOutputCallback(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs=None):
print('First layer output of epoch:', epoch+1, self.model.layers[0].output)
model_relu_3= Sequential()
# Use ReLU hidden layer
model_relu_3.add(Dense(3, input_dim= 8, activation= 'relu', kernel_initializer= 'uniform'))
model_relu_3.add(Dense(5, input_dim= 3, activation= 'relu', kernel_initializer= 'uniform'))
model_relu_3.add(Dense(5, input_dim= 5, activation= 'relu', kernel_initializer= 'uniform'))
model_relu_3.add(Dense(1, activation= 'sigmoid', kernel_initializer= 'uniform'))
model_relu_3.compile(loss='binary_crossentropy', optimizer='adam', metrics= ['accuracy'])
# Train model
tr_results = model_relu_3.fit(X, y, validation_split=0.2, epochs=10, batch_size=32,
verbose=2, callbacks=[PerEpochOutputCallback()])
====
Train on 614 samples, validate on 154 samples
Epoch 1/10
First layer output of epoch: 1 Tensor("dense_42/Relu:0", shape=(None, 3), dtype=float32)
614/614 - 0s - loss: 0.6915 - accuracy: 0.6531 - val_loss: 0.6897 - val_accuracy: 0.6429
Epoch 2/10
First layer output of epoch: 2 Tensor("dense_42/Relu:0", shape=(None, 3), dtype=float32)
614/614 - 0s - loss: 0.6874 - accuracy: 0.6531 - val_loss: 0.6853 - val_accuracy: 0.6429
Epoch 3/10
First layer output of epoch: 3 Tensor("dense_42/Relu:0", shape=(None, 3), dtype=float32)
614/614 - 0s - loss: 0.6824 - accuracy: 0.6531 - val_loss: 0.6783 - val_accuracy: 0.6429

Tensorflow 2 Metrics produce wrong results with 2 GPUs

I took this piece of code from tensorflow documentation about distributed training with custom loop https://www.tensorflow.org/tutorials/distribute/custom_training and I just fixed it to work with the tf.keras.metrics.AUC and run it with 2 GPUS (2 Nvidia V100 from a DGX machine).
# Import TensorFlow
import tensorflow as tf
# Helper libraries
import numpy as np
print(tf.__version__)
fashion_mnist = tf.keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
# Adding a dimension to the array -> new shape == (28, 28, 1)
# We are doing this because the first layer in our model is a convolutional
# layer and it requires a 4D input (batch_size, height, width, channels).
# batch_size dimension will be added later on.
train_images = train_images[..., None]
test_images = test_images[..., None]
# One hot
train_labels = tf.keras.utils.to_categorical(train_labels, 10)
test_labels = tf.keras.utils.to_categorical(test_labels, 10)
# Getting the images in [0, 1] range.
train_images = train_images / np.float32(255)
test_images = test_images / np.float32(255)
# If the list of devices is not specified in the
# `tf.distribute.MirroredStrategy` constructor, it will be auto-detected.
GPUS = [0, 1]
devices = ["/gpu:" + str(gpu_id) for gpu_id in GPUS]
strategy = tf.distribute.MirroredStrategy(devices=devices)
print ('Number of devices: {}'.format(strategy.num_replicas_in_sync))
BUFFER_SIZE = len(train_images)
BATCH_SIZE_PER_REPLICA = 64
GLOBAL_BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync
EPOCHS = 10
train_dataset = tf.data.Dataset.from_tensor_slices((train_images, train_labels)).shuffle(BUFFER_SIZE).batch(GLOBAL_BATCH_SIZE)
test_dataset = tf.data.Dataset.from_tensor_slices((test_images, test_labels)).batch(GLOBAL_BATCH_SIZE)
train_dist_dataset = strategy.experimental_distribute_dataset(train_dataset)
test_dist_dataset = strategy.experimental_distribute_dataset(test_dataset)
def create_model():
model = tf.keras.Sequential([
tf.keras.layers.Conv2D(32, 3, activation='relu'),
tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Conv2D(64, 3, activation='relu'),
tf.keras.layers.MaxPooling2D(),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')
])
return model
with strategy.scope():
# Set reduction to `none` so we can do the reduction afterwards and divide by
# global batch size.
loss_object = tf.keras.losses.CategoricalCrossentropy(
from_logits=True,
reduction=tf.keras.losses.Reduction.NONE)
def compute_loss(labels, predictions):
per_example_loss = loss_object(labels, predictions)
return tf.nn.compute_average_loss(per_example_loss, global_batch_size=GLOBAL_BATCH_SIZE)
with strategy.scope():
test_loss = tf.keras.metrics.Mean(name='test_loss')
train_accuracy = tf.keras.metrics.CategoricalAccuracy(
name='train_accuracy')
test_accuracy = tf.keras.metrics.CategoricalAccuracy(
name='test_accuracy')
train_auc = tf.keras.metrics.AUC(name='train_auc')
test_auc = tf.keras.metrics.AUC(name='test_auc')
# model, optimizer, and checkpoint must be created under `strategy.scope`.
with strategy.scope():
model = create_model()
optimizer = tf.keras.optimizers.Adam()
def train_step(inputs):
images, labels = inputs
with tf.GradientTape() as tape:
predictions = model(images, training=True)
loss = compute_loss(labels, predictions)
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
train_accuracy(labels, predictions)
train_auc(labels, predictions)
return loss
def test_step(inputs):
images, labels = inputs
predictions = model(images, training=False)
t_loss = loss_object(labels, predictions)
test_loss.update_state(t_loss)
test_accuracy(labels, predictions)
test_auc(labels, predictions)
# `run` replicates the provided computation and runs it
# with the distributed input.
#tf.function
def distributed_train_step(dataset_inputs):
per_replica_losses = strategy.run(train_step, args=(dataset_inputs,))
return strategy.reduce(tf.distribute.ReduceOp.SUM, per_replica_losses,
axis=None)
#tf.function
def distributed_test_step(dataset_inputs):
return strategy.run(test_step, args=(dataset_inputs,))
for epoch in range(EPOCHS):
# TRAIN LOOP
total_loss = 0.0
num_batches = 0
for x in train_dist_dataset:
total_loss += distributed_train_step(x)
num_batches += 1
train_loss = total_loss / num_batches
# TEST LOOP
for x in test_dist_dataset:
distributed_test_step(x)
template = ("Epoch {}, Loss: {}, Accuracy: {}, AUC: {},"
"Test Loss: {}, Test Accuracy: {}, Test AUC: {}")
print (template.format(epoch+1,
train_loss, train_accuracy.result()*100, train_auc.result()*100,
test_loss.result(), test_accuracy.result()*100, test_auc.result()*100))
test_loss.reset_states()
train_accuracy.reset_states()
test_accuracy.reset_states()
train_auc.reset_states()
test_auc.reset_states()
The problem is that AUC's evaluation is definitely wrong cause it exceeds its range (should be from 0-100) and i get theese results by running the above code for one time:
Epoch 1, Loss: 1.8061423301696777, Accuracy: 66.00833892822266, AUC: 321.8688659667969,Test Loss: 1.742477536201477, Test Accuracy: 72.0999984741211, Test AUC: 331.33709716796875
Epoch 2, Loss: 1.7129968404769897, Accuracy: 74.9816665649414, AUC: 337.37017822265625,Test Loss: 1.7084736824035645, Test Accuracy: 75.52999877929688, Test AUC: 337.1878967285156
Epoch 3, Loss: 1.643971562385559, Accuracy: 81.83333587646484, AUC: 355.96209716796875,Test Loss: 1.6072628498077393, Test Accuracy: 85.3499984741211, Test AUC: 370.603759765625
Epoch 4, Loss: 1.5887378454208374, Accuracy: 87.27833557128906, AUC: 373.6204528808594,Test Loss: 1.5906082391738892, Test Accuracy: 87.13999938964844, Test AUC: 371.9998474121094
Epoch 5, Loss: 1.581775426864624, Accuracy: 88.0, AUC: 373.9468994140625,Test Loss: 1.5964380502700806, Test Accuracy: 86.68000030517578, Test AUC: 371.0227355957031
Epoch 6, Loss: 1.5764907598495483, Accuracy: 88.49166870117188, AUC: 375.2404479980469,Test Loss: 1.5832056999206543, Test Accuracy: 87.94000244140625, Test AUC: 373.41998291015625
Epoch 7, Loss: 1.5698528289794922, Accuracy: 89.19166564941406, AUC: 376.473876953125,Test Loss: 1.5770654678344727, Test Accuracy: 88.58000183105469, Test AUC: 375.5516662597656
Epoch 8, Loss: 1.564456820487976, Accuracy: 89.71833801269531, AUC: 377.8564758300781,Test Loss: 1.5792100429534912, Test Accuracy: 88.27000427246094, Test AUC: 373.1791687011719
Epoch 9, Loss: 1.5612279176712036, Accuracy: 90.02000427246094, AUC: 377.9949645996094,Test Loss: 1.5729509592056274, Test Accuracy: 88.9800033569336, Test AUC: 375.5257263183594
Epoch 10, Loss: 1.5562015771865845, Accuracy: 90.54000091552734, AUC: 378.9789123535156,Test Loss: 1.56815767288208, Test Accuracy: 89.3499984741211, Test AUC: 375.8636474609375
Accuracy is ok but it seems that it's the only one metric that behaves nice. I tried other metrics too but they are not evaluated correctly. It seems that the problems come when using more than one GPU, cause when I run this code with one GPU it produce the right results.
When you use distributed strategy, the metric must be constructed and used inside the strategy.scope() block. So when you want to call the metric.result() method, remember to put it inside the with strategy.scope() block.
with strategy.scope():
print(metric.result())

Jacobian matrix of logits with respect to image using tf.GradientTape

I am trying to find the Jacobian of logits with respect to input but I do get None and I could not figure it why.
Let'say I have a model, I trained it and saved it.
import tensorflow as tf
print("TensorFlow version: ", tf.__version__)
tf.keras.backend.set_floatx('float64')
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
#Normalize the images, between 0-1
x_train, x_test = x_train / 255.0, x_test / 255.0
# Add a channels dimension
x_train = x_train[..., tf.newaxis]
x_test = x_test[..., tf.newaxis]
print(x_train.shape)
#(60000, 28, 28, 1)
print(y_train.shape)
(60000,)
print(x_test.shape)
#(10000, 28, 28, 1)
print(y_test.shape)
#(10000,)
num_class = 10
# Convert labels to one hot encoded vectors.
y_train_oh, y_test_oh = tf.keras.utils.to_categorical(y_train, num_classes= num_class, dtype='float32'), tf.keras.utils.to_categorical(y_test, num_classes= num_class, dtype='float32')
print(y_train_oh.shape)
#(60000, 10)
print(y_test_oh.shape)
#(10000, 10)
batch_size = 32
train_ds = tf.data.Dataset.from_tensor_slices((x_train, y_train_oh)).shuffle(10000).batch(batch_size)
test_ds = tf.data.Dataset.from_tensor_slices((x_test, y_test_oh)).batch(batch_size)
IMG_SIZE = (28, 28, 1)
input_img = tf.keras.layers.Input(shape=IMG_SIZE)
hidden_layer_1 = tf.keras.layers.Conv2D(filters = 16, kernel_size = (3, 3), strides=(1, 1), padding='same', activation=tf.nn.relu)(input_img)
hidden_layer_2 = tf.keras.layers.Conv2D(filters = 32, kernel_size = (3, 3), strides=(2, 2), padding='same', activation=tf.nn.relu)(hidden_layer_1)
hidden_layer_3 = tf.keras.layers.Conv2D(filters = 64, kernel_size = (3, 3), strides=(2, 2), padding='same', activation=tf.nn.relu)(hidden_layer_2)
flatten_layer = tf.keras.layers.Flatten()(hidden_layer_3)
output_img = tf.keras.layers.Dense(num_class)(flatten_layer)
#NO SOFTMAX LAYER IN THE END, WE WILL DO IT LATER
#predictions = tf.nn.softmax(logits)
model = tf.keras.Model(input_img, output_img)
model.summary()
loss_object = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
# This function accepts one-hot encoded labels
optimizer = tf.keras.optimizers.Adam()
train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.CategoricalAccuracy(name='train_accuracy')
test_loss = tf.keras.metrics.Mean(name='test_loss')
test_accuracy = tf.keras.metrics.CategoricalAccuracy(name='test_accuracy')
#tf.function
def train_step(images, labels):
with tf.GradientTape() as tape:
# training=True is only needed if there are layers with different
# behavior during training versus inference (e.g. Dropout).
predictions = model(images, training=True)
loss = loss_object(labels, predictions)
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
train_loss(loss)
train_accuracy(labels, predictions)
#tf.function
def test_step(images, labels):
# training=False is only needed if there are layers with different
# behavior during training versus inference (e.g. Dropout).
predictions = model(images, training=False)
t_loss = loss_object(labels, predictions)
test_loss(t_loss)
test_accuracy(labels, predictions)
# Train the model for 15 epochs.
num_epochs = 15
train_loss_results = []
train_accuracy_results = []
test_loss_results = []
test_accuracy_results = []
for epoch in range(num_epochs):
# Reset the metrics at the start of the next epoch
train_loss.reset_states()
train_accuracy.reset_states()
test_loss.reset_states()
test_accuracy.reset_states()
for images, labels in train_ds:
train_step(images, labels)
for test_images, test_labels in test_ds:
test_step(test_images, test_labels)
train_loss_results.append(train_loss.result())
train_accuracy_results.append(train_accuracy.result())
test_loss_results.append(test_loss.result())
test_accuracy_results.append(test_accuracy.result())
template = 'Epoch {}, Loss: {}, Accuracy: {}, Test Loss: {}, Test Accuracy: {}'
print(template.format(epoch+1,
train_loss.result(),
train_accuracy.result()*100,
test_loss.result(),
test_accuracy.result()*100))
tf.keras.models.save_model(model = model, filepath = 'model.h5', overwrite=True, include_optimizer=True)
# Epoch 1, Loss: 0.1654163608489558, Accuracy: 95.22, Test Loss: 0.061988271648914496, Test Accuracy: 97.88
# Epoch 2, Loss: 0.060983153790452826, Accuracy: 98.15833333333333, Test Loss: 0.044874734015780696, Test Accuracy: 98.53
# Epoch 3, Loss: 0.042541984771347297, Accuracy: 98.69, Test Loss: 0.042536806688480366, Test Accuracy: 98.57000000000001
# Epoch 4, Loss: 0.03330485398344463, Accuracy: 98.98166666666667, Test Loss: 0.039308084282613225, Test Accuracy: 98.64
# Epoch 5, Loss: 0.024959077225852524, Accuracy: 99.205, Test Loss: 0.04370295960736327, Test Accuracy: 98.67
# Epoch 6, Loss: 0.020565333928674955, Accuracy: 99.33666666666666, Test Loss: 0.04245114839809372, Test Accuracy: 98.69
# Epoch 7, Loss: 0.01639637468442185, Accuracy: 99.47666666666667, Test Loss: 0.04561551753656099, Test Accuracy: 98.72999999999999
# Epoch 8, Loss: 0.013642370500962534, Accuracy: 99.56333333333333, Test Loss: 0.04333075060614142, Test Accuracy: 98.83
# Epoch 9, Loss: 0.010697861799085589, Accuracy: 99.655, Test Loss: 0.05918524164135248, Test Accuracy: 98.48
# Epoch 10, Loss: 0.011164671695055153, Accuracy: 99.61666666666666, Test Loss: 0.05492968221334442, Test Accuracy: 98.64
# Epoch 11, Loss: 0.008642793950046499, Accuracy: 99.69833333333334, Test Loss: 0.05367191278261649, Test Accuracy: 98.74000000000001
# Epoch 12, Loss: 0.00788155746288626, Accuracy: 99.74499999999999, Test Loss: 0.06254112380584512, Test Accuracy: 98.68
# Epoch 13, Loss: 0.006521700676742724, Accuracy: 99.77000000000001, Test Loss: 0.06381602274510409, Test Accuracy: 98.7
# Epoch 14, Loss: 0.007104389384812846, Accuracy: 99.75166666666667, Test Loss: 0.05241271737958395, Test Accuracy: 98.87
# Epoch 15, Loss: 0.006479600550850722, Accuracy: 99.77833333333334, Test Loss: 0.06816933916442823, Test Accuracy: 98.74000000000001
You can find the saved model in h5 format in this link, if you do not want to train it.
It works well so far, I can do predictions on some samples:
predictions = model(mnist_twos, training=False)
for i, logits in enumerate(predictions):
class_idx = tf.argmax(logits).numpy()
p = tf.nn.softmax(logits)[class_idx] #probabilities
print("Example {} prediction: {} ({:4.1f}%)".format(i, class_idx, 100*p))
Example 0 prediction: 2 (100.0%)
Example 1 prediction: 2 (100.0%)
Example 2 prediction: 2 (100.0%)
Example 3 prediction: 2 (100.0%)
Example 4 prediction: 2 (100.0%)
Example 5 prediction: 2 (100.0%)
Example 6 prediction: 2 (100.0%)
Example 7 prediction: 2 (100.0%)
Example 8 prediction: 2 (100.0%)
Example 9 prediction: 2 (100.0%)
What I want to do is now to find the jacobian matrix of logits with respect to input image. Since I have 10 selected images, I will have a Jacobian matrix of size (10, 28, 28, 1) since the shape of the MNIST sample is (28, 28, 1). I can do this with Tensorflow 1.0 like:
for i in range(n_class):
if i==0:
j = tf.gradients(tf.reshape(logits, (-1,))[i], X_p)
else:
j = tf.concat([j, tf.gradients(tf.reshape(logits, (-1,))[i], X_p)],axis=0)
where X_p is the placeholder for the image I am feeding in.
X_p = tf.placeholder(shape=[28, 28, 1], dtype=tf.float32)
However, I am currently using Tensorflow 2.0 and I cannot make it work using tf.GradientTape. It always ends up None. This seems to be a common problem for everyone and I followed the examples here but to no avail. Can someone help me about it?
Please check the batch_jacobian method of the GradinetTape. https://www.tensorflow.org/api_docs/python/tf/GradientTape#batch_jacobian
Convert your input to the tf variable if you are getting None gradients even after batch_jacobian.

model.fit_generator() fails with use_multiprocessing=True

In the code example below, I can train the model only when NOT using multiprocessing.
My generator is straight from the tensorflow.keras.utils.Sequence description https://www.tensorflow.org/api_docs/python/tf/keras/utils/Sequence
Any idea how to fix the generator to allow multiprocessing?
Running on Win 10, tensorflow 1.13.1, python 3.6.8
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers
from tensorflow.keras.utils import Sequence
# Generator
class DataGenerator(Sequence):
def __init__(self, dim, batch_size, n_channels):
self.dim = dim
self.batch_size = batch_size
self.n_channels = n_channels
def __len__(self):
return 100
def __getitem__(self, idx):
X = np.random.randn(self.batch_size, self.dim, self.n_channels)
Y = np.random.randn(self.batch_size, self.dim, 1)
return X, Y
dim= 32
batch_size= 64
n_channels= 3
# Generators
training_generator = DataGenerator(dim, batch_size, n_channels)
validation_generator = DataGenerator(dim, batch_size, n_channels)
# Model
model = Sequential()
model.add(layers.GRU(128, return_sequences=True,
batch_input_shape=[None, training_generator.dim, training_generator.n_channels]))
model.add(layers.Dense(1))
model.compile(loss='mse', optimizer='adam')
# This training procedure runs
model.fit_generator(generator=training_generator,
epochs = 2,
steps_per_epoch = 100,
max_queue_size = 32,
validation_data=validation_generator,
validation_steps = 20,
verbose=1)
# This training procedure fails (Only change is that I added the multiprocessing options)
model.fit_generator(generator=training_generator,
epochs = 2,
steps_per_epoch = 100,
max_queue_size = 32,
validation_data=validation_generator,
validation_steps = 20,
verbose=1,
use_multiprocessing=True,
workers=4)
I expected the second fit_generator() call to train the model like the first one. Instead, I get no output, not even an error message.
I tried your code on Ubuntu 18.04.2 LTS machine with python 3.6.8 and tensorflow 1.13.1. It works in both cases as log shown below:
2019-07-13 12:56:17.003119: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
100/100 [==============================] - 3s 27ms/step - loss: 0.9987
100/100 [==============================] - 10s 103ms/step - loss: 0.9973 - val_loss: 0.9987
Epoch 2/2
100/100 [==============================] - 3s 26ms/step - loss: 0.9955
100/100 [==============================] - 8s 83ms/step - loss: 1.0028 - val_loss: 0.9955
Multiprocessing=True ......
Epoch 1/2
100/100 [==============================] - 3s 32ms/step - loss: 0.9952
100/100 [==============================] - 9s 89ms/step - loss: 0.9962 - val_loss: 0.9952
Epoch 2/2
100/100 [==============================] - 3s 28ms/step - loss: 0.9967
100/100 [==============================] - 9s 86ms/step - loss: 0.9968 - val_loss: 0.9967"
My suggestion is to first try with CPU only mode, by putting BOTH the model and the fit_generator code under "with tf.device('/cpu:0'):". If it works, it would be GPU related issue, such as proper driver, tensorflow with GPU support etc. Most likely, the issue was caused by GPU hanging.