Keras model fit doens't use XLA_GPU device - tensorflow

tensorflow-gpu==2.3.0 is properly installed.
GPU device can be discovered.
tf.config.get_visible_devices() shows the GPU device correctly:
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'),
PhysicalDevice(name='/physical_device:XLA_GPU:0', device_type='XLA_GPU')]
However, after I build my model using Keras and try model.fit, I found the GPU is not utilized at all.
the execution shows:
95/237 [===========>..................] - ETA: 3s - loss: 2883.6201 - mse: 2883.6201
Executing op __inference_train_function_56696 in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op __inference_train_function_56696 in device /job:localhost/replica:0/task:0/device:CPU:0
97/237 [===========>..................] - ETA: 3s - loss: 2878.0310 - mse: 2878.0310
Executing op __inference_train_function_56696 in device /job:localhost/replica:0/task:0/device:CPU:0
Executing op __inference_train_function_56696 in device /job:localhost/replica:0/task:0/device:CPU:0
99/237 [===========>..................] - ETA: 3s - loss: 2876.5935 - mse: 2876.5935
Executing op __inference_train_function_56696 in device /job:localhost/replica:0/task:0/device:CPU:0
What can I do to find out what's going on under the hood? How can I run model fit on a sepcified GPU?

Related

Shared Loss for Multitasking Model

I am currently training a multitasking classification model.
When I training the model, I can see there are two loss metrics for my two classification output which are species_loss and also diseases_loss but what I am curious about is why there is another loss metric named val_loss? Is that a shared loss between species_loss and diseases_loss? Is it ok to directly set my earlystopping monitor on that val_loss metrics?
Epoch 1/100
100/100 [==============================] - 134s 1s/step - loss: 0.5361 - species_loss: 0.1241 - diseases_loss: 0.4120 - species_accuracy: 0.9781 - diseases_accuracy: 0.9022 - val_loss: 2.9369 - val_species_loss: 0.1653 - val_diseases_loss: 2.7716 - val_species_accuracy: 0.9494 - val_diseases_accuracy: 0.5863

Accuracy in history dictionary different from what printed on screen

When training a model in Keras, the accuracies printed on-screen at every epoch are different from what saved in the history object. For example (minimal test, compacted output):
history = model.fit(...)
Epoch 1/5
156/156 [===] - loss: 0.6325 - accuracy: 0.7700 - val_loss: 0.4330 - val_accuracy: 0.8156
Epoch 2/5
156/156 [===] - loss: 0.3855 - accuracy: 0.8538 - val_loss: 0.4692 - val_accuracy: 0.8050
Epoch 3/5
156/156 [===] - loss: 0.3918 - accuracy: 0.8427 - val_loss: 0.4666 - val_accuracy: 0.7861
Epoch 4/5
156/156 [===] - loss: 0.3820 - accuracy: 0.8461 - val_loss: 0.4101 - val_accuracy: 0.8014
Epoch 5/5
156/156 [===] - loss: 0.3927 - accuracy: 0.8492 - val_loss: 0.4092 - val_accuracy: 0.7979
Then (rounding like printed values for convenience):
>>> [round(x, 4) for x in history.history['accuracy']]
[0.8184, 0.8474, 0.8484, 0.8488, 0.8476]
>>> [round(x, 4) for x in history.history['val_accuracy']]
[0.8156, 0.805, 0.7861, 0.8014, 0.7979]
As you can see, while validation accuracies match printed values, training accuracies do not (tested both in Colab with GPU and local PC with CPU, using Keras 2.4.0 and TensorFlow 2.4.1).
This is a problem if you want to save data from multiple tests to a file, for example. What am I getting wrong?
EDIT: here is an example to reproduce the problem, slightly modified from TF MNIST quickstart. See the block right after calling model.fit().
https://colab.research.google.com/drive/14Uogeq8wRlZlinaKLbkFr_Bl2aLzUJuy?usp=sharing
EDIT 2: as suggested by another user, I submitted a bug issue here: https://github.com/tensorflow/tensorflow/issues/48408
I used your colab and able to reproduce your issue. Yes, this seems like a serious bug. I tested the code in both CPU and GPU mode with tf 2.0, 2.1, 2.3 without any issue. But this issue causes in tf 2.4 and tf-nightly.
I would suggest you raise a bug issue in TensorFlow GitHub. And share a cross-link here and there so that others can follow the update. In the meantime, you can roll back to tf 2.3. However, I didn't check whether callbacks.CSVLogger also has some issue in the latest release, you can check that too.

Why Keras behave better than Pytorch under the same network configuration?

Recently, I have compared unet++ implementation of Keras version and Pytorch version on the same dataset. However, with Keras the loss decrease continuously and the accuracy is higher after 10 epochs, while with Pytorch the loss decrease unevenly and the accuracy is lower after 10 epochs. Anyone has met such problems and has any answers?
the final pytorch training process is like:
2019-12-15 18:14:20 Epoch:9 Iter: 1214/1219 loss:0.464673 acc:0.581713
2019-12-15 18:14:21 Epoch:9 Iter: 1215/1219 loss:0.450462 acc:0.584101
2019-12-15 18:14:21 Epoch:9 Iter: 1216/1219 loss:0.744811 acc:0.293406
2019-12-15 18:14:22 Epoch:9 Iter: 1217/1219 loss:0.387612 acc:0.735630
2019-12-15 18:14:23 Epoch:9 Iter: 1218/1219 loss:0.767146 acc:0.364759
the final keras training process is like:
685/690 [============================>.] - ETA: 2s - loss: 0.4940 - acc: 0.7309
686/690 [============================>.] - ETA: 1s - loss: 0.4941 - acc: 0.7306
687/690 [============================>.] - ETA: 1s - loss: 0.4939 - acc: 0.7308
688/690 [============================>.] - ETA: 0s - loss: 0.4942 - acc: 0.7303
689/690 [============================>.] - ETA: 0s - loss: 0.4943 - acc: 0.7302
Well, it's pretty hard to say without any code snippets. that being said, in general, initialization is way more important than you might think. I'm sure that the default initialization of pytorch is different from keras and I had similar issues in the past.
Another thing to check is the optimizer parameters, make sure that not only you are using the same optimizer(sgd, adam, ...) but also with the same parameters(lr, beta, momentum, ...)

Keras Batchnormalization, differing results in trainin and evaluation on training dataset

I'm am training a CNN, for the sake of debugging a my problem I am working on a small subset of the actual training data.
During training the loss and accuracy seem very reasonable and pretty good. (In the example I used the same small subset for validation, the problem shows here already)
Fit on x_train and validate on x_train, using batch_size=32
Epoch 10/10
1/10 [==>...........................] - ETA: 2s - loss: 0.5126 - acc: 0.7778
2/10 [=====>........................] - ETA: 1s - loss: 0.3873 - acc: 0.8576
3/10 [========>.....................] - ETA: 1s - loss: 0.3447 - acc: 0.8634
4/10 [===========>..................] - ETA: 1s - loss: 0.3320 - acc: 0.8741
5/10 [==============>...............] - ETA: 0s - loss: 0.3291 - acc: 0.8868
6/10 [=================>............] - ETA: 0s - loss: 0.3485 - acc: 0.8848
7/10 [====================>.........] - ETA: 0s - loss: 0.3358 - acc: 0.8879
8/10 [=======================>......] - ETA: 0s - loss: 0.3315 - acc: 0.8863
9/10 [==========================>...] - ETA: 0s - loss: 0.3215 - acc: 0.8885
10/10 [==============================] - 3s - loss: 0.3106 - acc: 0.8863 - val_loss: 1.5021 - val_acc: 0.2707
When I evaluate on the same training dataset however the accuracy is really off from what I saw during training ( I would expect it to be at least as good as during training on the same dataset).
When evaluating straight forward or using
K.set_learning_phase(0)
I get, similar to the validation (Evaluating on x_train using batch_size=32):
Evaluation Accuracy: 0.266318537392, Loss: 1.50756853772
If I set the backend to learning phase the results get pretty good again, so the per batch normalization seems to work well. I suspect that the cumulated mean and variance are not properly being used.
So after
K.set_learning_phase(1)
I get (Evaluating on x_train using batch_size=32):
Evaluation Accuracy: 0.887728457507, Loss: 0.335956037511
I added the the batchnormalization layer after the first convolutional layer like this:
model = models.Sequential()
model.add(Conv2D(80, first_conv_size, strides=2, activation="relu", input_shape=input_shape, padding=padding_name))
model.add(BatchNormalization(axis=-1))
model.add(MaxPooling2D(first_max_pool_size, strides=4, padding=padding_name))
...
Further down the line I would also have some dropout layers, which I removed to investigate the Batchnormalization behavior. My intend would be to use the model in non-training phase for normal prediction.
Shouldn't it work like that, or am I missing some additional configuration?
Thanks!
I'm using keras 2.0.8 with tensorflow 1.1.0 (anaconda)
This is really annoying. When you set the learning_phase to be True - a BatchNormalization layer is getting normalization statistics straight from data what might be a problem when you have a small batch_size. I came across similar issue some time ago - and here you have my solution:
When building a model - add an option if the model would predict in either learning or not-learning phase and in this used in learning phase use the following class instead of BatchNormalization:
class NonTrainableBatchNormalization(BatchNormalization):
"""
This class makes possible to freeze batch normalization while Keras
is in training phase.
"""
def call(self, inputs, training=None):
return super(
NonTrainableBatchNormalization, self).call(inputs, training=False)
Once you train your model - reset its weights to a NonTrainable copy:
learning_phase_model.set_weights(learned_model.get_weights())
Now you can fully enjoy using BatchNormalization in a learning_phase.

Are scalar_summary missed in skflow?

I'm having issue to get the loss (training or monitoring) summary showing in tensorboard using skflow
This is my code:
classifier = skflow.TensorFlowEstimator( model_fn=conv_model, n_classes=2, batch_size=BATCH_SIZE, steps=100000, learning_rate=0.001, config=RunConfig(gpu_memory_fraction=0.9))
val_monitor = monitors.ValidationMonitor(X_val, y_val, n_classes=2, print_steps=100)
classifier.fit(X_train, y_train, val_monitor, logdir='my_model_1/')
classifier.save('my_model_1/')
Everything runs well:
`I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally
/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/io/data_feeder.py:281: VisibleDeprecationWarning: converting an array with ndim > 0 to an index will result in an error in the future
out.itemset((i, self.y[sample]), 1.0)
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties:
name: GeForce GTX 980
major: 5 minor: 2 memoryClockRate (GHz) 1.253
pciBusID 0000:03:00.0
Total memory: 4.00GiB
Free memory: 3.91GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:755] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 980, pci bus id: 0000:03:00.0)
/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/io/data_feeder.py:370: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
out.itemset((i, y), 1.0)
Step #99, avg. train loss: 2.22587, avg. val loss: 2.14521
Step #199, avg. train loss: 0.82641, avg. val loss: 0.89103
Step #299, avg. train loss: 0.78344, avg. val loss: 0.85636
Step #399, avg. train loss: 0.76420, avg. val loss: 0.85675
Step #499, avg. train loss: 0.75868, avg. val loss: 0.84104
Step #599, avg. train loss: 0.75467, avg. val loss: 0.84945
Step #699, avg. train loss: 0.73990, avg. val loss: 0.91238
Step #799, avg. train loss: 0.73400, avg. val loss: 0.92720
Step #899, avg. train loss: 0.72879, avg. val loss: 0.91054
Step #999, avg. train loss: 0.73448, avg. val loss: 0.89823
Step #1099, avg. train loss: 0.70125, avg. val loss: 0.91640
Step #1199, avg. train loss: 0.71879, avg. val loss: 0.90597
Step #1299, avg. train loss: 0.70713, avg. val loss: 0.90736
Step #1399, avg. train loss: 0.70023, avg. val loss: 0.91414
Step #1499, avg. train loss: 0.69566, avg. val loss: 0.91007
Step #1599, avg. train loss: 0.68030, avg. val loss: 0.92729
Step #1699, avg. train loss: 0.68919, avg. val loss: 0.91168
Step #1799, avg. train loss: 0.67088, avg. val loss: 0.91744
Step #1899, avg. train loss: 0.68732, avg. val loss: 0.88844
Step #1999, avg. train loss: 0.67585, avg. val loss: 0.88854`
it generates the file .tfevents that have 4,8M size (attached)
when I connect to the machine using chrome as explorer I have data in graphs/histograms/ but nothing in events (No scalar data was found)
did I miss something to have loss logged ?
NB:I added
logging_ops.scalar_summary("model_loss", self._model_loss)
in learn/python/learn/estimators/base.py and the model-loss is appearing in tensorboard
Ps: I'm running on GPU machine using the last build tensorflow
attached tfevents my_model_1.zip
It was an issue in skflow corrected here,
and also for monitoring validation here