when training a model with tensorflow-gpu and GPU 4090 on WSL, abnormal loss got - tensorflow

I recently bought a new 4090 for deep learning. After installed nvidia driver, I started to run the programme on WSL2 for training the model I have successfually trained without gpu.
However, with GPU acceleration, the loss are abnormal:
363/363 [==============================] - 191s 19ms/step - loss: nan - categorical_accuracy: 0.0964 - val_loss: nan - val_categorical_accuracy: 0.0983
Epoch 2/10
363/363 [==============================] - 2s 6ms/step - loss: nan - categorical_accuracy: 0.0850 - val_loss: nan - val_categorical_accuracy: 0.0983
Epoch 3/10
363/363 [==============================] - 2s 5ms/step - loss: nan - categorical_accuracy: 0.0865 - val_loss: nan - val_categorical_accuracy: 0.0983
Epoch 4/10
363/363 [==============================] - 2s 5ms/step - loss: nan - categorical_accuracy: 0.0890 - val_loss: nan - val_categorical_accuracy: 0.0983
Epoch 5/10
363/363 [==============================] - 2s 5ms/step - loss: nan - categorical_accuracy: 0.0931 - val_loss: nan - val_categorical_accuracy: 0.0983
Epoch 6/10
363/363 [==============================] - 2s 5ms/step - loss: nan - categorical_accuracy: 0.0889 - val_loss: nan - val_categorical_accuracy: 0.0983
Epoch 7/10
363/363 [==============================] - 2s 5ms/step - loss: nan - categorical_accuracy: 0.0827 - val_loss: nan - val_categorical_accuracy: 0.0983
Epoch 8/10
363/363 [==============================] - 2s 6ms/step - loss: nan - categorical_accuracy: 0.0888 - val_loss: nan - val_categorical_accuracy: 0.0983
Epoch 9/10
363/363 [==============================] - 2s 5ms/step - loss: nan - categorical_accuracy: 0.0880 - val_loss: nan - val_categorical_accuracy: 0.0983
Epoch 10/10
363/363 [==============================] - 2s 6ms/step - loss: nan - categorical_accuracy: 0.0845 - val_loss: nan - val_categorical_accuracy: 0.0983
And I found erorrs and warnings as following:
E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:927] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
W tensorflow/stream_executor/gpu/asm_compiler.cc:63] Running ptxas --version returned 256
Relying on driver to perform ptx compilation.
Modify $PATH to customize ptxas location.
reinstalled nvidia driver
installed cuda tools

Related

Using TensorFlow-metal plugin, training stops after some time without any errors?

I've followed the steps provided by Apple, which utilizes conda, to install TensorFlow to get the best out of the M1 Pro MacBook Pro. As the title is self-descriptive, the training stops after some time without any errors. Please see the Keras training log below. This has happened many times. What could be the reason behind this situation? Have you experienced the same on your end? If so, how can I overcome this situation?
...
Epoch 38/50
625/625 [==============================] - 18s 29ms/step - loss: 1.6704 - acc: 0.4178 - val_loss: 1.8169 - val_acc: 0.4044
Epoch 39/50
625/625 [==============================] - 18s 29ms/step - loss: 1.6788 - acc: 0.4157 - val_loss: 1.6830 - val_acc: 0.4029
Epoch 40/50
625/625 [==============================] - 18s 28ms/step - loss: 1.6921 - acc: 0.4089 - val_loss: 1.7088 - val_acc: 0.4049
Epoch 41/50
625/625 [==============================] - 18s 28ms/step - loss: 1.6705 - acc: 0.4170 - val_loss: 1.6650 - val_acc: 0.4182
Epoch 42/50
625/625 [==============================] - 18s 29ms/step - loss: 1.6659 - acc: 0.4177 - val_loss: 1.9102 - val_acc: 0.3443
Epoch 43/50
625/625 [==============================] - 18s 29ms/step - loss: 1.6760 - acc: 0.4166 - val_loss: 1.6647 - val_acc: 0.4222
Epoch 44/50
532/625 [========================>.....] - ETA: 2s - loss: 1.6639 - acc: 0.4217

Lower model accuracies with tensorflow2.3+keras2.4 than tensorflow1.15+keras2.3

I created two anaconda environments for tensorflow2x and tensorflow1x respectively. In tensorflow2x, the tensorflow 2.3.2 and keras 2.4.3 (the latest) are installed, while in tensorflow1x, the tensorflow-gpu 1.15 and keras 2.3.1 are installed. Then I run a toy example mnist_cnn.py. It is found that the former tensorflow2 version give much lower accuracy than that the one obtained by the latter tensorflow 1.
Here below are the results:
# tensorflow2.3.2 + keras 2.4.3:
Epoch 1/12
60000/60000 [==============================] - 3s 54us/step - loss: 2.2795 - accuracy: 0.1270 - val_loss: 2.2287 - val_accuracy: 0.2883
Epoch 2/12
60000/60000 [==============================] - 3s 52us/step - loss: 2.2046 - accuracy: 0.2435 - val_loss: 2.1394 - val_accuracy: 0.5457
Epoch 3/12
60000/60000 [==============================] - 3s 52us/step - loss: 2.1133 - accuracy: 0.3636 - val_loss: 2.0215 - val_accuracy: 0.6608
Epoch 4/12
60000/60000 [==============================] - 3s 52us/step - loss: 1.9932 - accuracy: 0.4560 - val_loss: 1.8693 - val_accuracy: 0.7147
Epoch 5/12
60000/60000 [==============================] - 3s 52us/step - loss: 1.8430 - accuracy: 0.5239 - val_loss: 1.6797 - val_accuracy: 0.7518
Epoch 6/12
60000/60000 [==============================] - 3s 52us/step - loss: 1.6710 - accuracy: 0.5720 - val_loss: 1.4724 - val_accuracy: 0.7755
Epoch 7/12
60000/60000 [==============================] - 3s 53us/step - loss: 1.5003 - accuracy: 0.6071 - val_loss: 1.2725 - val_accuracy: 0.7928
Epoch 8/12
60000/60000 [==============================] - 3s 52us/step - loss: 1.3414 - accuracy: 0.6363 - val_loss: 1.0991 - val_accuracy: 0.8077
Epoch 9/12
60000/60000 [==============================] - 3s 53us/step - loss: 1.2129 - accuracy: 0.6604 - val_loss: 0.9603 - val_accuracy: 0.8169
Epoch 10/12
60000/60000 [==============================] - 3s 53us/step - loss: 1.1103 - accuracy: 0.6814 - val_loss: 0.8530 - val_accuracy: 0.8281
Epoch 11/12
60000/60000 [==============================] - 3s 52us/step - loss: 1.0237 - accuracy: 0.7021 - val_loss: 0.7689 - val_accuracy: 0.8350
Epoch 12/12
60000/60000 [==============================] - 3s 52us/step - loss: 0.9576 - accuracy: 0.7168 - val_loss: 0.7030 - val_accuracy: 0.8429
Test loss: 0.7029915698051452
Test accuracy: 0.8428999781608582
# tensorflow1.15.5 + keras2.3.1
60000/60000 [==============================] - 5s 84us/step - loss: 0.2631 - accuracy: 0.9198 - val_loss: 0.0546 - val_accuracy: 0.9826
Epoch 2/12
60000/60000 [==============================] - 4s 63us/step - loss: 0.0898 - accuracy: 0.9731 - val_loss: 0.0394 - val_accuracy: 0.9866
Epoch 3/12
60000/60000 [==============================] - 4s 63us/step - loss: 0.0674 - accuracy: 0.9799 - val_loss: 0.0341 - val_accuracy: 0.9881
Epoch 4/12
60000/60000 [==============================] - 4s 63us/step - loss: 0.0563 - accuracy: 0.9835 - val_loss: 0.0320 - val_accuracy: 0.9895
Epoch 5/12
60000/60000 [==============================] - 4s 63us/step - loss: 0.0465 - accuracy: 0.9859 - val_loss: 0.0343 - val_accuracy: 0.9889
Epoch 6/12
60000/60000 [==============================] - 4s 63us/step - loss: 0.0423 - accuracy: 0.9872 - val_loss: 0.0327 - val_accuracy: 0.9892
Epoch 7/12
60000/60000 [==============================] - 4s 63us/step - loss: 0.0387 - accuracy: 0.9882 - val_loss: 0.0279 - val_accuracy: 0.9907
Epoch 8/12
60000/60000 [==============================] - 4s 63us/step - loss: 0.0351 - accuracy: 0.9893 - val_loss: 0.0269 - val_accuracy: 0.9909
Epoch 9/12
60000/60000 [==============================] - 4s 63us/step - loss: 0.0330 - accuracy: 0.9902 - val_loss: 0.0311 - val_accuracy: 0.9895
Epoch 10/12
60000/60000 [==============================] - 4s 63us/step - loss: 0.0292 - accuracy: 0.9915 - val_loss: 0.0256 - val_accuracy: 0.9919
Epoch 11/12
60000/60000 [==============================] - 4s 63us/step - loss: 0.0293 - accuracy: 0.9911 - val_loss: 0.0276 - val_accuracy: 0.9911
Epoch 12/12
60000/60000 [==============================] - 4s 63us/step - loss: 0.0269 - accuracy: 0.9917 - val_loss: 0.0264 - val_accuracy: 0.9915
Test loss: 0.026934823030711867
Test accuracy: 0.9918000102043152
What caused the poor results for the tensorflow 2.3.2 + keras 2.4.3?? Is there any compatibility issue between tensorflow and keras here?
According to the author of keras, users should consider switching their Keras code to tf.keras in TensorFlow 2.x. In the above toy example, if
from tensorflow import keras in place of import keras, it also leads lower accuracy. It seems tf.keras gives poorer accuracy than keras? Maybe I run a wrong toy example for Tensorflow 2.X??
Update:
I also note if I decrease tensorflow to the version 2.2.1 (along with keras 2.3.1). They will produce about the same result. It seems there are some major changes from keras 2.3.1 to keras 2.4.0 (https://newreleases.io/project/github/keras-team/keras/release/2.4.0).
What are the specific main differences between keras 2.3.1 and keras 2.4.x??
Which versions of tensorflow are compatible with keras 2.4.x??

Training & validation accuracy increasing & training loss is decreasing - Validation Loss is NaN

I am training a classifier model on cats vs dogs data. The model is a minor variant of ResNet18 & returns a softmax probability for classes. However, I am noticing that the validation loss is majorly NaN whereas training loss is steadily decreasing & behaves as expected. Training & Validation accuracy increase epoch by epoch.
Epoch 1/15
312/312 [==============================] - 1372s 4s/step - loss: 0.7849 - accuracy: 0.5131 - val_loss: nan - val_accuracy: 0.5343
Epoch 2/15
312/312 [==============================] - 1372s 4s/step - loss: 0.6966 - accuracy: 0.5539 - val_loss: 13989871201999266517090304.0000 - val_accuracy: 0.5619
Epoch 3/15
312/312 [==============================] - 1373s 4s/step - loss: 0.6570 - accuracy: 0.6077 - val_loss: 747123703808.0000 - val_accuracy: 0.5679
Epoch 4/15
312/312 [==============================] - 1372s 4s/step - loss: 0.6180 - accuracy: 0.6483 - val_loss: nan - val_accuracy: 0.6747
Epoch 5/15
312/312 [==============================] - 1373s 4s/step - loss: 0.5838 - accuracy: 0.6852 - val_loss: nan - val_accuracy: 0.6240
Epoch 6/15
312/312 [==============================] - 1372s 4s/step - loss: 0.5338 - accuracy: 0.7301 - val_loss: 31236203781405710523301888.0000 - val_accuracy: 0.7590
Epoch 7/15
312/312 [==============================] - 1373s 4s/step - loss: 0.4872 - accuracy: 0.7646 - val_loss: 52170.8672 - val_accuracy: 0.7378
Epoch 8/15
312/312 [==============================] - 1372s 4s/step - loss: 0.4385 - accuracy: 0.7928 - val_loss: 2130819335420217655296.0000 - val_accuracy: 0.8101
Epoch 9/15
312/312 [==============================] - 1373s 4s/step - loss: 0.3966 - accuracy: 0.8206 - val_loss: 116842888.0000 - val_accuracy: 0.7857
Epoch 10/15
312/312 [==============================] - 1372s 4s/step - loss: 0.3643 - accuracy: 0.8391 - val_loss: nan - val_accuracy: 0.8199
Epoch 11/15
312/312 [==============================] - 1373s 4s/step - loss: 0.3285 - accuracy: 0.8557 - val_loss: 788904.2500 - val_accuracy: 0.8438
Epoch 12/15
312/312 [==============================] - 1372s 4s/step - loss: 0.3029 - accuracy: 0.8670 - val_loss: nan - val_accuracy: 0.8245
Epoch 13/15
312/312 [==============================] - 1373s 4s/step - loss: 0.2857 - accuracy: 0.8781 - val_loss: 121907.8594 - val_accuracy: 0.8444
Epoch 14/15
312/312 [==============================] - 1373s 4s/step - loss: 0.2585 - accuracy: 0.8891 - val_loss: nan - val_accuracy: 0.8674
Epoch 15/15
312/312 [==============================] - 1374s 4s/step - loss: 0.2430 - accuracy: 0.8965 - val_loss: 822.7968 - val_accuracy: 0.8776
I checked for the following -
Infinity/NaN in validation data
Infinity/NaN caused when normalizing data (using tf.keras.applications.resnet.preprocess_input)
If the model is predicting only one class & hence causing loss function to behave oddly
Training code for reference -
optimizer = tf.keras.optimizers.Adam(learning_rate=5e-3)
model = Resnet18(NUM_CLASSES=NUM_CLASSES) # variant of original model
model.compile(optimizer=optimizer, loss="categorical_crossentropy", metrics=["accuracy"])
history = model.fit(
train_dataset,
steps_per_epoch=len(X_train) // BATCH_SIZE,
epochs=EPOCHS,
validation_data=valid_dataset,
validation_steps=len(X_valid) // BATCH_SIZE,
verbose=1,
)
The most relevant answer I found was the last paragraph of the accepted answer here. However, that doesn't seem to be the case here as validation loss diverges by order of magnitudes compared to training loss & returns nan. Seems like the loss function is misbehaving.

two kind of description of model.compile get the different results

have already tried
The data pipeline and model build function is right
code
The code is https://gist.github.com/MaoXianXin/cd398521546d967560942e702c243ba7
I want to know why this two kind of description of model.compile get the different results.
see the accuracy
model.compile(optimizer=tf.keras.optimizers.RMSprop(lr=0.01), loss='categorical_crossentropy', metrics=['acc'])
Epoch 1/50
2019-05-27 13:53:20.280605: I tensorflow/stream_executor/dso_loader.cc:153] successfully opened CUDA library libcublas.so.10.0 locally
468/468 [==============================] - 6s 12ms/step - loss: 14.4332 - acc: 0.1045 - val_loss: 14.4601 - val_acc: 0.1029
Epoch 2/50
468/468 [==============================] - 3s 6ms/step - loss: 14.4354 - acc: 0.1044 - val_loss: 14.4763 - val_acc: 0.1023
Epoch 3/50
468/468 [==============================] - 3s 6ms/step - loss: 14.4359 - acc: 0.1044 - val_loss: 14.4714 - val_acc: 0.1026
Epoch 4/50
468/468 [==============================] - 3s 6ms/step - loss: 14.4359 - acc: 0.1044 - val_loss: 14.4682 - val_acc: 0.1028
model.compile('adam', 'categorical_crossentropy', metrics=['acc'])
Epoch 1/50
2019-05-27 13:51:16.122054: I tensorflow/stream_executor/dso_loader.cc:153] successfully opened CUDA library libcublas.so.10.0 locally
468/468 [==============================] - 5s 12ms/step - loss: 3.6567 - acc: 0.7388 - val_loss: 0.0732 - val_acc: 0.9791
Epoch 2/50
468/468 [==============================] - 3s 6ms/step - loss: 0.0812 - acc: 0.9760 - val_loss: 0.0449 - val_acc: 0.9854
Epoch 3/50
468/468 [==============================] - 3s 6ms/step - loss: 0.0533 - acc: 0.9836 - val_loss: 0.0428 - val_acc: 0.9869
Epoch 4/50
468/468 [==============================] - 3s 6ms/step - loss: 0.0426 - acc: 0.9871 - val_loss: 0.0446 - val_acc: 0.9872
Epoch 5/50
468/468 [==============================] - 3s 6ms/step - loss: 0.0376 - acc: 0.9886 - val_loss: 0.0449 - val_acc: 0.9867

Keras training cats vs dogs gives constant validation accuracy

I am following this keras tutorial to train a cats/dogs model with few data. I ran the exact code as given on the github, but the accuracy stays 0.5 and never changes.
My keras version is 2.0.9.
Found 2000 images belonging to 2 classes.
Found 800 images belonging to 2 classes.
Epoch 1/50
125/125 [==============================] - 150s 1s/step - loss: 0.7777 - acc: 0.4975 - val_loss: 0.6931 - val_acc: 0.5000
Epoch 2/50
125/125 [==============================] - 158s 1s/step - loss: 0.6932 - acc: 0.5000 - val_loss: 0.6931 - val_acc: 0.5000
Epoch 3/50
125/125 [==============================] - 184s 1s/step - loss: 0.6932 - acc: 0.5000 - val_loss: 0.6931 - val_acc: 0.5000
Epoch 4/50
125/125 [==============================] - 203s 2s/step - loss: 0.6932 - acc: 0.4940 - val_loss: 0.6931 - val_acc: 0.5000
Epoch 5/50
3/125 [..............................] - ETA: 2:30 - loss: 0.6931 - acc: 0.5417
Does anyone know what's the reason behind this?
My data directory looks like this:
data/
train/
cats/
cat846.jpg
cat828.jpg
cat926.jpg
cat382.jpg
cat792.jpg
...
dogs/
dog533.jpg
dog850.jpg
dog994.jpg
dog626.jpg
dog974.jpg
...
validation/
cats/
cat1172.jpg
cat1396.jpg
cat1336.jpg
cat1347.jpg
cat1211.jpg
...
dogs/
dog1014.jpg
dog1211.jpg
dog1088.jpg
dog1207.jpg
dog1186.jpg
...
It seems to have something to do with OS. The above result was on Ubuntu 16.04 virtual machine. I copied the code to windows, it worked correctly. But why?