CUDA error: device-side assert triggered in Colab - google-colaboratory

I am training EfficientDet v2 model in coco json format on colab. model confg are here:
gtf.Train_Dataset(root_dir, coco_dir, img_dir, set_dir, batch_size=8, image_size=512, use_gpu=True,num_workers=2)
gtf.Model();
gtf.Set_Hyperparams(lr=0.0001, val_interval=1, es_min_delta=0.0, es_patience=0)
%%time
gtf.Train(num_epochs=10, model_output_dir="trained/");
I am facing following issue while training:
I tried adding this code and restarting runtime but facing same issues.
import os
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
Anyone to solve?

Related

Extremely slow when saving model on Colab TPU

my situation is that saving model is extremely slow under Colab TPU environment.
I first encountered this issue when using checkpoint callback, which causes the training stuck at the end of the 1st epoch.
Then, I tried taking out callback and just save the model using model.save_weights(), but nothing has changed. By using Colab terminal, I found that the saving speed is about ~100k for 5 minutes.
The version of Tensorflow = 2.3
My code of model fitting is here:
with tpu_strategy.scope(): # creating the model in the TPUStrategy scope means we will train the model on the TPU
Baseline = create_model()
checkpoint = keras.callbacks.ModelCheckpoint('baseline_{epoch:03d}.h5',
save_weights_only=True, save_freq="epoch")
hist = model.fit(get_train_ds().repeat(),
steps_per_epoch = 100,
epochs = 5,
verbose = 1,
callbacks = [checkpoint])
model.save_weights("epoch-test.h5", overwrite=True)
I found the issue happened because I explicitly switched to graph mode by writing
from tensorflow.python.framework.ops import disable_eager_execution
disable_eager_execution()
Before
with tpu_strategy.scope():
model.fit(...)
Though I still don't understand the cause, remove disable_eager_execution solved the issue.

Python 3.8.8 Jupyter notebook kernel dies when I call model.fit() when I try to use my GPU

My tensorflow recognizes my gpu
However, when I call model.fit() on my data it shows:
epoch(1/2) and then the kernel dies immediately
If I run this in a separate virtual environment with no GPU it works fine:
I have simplified the model architecture and number of training points to only ten as a quick test and it still fails
Simple example
from numpy import loadtxt
from keras.models import Sequential
from keras.layers import Dense
model = keras.Sequential()
model.add(Dense(4,
activation='relu'))
model.add(Dense(1, activation='sigmoid'))
opt = keras.optimizers.Adam(learning_rate=.001)
model.compile(loss = 'binary_crossentropy' , optimizer = opt, metrics = ['accuracy'] )
info = model.fit(X_train, y_train, epochs=2, batch_size=2,shuffle=True, verbose=1)
versions:
Python 3.8.8
Num GPUs Available 1
2.5.0-dev20210227
2.4.3
cuda v11.2
I am going to answer my own question rather than deleting this because maybe someone else will be making the same simple mistake I was.
The main mistake I made was having the incorrect CUDA download. you can refer to the what versions are correct at this link:
https://www.tensorflow.org/install/source#gpu
TLDR: Just follow this video:
https://www.youtube.com/watch?v=hHWkvEcDBO0
This also highlighted the importance of a virtual environment where you control the package versions to prevent incompatibilities.
I had the same problem. I transferred the code into a python file and found the root cause. In my case it was copying cudnn dll files into C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.6\bin. Check the following link as well:
Could not load dynamic library 'cudnn64_8.dll'; dlerror: cudnn64_8.dll not found

Getting KeyError : 'callable_inputs' when trying to save a TF model in S3 bucket

I'm using sagemaker 2.5.1 and tensorflow 2.3.0
The weird part is that the same code worked before, the only change that I could think of is the new release of the two libraries
This appears to be a bug with SageMaker.
I'm assuming you are using a TensorFlow estimator to train the model. Something like this:
estimator = TensorFlow(
entry_point='script.py',
role=role,
train_instance_count=1,
train_instance_type='ml.p3.2xlarge',
framework_version='2.3.0',
py_version='py37',
script_mode=True,
hyperparameters={
'epochs': 100,
'batch-size': 256,
'learning-rate': 0.001
}
)
If that's the case, either TensorFlow 2.2 it TensorFlow 3.3 is causing this error when debugger callbacks are enabled. To fix the issue, you can set the debugger_hook_config to False:
estimator = TensorFlow(
entry_point='script.py',
role=role,
train_instance_count=1,
train_instance_type='ml.p3.2xlarge',
framework_version='2.3.0',
py_version='py37',
script_mode=True,
debugger_hook_config=False,
hyperparameters={
'epochs': 100,
'batch-size': 256,
'learning-rate': 0.001
}
)
The problem is actually coming from smdebug version 0.9.1
Downgrading to 0.8.1 solves the issue

Train object detection model on Google Colab with TPU

I'm training an object detection model by following the guide here https://towardsdatascience.com/creating-your-own-object-detector-ad69dda69c85
On Google Colab I am able to execute the following and it makes use of the GPU
python train.py --train_dir=training/ --pipeline_config_path=training/ssd_mobilenet_v1_0.75_depth_quantized_300x300_coco14_sync.config
I would now like train by using the TPU but this obviously does not just work out of the box. Running train.py is slow and appears to be using CPU only. How can I achieve this?
While using TPU in Google Colab, we should use the below mentioned code tocheck that the TPU devices are properly recognized in the environment:
import os
import pprint
import tensorflow as tf
if 'COLAB_TPU_ADDR' not in os.environ:
print('ERROR: Not connected to a TPU runtime; please see the first cell in this notebook for instructions!')
else:
tpu_address = 'grpc://' + os.environ['COLAB_TPU_ADDR']
print ('TPU address is', tpu_address)
with tf.Session(tpu_address) as session:
devices = session.list_devices()
print('TPU devices:')
pprint.pprint(devices)
This should output a list of 8 TPU devices available in our Colab environment.
In order to run the tf.keras model on a TPU, we have to convert it to a TPU-model using the tf.contrib.tpu.keras_to_tpu module.
It can be done using the below code:
# This address identifies the TPU we'll use when configuring TensorFlow.
TPU_WORKER = 'grpc://' + os.environ['COLAB_TPU_ADDR']
tf.logging.set_verbosity(tf.logging.INFO)
resnet_model = tf.contrib.tpu.keras_to_tpu_model(
resnet_model,
strategy=tf.contrib.tpu.TPUDistributionStrategy(
tf.contrib.cluster_resolver.TPUClusterResolver(TPU_WORKER)))
Fore more information, please refer this Medium Link and this Link.

TensorFlow Lite: Init node doesn't exist

I was trying to convert a model in a Keras file (.h5) to a TensorFlow Lite file (.tflite) using the following codes:
# Save model as .h5 keras file
keras_file = "eSleep.h5"
model_save = tf.keras.models.save_model(model,keras_file,overwrite=True,include_optimizer=True)
# Export keras file to TensorFlow Lite model
converter = tf.lite.TFLiteConverter.from_keras_model_file(keras_file)
tflite_model = converter.convert()
open("eSleep.tflite", "wb").write(tflite_model)
However, the following line:
tflite_model = converter.convert()
returned errors:
I tensorflow/core/grappler/devices.cc:53] Number of eligible GPUs (core count >= 8): 0 (Note: TensorFlow was not compiled with CUDA support)
I tensorflow/core/grappler/clusters/single_machine.cc:359] Starting new session
E tensorflow/core/grappler/grappler_item_builder.cc:636] Init node dense/kernel/Assign doesn't exist in graph
Can anybody help me to understand what does "Init node dense/kernel/Assign doesn't exist in graph" means and how to fix the error?
In my experience the converted model should work fine, even though this error is shown. You can ignore the error.
I solved the problem by using TensorFlow 1.12.