Google Colab - Your session crashed for an unknown reason - tensorflow

Your session crashed for an unknown reason
when I run the following cell in Google Colab:
from keras import backend as K
if 'tensorflow' == K.backend():
import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.visible_device_list = "0"
set_session(tf.Session(config=config))
I receive this message since I have uploaded two data sets to google drive.
Does anyone know this message and can give me some advice?
Many thanks for every hint.
Update:
I always receive the message
Update
I have removed the data sets from Google Drive, but the session is still crashing.

Google Colab is crashing because you are trying to Run Code related to GPU with Runtime as CPU.
The execution is successful if you change the Runtime as GPU. Steps for the same are mentioned below:
Runtime -> Change Runtime -> GPU (Select from dropdown).
Please find the Working code in Github Gist.

Just a side note: sometimes you may want to reinstall an litle older version of the related module (see from the error log). It works for me in a case.

This error happens when the expected device and the actual device are different.
For example, if you run the code that is written with torch_xla, which is for TPU training, on the GPU (cuda) then the Colab will return you this error.
It is really tricky since it does not give you an actual debugging message, etc, which makes you hard to find what is the actual problem.

Related

I'm having a minor problem running Tensorflow with Colab

I am a beginner who just learned about TensorFlow using Google Colab.
As in the attached image file, numbers 10 to 13 are underlined in tensorflow.keras~, what is the problem?
It's probably a function that indicates a typo, but there's nothing wrong with running it.
enter image description here
This underline error was an ongoing issue earlier in Google Colab, which is resolved now. Please try again replicating the same code in Google colab what you mentioned in the screen shot and let us know if the issue still persists at your end.
Please check the code below: (I replicated the same in Google Colab with python 3.8.10 and TensorFlow 2.9.2)
from keras.preprocessing import Sequence will show the error as you are not giving proper alias to import the Sequence API which is provided correctly in next line (using tensorflow.keras.preprocessing prefix or tensorflow.keras.utils).

How to disable GPUs in H2O AutoML

When I run an experiment with H2O AutoML, I got the error: "terminate called after throwing an instance of 'thrust::system::system_error' what(): parallel_for failed: invalid resource handle". This error message comes from XGBoost and it is because of the GPU limit exceed.
While I'm using the regular XGBoost, I set the cuda visible devices parameter to blank to disable GPUs. However, this arguments seems to be ignored in H2O AutoML - XGBoost implementation.
import os
os.environ["CUDA_VISIBLE_DEVICES"] = ""
Currently, the only xgboost can be run on gpu in H2O AutoML.
The question it that anybody knows how to disable GPUs in H2O AutoML?
As a workaround, I excluded XGBoost algorithm to run my experiment for now. The trouble is passed when I exclude XGBoost but I do not want to give up the power of XGBoost.
from h2o.automl import H2OAutoML
model = H2OAutoML(max_runtime_secs = 60*60*2, exclude_algos = ["XGBoost"])
That's definitely an oversight and we will need to add the ability to turn on/off and/or specify the GPU. I opened a ticket for this. I wonder if there's a way to temporarily disable the GPU at the system level (outside of H2O/Python) in the meantime? Thanks for the report!

Keras showing multiple processes for one training

Whenever I launch a training with Keras, I end up with multiples processes (dozens), as you can see on this htop screenshot. Is that normal?
Could it be the reason of why I experienced memory issues ? The cache becomes full as the training goes, then the swap is activated, and after some hours the machine needs to be restarted.
The training is done on a single GPU, using fit_generator:
training_model.fit_generator(
generator=train_generator,
steps_per_epoch=config["steps"],
epochs=config["epochs"],
verbose=1,
callbacks=callbacks
)
Keras 2.2.4
tensorflow-gpu 1.13.1
CUDA 10.0
Thanks for your help!
From the information which you have provided, it seems to be the case of Out of Memory.
Please confirm if you are using GPU. You can check it by running the command,
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
Also, if we use GPU, the processes can be monitored in nvtop and not on htop.
The best way to identify which operation in your entire project is consuming more memory and time is to use Tensorflow Profiler.
So, you can think of upgrading your Tensorflow Version to 1.15 or 2.2 if possible in which model.fit does the job of model.fit_generator too.
Code to use Profiler is shown below:
# Create a TensorBoard callback
logs = "logs/" + datetime.now().strftime("%Y%m%d-%H%M%S")
tboard_callback = tf.keras.callbacks.TensorBoard(log_dir = logs,
histogram_freq = 1,
profile_batch = '500,520')
model.fit(ds_train,
epochs=2,
validation_data=ds_test,
callbacks = [tboard_callback])
High level details of the execution can be viewed in the Overview page of Profiler can be viewed in the Profile tab of Tensorboard as shown below.
From the above screenshot, it can be understood that Input Processing is taking most of the Execution Time (83%).
Also, it will provide us the Recommendations to reduce the Memory Consumption and hence to Optimize our Execution.
Also, other Options in the Tools dropdown like TraceViewer provide useful information like execution information at Operations Level.
For more information, please refer Profiler Tutorial, Profiler Guide and Youtube Video.
Hope this helps. Happy Learning!

Tensorflow object detection API not displaying global steps

I am new here. I recently started working with object detection and decided to use the Tensorflow object detection API. But, when I start training the model, it does not display the global step like it should, although it's still training in the background.
Details:
I am training on a server and accessing it using OpenSSH on Windows. I trained a custom dataset, by collecting pictures and labeling them. I trained it using model_main.py. Also, until a couple of months back, the API was a little different, and only recently they changed to the latest version. For instance, earlier it used to use train.py for training, instead of model_main.py. All the online tutorial I can find use train.py, so it might be a problem with the latest commit. But I don't find anyone else fining this problem.
Thanks in advance!
Add tf.logging.set_verbosity(tf.logging.INFO) after the import section of the model_main.py script. It will display a summary after every 100th step.
As Thommy257 suggested, adding tf.logging.set_verbosity(tf.logging.INFO) after the import section of model_main.py prints the summary after every 100 steps by default.
Further, to specify the frequency of the summary, change
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir)
to
config = tf.estimator.RunConfig(model_dir=FLAGS.model_dir, log_step_count_steps=k)
where it will print after every k steps.
Regarding the recent change to model_main , the previous version is available at the folder "legacy". I use train.py and eval.py from this legacy folder with the same functionality as before.

Distributed Tensorflow: check failed: size>=0

I'm using keras 2.0.6. The version of tensorflow is 1.3.0.
My code can run with theano backend, but failed with tensorflow backend:
F tensorflow/core/framework/tensor_shape.cc:241] Check failed: size >= 0 (-14428307456 vs. 0)
I was wondering if anyone can thought of any possible reason that might cause this.
Thank you!
----UPDATE-----
I tested exactly the same code on my PC with tensorflow. It runs perfectly.
However, it throw out this error when I run it on a Supercomputer.
Although this error looks like overflow, there is no way that it didn't overflow on my PC, but overflow on a supercomputer.
I suspect that it comes from a bug on tensorflow for distributed computation.
I came across the same bug, but Tensorflow ran ok after that I shrank the batch size.
I think the reason is the GPU running out of memory.
I had met the error, in my issue, the error is coming from TF with different vision.
the error is solved.
the model was trained in tf 1.15, but frozen the model in tf 1.13. When froze it in tf 1.15, everything is ok.
I think you can check the model version.