I'm stuck in a frustrating loop trying to use the tensorflow estimator API.
When I try to restart my GPU instance, my notebook hangs on initialization.
If I exit the notebook, or switch runtime to CPU and back to GPU again, and try to connect to my instance, I says the instance is busy.
If I switch my runtime to no GPU and restart, the runtime initializes fine, but if I then try and reset the runtime to GPU, the notebook again says it is busy running what I assume to be a hanging GPU task.
So restarting the runtime, exiting the notebook, and switching the runtime to CPU and back to GPU do not seem to help with freeing/restarting the GPU backend.
Is there anything else I can try?
To reset your backend, select the command 'Reset all runtimes...' from the Runtime menu.
Related
I thought ColabPro+ would allow me to run on a GPU for longer than 24hr, but the VM is getting killed right at 24hrs, while the (Chrome) browser is open, and a python program has been running the whole time using the GPU. I tried running in background mode and in this case it killed the VM in about an hour. What am I missing?
When training a model on the GPU machine, it get interrupted due to some system patch process. Since Google cloud GPU machines do not have an option of live migration, it is painful task to restart the training every time this happens. Google has clearly mentioned that there is no way around this but to restart the machines in this Doc.
Is there a clever way to detect if the machine is rebooted and resume the training automatically.
Sometimes it also happens that due to some kernel update, the CUDA drivers stop working and GPU is not visible and CUDA drivers need a re-installation. So writing a startup script to resume the training is also not a bulletproof solution.
Yes there is. If you use tensorflow, you can use it's checkpointing feature to save your progress and pick up where you left off.
One great example of this is provided here: https://github.com/GoogleCloudPlatform/ml-on-gcp/blob/master/gce/survival-training/README-tf-estimator.md
Our GPUs are in exclusive mode. Sometimes some user may manually login a machine and steals a GPU.
How can I raise an exception whenever GPU initialization fails in a TensorFlow script? I noticed that when TensorFlow is unable to initialize the GPU, it prints out an error message but runs on CPU. I want to stop it instead of running on CPU.
If you force any part of your graph to run on a GPU using:
with tf.device('/device:GPU:0'):
then your session variables initializer will stop running and throw an InvalidArgumentError when no GPU is available.
I am using Windows 7. After i tested my GPU in tensorflow, which was awkwardly slowly on a already tested model on cpu, i switched to cpu with:
tf.device("/cpu:0")
I was assuming that i can switch back to gpu with:
tf.device("/gpu:0")
However i got the following error message from windows, when i try to rerun with this configuration:
The device "NVIDIA Quadro M2000M" is not exchange device and can not be removed.
With "nvida-smi" i looked for my GPU, but the system said the GPU is not there.
I restarted my laptop, tested if the GPU is there with "nvida-smi" and the GPU was recogniced.
I imported tensorflow again and started my model again, however the same error message pops up and my GPU vanished.
Is there something wrong with the configuration in one of the tensorflow configuration files? Or Keras files? What can i change to get this work again? Do you know why the GPU is so much slower that the 8 CPUs?
Solution: Reinstalling tensorflow-gpu worked for me.
However there is still the question why that happened and how i can switch between gpu and cpu? I dont want to use a second virtual enviroment.
I'm running a Python script using GPU-enabled Tensorflow. However, the program doesn't seem to recognize any GPU and starts using CPU straight away. What could be the cause of this?
Just want to add to the discussion that tensorflow may stop seeing a GPU due to CUDA initialization failure, in other words, tensorflow detects the GPU, but can't dispatch any op onto it, so it falls back to CPU. In this case, you should see an error in the log, like this
E tensorflow/stream_executor/cuda/cuda_driver.cc:481] failed call to cuInit: CUDA_ERROR_UNKNOWN
The cause is likely to be the conflict between different processes using GPU simultaneously. When that is the case, the most reliable way I found to get tensorflow working is to restart the machine. In the worst case, reinstalling tensorflow and / or NVidia driver.
See also one more case when GPU suddenly stops working.