I'm running a Python script using GPU-enabled Tensorflow. However, the program doesn't seem to recognize any GPU and starts using CPU straight away. What could be the cause of this?
Just want to add to the discussion that tensorflow may stop seeing a GPU due to CUDA initialization failure, in other words, tensorflow detects the GPU, but can't dispatch any op onto it, so it falls back to CPU. In this case, you should see an error in the log, like this
E tensorflow/stream_executor/cuda/cuda_driver.cc:481] failed call to cuInit: CUDA_ERROR_UNKNOWN
The cause is likely to be the conflict between different processes using GPU simultaneously. When that is the case, the most reliable way I found to get tensorflow working is to restart the machine. In the worst case, reinstalling tensorflow and / or NVidia driver.
See also one more case when GPU suddenly stops working.
Related
When training a model on the GPU machine, it get interrupted due to some system patch process. Since Google cloud GPU machines do not have an option of live migration, it is painful task to restart the training every time this happens. Google has clearly mentioned that there is no way around this but to restart the machines in this Doc.
Is there a clever way to detect if the machine is rebooted and resume the training automatically.
Sometimes it also happens that due to some kernel update, the CUDA drivers stop working and GPU is not visible and CUDA drivers need a re-installation. So writing a startup script to resume the training is also not a bulletproof solution.
Yes there is. If you use tensorflow, you can use it's checkpointing feature to save your progress and pick up where you left off.
One great example of this is provided here: https://github.com/GoogleCloudPlatform/ml-on-gcp/blob/master/gce/survival-training/README-tf-estimator.md
I'm stuck in a frustrating loop trying to use the tensorflow estimator API.
When I try to restart my GPU instance, my notebook hangs on initialization.
If I exit the notebook, or switch runtime to CPU and back to GPU again, and try to connect to my instance, I says the instance is busy.
If I switch my runtime to no GPU and restart, the runtime initializes fine, but if I then try and reset the runtime to GPU, the notebook again says it is busy running what I assume to be a hanging GPU task.
So restarting the runtime, exiting the notebook, and switching the runtime to CPU and back to GPU do not seem to help with freeing/restarting the GPU backend.
Is there anything else I can try?
To reset your backend, select the command 'Reset all runtimes...' from the Runtime menu.
Our GPUs are in exclusive mode. Sometimes some user may manually login a machine and steals a GPU.
How can I raise an exception whenever GPU initialization fails in a TensorFlow script? I noticed that when TensorFlow is unable to initialize the GPU, it prints out an error message but runs on CPU. I want to stop it instead of running on CPU.
If you force any part of your graph to run on a GPU using:
with tf.device('/device:GPU:0'):
then your session variables initializer will stop running and throw an InvalidArgumentError when no GPU is available.
I have previously asked if it is possible to run tensor flow with gpu support on a cpu. I was told that it is possible and the basic code to switch which device I want to use but not how to get the initial code working on a computer that doesn't have a gpu at all. For example I would like to train on a computer that has a NVidia gpu but program on a laptop that only has a cpu. How would I go about doing this? I have tried just writing the code as normal but it crashes before I can even switch which device I want to use. I am using Python on Linux.
This thread might be helpful: Tensorflow: ImportError: libcusolver.so.8.0: cannot open shared object file: No such file or directory
I've tried to import tensorflow with tensorflow-gpu loaded in the uni's HPC login node, which does not have GPUs. It works well. I don't have Nvidia GPU in my laptop, so I never go through the installation process. But I think the cause is it cannot find relevant libraries of CUDA, cuDNN.
But, why don't you just use cpu version? As #Finbarr Timbers mentioned, you still can run a model in a computer with GPU.
What errors are you getting? It is very possible to train on a GPU but develop on a CPU- many people do it, including myself. In fact, Tensorflow will automatically put your code on a GPU if possible.
If you add the following code to your model, you can see which devices are being used:
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
This should change when you run your model on a computer with a GPU.
I am using Windows 7. After i tested my GPU in tensorflow, which was awkwardly slowly on a already tested model on cpu, i switched to cpu with:
tf.device("/cpu:0")
I was assuming that i can switch back to gpu with:
tf.device("/gpu:0")
However i got the following error message from windows, when i try to rerun with this configuration:
The device "NVIDIA Quadro M2000M" is not exchange device and can not be removed.
With "nvida-smi" i looked for my GPU, but the system said the GPU is not there.
I restarted my laptop, tested if the GPU is there with "nvida-smi" and the GPU was recogniced.
I imported tensorflow again and started my model again, however the same error message pops up and my GPU vanished.
Is there something wrong with the configuration in one of the tensorflow configuration files? Or Keras files? What can i change to get this work again? Do you know why the GPU is so much slower that the 8 CPUs?
Solution: Reinstalling tensorflow-gpu worked for me.
However there is still the question why that happened and how i can switch between gpu and cpu? I dont want to use a second virtual enviroment.