Google cloud GPU machines reboot abruptly - tensorflow

When training a model on the GPU machine, it get interrupted due to some system patch process. Since Google cloud GPU machines do not have an option of live migration, it is painful task to restart the training every time this happens. Google has clearly mentioned that there is no way around this but to restart the machines in this Doc.
Is there a clever way to detect if the machine is rebooted and resume the training automatically.
Sometimes it also happens that due to some kernel update, the CUDA drivers stop working and GPU is not visible and CUDA drivers need a re-installation. So writing a startup script to resume the training is also not a bulletproof solution.

Yes there is. If you use tensorflow, you can use it's checkpointing feature to save your progress and pick up where you left off.
One great example of this is provided here: https://github.com/GoogleCloudPlatform/ml-on-gcp/blob/master/gce/survival-training/README-tf-estimator.md

Related

Is it possible to run .ipynb notebooks locally using GPU acceleration? How?

Every time I need to train a 'large' deep learning model I do it from Google Collab, as it allows you to use GPU acceleration.
My pc has a dedicated GPU, I was wondering if it is possible to use it to run my notebooks locally in a fast way. Is it possible to train models using my pc GPU? In that case, how?
I am open to work with DataSpell, VSCode or any other IDE.
Nicholas Renotte has a great 'Getting Started' video that goes through the entire process of setting up GPU accelerated notebooks on your PC. The stuff you're interested starts around the 12 minute mark.
Yes, it is possible to run .ipynb notebooks locally using GPU acceleration. To do so, you will need to install the necessary libraries and frameworks such as TensorFlow, PyTorch, or Keras. Depending on the IDE you choose, you will need to install the relevant plugins and packages for GPU acceleration.
In terms of IDEs, DataSpell, VSCode, PyCharm, and Jupyter Notebook are all suitable for running notebooks locally with GPU acceleration.
Once the necessary libraries and frameworks are installed, you will then need to install the appropriate drivers for your GPU and configure the environment for GPU acceleration.
Finally, you will need to modify the .ipynb notebook to enable GPU acceleration and specify the number of GPUs you will be using. Once all the necessary steps have been taken, you will then be able to run the notebook locally with GPU acceleration.

Running RAPIDS without GPU for development?

Is there a way to run RAPIDS without a GPU? I usually develop on a small local machine without a GPU, then push my code to a powerful remote server for real use. Things like TensorFlow allow switching between the CPU and GPU depending on if they're available. Can an equivalent thing be done with RAPIDS? Even if it's slow, being able to test things on a machine without a GPU would be extremely helpful.
There isn't a way to use RAPIDS without a GPU, and part of the reason for that is we're following the APIs the community has adopted in CPU packages across Pandas, Numpy, SKLearn, NetworkX, etc. This way it should be as easy as swapping an import statement to get something working on the CPU vs the GPU.

How to develop for tensor flow with gpu without a gpu

I have previously asked if it is possible to run tensor flow with gpu support on a cpu. I was told that it is possible and the basic code to switch which device I want to use but not how to get the initial code working on a computer that doesn't have a gpu at all. For example I would like to train on a computer that has a NVidia gpu but program on a laptop that only has a cpu. How would I go about doing this? I have tried just writing the code as normal but it crashes before I can even switch which device I want to use. I am using Python on Linux.
This thread might be helpful: Tensorflow: ImportError: libcusolver.so.8.0: cannot open shared object file: No such file or directory
I've tried to import tensorflow with tensorflow-gpu loaded in the uni's HPC login node, which does not have GPUs. It works well. I don't have Nvidia GPU in my laptop, so I never go through the installation process. But I think the cause is it cannot find relevant libraries of CUDA, cuDNN.
But, why don't you just use cpu version? As #Finbarr Timbers mentioned, you still can run a model in a computer with GPU.
What errors are you getting? It is very possible to train on a GPU but develop on a CPU- many people do it, including myself. In fact, Tensorflow will automatically put your code on a GPU if possible.
If you add the following code to your model, you can see which devices are being used:
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
This should change when you run your model on a computer with a GPU.

Tensorflow doesn't see GPU and uses CPU instead, how come?

I'm running a Python script using GPU-enabled Tensorflow. However, the program doesn't seem to recognize any GPU and starts using CPU straight away. What could be the cause of this?
Just want to add to the discussion that tensorflow may stop seeing a GPU due to CUDA initialization failure, in other words, tensorflow detects the GPU, but can't dispatch any op onto it, so it falls back to CPU. In this case, you should see an error in the log, like this
E tensorflow/stream_executor/cuda/cuda_driver.cc:481] failed call to cuInit: CUDA_ERROR_UNKNOWN
The cause is likely to be the conflict between different processes using GPU simultaneously. When that is the case, the most reliable way I found to get tensorflow working is to restart the machine. In the worst case, reinstalling tensorflow and / or NVidia driver.
See also one more case when GPU suddenly stops working.

TensorFlow very slow on second run (Ubuntu)

I'm having a problem with TensorFlow (CPU) on Ubuntu 14.04 (VM, droplet), where running a script is fast the first time, but when running the same (or another) script directly after completion of the first run, things become very slow.
I'm talking minutes instead of seconds. Even simple test scripts (like those provided in the tutorial) take forever, with no visible CPU load.
For comparison: first run of the test script from the tutorial gives:
{real:0m0.790s, user:0m0.688s, sys:0m0.111s}
Second run of the same script, directly after completion of the first run gives:
{real: 2m46.628s, user: 0m0.783s, sys: 0m0.104s}
Eventually, things seem to clear up and performance is back (only for one run though).
I narrowed the problem down to this:
sess=tf.Session()
takes very long. Apparently resources used by a previous Session are not properly released [?]. My scripts use the Context manager, like
with tf.Session() as sess:
sess.run(...)
My latest hypothesis is that this has to do with system properties (virtual machine settings, hypervisor issues interacting with the context manager of TF). Using the docker container of TF makes no difference. Rebooting didn't help either. The same scripts run OK on OS X.
To make sure it's obvious what happened and that this question is answered: This occurred because tensorflow was reading from /dev/random instead of /dev/urandom. On some systems, /dev/random can exhaust its supply of randomness and block until more is available, causing the slowdown. This has now been fixed in github. The fixes are included in release 0.6.0 and later.