system auto reboot when tensorflow model is too large - tensorflow

I'm using a nvidia GTX1080 gpu(8GB) to run Inception model on tensorflow, when I set batch_size = 16 and image_size = 400, then after I start the program, my ubuntu14.04 will auto reboot.

Make sure it is not a power supply unit problem. I was observing strange occasional reboots on my development machine. As I was increasing the size of input (batch size, larger NN) the rate of reboots was increasing as well. Turned out to be a PSU problem. A quick check is to limit GPU power consumption and see if this behavior will go away. For instance, you can limit power to about 150 watts with this command (you'll need a sudo rights):
sudo nvidia-smi -pl 150

I tracked the issue down to a faulty power supply. It had enough capacity according to spec, and limiting GPU power consumption by running "nvidia-smi -pl 150" didn't help at all. Probably it couldn't handle bursts in power consumption.
Anyway, after I changed the power supply from "Corsair CX750 Builder Series ATX 80 PLUS" to "Cooler Master V1000", the issue is gone.
See details of my investigation in the TensorFlow GitHub issue.

Changing the GPU power settings will work, if you have PSU with enough power (WATTS).
I limited my GPU's (TITANX) power to max. 200 WATTS using,
sudo nvidia-smi -pl 200
NOTE: Each GPU has power limitations, for e.g. TITANX's power limit is between 125W and 300W. So make sure to give value between those limits.

I was facing similar problems. Even with small batch sizes in both tensorflow and pytorch the pc was restarting by itself. I removed a video card but still no solution. Just nvidia-smi -pl 150 didn't work.
In addition;
sudo nvidia-smi -pm 1
sudo nvidia-smi -lgc 1400
sudo nvidia-smi -lmc 6500
sudo nvidia-smi -gtt 65
sudo nvidia-smi -cc 1
sudo nvidia-smi -pl 165
I added them and now it works with 2 gpu without any problem. These settings are for the RTX2080TI. Edit according to your own video card.
My system:
HP Z800 Workstation
Intel(R) Xeon(R) CPU E5-2643 0 # 3.30GHz
PSU 850W
Ubuntu 20.04
2x RTX 2080TI

I got the exactly same problem after a GTX 2070 installed on DELL T3610. The answer provided by Sergey above solved my problem. Just add a comment for windows users:
Run your command prompt as administrator
go to nvidia-smi directory: typically it is under C:\Program Files\NVIDIA Corporation\NVSMI
run nvidia-smi -pl 150
Then your problem should be solved and you will see the output that the power limit of your GPU has been reduced to 150w. (In my case, reduced to 150w from 185w).

I had a very similar problem but tracked it down to a PATH problem where CUDA 11 got inserted and somehow was overriding my CUDA 10.1 libraries. I am not sure when/how, but it might be related to an upgrade of the Nvidia drivers I had done recently. At least check and make sure your PATH and versions are correct. CUDA 11 will not work with Tensorflow 2.3.1 or prior, at least as of 11/2020 on Windows 10. Please let me know if there is a workaround that I am unaware of, but this was definitely the problem. When I fixed the PATH to point to the CUDA 10.1 path only, everything worked fine and I was able to max out the GPU for over 20 minutes with no restart.

I had the same issue and limiting poser usage resolved it. I had to reduce power supply to 150 as 200 did not work though.

Related

100% GPU utilization on a GCE without any processes

I've just started an instance on a Google Compute Engine with 2 GPUs (Nvidia Tesla K80). And straight away after the start, I can see via nvidia-smi that one of them is already fully utilized.
I've checked a list of running processes and there is nothing running at all. Does it mean that Google has rented out that same GPU to someone else?
It's all running on this machine:
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.5 LTS
Release: 16.04
Codename: xenial
Enabling "persistence mode" with nvidia-smi -pm 1 might solve the problem.
ECC in combination with non persistence mode can lead to 100% GPU utilization.
Alternatively you can disable ECC with nvidia-smi -e 0.
Note: I'm not sure if the performance actually is worse. I can remember that I was able to train ML model despite the 100% GPU utilization but I don't know if it was slower.
I would like to suggest you to report and create this issue on the Google Issue Tracker as need to investigate. Please provide your project number and instance name over there. Please follow this URL that make you able to create a file as private in Google Issue Tracker.

Tensorflow Performance Does Not Improve with additional GPU

I'm testing the standard benchmark for tensorflow with my desktop config as shown below.
Intel i7-7700k
Asus B250 Mining edition
16 Gigabyte p106
32GB memory
Ubuntu 16.04 cuda 9.0 and cudnn 7.1
Tensorflow 1.10 Installed
However, the results for 8 cards and 16 cards are the same.
Any idea why is this case happening?
This depends on your setup and the parameters you're using in the benchmark.
Verify nvidia-drivers are properly working: nvidia-smi.
All your GPUs should be listed there.
Verify tf-nightly-gpu is installed: pip list. This is a requirement according to benchmark documentation.
While the model is training, use nvidia-smi again to check if there is actual GPU utilization, and how many GPUs are utilized.
Try changing variable_update parameter values.
I install tf-nightly-gpu and variable_update=independent
enter image description here

do I have to reinstall tensorflow after changing gpu?

I'm using tensorflow with gpu. My computer have NVIDIA gforce 750 ti and I'm gonna replace it with 1080 ti. do I have to re install tensorflow(or other drivers etc.)? If it is true, what exactly do I have to re-install?
One more question, Can I speed up the training process by install one more gpu in the computer?
As far as I know the only thing you need to reinstall are the GPU drivers (CUDA an/or cuDNN). If you install the exact same version with the exact same bindings Tensorflow should not notice you changed the GPU and continue working...
And yes, you can speed up the training process with multiple GPUs, but telling you how to install and manage that is a bit too broad for a Stackoverflow answer....

After switching from gpu to cpu in tensorflow with tf.device("/cpu:0") GPU undocks everytime tf is imported

I am using Windows 7. After i tested my GPU in tensorflow, which was awkwardly slowly on a already tested model on cpu, i switched to cpu with:
tf.device("/cpu:0")
I was assuming that i can switch back to gpu with:
tf.device("/gpu:0")
However i got the following error message from windows, when i try to rerun with this configuration:
The device "NVIDIA Quadro M2000M" is not exchange device and can not be removed.
With "nvida-smi" i looked for my GPU, but the system said the GPU is not there.
I restarted my laptop, tested if the GPU is there with "nvida-smi" and the GPU was recogniced.
I imported tensorflow again and started my model again, however the same error message pops up and my GPU vanished.
Is there something wrong with the configuration in one of the tensorflow configuration files? Or Keras files? What can i change to get this work again? Do you know why the GPU is so much slower that the 8 CPUs?
Solution: Reinstalling tensorflow-gpu worked for me.
However there is still the question why that happened and how i can switch between gpu and cpu? I dont want to use a second virtual enviroment.

TensorFlow - which Docker image to use?

From TensorFlow Download and Setup under
Docker installation I see:
b.gcr.io/tensorflow/tensorflow latest 4ac133eed955 653.1 MB
b.gcr.io/tensorflow/tensorflow latest-devel 6a90f0a0e005 2.111 GB
b.gcr.io/tensorflow/tensorflow-full latest edc3d721078b 2.284 GB
I know 2. & 3. are with source code and I am using 2. for now.
What is the difference between 2. & 3. ?
Which one is recommended for "normal" use?
TLDR:
First of all - thanks for Docker images! They are the easiest and cleanest way to start with TF.
Few aside things about images
there is no PIL
there is no nano (but there is vi) and apt-get cannot find it. yes i probable can configure repos for it, but why not out of the box
There are four images:
b.gcr.io/tensorflow/tensorflow: TensorFlow CPU binary image.
b.gcr.io/tensorflow/tensorflow:latest-devel: CPU Binary image plus source code.
b.gcr.io/tensorflow/tensorflow:latest-gpu: TensorFlow GPU binary image.
gcr.io/tensorflow/tensorflow:latest-devel-gpu: GPU Binary image plus source code.
And the two properties of concern are:
1. CPU or GPU
2. no source or plus source
CPU or GPU: CPU
For a first time user it is highly recommended to avoid the GPU version as they can be any where from difficult to impossible to use. The reason is that not all machines have an NVidia graphic chip that meet the requirements. You should first get TensorFlow working to understand it then move onto using the GPU version if you want/need.
From TensorFlow Build Instructions
Optional: Install CUDA (GPUs on Linux)
In order to build or run TensorFlow with GPU support, both Cuda
Toolkit 7.0 and CUDNN 6.5 V2 from NVIDIA need to be installed.
TensorFlow GPU support requires having a GPU card with
NVidia Compute Capability >= 3.5. Supported cards include but are not limited to:
NVidia Titan
NVidia Titan X
NVidia K20
NVidia K40
no source or plus source: no source
The docker images will work without needing the source. You should only want or need the source if you need to rebuild TensorFlow for some reason such as adding a new OP.
The standard recommendation for someone new to using TensorFlow is to start with the CPU version without the source.