Keras uses GPU for first 2 epochs, then stops using it - tensorflow

I prepare the dataset and save it as as hdf5 file. I have a custom data generator that subclasses Sequence from keras and generates batches from the hdf5 file.
Now, when I model.fit_generator using the train generator, the model uses the GPU and trains fast for the first 2 epochs (GPU memory is full and GPU volatile usage fluctuates nicely around 50%). However, after the 3rd epoch, GPU volatile usage is 0% and the epoch takes 20x as long.
What's going on here?

Can you try configuring GPU as given in this post https://www.tensorflow.org/guide/gpu
Here is how i have done in my program
print("Runnning Jupyter Notebook using python version: {}".format(python_version()))
print("Running tensorflow version: {}".format(tf.keras.__version__))
print("Running tensorflow.keras version: {}".format(tf.__version__))
print("Running keras version: {}".format(keras.__version__))
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
tf.config.experimental.list_physical_devices('GPU')
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only allocate 2GB of memory on the first GPU
try:
tf.config.experimental.set_virtual_device_configuration(
gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=2048)])
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Virtual devices must be set before GPUs have been initialized
print(e)
Here is the output of above command:
Runnning Jupyter Notebook using python version: 3.7.7
Running tensorflow version: 2.2.4-tf
Running tensorflow.keras version: 2.1.0
Running keras version: 2.3.1
Num GPUs Available: 1
1 Physical GPUs, 1 Logical GPUs
Value might differ, memory_limit=2048 is the amount of memory allocated to GPU device.
To confirm allocation please use nvidia-smi, if you run with this config keras won't increase memory usage. As you told that after 2 epochs it is very slow, can you tell further does kernel dies, system hangs or restarts? Issues without config I have faced, is system just hangs. If you are running on ubuntu 18.04 LTS please use System Monitor(visually tells how many cores are being used, periodic contants increase means something is wrong) tool before executing all cells in notebook, once you start check Resources Tab in System Monitor.
Do:
A fresh run
Or Restart & Run All
Suspected Issue: How to prevent tensorflow from allocating the totality of a GPU memory?

Same Error Here!!
Because when you install tensorflow-gpu along nivida tool kit it provide a limited amount of GPU memory (Here in my case 2GB) .Due to leak of memory it release GPU finally and turn to use CPU .
if you want to avoid such condition Use Google Colab which provide about 36.7GB GPU memory.

Related

CPU and GPU Tensorflow Installation

I'm very new with TensorFlow.
I want to run my code on my CUDA gpu. So I've installed TensorFlow -gpu, after I've installed normal TensorFlow.
How can I tell Python that it takes gpu based TensorFlow?
If you have tensorflow-gpu installed there really isn't any reason to also have tensorflow. Without the presence of a gpu it will just run on cpu anyway.
To be specific in which GPU you use:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
in place of "0" you can either list GPUs (if you have multiple), or "" if you want it to run on cpu.
alternatively, specify in the session:
sess = tf.Session(config=tf.ConfigProto(device_count={'GPU': 0}))
Furthermore you can check which version your computer prioritizes by opening python console and typing:
>>> import tensorflow
>>> tensorflow
<module 'tensorflow' from
'/home/.../python3.6/site- packages/tensorflow/__init__.py'>
^
|
here

How to allow soft device placement when deploying a TensorFlow model to GCP?

I am trying to deploy a TensorFlow model to GCP's Cloud Machine Learning Engine for prediction, but I get the following error:
$> gcloud ml-engine versions create v1 --model $MODEL_NAME --origin $MODEL_BINARIES --runtime-version 1.9
Creating version (this might take a few minutes)......failed.
ERROR: (gcloud.ml-engine.versions.create) Bad model detected with error: "Failed to load model: Loading servable: {name: default version: 1} failed: Invalid argument: Cannot assign a device for operation 'tartarus/dense_2/bias': Operation was explicitly assigned to /device:GPU:3 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0 ]. Make sure the device specification refers to a valid device.\n\t [[Node: tartarus/dense_2/bias = VariableV2[_class=[\"loc:#tartarus/dense_2/bias\"], _output_shapes=[[200]], container=\"\", dtype=DT_FLOAT, shape=[200], shared_name=\"\", _device=\"/device:GPU:3\"]()]]\n\n (Error code: 0)"
My model was trained on several GPUs, and it seems like the default machines on CMLE don't support GPU for prediction, hence the error I get. So, I am wondering if the following is possible:
Set the allow_soft_placement var to True, so that CMLE can use the CPU instead of the GPU for a given model.
Activate GPU prediction on CMLE for a given model.
If not, how can I deploy a TF model trained on GPUs to CMLE for prediction? It feels like this should be a straightforward feature to use, but I can't find any documentation about it.
Thanks!
I've never used gcloud ml-engine versions create, but when you deploy a training job with gcloud ml-engine jobs submit training, you can add a config flag that identifies a configuration file.
This file lets you identify the target machine for training, and you can use multiple CPUs and GPUs. The documentation for the configuration file is here.

How to run a project Caffe with CPU only

i'm using Ubuntu 14.04 without GPU and i want to run this code ( with CPU only ). I want to run this code with CPU ( not with GPU) : https://github.com/smallcorgi/Faster-RCNN_TF . what should I do ?
The Github repository you are referring to is a TensorFlow implementation of Faster-RCNN, not Caffe.
If you want to use the Caffe implementation, you have to use this repository : https://github.com/rbgirshick/py-faster-rcnn
You have to edit the Python scripts that are used to train and test the model, e.g. train_faster_rcnn_alt_opt.py so that the line caffe.set_mode_gpu() is replaced by caffe.set_mode_cpu(). You might have to compile Caffe by editing the Makefile.config file to remove CUDNN and CUDA support.
Note that Caffe on CPU will be very slow compared to GPU computing.

tensorflow inception for retraining / fine tuning with a pretrained model: inception_train

I tried to retrain (new images, new classes) on top of the pretrained inception model, I therefor followed the instructions of the inception readme:
https://github.com/tensorflow/models/tree/master/inception#how-to-construct-a-new-dataset-for-retraining
I successfully built and ran build_image_data using bazel, as described in the tutorial. Afterwards I successfully built inception_train using bazel:
~/tensorflowmodels/models/inception# bazel build inception/inception_train
INFO: Found 1 target...
Target //inception:inception_train up-to-date (nothing to build)
INFO: Elapsed time: 0.073s, Critical Path: 0.00s
However, running bazel-bin/inception/inception_train I always get the following:
~/tensorflowmodels/models/inception# bazel-bin/inception/inception_train --train_dir="/" --validation_dir="/" --data_dir="/images_jpg/" --pretrained_model_checkpoint_path="/tensorflowmodels/models/inception/inception-v3/" --fine_tune=True --initial_learning_rate=0.001 --input_queue_memory_factor=1 --num_gpus=1
-bash: bazel-bin/inception/inception_train: No such file or directory
Naturally I would say it's by 99.9999% chance a typo. So then I tried to run inception_train.py with python. I had to change some import locations, and it finally ran with the parameters. However the script stops without any error messages after the initialization of the CUDA drivers.
Any help on how to solve this (or perform fine tuning / retraining with inception) would be very much appreciated.
tensorflow version: 0.9rc0
CPU: Xeon 5, 24 cores
GPU: Grid K2 8 GB
OS: Ubuntu 14.04
BTW I posted this already as an Github issue (which was closed, since it would be more a case for Stack Overflow).

Extremly low accuracy in "Deep MNIST for Experts" using Pascal GPU

first of all, I'm abit unsure if I should have asked this on Github or here, but since I wasn't sure I opted to go with stackoverflow.
I recently got a Nvidia GTX 1070 and wanted to try out tensorflow with it. I'm using a fresh install of Ubuntu 16.04, the nvidia-367 driver from the "Graphics Drivers Team" PPA, nvidia-cuda-toolkit 7.5.18-0ubuntu1 and cuDNN v4 (Feb 10, 2016).
Tensorflow was installed according to https://www.tensorflow.org/versions/r0.9/get_started/os_setup.html follwing the "Virtualenv installation", using this TF_BINARY_URL:
# Ubuntu/Linux 64-bit, GPU enabled, Python 2.7
# Requires CUDA toolkit 7.5 and CuDNN v4. For other versions, see "Install from sources" below.
(tensorflow)$ export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.9.0-cp27-none-linux_x86_64.whl
The first tutorial seems to work fine, and I've run a few other example models that also seem to work fine, but for some reason I'm getting an accuracy of about 9.5% in the "Deep MNIST for Experts" tutorial.
At first I thought I had made some error copy-pasting the code and spent some time trying to debug it to no avail. Then I found this issue on github https://github.com/tensorflow/tensorflow/issues/2781 and tried downloading his code and dont get anywhere close to 90% accuracy. I also tried fixing the bug in the code, so the train step runs every iteration, with no luck.
This is the output I get from running the tut.py from the above mentioned issue on github, modified to run train_step on each iteration of the loop:
$ python -i tut.py
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcuda.so locally
I tensorflow/stream_executor/dso_loader.cc:108] successfully opened CUDA library libcurand.so locally
>>> conv_net()
Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:924] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties:
name: GeForce GTX 1070
major: 6 minor: 1 memoryClockRate (GHz) 1.7715
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 7.46GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0: Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:806] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1070, pci bus id: 0000:01:00.0)
step 0, training accuracy 0.14
step 100, training accuracy 0.1
step 200, training accuracy 0.16
step 300, training accuracy 0.12
step 400, training accuracy 0.1
step 500, training accuracy 0.08
[....]
step 19500, training accuracy 0.18
step 19600, training accuracy 0.06
step 19700, training accuracy 0.1
step 19800, training accuracy 0.12
step 19900, training accuracy 0.08
W tensorflow/core/common_runtime/bfc_allocator.cc:213] Ran out of memory trying to allocate 5.84GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
test accuracy 0.0954
I might also add that I'm fairly sure I ran this tutorial a while back using an older GPU, and didn't have any issues, so somehow I get the feeling that something with the Pascal architecture isn't supported properly. What's even stranger is that some of the more complex models like the CNN and RNN "tutorials"/examples (seems to) run fine.
Edit:
I installed the CPU version using
# Ubuntu/Linux 64-bit, CPU only, Python 2.7
(tensorflow)$ export TF_BINARY_URL=https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-0.9.0-cp27-none-linux_x86_64.whl
and running 1000 iterations (instead of 20000) gives this result:
$ python -i tut.py
>>> conv_net()
Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
step 0, training accuracy 0.14
step 100, training accuracy 0.88
step 200, training accuracy 0.88
step 300, training accuracy 0.82
step 400, training accuracy 0.94
step 500, training accuracy 0.92
step 600, training accuracy 0.98
step 700, training accuracy 0.94
step 800, training accuracy 0.9
step 900, training accuracy 1
test accuracy 0.9648
Guess I'll try making a reinstall from source with newer versions for "everything".
Installing newer versions of CUDA and CuDNN seems to have solved the issue. (I saw the download page for CuDNN explicitly states that version 4 doesn't work with GTX 1070/1080.)
What worked for me was:
use the "Graphics Drivers Team" Ubuntu PPA thingy for installing the nvidia-367 drivers.
install CUDA 8.0RC using the runfile, didn't install the bundled driver. I tried the deb file but there were some issues with it wanting to install the bundled nvidia-361 driver. I never tried the third option (some tar.gz file IIRC?)
installed bazel from source, again I had some issues with the custom apt repo due to some dependency on java.
I used HEAD from tensorflows git repo, no particular reason.
I ran into this issue (or something very similar). This was solved by switching to gcc-4.9 instead of the default. (I only changed the path in the configure script for tensorflow.) I have no idea why this works, and it was something of a lucky guess.
Think I needed to install the zlib1g-dev package due to missing header files, but if so the error message was very clear that this was the issue.
Sorry for my mistake...(I deleted previous answer.)
but I've found the solution.
check the following link.
you have to join developer of nvidia. and download cuda-8.0 (after installing cuda-8.0, it is necessary to reinstall nvidia driver!)
https://developer.nvidia.com/cuda-release-candidate-download