i'm using Ubuntu 14.04 without GPU and i want to run this code ( with CPU only ). I want to run this code with CPU ( not with GPU) : https://github.com/smallcorgi/Faster-RCNN_TF . what should I do ?
The Github repository you are referring to is a TensorFlow implementation of Faster-RCNN, not Caffe.
If you want to use the Caffe implementation, you have to use this repository : https://github.com/rbgirshick/py-faster-rcnn
You have to edit the Python scripts that are used to train and test the model, e.g. train_faster_rcnn_alt_opt.py so that the line caffe.set_mode_gpu() is replaced by caffe.set_mode_cpu(). You might have to compile Caffe by editing the Makefile.config file to remove CUDNN and CUDA support.
Note that Caffe on CPU will be very slow compared to GPU computing.
Related
Stylegan2 uses network pickle files to store ML models. I transfer trained one model, which I am able to open up on cloud servers. I have been generating images from this model fine with the following setup:
Google Colab: Python 3.6.9, CUDA 10.1, tensorflow-gpu 1.15, CuDNN 7.6.5
However, I cannot open the network pickle file on my local machine, even though I've been trying to replicate that cloud setup the best I can. (I have the right GPU hardware/drivers/etc.)
Local (Windows 10) Python 3.6.9, CUDA 10.1, tensorflow-gpu 1.15, CuDNN 7.6.5
Requires a library 'dnnlib' to be in the PYTHONPATH and for a tf.Session() to be initialized
I get the an assertion error about the pickle.
**Assertion error**: `assert state["version"] in [2,3]`
I find this error very odd because the network pickle works on the cloud--so it was saved properly. Additionally, my local set up can open up other network pickles(ie. ones downloaded from the internet through GET requests), making me think that I have properly set up my PYTHONPATH and initialized a tf.Session. These are prerequisites listed in the Stylegan repo:
"You can import the networks in your own Python code using pickle.load(). For this to work, you need to include the dnnlib source directory in PYTHONPATH and create a default TensorFlow session by calling dnnlib.tflib.init_tf()"
I'm not sure why I cannot open up this pickle in one environment, but can in another. Does anyone have any suggestions as to where I might start looking?
Actually, I figured it out by printing out what version was throwing the error. The version printed was '4'. I realized that this matched the pickle (HIGHEST_PROTOCOL) and that what I needed was the newest pull of the Stylegan2 repository, which included pickle format_version 4 in their allowed versions.
I prepare the dataset and save it as as hdf5 file. I have a custom data generator that subclasses Sequence from keras and generates batches from the hdf5 file.
Now, when I model.fit_generator using the train generator, the model uses the GPU and trains fast for the first 2 epochs (GPU memory is full and GPU volatile usage fluctuates nicely around 50%). However, after the 3rd epoch, GPU volatile usage is 0% and the epoch takes 20x as long.
What's going on here?
Can you try configuring GPU as given in this post https://www.tensorflow.org/guide/gpu
Here is how i have done in my program
print("Runnning Jupyter Notebook using python version: {}".format(python_version()))
print("Running tensorflow version: {}".format(tf.keras.__version__))
print("Running tensorflow.keras version: {}".format(tf.__version__))
print("Running keras version: {}".format(keras.__version__))
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
tf.config.experimental.list_physical_devices('GPU')
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
# Restrict TensorFlow to only allocate 2GB of memory on the first GPU
try:
tf.config.experimental.set_virtual_device_configuration(
gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=2048)])
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
# Virtual devices must be set before GPUs have been initialized
print(e)
Here is the output of above command:
Runnning Jupyter Notebook using python version: 3.7.7
Running tensorflow version: 2.2.4-tf
Running tensorflow.keras version: 2.1.0
Running keras version: 2.3.1
Num GPUs Available: 1
1 Physical GPUs, 1 Logical GPUs
Value might differ, memory_limit=2048 is the amount of memory allocated to GPU device.
To confirm allocation please use nvidia-smi, if you run with this config keras won't increase memory usage. As you told that after 2 epochs it is very slow, can you tell further does kernel dies, system hangs or restarts? Issues without config I have faced, is system just hangs. If you are running on ubuntu 18.04 LTS please use System Monitor(visually tells how many cores are being used, periodic contants increase means something is wrong) tool before executing all cells in notebook, once you start check Resources Tab in System Monitor.
Do:
A fresh run
Or Restart & Run All
Suspected Issue: How to prevent tensorflow from allocating the totality of a GPU memory?
Same Error Here!!
Because when you install tensorflow-gpu along nivida tool kit it provide a limited amount of GPU memory (Here in my case 2GB) .Due to leak of memory it release GPU finally and turn to use CPU .
if you want to avoid such condition Use Google Colab which provide about 36.7GB GPU memory.
I'm very new with TensorFlow.
I want to run my code on my CUDA gpu. So I've installed TensorFlow -gpu, after I've installed normal TensorFlow.
How can I tell Python that it takes gpu based TensorFlow?
If you have tensorflow-gpu installed there really isn't any reason to also have tensorflow. Without the presence of a gpu it will just run on cpu anyway.
To be specific in which GPU you use:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
in place of "0" you can either list GPUs (if you have multiple), or "" if you want it to run on cpu.
alternatively, specify in the session:
sess = tf.Session(config=tf.ConfigProto(device_count={'GPU': 0}))
Furthermore you can check which version your computer prioritizes by opening python console and typing:
>>> import tensorflow
>>> tensorflow
<module 'tensorflow' from
'/home/.../python3.6/site- packages/tensorflow/__init__.py'>
^
|
here
I'm running a program to process some data, and I inference both a TensorFlow model and a Pytorch model.
When inferencing either of the models everything works fine. However, when I add the pytorch input my program crashes with this error:
2018-05-14 12:55:05.525251: E tensorflow/stream_executor/cuda/cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2018-05-14 12:55:05.525280: F tensorflow/core/kernels/conv_ops.cc:717] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms)
Note that this already happens before I do anything with Pytorch. No models are loaded, nothing is put on GPU, no devices are checked.
Does anyone know what might be going wrong, how to fix it, and if there are some parameters I can change?
Something I already tried is disabling the PyTorch backend using this code:
import torch.backends.cudnn as cudnn
cudnn.enabled = False
But unfortunately this does not help...
You'll find in the NVIDIA Forums some references of cuBLAS not playing well with several Python processes interacting with it at the same time. This is referenced in this 1 year old issue for Tensorflow, but it should be the same for any multiple-PyTorch client applications interfacing with GPU through CUDA - and cuBLAS, to be more specific. cuBLAS handles weren't being properly initialized, somehow due to a mixture of issues related to on-disk caching and RAM utilization being too large.
The solution was both to delete the on-disk cache for cuBLAS,
sudo rm -rf ~/.nv
and restrict the amount of memory usage for nets.
I tried to retrain (new images, new classes) on top of the pretrained inception model, I therefor followed the instructions of the inception readme:
https://github.com/tensorflow/models/tree/master/inception#how-to-construct-a-new-dataset-for-retraining
I successfully built and ran build_image_data using bazel, as described in the tutorial. Afterwards I successfully built inception_train using bazel:
~/tensorflowmodels/models/inception# bazel build inception/inception_train
INFO: Found 1 target...
Target //inception:inception_train up-to-date (nothing to build)
INFO: Elapsed time: 0.073s, Critical Path: 0.00s
However, running bazel-bin/inception/inception_train I always get the following:
~/tensorflowmodels/models/inception# bazel-bin/inception/inception_train --train_dir="/" --validation_dir="/" --data_dir="/images_jpg/" --pretrained_model_checkpoint_path="/tensorflowmodels/models/inception/inception-v3/" --fine_tune=True --initial_learning_rate=0.001 --input_queue_memory_factor=1 --num_gpus=1
-bash: bazel-bin/inception/inception_train: No such file or directory
Naturally I would say it's by 99.9999% chance a typo. So then I tried to run inception_train.py with python. I had to change some import locations, and it finally ran with the parameters. However the script stops without any error messages after the initialization of the CUDA drivers.
Any help on how to solve this (or perform fine tuning / retraining with inception) would be very much appreciated.
tensorflow version: 0.9rc0
CPU: Xeon 5, 24 cores
GPU: Grid K2 8 GB
OS: Ubuntu 14.04
BTW I posted this already as an Github issue (which was closed, since it would be more a case for Stack Overflow).