Can I run the DeepLab image segmentation completely on CPU?
I have access to hpc, with high Memory resources but it is not GPU-enabled.
yes you can run it completely on the CPU. For that you only have to make a small change:
Open the file train.py and include the line
os.environ["CUDA_VISIBLE_DEVICES"]=""
before tensorflow is included.
Yes, you can. In fact you might not need to change the code at all to run it on a CPU-only machine.
Related
I'm trying to use aitextgen to finetune 774M gpt 2 on a dataset. unfortunately, no matter what i do, training fails because there are only 80 mb of vram available. how can i clear the vram without restarting the runtime and maybe prevent the vram from being full?
Another solution can be using these code snippets.
1.
!pip install numba
Then:
from numba import cuda
# all of your code and execution
cuda.select_device(0)
cuda.close()
Your problem is discussed in Tensorflow official github. https://github.com/tensorflow/tensorflow/issues/36465
Update: #alchemy reported this to be unrecoverable in terms of turning on.
You can try below code.
device = cuda.get_current_device()
device.reset()
Run the command !nvidia-smi inside a notebook block.
Look for the process id for the GPU that is unnecessary for you to remove for cleaning up vram. Then run the command !kill process_id
It should help you.
Solved this problem myself. It was because there were too much images in the celeba dataset and my dataloader was so inefficient. The dataloading took too much time and caused the low speed.
But still, this could not explain why the code was running on the cpu while the gpu memory was also taken up. After all I just transfer to pytorch.
My environment: windows10, cuda 9.0, cudnn 7.0.5, tensorflow-gpu 1.8.0.
I am working a cyclegan model. At first, it worked fine with my toy dataset, and could run on gpu without main problem(though the first 10 iterations took extremely long time, which means it might be running on cpu).
I later tried celeba dataset, only changed the folder name to load the data(I loaded data to the memory all at once, then use my own next_batch function and feed_dict to train the model). Then the problem arose: the GPU memory was still taken according to GPU-Z, but the GPU-load is low(less than 10%), and the training speed is very slow(took more than 10 times than normal), which means the code was running on CPU.
Would anyone please give me some advise? Any help is appreciated, thanks.
What is the batch size that you were trying? If it's too low (something like 2-8) for a small model, the memory consumed will not be much. It all depends on your batch size, the number of parameters in your model, etc. It also depends on the model architecture and how much of the model has components that can be run in parallel. Maybe try increasing your batch size and re-running it?
I have a code written in tensorflow that I run on CPUs and it runs fine.
I am transferring to a new machine which has GPUs and I run the code on the new machine but the training speed did not improve as expected (takes almost the same time).
I understood that Tensorflow automatically detects GPUs and run the operations on them (https://www.quora.com/How-do-I-automatically-put-all-my-computation-in-a-GPU-in-TensorFlow) & (https://www.tensorflow.org/tutorials/using_gpu).
Do I have to change the code to make it manually runs the operations on GPUs (for now I have a single GPU)? and what would be gained by doing that manually?
Thanks
If the GPU version of TensorFlow is installed and if you don't assign all your tensors to CPU, some of them should be assigned to GPU.
To find out which devices (CPU, GPU) are available to TensorFlow, you can use this:
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
Regarding the question of the performance, it's quite a broad subject and it really depends of your model, your data and so on. Here are a few and wide remarks on TensorFlow performance.
I modified the Tensorflow tutorial example for beginners by adding a hidden layer. The recognition rate obtained by running on GPU is ~95%. But when running in cpu-only mode, I can get ~40%. Does anybody know why the same python code behaves so differently? Thanks.
Weiguang
I'm using tensorflow to train a model and predict, and use htop on ubuntu to monitor cpu usage. predict is very slow, I just can't bear it. htop shows that cpu color is almost red, which means almost all cpu resource is used by system kernel threads, but cpu usage is 0% before tensorflow start.
I have not changed the thread_num, I'm using tensorflow v0.11 on ubuntu14.04.
The problem is that default glibc malloc is not efficient for small allocations. Also, because Google develops/tests tensorflow with tcmalloc internally, bad interactions with regular malloc don't get ironed out. The solution is to run TensorFlow with tcmalloc.
sudo apt-get install google-perftools
export LD_PRELOAD="/usr/lib/libtcmalloc.so.4"
python ...
If you're looking for something to improve the inference performance, I could recommend trying OpenVINO. It improves your model's accuracy by converting it to Intermediate Representation (IR), conducting graph pruning, and fusing certain operations into others. Then, in runtime, it uses vectorization. OpenVINO is optimized for Intel hardware, although it should work with any CPU.
It's rather straightforward to convert the Tensorflow model to OpenVINO unless you have fancy custom layers. The full tutorial on how to do it can be found here. Some snippets are below.
Install OpenVINO
The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.
pip install openvino-dev[tensorflow]
Use Model Optimizer to convert SavedModel model
The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, which is a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (just change data_type). Run in the command line:
mo --saved_model_dir "model" --data_type FP32 --output_dir "model_ir"
Run the inference
The converted model can be loaded by the runtime and compiled for a specific device, e.g., CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what the best choice for you is, use AUTO. If you care about latency, I suggest adding a performance hint (as shown below) to use the device that fulfills your requirement. If you care about throughput, change the value to THROUGHPUT or CUMULATIVE_THROUGHPUT.
# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="AUTO", config={"PERFORMANCE_HINT":"LATENCY"})
# Get output layer
output_layer_ir = compiled_model_ir.output(0)
# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]
Disclaimer: I work on OpenVINO.