Using multiple gpus on windows using theano,keras

Using multiple gpus on windows using theano,keras - gpu

I am a beginner in deep learning/theano/keras.I'm trying to figure out how to use multiple gpus on windows 7. I've had success installing Theano,keras(as described in this post How do I install Keras and Theano in Anaconda Python on Windows?) and using one gpu. I want to use both my gpus
Following are the details of configs and versions
Python - 2.7(Anaconda-4.3.14,Windows-64bit)
,CUDA - 7.5.17
,Theano - 0.9.0rc3
,keras - 1.2.2
,pycuda - 2016.1.2+cuda7518
,gpu - Geforce GTX 480(2 of them)
Theano configuration is as below
.theanorc.txt
[global]
floatX = float32
device = gpu
[nvcc]
flags=-LC:\ProgramData\Anaconda2\libs
compiler_bindir=C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin
[lib]
cnmem=0.8
Currently I'm able to use only one GPU and I am getting memory error as below when I try to fit the model
MemoryError: ('Error allocating 411041792 bytes of device memory (CNMEM_STATUS_OUT_OF_MEMORY).', "you might consider using 'theano.shared(..., borrow=True)'")
Does using 2 gpus solve the problem(if yes, how do I enable the second one?)
or is my model too big ?
Thank You

Related

How to Enable Mixed precision training

i'm trying to train a deep learning model on vs code so i would like to use the GPU for that. I have cuda 11.6 , nvidia GeForce GTX 1650, TensorFlow-gpu==2.5.0 and pip version 21.2.3 for windows 10. The problem is whenever i run this part of code i've got this error : Mixed precision training with AMP or APEX (--fp16 or --bf16) and half precision evaluation (--fp16_full_eval or --bf16_full_eval) can only be used on CUDA devices.
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir=new_output_models_dir,
#output_dir="dev/",
group_by_length=True,
per_device_train_batch_size=16,
gradient_accumulation_steps=2,
#dataloader_num_workers = 1,
dataloader_num_workers = 0,
evaluation_strategy="steps",
num_train_epochs=40,
fp16=True,
save_steps=400,
eval_steps=400,
logging_steps=400,
learning_rate=1e-4,
warmup_steps=500,
save_total_limit=2,
)
I've also tested whether tensorflow can access a gpu and whether tensorflow was built with cuda gpu support using tf.config.list_physical_devices('GPU') and tf.test.is_built_with_cuda() and both of them return TRUE . How to slove this issue ? and why i'm getting this error ? Any ideas !

The above error suggests that it does not accept fp16=True/bf16=True in non-GPU mode. Perhaps Cuda 11.6 might be an issue here which has stability issues.
Test with Cuda 11.2 and CudNN 8.1 . If that does not work you can go with fp16=False parametre.
Ref - https://www.tensorflow.org/install/source#gpu

the alternative for NCCL on window 10

So I am on windows 10 and am using multiple GPUs now in order to run the training of some machine learning model and this model is about GAN algorithm you can check the full code over here :
Here, I get to the point where there is need to reduce the sum from different GPU devices as following:
if len(devices) > 1:
with tf.name_scope('SumAcrossGPUs'), tf.device(None):
for var_idx, grad_shape in enumerate(self._grad_shapes):
g = [dev_grads[dev][var_idx][0] for dev in devices]
if np.prod(grad_shape): # nccl does not support zero-sized tensors
g = tf.contrib.nccl.all_sum(g)
for dev, gg in zip(devices, g):
dev_grads[dev][var_idx] = (gg, dev_grads[dev][var_idx][1])
Now in this part I get an error regarding NCCL, which I noticed that is not supported on windows it needs linux, therefore I am stuck here...what is the "work around solution" here??..how can I manage to use NCCL on windows or an alternative to the code above..is there any simple way to do that?...thanks in advance.
Note: I have checked out some stackoverflow issues already. However, no answer exist which can solve my problem.

keras with tensorflow : a CUDA runtime call was likely performed without using a StreamExecutor context

I am using kears with tensorflow backend, and following is the problem. Is there any can solve this problem, thanks!

The error is caused by a illegal value of CNMEM. According to theano doc, CNMEM can only be assigned as a float.
0: not enabled.
0 < N <= 1: use this fraction of the total GPU memory (clipped to .95 for driver memory).
1: use this number in megabytes (MB) of memory.
You can also refer to here.
The warning is due to a change in Theano (Kera's backend). It will change from CUDA to GpuArray. You can refer to here for a solution.
Actually if you fix the warning, the error will disappear as well according to:
This value allocates GPU memory ONLY when using (CUDA backend) and has no effect when the GPU backend is (GpuArray Backend). For the new backend, please see config.gpuarray.preallocate

tensorflow distributed training w/ estimator + experiment framework

Hi I have a wield situation when trying to use estimator + experiment class for distributed training.
Here's an example: https://gist.github.com/protoget/2cf2b530bc300f209473374cf02ad829
This is a simple case that uses
DNNClassifier from TF official tutorial
Experiment framework
1 worker and 1 ps on the same host with different ports.
What happens is
1) when I start ps job, it looks good:
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job ps -> {0 -> localhost:9000}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:9001}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:221] Started server with target: grpc://localhost:9000
2) when I start worker job, the job silently exits, leaving no log at all.
Eagerly seeking help.

I have same problem and I finally get the solution.
The problem is in config._environment
config = {"cluster": {'ps': ['127.0.0.1:9000'],
'worker': ['127.0.0.1:9001']}}
if args.type == "worker":
config["task"] = {'type': 'worker', 'index': 0}
else:
config["task"] = {'type': 'ps', 'index': 0}
os.environ['TF_CONFIG'] = json.dumps(config)
config = run_config.RunConfig()
config._environment = run_config.Environment.CLOUD
Set config._environment as Environment.CLOUD.
Then you can have distributed training system.
I hope it makes you happy :)

I have the same issue, it's due to some internal tensorflow code I guess, I've opened a question on SO already for this: TensorFlow: minimalist program fails on distributed mode.
I also opened a pull request: https://github.com/tensorflow/tensorflow/issues/8796.
There are two options to solve your issue. As this is due to your ClusterSpec having implicit local environment, you could try set another one (either google or cloud), but I cannot assure you that the rest of your work won't be impacted. So I prefered to have a glance at the code and try fix it myself for local mode, which is why I explain bellow.
You'll see explanations of why it fails in those posts more precisely, the fact is Google has been pretty silent so far so what I did is that I patched their source code (in tensorflow/contrib/learn/python/learn/experiment.py):
# Start the server, if needed. It's important to start the server before
# we (optionally) sleep for the case where no device_filters are set.
# Otherwise, the servers will wait to connect to each other before starting
# to train. We might as well start as soon as we can.
config = self._estimator.config
if (config.environment != run_config.Environment.LOCAL and
config.environment != run_config.Environment.GOOGLE and
config.cluster_spec and config.master):
self._start_server()
(this part prevents server from starting in local mode, which is yours if you set none in your cluster spec, so you should simply comment config.environment != run_config.Environment.LOCAL and and that should work).

Tensorflow: dynamically call GPUs with enough free memory

My desktop has two gpus which can run Tensorflow with specification /gpu:0 or /gpu:1. However, if I don't specify which gpu to run the code, Tensorflow will by default to call /gpu:0, as we all know.
Now I would like to setup the system such that it can assign gpu dynamically according to the free memory of each gpu. For example, if a script doesn't specify which gpu to run the code, the system first assigns /gpu:0 for it; then if another script runs now, it will check whether /gpu:0 has enough free memory. If yes, it will continue assign /gpu:0 to it, otherwise it will assign /gpu:1 to it. How can I achieve it?
Follow-ups:
I believe the question above may be related to the virtualization problem of GPU. That is to say, if I can virtualize multi-gpu in a desktop into one GPU, I can get what I want. So beside any setup methods for Tensorflow, any ideas about virtualization is also welcome.

TensorFlow generally assumes it's not sharing GPU with anyone, so I don't see a way of doing it from inside TensorFlow. However, you could do it from outside as follows -- shell script that calls nvidia-smi, parses out GPU k with more memory, then sets "CUDA_VISIBLE_DEVICES=k" and calls TensorFlow script

Inspired by:
How to set specific gpu in tensorflow?
def leave_gpu_with_most_free_ram():
try:
command = "nvidia-smi --query-gpu=memory.free --format=csv"
memory_free_info = _output_to_list(sp.check_output(command.split()))[1:]
memory_free_values = [int(x.split()[0]) for i, x in enumerate(memory_free_info)]
least_busy_idx = memory_free_values.index(max(memory_free_values))
# update CUDA variable
gpus =[least_busy_idx]
setting = ','.join(map(str, gpus))
os.environ["CUDA_VISIBLE_DEVICES"] = setting
print('Left next %d GPU(s) unmasked: [%s] (from %s available)'
% (leave_unmasked, setting, str(available_gpus)))
except FileNotFoundError as e:
print('"nvidia-smi" is probably not installed. GPUs are not masked')
print(e)
except sp.CalledProcessError as e:
print("Error on GPU masking:\n", e.output)
Add a call to this function before importing tensorflow

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas