the alternative for NCCL on window 10

the alternative for NCCL on window 10 - tensorflow

So I am on windows 10 and am using multiple GPUs now in order to run the training of some machine learning model and this model is about GAN algorithm you can check the full code over here :
Here, I get to the point where there is need to reduce the sum from different GPU devices as following:
if len(devices) > 1:
with tf.name_scope('SumAcrossGPUs'), tf.device(None):
for var_idx, grad_shape in enumerate(self._grad_shapes):
g = [dev_grads[dev][var_idx][0] for dev in devices]
if np.prod(grad_shape): # nccl does not support zero-sized tensors
g = tf.contrib.nccl.all_sum(g)
for dev, gg in zip(devices, g):
dev_grads[dev][var_idx] = (gg, dev_grads[dev][var_idx][1])
Now in this part I get an error regarding NCCL, which I noticed that is not supported on windows it needs linux, therefore I am stuck here...what is the "work around solution" here??..how can I manage to use NCCL on windows or an alternative to the code above..is there any simple way to do that?...thanks in advance.
Note: I have checked out some stackoverflow issues already. However, no answer exist which can solve my problem.

Related

np.float32 floating point differences between intel MacBook and M1

I have recently upgraded my Intel MacBook Pro 13" to a MacBook Pro 14" with M1 Pro. Been working hard on getting my software to compile and work again. No big issues fortunately, except for floating point problems in some obscure fortran code and in python. With regard to python/numpy I have the following question.
I have a large code base bur for simplicity will use this simple function that converts flight level to pressure to show the issue.
def fl2pres(FL):
P0=101325
T0=288.15
T1=216.65
g=9.80665
R=287.0528742
GAMMA=0.0065
P11=P0*np.exp(-g/GAMMA/R*np.log(T0/T1))
h=FL*30.48
return np.where(h<=11000, \
P0*np.exp(-g/GAMMA/R*np.log((T0/(T0-GAMMA*h) ))),\
P11*np.exp(-g/R/T1*(h-11000)) )
When I run the code on my M1 Pro, I get:
In [2]: fl2pres(np.float64([400, 200]))
Out[3]: array([18753.90334892, 46563.239766 ])
and;
In [3]: fl2pres(np.float32([400, 200]))
Out[3]: array([18753.90234375, 46563.25080916])
Doing the same on my older Intel MacBook Pro I get:
In [2]: fl2pres(np.float64([400, 200]))
Out[2]: array([18753.90334892, 46563.239766 ])
and;
In [3]: fl2pres(np.float32([400, 200]))
Out[3]: array([18753.904296888, 46563.24778944])
The float64 calculations match but the float32 do not. We use float32 quite a lot throughout our code for memory optimisation. I understand that due to architectural differences this sort of floating point errors can occur but was wondering whether a simple fix was possible as currently some unit-tests fail. I could include the architecture in these tests but am hoping for an easier solution?
Converting all inputs to float64 makes my unit-tests pass and hence fixes this issue but sine we have quite some large arrays and dataframes, the impact on memory is unwanted.
Both laptops run python 3.9.10 installed through homebrew, pandas 1.4.1 and numpy 1.22.3 (installed to map against accelerate and blas).
EDIT
I have changes the function to print intermediate values to see where changes occur:
def fl2pres(FL):
P0=101325
T0=288.15
T1=216.65
g=9.80665
R=287.0528742
GAMMA=0.0065
P11=P0*np.exp(-g/GAMMA/R*np.log(T0/T1))
h=FL*30.48
A = np.log((T0/(T0-GAMMA*h)))
B = np.exp(-g/GAMMA/R*A)
C = np.exp(-g/R/T1*(h-11000))
print(f"P11:{P11}, h:{h}, A:{A}, B:{B}, C:{C}")
return np.where(h<=11000, P0*B, P11*C)
Running this function with the same input as above for the float32 case, I get on M1 Pro:
P11:22632.040591374975, h:[12192. 6096.], A:[0.32161594 0.14793371], B:[0.1844504 0.45954345], C:[0.82864394 2.16691503]
array([18753.90334892, 46563.239766 ])
On Intel:
P11:22632.040591374975, h:[12192. 6096.], A:[0.32161596 0.14793368], B:[0.18445034 0.45954353], C:[0.828644 2.166915]
array([18753.90429688, 46563.24778944])

As per the issue I created at numpy's GitHub:
the differences you are experiencing seem to be all within a single
"ULP" (unit in the last place), maybe 2? For special math functions,
like exp, or sin, small errors are unfortunately expected and can be
system dependend (both hardware and OS/math libraries).
One thing that could be would might have a slightly larger effect
could be use of SVML that NumPy has on newer machines (i.e. only on
the intel one). That can be disabled at build time using
NPY_DISABLE_SVML=1 as an environment variable, but I don't think you
can disable its use without building NumPy. (However, right now, it
may well be that the M1 machine is the less precise one, or that they
are both roughly the same, just different)
I haven't tried compiling numpy using NPY_DISABLE_SVML=1 and my plan now is to use a docker container that can run on all my platforms and use a single "truth" for my tests.

GPU cannot be indentified to run module

I have 2 GPU machine available to use with ID 2 and 3, and would like to use them all to fit model. Here is my code,
os.environ['CUDA_VISIBLE_DEVICES'] = '2, 3'
with tf,device('/gpu:2'):
critic_model.fit(x,y,epochs =10)
with tf.device('/gpu:3'):
history = model.fit(x,y,epochs=19)
However, when I check nvidia-smi, I found only machine 2 is utilized, I wonder why ?
Any idea could be helpful !

Multiple problems here:
For one, Tensorflow has "its own" GPU numbering independent from the IDs on your machine. So when you pass CUDA_VISIBLE_DEVICES=2,3, Tensorflow will see those two GPUs, but they will be '/gpu:0' and '/gpu:1' in the program. Since neither '/gpu:2' nor '/gpu:3' exist, I suspect that all ops are simply put on '/gpu:0' or the CPU.
However, the main problem is that this is not how you use with tf.device at all. You need to wrap the model creation into the context manager. I.e. all the op calls such as tf.nn.conv2d, tf.matmul etc. need to be wrapped. At the point you call model.fit, the ops have already been created (and put on '/gpu:0' by default) and your with tf.device statement does nothing.

tensorflow distributed training w/ estimator + experiment framework

Hi I have a wield situation when trying to use estimator + experiment class for distributed training.
Here's an example: https://gist.github.com/protoget/2cf2b530bc300f209473374cf02ad829
This is a simple case that uses
DNNClassifier from TF official tutorial
Experiment framework
1 worker and 1 ps on the same host with different ports.
What happens is
1) when I start ps job, it looks good:
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job ps -> {0 -> localhost:9000}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:9001}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:221] Started server with target: grpc://localhost:9000
2) when I start worker job, the job silently exits, leaving no log at all.
Eagerly seeking help.

I have same problem and I finally get the solution.
The problem is in config._environment
config = {"cluster": {'ps': ['127.0.0.1:9000'],
'worker': ['127.0.0.1:9001']}}
if args.type == "worker":
config["task"] = {'type': 'worker', 'index': 0}
else:
config["task"] = {'type': 'ps', 'index': 0}
os.environ['TF_CONFIG'] = json.dumps(config)
config = run_config.RunConfig()
config._environment = run_config.Environment.CLOUD
Set config._environment as Environment.CLOUD.
Then you can have distributed training system.
I hope it makes you happy :)

I have the same issue, it's due to some internal tensorflow code I guess, I've opened a question on SO already for this: TensorFlow: minimalist program fails on distributed mode.
I also opened a pull request: https://github.com/tensorflow/tensorflow/issues/8796.
There are two options to solve your issue. As this is due to your ClusterSpec having implicit local environment, you could try set another one (either google or cloud), but I cannot assure you that the rest of your work won't be impacted. So I prefered to have a glance at the code and try fix it myself for local mode, which is why I explain bellow.
You'll see explanations of why it fails in those posts more precisely, the fact is Google has been pretty silent so far so what I did is that I patched their source code (in tensorflow/contrib/learn/python/learn/experiment.py):
# Start the server, if needed. It's important to start the server before
# we (optionally) sleep for the case where no device_filters are set.
# Otherwise, the servers will wait to connect to each other before starting
# to train. We might as well start as soon as we can.
config = self._estimator.config
if (config.environment != run_config.Environment.LOCAL and
config.environment != run_config.Environment.GOOGLE and
config.cluster_spec and config.master):
self._start_server()
(this part prevents server from starting in local mode, which is yours if you set none in your cluster spec, so you should simply comment config.environment != run_config.Environment.LOCAL and and that should work).

Using multiple gpus on windows using theano,keras

I am a beginner in deep learning/theano/keras.I'm trying to figure out how to use multiple gpus on windows 7. I've had success installing Theano,keras(as described in this post How do I install Keras and Theano in Anaconda Python on Windows?) and using one gpu. I want to use both my gpus
Following are the details of configs and versions
Python - 2.7(Anaconda-4.3.14,Windows-64bit)
,CUDA - 7.5.17
,Theano - 0.9.0rc3
,keras - 1.2.2
,pycuda - 2016.1.2+cuda7518
,gpu - Geforce GTX 480(2 of them)
Theano configuration is as below
.theanorc.txt
[global]
floatX = float32
device = gpu
[nvcc]
flags=-LC:\ProgramData\Anaconda2\libs
compiler_bindir=C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin
[lib]
cnmem=0.8
Currently I'm able to use only one GPU and I am getting memory error as below when I try to fit the model
MemoryError: ('Error allocating 411041792 bytes of device memory (CNMEM_STATUS_OUT_OF_MEMORY).', "you might consider using 'theano.shared(..., borrow=True)'")
Does using 2 gpus solve the problem(if yes, how do I enable the second one?)
or is my model too big ?
Thank You

Tensorflow: dynamically call GPUs with enough free memory

My desktop has two gpus which can run Tensorflow with specification /gpu:0 or /gpu:1. However, if I don't specify which gpu to run the code, Tensorflow will by default to call /gpu:0, as we all know.
Now I would like to setup the system such that it can assign gpu dynamically according to the free memory of each gpu. For example, if a script doesn't specify which gpu to run the code, the system first assigns /gpu:0 for it; then if another script runs now, it will check whether /gpu:0 has enough free memory. If yes, it will continue assign /gpu:0 to it, otherwise it will assign /gpu:1 to it. How can I achieve it?
Follow-ups:
I believe the question above may be related to the virtualization problem of GPU. That is to say, if I can virtualize multi-gpu in a desktop into one GPU, I can get what I want. So beside any setup methods for Tensorflow, any ideas about virtualization is also welcome.

TensorFlow generally assumes it's not sharing GPU with anyone, so I don't see a way of doing it from inside TensorFlow. However, you could do it from outside as follows -- shell script that calls nvidia-smi, parses out GPU k with more memory, then sets "CUDA_VISIBLE_DEVICES=k" and calls TensorFlow script

Inspired by:
How to set specific gpu in tensorflow?
def leave_gpu_with_most_free_ram():
try:
command = "nvidia-smi --query-gpu=memory.free --format=csv"
memory_free_info = _output_to_list(sp.check_output(command.split()))[1:]
memory_free_values = [int(x.split()[0]) for i, x in enumerate(memory_free_info)]
least_busy_idx = memory_free_values.index(max(memory_free_values))
# update CUDA variable
gpus =[least_busy_idx]
setting = ','.join(map(str, gpus))
os.environ["CUDA_VISIBLE_DEVICES"] = setting
print('Left next %d GPU(s) unmasked: [%s] (from %s available)'
% (leave_unmasked, setting, str(available_gpus)))
except FileNotFoundError as e:
print('"nvidia-smi" is probably not installed. GPUs are not masked')
print(e)
except sp.CalledProcessError as e:
print("Error on GPU masking:\n", e.output)
Add a call to this function before importing tensorflow

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas