tensorflow distributed training w/ estimator + experiment framework - tensorflow

Hi I have a wield situation when trying to use estimator + experiment class for distributed training.
Here's an example: https://gist.github.com/protoget/2cf2b530bc300f209473374cf02ad829
This is a simple case that uses
DNNClassifier from TF official tutorial
Experiment framework
1 worker and 1 ps on the same host with different ports.
What happens is
1) when I start ps job, it looks good:
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job ps -> {0 -> localhost:9000}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:9001}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:221] Started server with target: grpc://localhost:9000
2) when I start worker job, the job silently exits, leaving no log at all.
Eagerly seeking help.

I have same problem and I finally get the solution.
The problem is in config._environment
config = {"cluster": {'ps': ['127.0.0.1:9000'],
'worker': ['127.0.0.1:9001']}}
if args.type == "worker":
config["task"] = {'type': 'worker', 'index': 0}
else:
config["task"] = {'type': 'ps', 'index': 0}
os.environ['TF_CONFIG'] = json.dumps(config)
config = run_config.RunConfig()
config._environment = run_config.Environment.CLOUD
Set config._environment as Environment.CLOUD.
Then you can have distributed training system.
I hope it makes you happy :)

I have the same issue, it's due to some internal tensorflow code I guess, I've opened a question on SO already for this: TensorFlow: minimalist program fails on distributed mode.
I also opened a pull request: https://github.com/tensorflow/tensorflow/issues/8796.
There are two options to solve your issue. As this is due to your ClusterSpec having implicit local environment, you could try set another one (either google or cloud), but I cannot assure you that the rest of your work won't be impacted. So I prefered to have a glance at the code and try fix it myself for local mode, which is why I explain bellow.
You'll see explanations of why it fails in those posts more precisely, the fact is Google has been pretty silent so far so what I did is that I patched their source code (in tensorflow/contrib/learn/python/learn/experiment.py):
# Start the server, if needed. It's important to start the server before
# we (optionally) sleep for the case where no device_filters are set.
# Otherwise, the servers will wait to connect to each other before starting
# to train. We might as well start as soon as we can.
config = self._estimator.config
if (config.environment != run_config.Environment.LOCAL and
config.environment != run_config.Environment.GOOGLE and
config.cluster_spec and config.master):
self._start_server()
(this part prevents server from starting in local mode, which is yours if you set none in your cluster spec, so you should simply comment config.environment != run_config.Environment.LOCAL and and that should work).

Related

the alternative for NCCL on window 10

So I am on windows 10 and am using multiple GPUs now in order to run the training of some machine learning model and this model is about GAN algorithm you can check the full code over here :
Here, I get to the point where there is need to reduce the sum from different GPU devices as following:
if len(devices) > 1:
with tf.name_scope('SumAcrossGPUs'), tf.device(None):
for var_idx, grad_shape in enumerate(self._grad_shapes):
g = [dev_grads[dev][var_idx][0] for dev in devices]
if np.prod(grad_shape): # nccl does not support zero-sized tensors
g = tf.contrib.nccl.all_sum(g)
for dev, gg in zip(devices, g):
dev_grads[dev][var_idx] = (gg, dev_grads[dev][var_idx][1])
Now in this part I get an error regarding NCCL, which I noticed that is not supported on windows it needs linux, therefore I am stuck here...what is the "work around solution" here??..how can I manage to use NCCL on windows or an alternative to the code above..is there any simple way to do that?...thanks in advance.
Note: I have checked out some stackoverflow issues already. However, no answer exist which can solve my problem.

Big difference in execution time for first and subsequent run of cupy functions

When I run cupy functions on cupy arrays, the first call of a function takes significantly longer than the second run, even if I run it on a different array the second time.
Why is this?
import cupy as cp
cp.__version__
# 7.5.0
A = cp.random.random((1024, 1024))
B = cp.random.random((1024, 1024))
from time import time
def test(func, *args):
t = time()
func(*args)
print("{}".format(round(time() - t, 4)))
test(cp.fft.fft2, A)
test(cp.fft.fft2, B)
# 0.129
# 0.001
test(cp.matmul, A, A.T)
test(cp.matmul, B, B.T)
# 0.171
# 0.0
test(cp.linalg.inv, A)
test(cp.linalg.inv, B)
# 0.259
# 0.002
CuPy is just-in-time compiling the kernel under the hood the first time you use a function in a Python process, which takes a bit of time.
From the CuPy documentation:
CuPy uses on-the-fly kernel synthesis: when a kernel call is required,
it compiles a kernel code optimized for the shapes and dtypes of given
arguments, sends it to the GPU device, and executes the kernel. The
compiled code is cached to $(HOME)/.cupy/kernel_cache directory (this
cache path can be overwritten by setting the CUPY_CACHE_DIR
environment variable). It may make things slower at the first kernel
call, though this slow down will be resolved at the second execution.
CuPy also caches the kernel code sent to GPU device within the
process, which reduces the kernel transfer time on further calls.
As per cupy user guide:
Context Initialization:
It may take several seconds when calling a
CuPy function for the first time in a process. This is because the
CUDA driver creates a CUDA context during the first CUDA API call in
CUDA applications.

Tensorflow: Setting allow_growth to true does still allocate memory of all my GPUs

I have several GPUs but I only want to use one GPU for my training. I am using following options:
config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=True)
config.gpu_options.allow_growth = True
with tf.Session(config=config) as sess:
Despite setting / using all these options, all of my GPUs allocate memory and
#processes = #GPUs
How can I prevent this from happening?
Note
I do not want use set the devices manually and I do not want to set CUDA_VISIBLE_DEVICES since I want tensorflow to automatically find the best (an idle) GPU available
When I try to start another run it uses the same GPU that is already used by another tensorflow process even though there are several other free GPUs (apart from the memory allocation on them)
I am running tensorflow in a docker container: tensorflow/tensorflow:latest-devel-gpu-py
I had this problem my self. Setting config.gpu_options.allow_growth = True
Did not do the trick, and all of the GPU memory was still consumed by Tensorflow.
The way around it is the undocumented environment variable TF_FORCE_GPU_ALLOW_GROWTH (I found it in
https://github.com/tensorflow/tensorflow/blob/3e21fe5faedab3a8258d344c8ad1cec2612a8aa8/tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc#L25)
Setting TF_FORCE_GPU_ALLOW_GROWTH=true works perfectly.
In the Python code, you can set
os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'
I can offer you a method mask_busy_gpus defined here: https://github.com/yselivonchyk/TensorFlow_DCIGN/blob/master/utils.py
Simplified version of the function:
import subprocess as sp
import os
def mask_unused_gpus(leave_unmasked=1):
ACCEPTABLE_AVAILABLE_MEMORY = 1024
COMMAND = "nvidia-smi --query-gpu=memory.free --format=csv"
try:
_output_to_list = lambda x: x.decode('ascii').split('\n')[:-1]
memory_free_info = _output_to_list(sp.check_output(COMMAND.split()))[1:]
memory_free_values = [int(x.split()[0]) for i, x in enumerate(memory_free_info)]
available_gpus = [i for i, x in enumerate(memory_free_values) if x > ACCEPTABLE_AVAILABLE_MEMORY]
if len(available_gpus) < leave_unmasked: ValueError('Found only %d usable GPUs in the system' % len(available_gpus))
os.environ["CUDA_VISIBLE_DEVICES"] = ','.join(map(str, available_gpus[:leave_unmasked]))
except Exception as e:
print('"nvidia-smi" is probably not installed. GPUs are not masked', e)
Usage:
mask_unused_gpus()
with tf.Session()...
Prerequesities: nvidia-smi
With this script I was solving next problem: on a multy-GPU cluster use only single (or arbitrary) number of GPUs allowing them to be automatically allocated.
Shortcoming of the script: if you are starting multiple scripts at once random assignment might cause same GPU assignment, because script depends on memory allocation and memory allocation takes some seconds to kick in.

keras with tensorflow : a CUDA runtime call was likely performed without using a StreamExecutor context

I am using kears with tensorflow backend, and following is the problem. Is there any can solve this problem, thanks!
The error is caused by a illegal value of CNMEM. According to theano doc, CNMEM can only be assigned as a float.
0: not enabled.
0 < N <= 1: use this fraction of the total GPU memory (clipped to .95 for driver memory).
1: use this number in megabytes (MB) of memory.
You can also refer to here.
The warning is due to a change in Theano (Kera's backend). It will change from CUDA to GpuArray. You can refer to here for a solution.
Actually if you fix the warning, the error will disappear as well according to:
This value allocates GPU memory ONLY when using (CUDA backend) and has no effect when the GPU backend is (GpuArray Backend). For the new backend, please see config.gpuarray.preallocate

Tensorflow: dynamically call GPUs with enough free memory

My desktop has two gpus which can run Tensorflow with specification /gpu:0 or /gpu:1. However, if I don't specify which gpu to run the code, Tensorflow will by default to call /gpu:0, as we all know.
Now I would like to setup the system such that it can assign gpu dynamically according to the free memory of each gpu. For example, if a script doesn't specify which gpu to run the code, the system first assigns /gpu:0 for it; then if another script runs now, it will check whether /gpu:0 has enough free memory. If yes, it will continue assign /gpu:0 to it, otherwise it will assign /gpu:1 to it. How can I achieve it?
Follow-ups:
I believe the question above may be related to the virtualization problem of GPU. That is to say, if I can virtualize multi-gpu in a desktop into one GPU, I can get what I want. So beside any setup methods for Tensorflow, any ideas about virtualization is also welcome.
TensorFlow generally assumes it's not sharing GPU with anyone, so I don't see a way of doing it from inside TensorFlow. However, you could do it from outside as follows -- shell script that calls nvidia-smi, parses out GPU k with more memory, then sets "CUDA_VISIBLE_DEVICES=k" and calls TensorFlow script
Inspired by:
How to set specific gpu in tensorflow?
def leave_gpu_with_most_free_ram():
try:
command = "nvidia-smi --query-gpu=memory.free --format=csv"
memory_free_info = _output_to_list(sp.check_output(command.split()))[1:]
memory_free_values = [int(x.split()[0]) for i, x in enumerate(memory_free_info)]
least_busy_idx = memory_free_values.index(max(memory_free_values))
# update CUDA variable
gpus =[least_busy_idx]
setting = ','.join(map(str, gpus))
os.environ["CUDA_VISIBLE_DEVICES"] = setting
print('Left next %d GPU(s) unmasked: [%s] (from %s available)'
% (leave_unmasked, setting, str(available_gpus)))
except FileNotFoundError as e:
print('"nvidia-smi" is probably not installed. GPUs are not masked')
print(e)
except sp.CalledProcessError as e:
print("Error on GPU masking:\n", e.output)
Add a call to this function before importing tensorflow