The GRPC doesn't work in distributed Tensorflow - tensorflow

I'm running a distributed Tensorflow script. When creating cluster server, I see some information appear in the console that look like the following:
E0805 20:51:03.294260965 3387 ev_epoll1_linux.c:1051] grpc epoll fd: 3
2017-08-05 20:51:03.299766: I tensorflow/core/distributed_runtime/rpc/] Initialize GrpcChannelCache for job ps -> {0 -> localhost:2222}
2017-08-05 20:51:03.299790: I tensorflow/core/distributed_runtime/rpc/] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2223}
2017-08-05 20:51:03.305220: I tensorflow/core/distributed_runtime/rpc/] Started server with target: grpc://localhost:2223
When training, I encounter same information and no other response.
E0805 20:52:45.889979901 3387 ev_epoll1_linux.c:1051] grpc epoll fd: 3
The information is printed from with tf.Session("grpc://localhost:2223") as sess:
The version of Tensorflow : 1.3.0-rc0 , which compiling with bazel and working well for single machine
The version of Linux : Distributor ID: Ubuntu
Description: Ubuntu 14.04.5 LTS
Release: 14.04
Codename: trusty
The Active Internet connects is :
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0* LISTEN 8321/python
tcp 0 0* LISTEN 8883/python
Here is sample code of creating cluster server
def main(_):
server = tf.train.Server(cluster,
if __name__ == "__main__":
and traning code
train_X = np.random.rand(100).astype(np.float32)
train_Y = train_X * 0.1 + 0.3
with tf.device("/job:worker/task:0"):
X = tf.placeholder(tf.float32)
Y = tf.placeholder(tf.float32)
w = tf.Variable(0.0)
b = tf.Variable(0.0)
y = w * X + b
loss = tf.reduce_mean(tf.square(y - Y))
init_op = tf.global_variables_initializer()
train_op = tf.train.GradientDescentOptimizer(0.01).minimize(loss)
with tf.Session("grpc://localhost:2223") as sess:
for i in range(500):, feed_dict={X: train_Y, Y: train_Y})
print("after train")
if i % 50 == 0:
print i,,
Does anyone know how to fix it? Thanks.


Don't see any transfers on NVLINK with NCCL all_sum test

With the following code (uses tensorflow.contrib.nccl.all_sum), I expected to see bytes being transferred over NVLINK.
In reality, I don't.
from tensorflow.contrib.nccl import all_sum
with tf.device('/gpu:0'):
a = tf.get_variable(
"a", initializer=tf.constant(1.0, shape=(args.dim, args.dim)))
with tf.device('/gpu:1'):
b = tf.get_variable(
"b", initializer=tf.constant(2.0, shape=(args.dim, args.dim)))
with tf.device('/gpu:0'):
summed_node = all_sum([a, b])
sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=True,
init = tf.global_variables_initializer()
with tf.device('/gpu:0'):
summed =
My machine is an AWS instance of p3.8xlarge. My understanding is, this configuration supports NVLINK.
The execution is fine but when I use nvidia-smi nvlink -g 0 -i 0 the link Tx/Rx counts are zero.

Tensorflow can't detect GPU when invoked by Ray worker

When I try the following code sample for using Tensorflow with Ray, Tensorflow fails to detect the GPU's on my machine when invoked by the "remote" worker but it does find the GPU's when invoked "locally". I put "remote" and "locally" in scare quotes because everything is running on my desktop which has two GPU's and is running Ubuntu 16.04 and I installed Tensorflow using the tensorflow-gpu Anaconda package.
The local_network seems to be responsible for these messages in the logs:
2018-01-26 17:24:33.149634: I tensorflow/core/common_runtime/gpu/] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Quadro M5000, pci bus id: 0000:03:00.0)
2018-01-26 17:24:33.149642: I tensorflow/core/common_runtime/gpu/] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Quadro M5000, pci bus id: 0000:04:00.0)
And the remote_network seems to be responsible for this message:
2018-01-26 17:24:34.309270: E tensorflow/stream_executor/cuda/] failed call to cuInit: CUDA_ERROR_NO_DEVICE
Why is Tensorflow able to detect the GPU in one case but not the other?
import tensorflow as tf
import numpy as np
import ray
class Network(object):
def __init__(self, x, y):
# Seed TensorFlow to make the script deterministic.
# Define the inputs.
x_data = tf.constant(x, dtype=tf.float32)
y_data = tf.constant(y, dtype=tf.float32)
# Define the weights and computation.
w = tf.Variable(tf.random_uniform([1], -1.0, 1.0))
b = tf.Variable(tf.zeros([1]))
y = w * x_data + b
# Define the loss.
self.loss = tf.reduce_mean(tf.square(y - y_data))
optimizer = tf.train.GradientDescentOptimizer(0.5)
self.grads = optimizer.compute_gradients(self.loss)
self.train = optimizer.apply_gradients(self.grads)
# Define the weight initializer and session.
init = tf.global_variables_initializer()
self.sess = tf.Session()
# Additional code for setting and getting the weights
self.variables = ray.experimental.TensorFlowVariables(self.loss, self.sess)
# Return all of the data needed to use the network.
# Define a remote function that trains the network for one step and returns the
# new weights.
def step(self, weights):
# Set the weights in the network.
# Do one step of training. We only need the actual gradients so we filter over the list.
actual_grads =[grad[0] for grad in self.grads])
return actual_grads
def get_weights(self):
return self.variables.get_weights()
# Define a remote function for generating fake data.
def generate_fake_x_y_data(num_data, seed=0):
# Seed numpy to make the script deterministic.
x = np.random.rand(num_data)
y = x * 0.1 + 0.3
return x, y
# Generate some training data.
batch_ids = [generate_fake_x_y_data.remote(BATCH_SIZE, seed=i) for i in range(NUM_BATCHES)]
x_ids = [x_id for x_id, y_id in batch_ids]
y_ids = [y_id for x_id, y_id in batch_ids]
# Generate some test data.
x_test, y_test = ray.get(generate_fake_x_y_data.remote(BATCH_SIZE, seed=NUM_BATCHES))
# Create actors to store the networks.
remote_network = ray.remote(Network)
actor_list = [remote_network.remote(x_ids[i], y_ids[i]) for i in range(NUM_BATCHES)]
local_network = Network(x_test, y_test)
# Get initial weights of local network.
weights = local_network.get_weights()
# Do some steps of training.
for iteration in range(NUM_ITERS):
# Put the weights in the object store. This is optional. We could instead pass
# the variable weights directly into step.remote, in which case it would be
# placed in the object store under the hood. However, in that case multiple
# copies of the weights would be put in the object store, so this approach is
# more efficient.
weights_id = ray.put(weights)
# Call the remote function multiple times in parallel.
gradients_ids = [actor.step.remote(weights_id) for actor in actor_list]
# Get all of the weights.
gradients_list = ray.get(gradients_ids)
# Take the mean of the different gradients. Each element of gradients_list is a list
# of gradients, and we want to take the mean of each one.
mean_grads = [sum([gradients[i] for gradients in gradients_list]) / len(gradients_list) for i in range(len(gradients_list[0]))]
feed_dict = {grad[0]: mean_grad for (grad, mean_grad) in zip(local_network.grads, mean_grads)}, feed_dict=feed_dict)
weights = local_network.get_weights()
# Print the current weights. They should converge to roughly to the values 0.1
# and 0.3 used in generate_fake_x_y_data.
if iteration % 20 == 0:
print("Iteration {}: weights are {}".format(iteration, weights))
The GPUs are cut off by ray.remote decorator itself. From its source code:
def remote(*args, **kwargs):
num_cpus = kwargs["num_cpus"] if "num_cpus" in kwargs else 1
num_gpus = kwargs["num_gpus"] if "num_gpus" in kwargs else 0 # !!!
So the following call effectively sets num_gpus=0:
remote_network = ray.remote(Network)
Ray API is a bit strange, and you can't simply say ray.remote(Network, num_gpus=2) (though that's exactly what you want). Here's what I did and it seems to work on my machine:
class RemoteNetwork(Network):
actor_list = [RemoteNetwork.remote(x_ids[i],y_ids[i]) for i in range(NUM_BATCHES)]

Run distributed tensorflow example with error

I have three node to run a distributed tensorflow, which is two worker(one has GPU,one not)and one ps(without GPU).The code is below:
from __future__ import print_function
import tensorflow as tf
import sys
import time
# cluster specification
parameter_servers = [""]
workers = [ "",
cluster = tf.train.ClusterSpec({"ps":parameter_servers, "worker":workers})
# input flags"job_name", "", "Either 'ps' or 'worker'")"task_index", 0, "Index of task within the job")
# start a server for a specific task
server = tf.train.Server(cluster,
# config
batch_size = 100
learning_rate = 0.001
training_epochs = 20
logs_path = "/tmp/mnist/1"
# load mnist data set
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)
if FLAGS.job_name == "ps":
elif FLAGS.job_name == "worker":
# Between-graph replication
with tf.device(tf.train.replica_device_setter(
worker_device="/job:worker/task:%d" % FLAGS.task_index,
# count the number of updates
global_step = tf.get_variable('global_step', [],
initializer = tf.constant_initializer(0),
trainable = False)
# input images
with tf.name_scope('input'):
# None -> batch size can be any size, 784 -> flattened mnist image
x = tf.placeholder(tf.float32, shape=[None, 784], name="x-input")
# target 10 output classes
y_ = tf.placeholder(tf.float32, shape=[None, 10], name="y-input")
# model parameters will change during training so we use tf.Variable
with tf.name_scope("weights"):
W1 = tf.Variable(tf.random_normal([784, 100]))
W2 = tf.Variable(tf.random_normal([100, 10]))
# bias
with tf.name_scope("biases"):
b1 = tf.Variable(tf.zeros([100]))
b2 = tf.Variable(tf.zeros([10]))
# implement model
with tf.name_scope("softmax"):
# y is our prediction
z2 = tf.add(tf.matmul(x,W1),b1)
a2 = tf.nn.sigmoid(z2)
z3 = tf.add(tf.matmul(a2,W2),b2)
y = tf.nn.softmax(z3)
# specify cost function
with tf.name_scope('cross_entropy'):
# this is our cost
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
# specify optimizer
with tf.name_scope('train'):
# optimizer is an "operation" which we can execute in a session
grad_op = tf.train.GradientDescentOptimizer(learning_rate)
train_op = grad_op.minimize(cross_entropy, global_step=global_step)
with tf.name_scope('Accuracy'):
# accuracy
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
# create a summary for our cost and accuracy
tf.scalar_summary("cost", cross_entropy)
tf.scalar_summary("accuracy", accuracy)
# merge all summaries into a single "operation" which we can execute in a session
summary_op = tf.merge_all_summaries()
init_op = tf.initialize_all_variables()
print("Variables initialized ...")
sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0),
begin_time = time.time()
frequency = 100
with sv.prepare_or_wait_for_session( as sess:
# create log writer object (this will log on every machine)
writer = tf.train.SummaryWriter(logs_path, graph=tf.get_default_graph())
# perform training cycles
start_time = time.time()
for epoch in range(training_epochs):
# number of batches in one epoch
batch_count = int(mnist.train.num_examples/batch_size)
count = 0
for i in range(batch_count):
batch_x, batch_y = mnist.train.next_batch(batch_size)
# perform the operations we defined earlier on batch
_, cost, summary, step =
[train_op, cross_entropy, summary_op, global_step],
feed_dict={x: batch_x, y_: batch_y})
writer.add_summary(summary, step)
count += 1
if count % frequency == 0 or i+1 == batch_count:
elapsed_time = time.time() - start_time
start_time = time.time()
print("Step: %d," % (step+1),
" Epoch: %2d," % (epoch+1),
" Batch: %3d of %3d," % (i+1, batch_count),
" Cost: %.4f," % cost,
" AvgTime: %3.2fms" % float(elapsed_time*1000/frequency))
count = 0
print("Test-Accuracy: %2.2f" %, feed_dict={x: mnist.test.images, y_: mnist.test.labels}))
print("Total Time: %3.2fs" % float(time.time() - begin_time))
print("Final Cost: %.4f" % cost)
I run the above code on my three node with instruction below in terminal:
pc-01$ python --job-name="ps" --task_index=0
pc-02$ python --job-name="worker" --task_index=0
pc-03$ python --job-name="worker" --task_index=1
However, after the Variables initialized, I met a question that the terminal of worker always print :
I tensor flow/core/distributed_runtime/] CreateSession still waiting for response from worker:/job:worker/replica:0/task:0
and the terminal of ps don't proceed.
The IP of ps is, and the IP of the worker is,,just like the code above.
Anyone can help me?
I guess filtering out device should help here. Could you please try adding device_filter to your session ?
config = tf.ConfigProto(
device_filters=["/job:ps", "/job:worker/task:%d" % FLAGS.task_index])
with sv.prepare_or_wait_for_session(, config=\config) as sess:
This should fix the issue.

Distributed Tensorflow: good example for synchronous training on CPUs

I am new to distributed tensorflow and am looking for a good example to perform synchronous training on CPUs.
I have already tried the Distributed Tensorflow Example and it can perform the asynchronous training successfully over 1 parameter server (1 machine with 1 CPU) and 3 workers (each worker = 1 machine with 1 CPU). However, when it comes to the synchronous training, I am not able to run it correctly, although I have followed the tutorial of
SyncReplicasOptimizer(V1.0 and V2.0).
I have inserted the official SyncReplicasOptimizer code into the working asynchronous training example but the training process is still asynchronous. My detailed code is as follows. Any code relates to synchronous training is within the block of ******.
import tensorflow as tf
import sys
import time
# cluster specification ----------------------------------------------------------------------
parameter_servers = [""]
workers = ["", "", ""]
cluster = tf.train.ClusterSpec({"ps":parameter_servers, "worker":workers})
# input flags"job_name", "", "Either 'ps' or 'worker'")"task_index", 0, "Index of task within the job")
# start a server for a specific task
server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index)
# Parameters ----------------------------------------------------------------------
N = 3 # number of replicas
learning_rate = 0.001
training_epochs = int(21/N)
batch_size = 100
# Network Parameters
n_input = 784 # MNIST data input (img shape: 28*28)
n_hidden_1 = 256 # 1st layer number of features
n_hidden_2 = 256 # 2nd layer number of features
n_classes = 10 # MNIST total classes (0-9 digits)
if FLAGS.job_name == "ps":
print("--- Parameter Server Ready ---")
elif FLAGS.job_name == "worker":
# Import MNIST data
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)
# Between-graph replication
with tf.device(tf.train.replica_device_setter(
worker_device="/job:worker/task:%d" % FLAGS.task_index,
# count the number of updates
global_step = tf.get_variable('global_step', [],
initializer = tf.constant_initializer(0),
trainable = False,
dtype = tf.int32)
# tf Graph input
x = tf.placeholder("float", [None, n_input])
y = tf.placeholder("float", [None, n_classes])
# Create model
def multilayer_perceptron(x, weights, biases):
# Hidden layer with RELU activation
layer_1 = tf.add(tf.matmul(x, weights['h1']), biases['b1'])
layer_1 = tf.nn.relu(layer_1)
# Hidden layer with RELU activation
layer_2 = tf.add(tf.matmul(layer_1, weights['h2']), biases['b2'])
layer_2 = tf.nn.relu(layer_2)
# Output layer with linear activation
out_layer = tf.matmul(layer_2, weights['out']) + biases['out']
return out_layer
# Store layers weight & bias
weights = {
'h1': tf.Variable(tf.random_normal([n_input, n_hidden_1])),
'h2': tf.Variable(tf.random_normal([n_hidden_1, n_hidden_2])),
'out': tf.Variable(tf.random_normal([n_hidden_2, n_classes]))
biases = {
'b1': tf.Variable(tf.random_normal([n_hidden_1])),
'b2': tf.Variable(tf.random_normal([n_hidden_2])),
'out': tf.Variable(tf.random_normal([n_classes]))
# Construct model
pred = multilayer_perceptron(x, weights, biases)
# Define loss and optimizer
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred, y))
# ************************* SyncReplicasOpt Version 1.0 *****************************************************
''' This optimizer collects gradients from all replicas, "summing" them,
then applying them to the variables in one shot, after which replicas can fetch the new variables and continue. '''
# Create any optimizer to update the variables, say a simple SGD
opt = tf.train.AdamOptimizer(learning_rate=learning_rate)
# Wrap the optimizer with sync_replicas_optimizer with N replicas: at each step the optimizer collects N gradients before applying to variables.
opt = tf.train.SyncReplicasOptimizer(opt, replicas_to_aggregate=N,
replica_id=FLAGS.task_index, total_num_replicas=N)
# Now you can call `minimize()` or `compute_gradients()` and `apply_gradients()` normally
train = opt.minimize(cost, global_step=global_step)
# You can now call get_init_tokens_op() and get_chief_queue_runner().
# Note that get_init_tokens_op() must be called before creating session
# because it modifies the graph.
init_token_op = opt.get_init_tokens_op()
chief_queue_runner = opt.get_chief_queue_runner()
# **************************************************************************************
# Test model
correct = tf.equal(tf.argmax(pred, 1), tf.argmax(y, 1))
accuracy = tf.reduce_mean(tf.cast(correct, "float"))
# Initializing the variables
init_op = tf.initialize_all_variables()
print("---Variables initialized---")
# **************************************************************************************
is_chief = (FLAGS.task_index == 0)
# Create a "supervisor", which oversees the training process.
sv = tf.train.Supervisor(is_chief=is_chief,
# **************************************************************************************
with sv.prepare_or_wait_for_session( as sess:
# **************************************************************************************
# After the session is created by the Supervisor and before the main while loop:
if is_chief:
sv.start_queue_runners(sess, [chief_queue_runner])
# Insert initial tokens to the queue.
# **************************************************************************************
# Statistics
net_train_t = 0
# Training
for epoch in range(training_epochs):
total_batch = int(mnist.train.num_examples/batch_size)
# Loop over all batches
for i in range(total_batch):
batch_x, batch_y = mnist.train.next_batch(batch_size)
# ======== net training time ========
begin_t = time.time(), feed_dict={x: batch_x, y: batch_y})
end_t = time.time()
net_train_t += (end_t - begin_t)
# ===================================
# Calculate training accuracy
# acc =, feed_dict={x: mnist.train.images, y: mnist.train.labels})
# print("Epoch:", '%04d' % (epoch+1), " Train Accuracy =", acc)
print("Epoch:", '%04d' % (epoch+1))
print("Training Finished!")
print("Net Training Time: ", net_train_t, "second")
# Testing
print("Testing Accuracy = ", accuracy.eval({x: mnist.test.images, y: mnist.test.labels}))
Anything wrong with my code? Or can I have a good example to follow?
I think your question can be answered as the comments in the issue #9596 of the tensorflow.
This problem is caused by the bugs of the new version of tf.train.SyncReplicasOptimizer(). You can use old version of this API to avoid this problem.
Another solution is from the Tensorflow Distributed Benchmarks. Take a look at the source code, and you can find that they synchronize workers manually through the queue in the tensorflow. Through experiments, this benchmark runs exactly as what you expect.
Hope these comments and resources can help you solve your problem. Thanks!
I am not sure if you would be interested in user-transparent distributed tensorflow which uses MPI in the backend. We have recently developed one such version with MaTEx:
Hence, for distributed TensorFlow, you would not need to write a SyncReplicaOptimizer code, since all the changes are abstracted from the user.
Hope this helps.
One issue is that you need to specify an aggregation_method in the minimize method for it to run synchronously,
train = opt.minimize(cost, global_step=global_step, aggregation_method=tf.AggregationMethod.ADD_N)

tensorflow Unable to find a suitable algorithm for doing forward convolution

When running the PDE example on the TensorFlow website
#Import libraries for simulation
import tensorflow as tf
import numpy as np
sess = tf.InteractiveSession()
def make_kernel(a):
"""Transform a 2D array into a convolution kernel"""
a = np.asarray(a)
a = a.reshape(list(a.shape) + [1,1])
return tf.constant(a, dtype=1)
def simple_conv(x, k):
"""A simplified 2D convolution operation"""
x = tf.expand_dims(tf.expand_dims(x, 0), -1)
y = tf.nn.depthwise_conv2d(x, k, [1, 1, 1, 1], padding='SAME')
return y[0, :, :, 0]
def laplace(x):
"""Compute the 2D laplacian of an array"""
laplace_k = make_kernel([[0.5, 1.0, 0.5],
[1.0, -6., 1.0],
[0.5, 1.0, 0.5]])
return simple_conv(x, laplace_k)
# Initial Conditions -- some rain drops hit a pond
N = 500
# Set everything to zero
u_init = np.zeros([N, N], dtype=np.float32)
ut_init = np.zeros([N, N], dtype=np.float32)
# Some rain drops hit a pond at random points
for n in range(40):
a,b = np.random.randint(0, N, 2)
u_init[a,b] = np.random.uniform()
# Parameters:
# eps -- time resolution
# damping -- wave damping
eps = tf.placeholder(tf.float32, shape=())
damping = tf.placeholder(tf.float32, shape=())
# Create variables for simulation state
U = tf.Variable(u_init)
Ut = tf.Variable(ut_init)
# Discretized PDE update rules
U_ = U + eps * Ut
Ut_ = Ut + eps * (laplace(U) - damping * Ut)
# Operation to update the state
step =
# Initialize state to initial conditions
# Run 1000 steps of PDE
nsteps = 1000
for i in range(nsteps):
# Step simulation{eps: 0.03, damping: 0.04})
# Visualize every 50 steps
if i % 50 == 0:
print("iter = %d, max(U) = %f, min(U) = %f" % \
(i, np.max(U.eval()), np.min(U.eval())))
on the GPU on my local machine, I get the following error in the loop at{eps: 0.03, damping: 0.04})
I tensorflow/core/common_runtime/gpu/] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 750 Ti, pci bus id: 0000:01:00.0)
F tensorflow/stream_executor/cuda/] Check failed: status == CUDNN_STATUS_SUCCESS (3 vs. 0) Unable to find a suitable algorithm for doing forward convolution
Aborted (core dumped)
When I run the code using the CPU with tf.device('/cpu:0'): it works fine. Also, I have run other examples using the GPU just fine.
Is this a feature they have yet to implement? Or did I make a mistake somewhere?
System information:
Operating System: Ubuntu 14.04 LTS
Graphics card: GeForce GTX 750 Ti
Installed version of CUDA and cuDNN: CUDA 7.5, cuNN v5
I installed the source by pulling from GitHub. More information on the GitHub issue tracker:
(1) TensorFlow Requirements(Please refer to the tensorflow manual)
The TensorFlow Python API supports Python 2.7 and Python 3.3+.
The GPU version (Linux only) works best with Cuda Toolkit 7.5 and cuDNN v4. other versions are supported (Cuda toolkit >= 7.0 and cuDNN 6.5(v2), 7.0(v3), v5) only when installing from sources.
(2) Make do
(2-1) remove cuDNN5
(2-2) install cuDNN4 and setting
(2-3-1) uninstall tensorflow
(2-3-2) install (gpu) tensorflow