Distributing graphs across several machines in Distributed Tensorflow - tensorflow

I am currently working on a project using Distributed Tensorflow. My goal is to run several independent graphs across several different machines.
As an example, I want to do something like this (assume that the server is open on each machine)
import tensorflow as tf
a = tf.constant(3)
b = tf.constant(2)
x = tf.mul(a,b) # To be run on "grpc://www.example0.com:2222"
y = tf.mul(a,b) # To be run on "grpc://www.example1.com:2222"
z = tf.mul(a,b) # To be run on "grpc://www.example2.com:2222"
with tf.Session() as sess:
sess.run([x,y,z]) # Ops x,y,z are run on different machines in parallel
My current attempt at this is shown in the following code. However, this code runs the sessions in serial, but I want them to be executed in a parallel distributed manner.
import tensorflow as tf
a = tf.constant(3)
b = tf.constant(2)
x = tf.mul(a,b) # To be run on "grpc://www.example0.com:2222"
y = tf.mul(a,b) # To be run on "grpc://www.example1.com:2222"
z = tf.mul(a,b) # To be run on "grpc://www.example2.com:2222"
with tf.Session("grpc://www.example0.com:2222") as sess:
sess.run(x)
with tf.Session("grpc://www.example1.com:2222") as sess:
sess.run(y)
with tf.Session("grpc://www.example2.com:2222") as sess:
sess.run(z)
While reading the documentation about Distributed Tensorflow, I found that tf.device allows me to set which CPU or GPU to run Tensorflow Ops on. Is there something similar that allows me to set the session target to specify which machine will run which op? Or is there another way of distributing Tensorflow Ops?

I'm currently struggling with this myself. The following is mostly cribbed from the tensorflow distributed how-to guide.
You can pin ops to a job/task using tf.device:
clusterspec = \
{ "worker":
[ "www.example0.com:2222"
, "www.example1.com:2222"
, "www.example2.com:2222"
]
, "master":
[ "localhost:2222" ]
}
cluster = tf.ClusterSpec(clusterspec)
a = tf.constant(3)
b = tf.constant(2)
# pin 'x' to www.example0.com
with tf.device("/job:worker/task:0"):
x = tf.mul(a, b)
# pin 'y' to www.example1.com
with tf.device("/job:worker/task:1"):
y = tf.mul(a, b)
server = tf.train.Server(cluster, job_name="master", task_index=0)
with tf.Session(server.target) as sess:
# run the ops
print(sess.run([x, y]))
However, at least for me, this only works if all the worker processes are on the same machine as the master. Otherwise, it hangs at sess.run.
This turned out to be a problem with the use of localhost in the cluster specification. If you share the same cluster specification between servers, don't use localhost; instead, use the IP address or hostname of the computer that you think localhost refers to. In the case of the above example, suppose that you're running the master script on www.master.com. You have two options:
1. One clusterspec per server using localhost
On each server, localhost refers to the machine running the server.
# on www.example0.com
clusterspec = \
{ "worker":
[ "localhost:2222"
, "www.example1.com:2222"
, "www.example2.com:2222"
]
, "master":
[ "www.master.com:2222" ]
}
cluster = tf.ClusterSpec(clusterspec)
server = tf.train.Server(cluster, job_name="worker", task_index=0)
server.join()
# on www.example1.com
clusterspec = \
{ "worker":
[ "www.example0.com:2222"
, "localhost:2222"
, "www.example2.com:2222"
]
, "master":
[ "www.master.com:2222" ]
}
cluster = tf.ClusterSpec(clusterspec)
server = tf.train.Server(cluster, job_name="worker", task_index=1)
server.join()
# on www.example2.com
clusterspec = \
{ "worker":
[ "www.example0.com:2222"
, "www.example1.com:2222"
, "localhost:2222"
]
, "master":
[ "www.master.com:2222" ]
}
cluster = tf.ClusterSpec(clusterspec)
server = tf.train.Server(cluster, job_name="worker", task_index=2)
server.join()
# on www.master.com
clusterspec = \
{ "worker":
[ "www.example0.com:2222"
, "www.example1.com:2222"
, "www.example2.com:2222"
]
, "master":
[ "localhost:2222" ]
}
cluster = tf.ClusterSpec(clusterspec)
a = tf.constant(3)
b = tf.constant(2)
with tf.device("/job:worker/task:0"):
x = tf.mul(a, b)
with tf.device("/job:worker/task:1"):
y = tf.mul(a, b)
server = tf.train.Server(cluster, job_name="master", task_index=0)
with tf.Session(server.target) as sess:
print(sess.run([x, y]))
2. Shared clusterspec
One cluster specification, using IP addresses / domain names that can all be seen from every node.
Saved in clusterspec.json:
{ "worker":
[ "www.example0.com:2222"
, "www.example1.com:2222"
, "www.example2.com:2222"
]
, "master":
[ "www.master.com:2222" ]
}
Then on each worker:
import json
with open('clusterspec.json', 'r') as f:
clusterspec = json.load(f)
cluster = tf.ClusterSpec(clusterspec)
server = tf.train.Server(cluster, job_name="worker", task_index=<INDEX OF TASK>)
Then on the master:
import json
with open('clusterspec.json', 'r') as f:
clusterspec = json.load(f)
cluster = tf.ClusterSpec(clusterspec)
a = tf.constant(3)
b = tf.constant(2)
with tf.device("/job:worker/task:0"):
x = tf.mul(a, b)
with tf.device("/job:worker/task:1"):
y = tf.mul(a, b)
server = tf.train.Server(cluster, job_name="master", task_index=0)
with tf.Session(server.target) as sess:
print(sess.run([x, y]))

Related

Op type not registered \'IO>BigQueryClient\' with BigQuery connector on AI platform

I'm trying to parallelize the training step of my model with tensorflow ParameterServerStrategy. I work with GCP AI Platform to create the cluster and launch the task.
As my dataset is huge, I use the bigquery tensorflow connector included in tensorflow-io.
My script is inspired by the documentation of tensorflow bigquery reader and the documentation of tensorflow ParameterServerStrategy
Locally my script works well but when I launch it with AI Platform I get the following error :
{"created":"#1633444428.903993309","description":"Error received from peer ipv4:10.46.92.135:2222","file":"external/com_github_grpc_grpc/src/core/lib/surface/call.cc","file_line":1056,"grpc_message":"Op type not registered \'IO>BigQueryClient\' in binary running on gke-cml-1005-141531--n1-standard-16-2-644bc3f8-7h8p. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.","grpc_status":5}
The scripts works with fake data on AI platform and works locally with bigquery connector.
I imagine that the compilation of the model including the bigquery connector and its calls on other devices creates the bug but I don't know how to fix it.
I read this error happens when devices don't have same tensorflow versions so I checked tensorflow and tensorflow-io version on each device.
tensorflow : 2.5.0
tensorflow-io : 0.19.1
I created a similar example which reproduce the bug on AI platform
import os
from tensorflow_io.bigquery import BigQueryClient
from tensorflow_io.bigquery import BigQueryReadSession
import tensorflow as tf
import multiprocessing
import portpicker
from tensorflow.keras.layers.experimental import preprocessing
from google.cloud import bigquery
from tensorflow.python.framework import dtypes
import numpy as np
import pandas as pd
client = bigquery.Client()
PROJECT_ID = <your_project>
DATASET_ID = 'tmp'
TABLE_ID = 'bq_tf_io'
BATCH_SIZE = 32
# Bigquery requirements
def init_bq_table():
table = '%s.%s.%s' %(PROJECT_ID, DATASET_ID, TABLE_ID)
# Create toy_data
def create_toy_data(N):
x = np.random.random(size = N)
y = 0.2 + x + np.random.normal(loc=0, scale = 0.3, size = N)
return x, y
x, y =create_toy_data(1000)
df = pd.DataFrame(data = {'x': x, 'y': y})
job_config = bigquery.LoadJobConfig(write_disposition="WRITE_TRUNCATE",)
job = client.load_table_from_dataframe( df, table, job_config=job_config )
job.result()
# Create initial data
#init_bq_table()
CSV_SCHEMA = [
bigquery.SchemaField("x", "FLOAT64"),
bigquery.SchemaField("y", "FLOAT64"),
]
def transform_row(row_dict):
# Trim all string tensors
dataset_x = row_dict
dataset_x['constant'] = tf.cast(1, tf.float64)
# Extract feature column
dataset_y = dataset_x.pop('y')
#Export as tensor
dataset_x = tf.stack([dataset_x[column] for column in dataset_x], axis=-1)
return (dataset_x, dataset_y)
def read_bigquery(table_name):
tensorflow_io_bigquery_client = BigQueryClient()
read_session = tensorflow_io_bigquery_client.read_session(
"projects/" + PROJECT_ID,
PROJECT_ID, TABLE_ID, DATASET_ID,
list(field.name for field in CSV_SCHEMA),
list(dtypes.double if field.field_type == 'FLOAT64'
else dtypes.string for field in CSV_SCHEMA),
requested_streams=2)
dataset = read_session.parallel_read_rows()
return dataset
def get_data():
dataset = read_bigquery(TABLE_ID)
dataset = dataset.map(transform_row, num_parallel_calls=4)
dataset = dataset.batch(BATCH_SIZE).prefetch(2)
return dataset
cluster_resolver = tf.distribute.cluster_resolver.TFConfigClusterResolver()
# parameter server and worker just wait jobs from the coordinator (chief)
if cluster_resolver.task_type in ("worker"):
worker_config = tf.compat.v1.ConfigProto()
server = tf.distribute.Server(
cluster_resolver.cluster_spec(),
job_name=cluster_resolver.task_type,
task_index=cluster_resolver.task_id,
config=worker_config,
protocol="grpc")
server.join()
elif cluster_resolver.task_type in ("ps"):
server = tf.distribute.Server(
cluster_resolver.cluster_spec(),
job_name=cluster_resolver.task_type,
task_index=cluster_resolver.task_id,
protocol="grpc")
server.join()
elif cluster_resolver.task_type == 'chief':
strategy = tf.distribute.experimental.ParameterServerStrategy(cluster_resolver=cluster_resolver)
if cluster_resolver.task_type == 'chief':
learning_rate = 0.01
with strategy.scope():
# model
model_input = tf.keras.layers.Input(
shape=(2,), dtype=tf.float64)
layer_1 = tf.keras.layers.Dense( 8, activation='relu')(model_input)
dense_output = tf.keras.layers.Dense(1)(layer_1)
model = tf.keras.Model(model_input, dense_output)
#optimizer
optimizer=tf.keras.optimizers.SGD(learning_rate=learning_rate)
accuracy = tf.keras.metrics.MeanSquaredError()
#tf.function
def distributed_train_step(iterator):
def train_step(x_batch_train, y_batch_train):
with tf.GradientTape() as tape:
y_predict = model(x_batch_train, training=True)
loss_value = tf.keras.losses.MeanSquaredError(reduction=tf.keras.losses.Reduction.NONE)(y_batch_train, y_predict)
grads = tape.gradient(loss_value, model.trainable_weights)
optimizer.apply_gradients(zip(grads, model.trainable_weights))
accuracy.update_state(y_batch_train, y_predict)
return loss_value
x_batch_train, y_batch_train = next(iterator)
return strategy.run(train_step, args=(x_batch_train, y_batch_train))
coordinator = tf.distribute.experimental.coordinator.ClusterCoordinator(strategy)
#test
def dataset_fn(_):
def create_toy_data(N):
x = np.random.random(size = N)
y = 0.2 + x + np.random.normal(loc=0, scale = 0.3, size = N)
return np.c_[x,y]
def toy_transform_row(row):
dataset_x = tf.stack([row[0], tf.cast(1, tf.float64)], axis=-1)
dataset_y = row[1]
return dataset_x, dataset_y
N = 1000
data =create_toy_data(N)
dataset = tf.data.Dataset.from_tensor_slices(data)
dataset = dataset.map(toy_transform_row, num_parallel_calls=4)
dataset = dataset.batch(BATCH_SIZE)
dataset = dataset.prefetch(2)
return dataset
#tf.function
def per_worker_dataset_fn():
return strategy.distribute_datasets_from_function(lambda x : get_data()) # <-- Not working with AI platform
#return strategy.distribute_datasets_from_function(dataset_fn) # <-- Working with AI platform
per_worker_dataset = coordinator.create_per_worker_dataset(per_worker_dataset_fn)
# Train model
for epoch in range(5):
per_worker_iterator = iter(per_worker_dataset)
accuracy.reset_states()
for step in range(5):
coordinator.schedule(distributed_train_step, args=(per_worker_iterator,))
coordinator.join()
print ("Finished epoch %d, accuracy is %f." % (epoch, accuracy.result().numpy()))
When I create the dataset with per_worker_dataset_fn() I can use the bigquery connector (bugging) or create the dataset in live (working).
AI Platform Cluster configuration :
runtimeVersion: "2.5"
pythonVersion: "3.7"
Did someone get this issue ? Bigquery connector worked pretty well with MirroredStrategy on AI Platform. Tell me if I should report the issue somewhere else.
I think this is due to lazy loading of libtensorflow_io.so.
https://github.com/tensorflow/io/commit/85d018ee59ceccfae06914ec2a2f6d6583775ff7
Can you try adding something like this to your code:
import tensorflow_io
tensorflow_io.experimental.oss()
As far as I understand this happens because when you submit your training job to Cloud AI training, it is using a stock TensorFlow 2.5 environment that doesn't have tensorflow-io package installed. Therefore it is complaining that it doesn't know about 'IO>BigQueryClient' op defined in tensorflow-io package.
Instead you can submit your training job to be using a custom container:
https://cloud.google.com/ai-platform/training/docs/custom-containers-training
You don't need to write a new Docker file, you can use
gcr.io/deeplearning-platform-release/tf-cpu.2-5
or
gcr.io/deeplearning-platform-release/tf-gpu.2-5 (if your training job needs GPU) that has the right version of tensorflow-io installed.
You can read more about these containers here:
https://cloud.google.com/tensorflow-enterprise/docs/use-with-deep-learning-containers
Here is my old example showing how to run a distributed training on Cloud AI using BigQueryReader: https://github.com/vlasenkoalexey/criteo/blob/master/scripts/train-cloud.sh
It is no longer maintained, but should give you a general idea how it should look like.

Distributed tensorflow : What is the job of chief worker?

I am using a version of the distributed tensorflow example https://www.tensorflow.org/deploy/distributed
Here is my code in "mnist_trainer.py".
import math
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
tf.logging.set_verbosity(tf.logging.INFO)
# Flags for defining the tf.train.ClusterSpec
tf.app.flags.DEFINE_string("ps_hosts", "",
"Comma-separated list of hostname:port pairs")
tf.app.flags.DEFINE_string("worker_hosts", "",
"Comma-separated list of hostname:port pairs")
# Flags for defining the tf.train.Server
tf.app.flags.DEFINE_string("job_name", "", "One of 'ps', 'worker'")
tf.app.flags.DEFINE_integer("task_index", 0, "Index of task within the job")
tf.app.flags.DEFINE_integer("hidden_units", 100,
"Number of units in the hidden layer of the NN")
tf.app.flags.DEFINE_string("data_dir", "/home/anijsure/mnist_data",
"Directory for storing mnist data")
tf.app.flags.DEFINE_integer("batch_size", 100, "Training batch size")
FLAGS = tf.app.flags.FLAGS
IMAGE_PIXELS = 28
def main(_):
print "Starting"
ps_hosts = FLAGS.ps_hosts.split(",")
worker_hosts = FLAGS.worker_hosts.split(",")
# Create a cluster from the parameter server and worker hosts.
print "Cluster starting"
cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})
# Create and start a server for the local task.
print "Server starting"
server = tf.train.Server(cluster,
job_name=FLAGS.job_name,
task_index=FLAGS.task_index)
if FLAGS.job_name == "ps":
server.join()
elif FLAGS.job_name == "worker":
print "Job : WORKER"
# Assigns ops to the local worker by default.
with tf.device(tf.train.replica_device_setter(
worker_device="/job:worker/task:%d" % FLAGS.task_index,
cluster=cluster)):
mytask = tf.constant(FLAGS.task_index, name="mytask")
mnist = input_data.read_data_sets(FLAGS.data_dir, one_hot=True)
dataset = tf.data.Dataset.from_tensor_slices((mnist.train.images, mnist.train.labels))
# Create batches of data
dataset = dataset.batch(FLAGS.batch_size)
# Create an iterator, to go over the dataset
iterator = dataset.make_initializable_iterator()
X,Y = iterator.get_next()
# Variables of the hidden layer
hid_w = tf.Variable(
tf.truncated_normal([IMAGE_PIXELS * IMAGE_PIXELS, FLAGS.hidden_units],
stddev=1.0 / IMAGE_PIXELS), name="hid_w")
hid_b = tf.Variable(tf.zeros([FLAGS.hidden_units]), name="hid_b")
# Variables of the softmax layer
sm_w = tf.Variable(
tf.truncated_normal([FLAGS.hidden_units, 10],
stddev=1.0 / math.sqrt(FLAGS.hidden_units)),
name="sm_w")
sm_b = tf.Variable(tf.zeros([10]), name="sm_b")
hid_lin = tf.nn.xw_plus_b(X, hid_w, hid_b)
hid = tf.nn.relu(hid_lin)
y = tf.nn.xw_plus_b(hid, sm_w, sm_b)
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=Y, logits=y), name="loss")
global_step = tf.train.get_or_create_global_step()
train_op = tf.train.AdagradOptimizer(0.01).minimize(
loss, global_step=global_step)
# The StopAtStepHook handles stopping after running given steps.
chiefhooks=[tf.train.StopAtStepHook(num_steps=25)]
allhooks=[tf.train.LoggingTensorHook(tensors={"Task": "mytask","loss":"loss", "Step":"global_step"}, every_n_iter=1)]
# The MonitoredTrainingSession takes care of session initialization,
# restoring from a checkpoint, saving to a checkpoint, and closing when done
# or an error occurs.
with tf.train.MonitoredTrainingSession(master=server.target,
is_chief=(FLAGS.task_index == 0),
checkpoint_dir="/tmp/train_logs_%d" % FLAGS.task_index,
hooks=allhooks, chief_only_hooks=chiefhooks) as mon_sess:
mon_sess.run(iterator.initializer)
while not mon_sess.should_stop():
# Run a training step asynchronously.
# See `tf.train.SyncReplicasOptimizer` for additional details on how to
# perform *synchronous* training.
# mon_sess.run handles AbortedError in case of preempted PS.
_ = mon_sess.run([train_op])
if __name__ == "__main__":
tf.app.run()
I run it like so:
HOSTS=<node0>:2222
WORKERS=<node1>:2222,<node1>:2223,<node1>:2224
python mnist_trainer.py --ps_hosts=$HOSTS --worker_hosts=$WORKERS --job_name=ps --task_index=0 &
python mnist_trainer.py --data_dir mnist_data --ps_hosts=$HOSTS --worker_hosts=$WORKERS --job_name=worker --task_index=0 2>&1 | tee worker0.log &
python mnist_trainer.py --data_dir mnist_data_1 --ps_hosts=$HOSTS --worker_hosts=$WORKERS --job_name=worker --task_index=1 2>&1 | tee worker1.log &
python mnist_trainer.py --data_dir mnist_data_2 --ps_hosts=$HOSTS --worker_hosts=$WORKERS --job_name=worker --task_index=2 2>&1 | tee worker2.log &
I have tried this with 1 PS and 2 or 3 workers - both nodes are CPU machines. PS is on node0 and workers are all different ports on node1. In either of 2 or 3 worker case, chief worker (task0 worker) does not seem to be making any updates at all. I have set the StopatStepHook to 25 on chief worker only. However training seems to stop at global_step=549 with 2 worker case and global_step=1098 with 3 worker case. I am printing worker task# with the LoggingTensorHook and it only shows task 1 and 2 logging anything. Only on the last iteration does task 0 log the tensors.
Is this expected behaviour? Is chief worker supposed to only keep track of monitoring session, checkpointing, etc?
Considering that the training does stop at this magic number of 550 iters, something on the chief worker is indeed triggering the stop.
What is the chief worker doing and how is it keeping track of the stopping step?
Usually the chief worker is responsible for initialize graph, save model checkpoint operations for the training cluster.
According to the TensorFlow documentation for tf.estimator.train_and_evaluate:
…[T]he chief worker also does the model training job, similar to other non-chief training workers (see next paragraph). In addition to the model training, it manages some extra work, e.g., checkpoint saving and restoring, writing summaries, etc.

Sharing of array list or variable between 2 distributed tensorflow processes

I am presently working on Distributed tensorflow considering 2 worker processes and facing the issue of sharing variable between these two worker process.
I found tf.get_collection/tf.add_collection but still unable to get the variable value shared between the 2 processes.
Adding Few details around how I want to share the data among the worker processes in Distributed Tensorflow :
def create_variable(layer_shape):
with tf.variable_scope("share_lay"):
layers = tf.get_variable("layers", shape=layer_shape, trainable=True)
with tf.variable_scope("share_lay", reuse=tf.AUTO_REUSE):
layers = tf.get_variable("layers", shape=layer_shape, trainable=True)
return layers
def set_layer(layers):
tf.add_to_collection("layers", layers)
def get_layer(name):
return tf.get_collection(name)[0]
taskid == 0:
layers = create_variable(layer_shape)
layers = <some value>
set_layer(layers)
taskid == 1:
layers = create_variable(layer_shape)
layers = get_layer("layers")
I am getting an error when performing get_layer() as :
return tf.get_collection(name)[0]
IndexError: list index out of range
It appears that the data cannot be share between the workers
Request some suggestions regarding the same
Any suggestions / pointers is appreciated,
Thanks,
Kapil
I finally solve the same problem by using tf.train.replica_device_setter() to place the variables on parameter server and add them to a colletion. Later, I can use tf.get_collection() in any worker to return that collection, which is actually a python list. Note that tf.get_collection only return a copy of original collection. If you want to change the variables in the original collection, you should use tf.get_collecion_ref which actually returns the collection list itself.
Here is an example:
import tensorflow as tf
FLAGS = tf.app.flags.FLAGS
tf.app.flags.DEFINE_string('job_name', '',
"""One of 'ps', 'worker' """)
tf.app.flags.DEFINE_integer('task_index', 0,
"""Index of task within the job""")
cluster = tf.train.ClusterSpec(
{'ps': ['localhost:22222'],
'worker': ['localhost:22223', 'localhost:22227']})
config = tf.ConfigProto(
intra_op_parallelism_threads=1,
inter_op_parallelism_threads=1)
if FLAGS.job_name == 'ps':
server = tf.train.Server(cluster, job_name='ps', task_index=FLAGS.task_index, config=config)
server.join()
else:
server = tf.train.Server(cluster, job_name='worker', task_index=FLAGS.task_index, config=config)
with tf.device(tf.train.replica_device_setter(cluster=cluster)):
#create a colletion 'shared_list' and add two variables to the collection 'shared_list'
#note that these two variables are placed on parameter server
a = tf.Variable(name='a', initial_value=tf.constant(1.0),
collections=[tf.GraphKeys.GLOBAL_VARIABLES, 'shared_list'])
b = tf.Variable(name='b', initial_value=tf.constant(2.0),
collections=[tf.GraphKeys.GLOBAL_VARIABLES, 'shared_list'])
#now let's print out the value of a+2.0 and b+2.0 using the collection 'shared_list' from different worker
#note that tf.get_collection will return a copy of exiting collection which is actually a python list
with tf.device('/job:worker/task:%d' %FLAGS.task_index):
c = tf.get_collection('shared_list')[0] + 2.0 # a+2.0
d = tf.get_collection('shared_list')[1] + 2.0 # b+2.0
with tf.train.MonitoredTrainingSession(master=server.target,
is_chief=(FLAGS.task_index==0),
config=config) as sess:
print('this is worker %d' % FLAGS.task_index)
print(c.eval(session=sess))
print(d.eval(session=sess))
server.join()
worker 0 will print out:
this is worker 0
3.0
4.0
worker 1 will print out:
this is worker 1
3.0
4.0
Edit: work 0 modifies the variable 'a' to 10, and then worker 1 prints out the new value of 'a', which becomes 10 immediately. Actually, variable 'a' is available for both worker 0 and worker 1 because they are in distributed setting. Below is an example. Also refers to this blog in Amid Fish by Matthew Rahtz for how to share variables in distributed tensorflow. Actually, we don't need any parameter server to share variables. Any two workers can share the same variable with each other as long as the two workers create two variables having exactly the same name.
Here is the example
import tensorflow as tf
from time import sleep
FLAGS = tf.app.flags.FLAGS
tf.app.flags.DEFINE_string('job_name', '',
"""One of 'ps', 'worker' """)
tf.app.flags.DEFINE_integer('task_index', 0,
"""Index of task within the job""")
cluster = tf.train.ClusterSpec(
{'ps': ['localhost:22222'],
'worker': ['localhost:22223', 'localhost:22227']})
if FLAGS.job_name == 'ps':
server = tf.train.Server(cluster, job_name='ps', task_index=FLAGS.task_index)
server.join()
else:
server = tf.train.Server(cluster, job_name='worker', task_index=FLAGS.task_index)
with tf.device(tf.train.replica_device_setter(cluster=cluster)):
# create a colletion 'shared_list' and add two variables to the collection 'shared_list'
# note that these two variables are placed on parameter server
a = tf.Variable(name='a', initial_value=tf.constant(1.0),
collections=[tf.GraphKeys.GLOBAL_VARIABLES, 'shared_list'])
b = tf.Variable(name='b', initial_value=tf.constant(2.0),
collections=[tf.GraphKeys.GLOBAL_VARIABLES, 'shared_list'])
# change the value of 'a' in worker 0
if FLAGS.task_index == 0:
change_a = a.assign(10)
# print out the new value of a in worker 1 using get_collction. Note that we may need to
# use read_value() method to force the op to read the current value of a
if FLAGS.task_index == 1:
with tf.device('/job:worker/task:1'): # place read_a to worker 1
read_a = tf.get_collection('shared_list')[0].read_value() # a = 10
with tf.train.MonitoredTrainingSession(master=server.target,
is_chief=(FLAGS.task_index == 0))as sess:
if FLAGS.task_index == 0:
sess.run(change_a)
if FLAGS.task_index == 1:
sleep(1) # sleep a little bit to wait until change_a has been executed
print(read_a.eval(session=sess))
server.join()
worker 1 prints out
10

gcloud jobs submit prediction 'can't decode json' with --data-format=TF_RECORD

I pushed up some test data to gcloud for prediction as a binary tfrecord-file. Running my script I got the error ('No JSON object could be decoded', 162). What do you think I am doing wrong?
To push a prediction job to gcloud, i use this script:
REGION=us-east1
MODEL_NAME=mymodel
VERSION=v_hopt_22
INPUT_PATH=gs://mydb/test-data.tfr
OUTPUT_PATH=gs://mydb/prediction.tfr
JOB_NAME=pred_${MODEL_NAME}_${VERSION}_b
args=" --model "$MODEL_NAME
args+=" --version "$VERSION
args+=" --data-format=TF_RECORD"
args+=" --input-paths "$INPUT_PATH
args+=" --output-path "$OUTPUT_PATH
args+=" --region "$REGION
gcloud ml-engine jobs submit prediction $JOB_NAME $args
test-data.tfr has been generated from a numpy array, as so:
import numpy as np
filename = './Datasets/test-data.npz'
data = np.load(filename)
features = data['X'] # features[channel, example, feature]
np_features = np.swapaxes(features, 0, 1) # features[example, channel, feature]
import tensorflow as tf
import nnscoring.data as D
def floats_feature(arr):
return tf.train.Feature(float_list=tf.train.FloatList(value=arr.flatten().tolist()))
writer = tf.python_io.TFRecordWriter("./Datasets/test-data.tfr")
for i, np_example in enumerate(np_features):
if i%1000==0: print(i)
tf_feature = {
ch: floats_feature(x)
for ch, x in zip(D.channels, np_example)
}
tf_features = tf.train.Features(feature=tf_feature)
tf_example = tf.train.Example(features=tf_features)
writer.write(tf_example.SerializeToString())
writer.close()
Update (following yxshi):
I define the following serving function
def tfrecord_serving_input_fn():
import tensorflow as tf
seq_length = int(dt*sr)
examples = tf.placeholder(tf.string, shape=())
feat_map = {
channel: tf.FixedLenSequenceFeature(shape=(seq_length,),
dtype=tf.float32, allow_missing=True)
for channel in channels
}
parsed = tf.parse_single_example(examples, features=feat_map)
features = {
channel: tf.expand_dims(tensor, -1)
for channel, tensor in parsed.iteritems()
}
from collections import namedtuple
InputFnOps = namedtuple("InputFnOps", "features labels receiver_tensors")
tf.contrib.learn.utils.input_fn_utils.InputFnOps = InputFnOps
return InputFnOps(features=features, labels=None, receiver_tensors=examples)
# InputFnOps = tf.contrib.learn.utils.input_fn_utils.InputFnOps
# return InputFnOps(features, None, parsed)
# Error: InputFnOps has no attribute receiver_tensors
.., which I pass to generate_experiment_fn as so:
export_strategies = [
saved_model_export_utils.make_export_strategy(
tfrecord_serving_input_fn,
exports_to_keep = 1,
default_output_alternative_key = None,
)]
gen_exp_fn = generate_experiment_fn(
train_steps_per_iteration = args.train_steps_per_iteration,
train_steps = args.train_steps,
export_strategies = export_strategies
)
(aside: note the dirty patch of InputFnOps)
It looks like the input is not correctly specified in the inference graph. To use tf_record as input data format, your inference graph must accept strings as the input placeholder. In your case, you should have something like below in your inference code:
examples = tf.placeholder(tf.string, name='input', shape=(None,))
with tf.name_scope('inputs'):
feature_map = {
ch: floats_feature(x)
for ch, x in zip(D.channels, np_example)
}
parsed = tf.parse_example(examples, features=feature_map)
f1 = parsed['feature_name_1']
f2 = parsed['feature_name_2']
...
A close example is here:
https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/flowers/trainer/model.py#L253
Hope it helps.

Tensorflow: running code on multiple nodes

import json
import tensorflow as tf
num_devices=3
with open('clusterspec.json', 'r') as f:
clusterspec = json.load(f)
cluster = tf.train.ClusterSpec(clusterspec)
a = tf.constant(3)
b = tf.constant(2)
for i in range(num_devices):
with tf.device("/job:worker/task:%d"%i):
if( i == 0):
x = tf.mul(a,b)
if( i == 1):
y = tf.mul(a,b)
if( i == 2):
z = tf.mul(a,b)
server = tf.train.Server(cluster, job_name="master", task_index=0)
with tf.Session(server.target) as sess:
print "here"
print(sess.run([x,y,z]))
Could someone please suggest a better way to run this simple example on multiple machines. I don't want to have three different tensors x, y, and z. Instead, I want only a single tensor, and not use the branching within the graph at all.