Distributed tensorflow : What is the job of chief worker? - tensorflow

I am using a version of the distributed tensorflow example https://www.tensorflow.org/deploy/distributed
Here is my code in "mnist_trainer.py".
import math
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
tf.logging.set_verbosity(tf.logging.INFO)
# Flags for defining the tf.train.ClusterSpec
tf.app.flags.DEFINE_string("ps_hosts", "",
"Comma-separated list of hostname:port pairs")
tf.app.flags.DEFINE_string("worker_hosts", "",
"Comma-separated list of hostname:port pairs")
# Flags for defining the tf.train.Server
tf.app.flags.DEFINE_string("job_name", "", "One of 'ps', 'worker'")
tf.app.flags.DEFINE_integer("task_index", 0, "Index of task within the job")
tf.app.flags.DEFINE_integer("hidden_units", 100,
"Number of units in the hidden layer of the NN")
tf.app.flags.DEFINE_string("data_dir", "/home/anijsure/mnist_data",
"Directory for storing mnist data")
tf.app.flags.DEFINE_integer("batch_size", 100, "Training batch size")
FLAGS = tf.app.flags.FLAGS
IMAGE_PIXELS = 28
def main(_):
print "Starting"
ps_hosts = FLAGS.ps_hosts.split(",")
worker_hosts = FLAGS.worker_hosts.split(",")
# Create a cluster from the parameter server and worker hosts.
print "Cluster starting"
cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})
# Create and start a server for the local task.
print "Server starting"
server = tf.train.Server(cluster,
job_name=FLAGS.job_name,
task_index=FLAGS.task_index)
if FLAGS.job_name == "ps":
server.join()
elif FLAGS.job_name == "worker":
print "Job : WORKER"
# Assigns ops to the local worker by default.
with tf.device(tf.train.replica_device_setter(
worker_device="/job:worker/task:%d" % FLAGS.task_index,
cluster=cluster)):
mytask = tf.constant(FLAGS.task_index, name="mytask")
mnist = input_data.read_data_sets(FLAGS.data_dir, one_hot=True)
dataset = tf.data.Dataset.from_tensor_slices((mnist.train.images, mnist.train.labels))
# Create batches of data
dataset = dataset.batch(FLAGS.batch_size)
# Create an iterator, to go over the dataset
iterator = dataset.make_initializable_iterator()
X,Y = iterator.get_next()
# Variables of the hidden layer
hid_w = tf.Variable(
tf.truncated_normal([IMAGE_PIXELS * IMAGE_PIXELS, FLAGS.hidden_units],
stddev=1.0 / IMAGE_PIXELS), name="hid_w")
hid_b = tf.Variable(tf.zeros([FLAGS.hidden_units]), name="hid_b")
# Variables of the softmax layer
sm_w = tf.Variable(
tf.truncated_normal([FLAGS.hidden_units, 10],
stddev=1.0 / math.sqrt(FLAGS.hidden_units)),
name="sm_w")
sm_b = tf.Variable(tf.zeros([10]), name="sm_b")
hid_lin = tf.nn.xw_plus_b(X, hid_w, hid_b)
hid = tf.nn.relu(hid_lin)
y = tf.nn.xw_plus_b(hid, sm_w, sm_b)
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=Y, logits=y), name="loss")
global_step = tf.train.get_or_create_global_step()
train_op = tf.train.AdagradOptimizer(0.01).minimize(
loss, global_step=global_step)
# The StopAtStepHook handles stopping after running given steps.
chiefhooks=[tf.train.StopAtStepHook(num_steps=25)]
allhooks=[tf.train.LoggingTensorHook(tensors={"Task": "mytask","loss":"loss", "Step":"global_step"}, every_n_iter=1)]
# The MonitoredTrainingSession takes care of session initialization,
# restoring from a checkpoint, saving to a checkpoint, and closing when done
# or an error occurs.
with tf.train.MonitoredTrainingSession(master=server.target,
is_chief=(FLAGS.task_index == 0),
checkpoint_dir="/tmp/train_logs_%d" % FLAGS.task_index,
hooks=allhooks, chief_only_hooks=chiefhooks) as mon_sess:
mon_sess.run(iterator.initializer)
while not mon_sess.should_stop():
# Run a training step asynchronously.
# See `tf.train.SyncReplicasOptimizer` for additional details on how to
# perform *synchronous* training.
# mon_sess.run handles AbortedError in case of preempted PS.
_ = mon_sess.run([train_op])
if __name__ == "__main__":
tf.app.run()
I run it like so:
HOSTS=<node0>:2222
WORKERS=<node1>:2222,<node1>:2223,<node1>:2224
python mnist_trainer.py --ps_hosts=$HOSTS --worker_hosts=$WORKERS --job_name=ps --task_index=0 &
python mnist_trainer.py --data_dir mnist_data --ps_hosts=$HOSTS --worker_hosts=$WORKERS --job_name=worker --task_index=0 2>&1 | tee worker0.log &
python mnist_trainer.py --data_dir mnist_data_1 --ps_hosts=$HOSTS --worker_hosts=$WORKERS --job_name=worker --task_index=1 2>&1 | tee worker1.log &
python mnist_trainer.py --data_dir mnist_data_2 --ps_hosts=$HOSTS --worker_hosts=$WORKERS --job_name=worker --task_index=2 2>&1 | tee worker2.log &
I have tried this with 1 PS and 2 or 3 workers - both nodes are CPU machines. PS is on node0 and workers are all different ports on node1. In either of 2 or 3 worker case, chief worker (task0 worker) does not seem to be making any updates at all. I have set the StopatStepHook to 25 on chief worker only. However training seems to stop at global_step=549 with 2 worker case and global_step=1098 with 3 worker case. I am printing worker task# with the LoggingTensorHook and it only shows task 1 and 2 logging anything. Only on the last iteration does task 0 log the tensors.
Is this expected behaviour? Is chief worker supposed to only keep track of monitoring session, checkpointing, etc?
Considering that the training does stop at this magic number of 550 iters, something on the chief worker is indeed triggering the stop.
What is the chief worker doing and how is it keeping track of the stopping step?

Usually the chief worker is responsible for initialize graph, save model checkpoint operations for the training cluster.

According to the TensorFlow documentation for tf.estimator.train_and_evaluate:
…[T]he chief worker also does the model training job, similar to other non-chief training workers (see next paragraph). In addition to the model training, it manages some extra work, e.g., checkpoint saving and restoring, writing summaries, etc.

Related

How do I use all cores of my CPU in reinforcement learning with TF Agents?

I work with an RL algorithm. I'm using tensorflow and tf-agents and training a DQN. My problem is that only one core of the CPU is used when calculating the 10 episodes in the environment for data collection.
My training function looks like this:
def train_step(self, n_steps):
env_steps = tf_metrics.EnvironmentSteps()
#num_episodes = tf_metrics.NumberOfEpisodes()
rew = TFSumOfRewards()
action_hist = tf_metrics.ChosenActionHistogram(
name='ChosenActionHistogram', dtype=tf.int32, buffer_size=1000
)
#add reply buffer and metrict to the observer
replay_observer = [self.replay_buffer.add_batch]
train_metrics = [env_steps, rew]
self.replay_buffer.clear()
driver = dynamic_episode_driver.DynamicEpisodeDriver(
self.train_env, self.collect_policy, observers=replay_observer + train_metrics, num_episodes=self.collect_episodes)
final_time_step, policy_state = driver.run()
print('Number of Steps: ', env_steps.result().numpy())
for train_metric in train_metrics:
train_metric.tf_summaries(train_step=self.global_step, step_metrics=train_metrics)
# Convert the replay buffer to a tf.data.Dataset
# Dataset generates trajectories with shape [Bx2x...]
AUTOTUNE = tf.data.experimental.AUTOTUNE
dataset = self.replay_buffer.as_dataset(
num_parallel_calls=AUTOTUNE,
sample_batch_size=self.batch_size,
num_steps=(self.train_sequence_length + 1)).prefetch(AUTOTUNE)
iterator = iter(dataset)
train_loss = None
for _ in range(n_steps):
# Sample a batch of data from the buffer and update the agent's network.
experience, unused_info = next(iterator)
train_loss = self.agent.train(experience)
def train_agent(self, n_epoch):
for i in range(n_epoch):
self.train_step(int(self.replay_buffer.num_frames().numpy()/self.batch_size))
if(self.IsAutoStoreCheckpoint == True):
self.store_check_point()
pass
As already written above, num_episodes = 10. So it would make sense to calculate 10 episodes in parallel before the network is trained.
If I set the value num_parallel_calls to e.g. 10 nothing changes. What do I have to do to use all cores of my CPU (Ryzen 9 5950x with 16 cores)?
Thanks!
masterkey

How to access the Q network output layers from a DqnAgent in Tensorflow agents

My Q-network for a DqnAgent is a Sequential set of layers (sequential.Sequential) - really similar to the tutorial here: https://www.tensorflow.org/agents/tutorials/1_dqn_tutorial#agent:
q_net = sequential.Sequential(dense_layers + [q_values_layer])
You can normally access keras layers by doing .layers[i].output etc.
But when used as q_network for a DqnAgent the layer outputs are never available/initialised.
Is there some way I can access the layer outputs and values when the network is attached to an agent like this? I want this mainly for debugging.
Again my loop is very similar to the loop here: https://www.tensorflow.org/agents/tutorials/1_dqn_tutorial#training_the_agent:
agent = dqn_agent.DqnAgent(
train_env.time_step_spec(),
train_env.action_spec(),
q_network=q_net,
optimizer=optimizer,
td_errors_loss_fn=common.element_wise_squared_loss,
train_step_counter=train_step_counter)
...
collect_driver = py_driver.PyDriver(
env,
py_tf_eager_policy.PyTFEagerPolicy(
agent.collect_policy, use_tf_function=True),
[rb_observer],
max_steps=collect_steps_per_iteration)
for _ in range(num_iterations):
# Collect a few steps and save to the replay buffer.
time_step, _ = collect_driver.run(time_step)
# Sample a batch of data from the buffer and update the agent's network.
experience, unused_info = next(iterator)
train_loss = agent.train(experience).loss
step = agent.train_step_counter.numpy()
if step % log_interval == 0:
print('step = {0}: loss = {1}'.format(step, train_loss))
if step % eval_interval == 0:
avg_return = compute_avg_return(eval_env, agent.policy, num_eval_episodes)
print('step = {0}: Average Return = {1}'.format(step, avg_return))
returns.append(avg_return)

python multiprocessing pool.map hangs when calling tensorflow/keras model

I use pool.map from multiprocessing to parallelize my python code. When I call my tensorflow/keras model with pool.map, the code hangs if my neural network is larger than a certain size. I still have plenty of RAM available, and calling the model outside of pool works fine.
I use python 3.7, tensorflow 2.3 on linux.
A mwe is provided below, it is also on colab:
def my_function(i):
a = MODEL(np.array(i).reshape(1,1))
print('foo', i)
return a
THREADS = os.cpu_count()
N = 4
NEURONS = 150000 # works for 100000, hangs for 150000
MODEL = tf.keras.Sequential([tf.keras.layers.Dense(NEURONS, input_shape=(1,))])
my_function(10) # works fine
pool = multiprocessing.Pool(THREADS)
_ = pool.map(my_function, range(N)) # hangs
pool.close()
pool.join()
Any idea what the issue is? How can I call a large model in parallel?
Edit: the size of a is not the issue, and the code hangs only if tf.keras is called once outside of pool, see mwe below and colab. The critical number of neurons is lower than in the original example. Any idea?
def my_function(i):
print('start', i)
model = tf.keras.Sequential([tf.keras.layers.Dense(NEURONS, input_shape=(1,))])
print('finish', i)
return None
THREADS = os.cpu_count()
N = 4
NEURONS = 20000 # works with 10000, not with 20000
# works
pool = multiprocessing.Pool(THREADS)
_ = pool.map(my_function, range(N))
pool.close()
pool.join()
# works
my_function(10)
# doesn't work if many neurons
pool = multiprocessing.Pool(THREADS)
_ = pool.map(my_function, range(N))
pool.close()
pool.join()

Sharing of array list or variable between 2 distributed tensorflow processes

I am presently working on Distributed tensorflow considering 2 worker processes and facing the issue of sharing variable between these two worker process.
I found tf.get_collection/tf.add_collection but still unable to get the variable value shared between the 2 processes.
Adding Few details around how I want to share the data among the worker processes in Distributed Tensorflow :
def create_variable(layer_shape):
with tf.variable_scope("share_lay"):
layers = tf.get_variable("layers", shape=layer_shape, trainable=True)
with tf.variable_scope("share_lay", reuse=tf.AUTO_REUSE):
layers = tf.get_variable("layers", shape=layer_shape, trainable=True)
return layers
def set_layer(layers):
tf.add_to_collection("layers", layers)
def get_layer(name):
return tf.get_collection(name)[0]
taskid == 0:
layers = create_variable(layer_shape)
layers = <some value>
set_layer(layers)
taskid == 1:
layers = create_variable(layer_shape)
layers = get_layer("layers")
I am getting an error when performing get_layer() as :
return tf.get_collection(name)[0]
IndexError: list index out of range
It appears that the data cannot be share between the workers
Request some suggestions regarding the same
Any suggestions / pointers is appreciated,
Thanks,
Kapil
I finally solve the same problem by using tf.train.replica_device_setter() to place the variables on parameter server and add them to a colletion. Later, I can use tf.get_collection() in any worker to return that collection, which is actually a python list. Note that tf.get_collection only return a copy of original collection. If you want to change the variables in the original collection, you should use tf.get_collecion_ref which actually returns the collection list itself.
Here is an example:
import tensorflow as tf
FLAGS = tf.app.flags.FLAGS
tf.app.flags.DEFINE_string('job_name', '',
"""One of 'ps', 'worker' """)
tf.app.flags.DEFINE_integer('task_index', 0,
"""Index of task within the job""")
cluster = tf.train.ClusterSpec(
{'ps': ['localhost:22222'],
'worker': ['localhost:22223', 'localhost:22227']})
config = tf.ConfigProto(
intra_op_parallelism_threads=1,
inter_op_parallelism_threads=1)
if FLAGS.job_name == 'ps':
server = tf.train.Server(cluster, job_name='ps', task_index=FLAGS.task_index, config=config)
server.join()
else:
server = tf.train.Server(cluster, job_name='worker', task_index=FLAGS.task_index, config=config)
with tf.device(tf.train.replica_device_setter(cluster=cluster)):
#create a colletion 'shared_list' and add two variables to the collection 'shared_list'
#note that these two variables are placed on parameter server
a = tf.Variable(name='a', initial_value=tf.constant(1.0),
collections=[tf.GraphKeys.GLOBAL_VARIABLES, 'shared_list'])
b = tf.Variable(name='b', initial_value=tf.constant(2.0),
collections=[tf.GraphKeys.GLOBAL_VARIABLES, 'shared_list'])
#now let's print out the value of a+2.0 and b+2.0 using the collection 'shared_list' from different worker
#note that tf.get_collection will return a copy of exiting collection which is actually a python list
with tf.device('/job:worker/task:%d' %FLAGS.task_index):
c = tf.get_collection('shared_list')[0] + 2.0 # a+2.0
d = tf.get_collection('shared_list')[1] + 2.0 # b+2.0
with tf.train.MonitoredTrainingSession(master=server.target,
is_chief=(FLAGS.task_index==0),
config=config) as sess:
print('this is worker %d' % FLAGS.task_index)
print(c.eval(session=sess))
print(d.eval(session=sess))
server.join()
worker 0 will print out:
this is worker 0
3.0
4.0
worker 1 will print out:
this is worker 1
3.0
4.0
Edit: work 0 modifies the variable 'a' to 10, and then worker 1 prints out the new value of 'a', which becomes 10 immediately. Actually, variable 'a' is available for both worker 0 and worker 1 because they are in distributed setting. Below is an example. Also refers to this blog in Amid Fish by Matthew Rahtz for how to share variables in distributed tensorflow. Actually, we don't need any parameter server to share variables. Any two workers can share the same variable with each other as long as the two workers create two variables having exactly the same name.
Here is the example
import tensorflow as tf
from time import sleep
FLAGS = tf.app.flags.FLAGS
tf.app.flags.DEFINE_string('job_name', '',
"""One of 'ps', 'worker' """)
tf.app.flags.DEFINE_integer('task_index', 0,
"""Index of task within the job""")
cluster = tf.train.ClusterSpec(
{'ps': ['localhost:22222'],
'worker': ['localhost:22223', 'localhost:22227']})
if FLAGS.job_name == 'ps':
server = tf.train.Server(cluster, job_name='ps', task_index=FLAGS.task_index)
server.join()
else:
server = tf.train.Server(cluster, job_name='worker', task_index=FLAGS.task_index)
with tf.device(tf.train.replica_device_setter(cluster=cluster)):
# create a colletion 'shared_list' and add two variables to the collection 'shared_list'
# note that these two variables are placed on parameter server
a = tf.Variable(name='a', initial_value=tf.constant(1.0),
collections=[tf.GraphKeys.GLOBAL_VARIABLES, 'shared_list'])
b = tf.Variable(name='b', initial_value=tf.constant(2.0),
collections=[tf.GraphKeys.GLOBAL_VARIABLES, 'shared_list'])
# change the value of 'a' in worker 0
if FLAGS.task_index == 0:
change_a = a.assign(10)
# print out the new value of a in worker 1 using get_collction. Note that we may need to
# use read_value() method to force the op to read the current value of a
if FLAGS.task_index == 1:
with tf.device('/job:worker/task:1'): # place read_a to worker 1
read_a = tf.get_collection('shared_list')[0].read_value() # a = 10
with tf.train.MonitoredTrainingSession(master=server.target,
is_chief=(FLAGS.task_index == 0))as sess:
if FLAGS.task_index == 0:
sess.run(change_a)
if FLAGS.task_index == 1:
sleep(1) # sleep a little bit to wait until change_a has been executed
print(read_a.eval(session=sess))
server.join()
worker 1 prints out
10

Implement early stopping in tf.estimator.DNNRegressor using the available training hooks

I am new to tensorflow and want to implement early stopping in tf.estimator.DNNRegressor with available training hooksTraining Hooks for the MNIST dataset. The early stopping hook will stop training if the loss does not improve for some specified number of steps. Tensorflow documentaton only provides example for Logging hooks. Can someone write a code snippet for implementing it?
Here is a EarlyStoppingHook sample implementation:
import numpy as np
import tensorflow as tf
import logging
from tensorflow.python.training import session_run_hook
class EarlyStoppingHook(session_run_hook.SessionRunHook):
"""Hook that requests stop at a specified step."""
def __init__(self, monitor='val_loss', min_delta=0, patience=0,
mode='auto'):
"""
"""
self.monitor = monitor
self.patience = patience
self.min_delta = min_delta
self.wait = 0
if mode not in ['auto', 'min', 'max']:
logging.warning('EarlyStopping mode %s is unknown, '
'fallback to auto mode.', mode, RuntimeWarning)
mode = 'auto'
if mode == 'min':
self.monitor_op = np.less
elif mode == 'max':
self.monitor_op = np.greater
else:
if 'acc' in self.monitor:
self.monitor_op = np.greater
else:
self.monitor_op = np.less
if self.monitor_op == np.greater:
self.min_delta *= 1
else:
self.min_delta *= -1
self.best = np.Inf if self.monitor_op == np.less else -np.Inf
def begin(self):
# Convert names to tensors if given
graph = tf.get_default_graph()
self.monitor = graph.as_graph_element(self.monitor)
if isinstance(self.monitor, tf.Operation):
self.monitor = self.monitor.outputs[0]
def before_run(self, run_context): # pylint: disable=unused-argument
return session_run_hook.SessionRunArgs(self.monitor)
def after_run(self, run_context, run_values):
current = run_values.results
if self.monitor_op(current - self.min_delta, self.best):
self.best = current
self.wait = 0
else:
self.wait += 1
if self.wait >= self.patience:
run_context.request_stop()
This implementation is based on Keras implementation.
To use it with CNN MNIST example create hook and pass it to train.
early_stopping_hook = EarlyStoppingHook(monitor='sparse_softmax_cross_entropy_loss/value', patience=10)
mnist_classifier.train(
input_fn=train_input_fn,
steps=20000,
hooks=[logging_hook, early_stopping_hook])
Here sparse_softmax_cross_entropy_loss/value is the name of the loss op in that example.
EDIT 1:
It looks like there is no "official" way of finding loss node when using estimators (or I can't find it).
For the DNNRegressor this node has name dnn/head/weighted_loss/Sum.
Here is how to find it in the graph:
Start tensorboard in model directory. In my case I didn't set any directory so estimator used temporary directory and printed this line:
WARNING:tensorflow:Using temporary folder as model directory: /tmp/tmpInj8SC
Start tensorboard:
tensorboard --logdir /tmp/tmpInj8SC
Open it in browser and navigate to GRAPHS tab.
Find loss in the graph. Expand blocks in the sequence: dnn → head → weighted_loss and click on the Sum node (note that there is summary node named loss connected to it).
Name shown in the info "window" to the right is the name of the selected node, that need to be passed to monitor argument pf EarlyStoppingHook.
Loss node of the DNNClassifier has the same name by default. Both DNNClassifier and DNNRegressor have optional argument loss_reduction that influences loss node name and behavior (defaults to losses.Reduction.SUM).
EDIT 2:
There is a way of finding loss without looking at the graph.
You can use GraphKeys.LOSSES collection to get the loss. But this way will work only after training started. So you can use it only in a hook.
For example you can remove monitor argument from the EarlyStoppingHook class and change its begin function to always use the first loss in the collection:
self.monitor = tf.get_default_graph().get_collection(tf.GraphKeys.LOSSES)[0]
You also probably need to check that there is a loss in the collection.