How do I use device_filters with tf.contrib.learn.Experiment? - tensorflow

By default, TensorFlow distributed training establishes all-to-all connections between workers and parameter servers, even though in asynchronous distributed training, the only necessary communication is between each individual worker and the parameter servers.
How do I limit communication when I'm using tf.contrib.learn.Experiment?

# The easiest way to parse TF_CONFIG environment variable is to create a RunConfig.
# Unfortunately, it is an immutable object, so we're going to create a
# temporary one and only use it for `task_type` and `task_id`.
tmp = tf.contrib.learn.RunConfig()
task_type, task_id = tmp.task_type, tmp.task_id
# We use a device_filter to limit the communication between this job
# and the parameter servers, i.e., there is no need to directly
# communicate with the other workers; attempting to do so can result
# in reliability problems.
device_filters = [
'/job:ps', '/job:%s/task:%d' % (task_type, task_id)
]
session_config = tf.ConfigProto(device_filters=device_filters)
run_config = tf.contrib.learn.RunConfig(
model_dir=args.job_dir,
session_config=session_config)
# Create the experiment_fn:
experiment_fn = ...
# Run the experiment
learn_runner.run(experiment_fn, run_config=run_config)

Related

Tf-agent Actor/Learner: TFUniform ReplayBuffer dimensionality issue - invalid shape of Replay Buffer vs. Actor update

I try to adapt the this tf-agents actor<->learner DQN Atari Pong example to my windows machine using a TFUniformReplayBuffer instead of the ReverbReplayBuffer which only works on linux machine but I face a dimensional issue.
[...]
---> 67 init_buffer_actor.run()
[...]
InvalidArgumentError: {{function_node __wrapped__ResourceScatterUpdate_device_/job:localhost/replica:0/task:0/device:CPU:0}} Must have updates.shape = indices.shape + params.shape[1:] or updates.shape = [], got updates.shape [84,84,4], indices.shape [1], params.shape [1000,84,84,4] [Op:ResourceScatterUpdate]
The problem is as follows: The tf actor tries to access the replay buffer and initialize the it with a certain number random samples of shape (84,84,4) according to this deepmind paper but the replay buffer requires samples of shape (1,84,84,4).
My code is as follows:
def train_pong(
env_name='ALE/Pong-v5',
initial_collect_steps=50000,
max_episode_frames_collect=50000,
batch_size=32,
learning_rate=0.00025,
replay_capacity=1000):
# load atari environment
collect_env = suite_atari.load(
env_name,
max_episode_steps=max_episode_frames_collect,
gym_env_wrappers=suite_atari.DEFAULT_ATARI_GYM_WRAPPERS_WITH_STACKING)
# create tensor specs
observation_tensor_spec, action_tensor_spec, time_step_tensor_spec = (
spec_utils.get_tensor_specs(collect_env))
# create training util
train_step = train_utils.create_train_step()
# calculate no. of actions
num_actions = action_tensor_spec.maximum - action_tensor_spec.minimum + 1
# create agent
agent = dqn_agent.DqnAgent(
time_step_tensor_spec,
action_tensor_spec,
q_network=create_DL_q_network(num_actions),
optimizer=tf.compat.v1.train.RMSPropOptimizer(learning_rate=learning_rate))
# create uniform replay buffer
replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
data_spec=agent.collect_data_spec,
batch_size=1,
max_length=replay_capacity)
# observer of replay buffer
rb_observer = replay_buffer.add_batch
# create batch dataset
dataset = replay_buffer.as_dataset(
sample_batch_size=batch_size,
num_steps = 2,
single_deterministic_pass=False).prefetch(3)
# create callable function for actor
experience_dataset_fn = lambda: dataset
# create random policy for buffer init
random_policy = random_py_policy.RandomPyPolicy(collect_env.time_step_spec(),
collect_env.action_spec())
# create initalizer
init_buffer_actor = actor.Actor(
collect_env,
random_policy,
train_step,
steps_per_run=initial_collect_steps,
observers=[replay_buffer.add_batch])
# initialize buffer with random samples
init_buffer_actor.run()
(The approach is using the OpenAI Gym Env as well as the corresponding wrapper functions)
I worked with keras-rl2 and tf-agents without actor<->learner for other atari games to create the DQN and both worked quite well afer a some adaptions. I guess my current code will also work after a few adaptions in the tf-agent libary functions, but that would obviate the purpose of the libary.
My current assumption: The actor<->learner methods are not able to work with the TFUniformReplayBuffer (as I expect them to), due to the missing support of the TFPyEnvironment - or I still have some knowledge shortcomings regarding this tf-agents approach
Previous (successful) attempt:
from tf_agents.environments.tf_py_environment import TFPyEnvironment
tf_collect_env = TFPyEnvironment(collect_env)
init_driver = DynamicStepDriver(
tf_collect_env,
random_policy,
observers=[replay_buffer.add_batch],
num_steps=200)
init_driver.run()
I would be very grateful if someone could explain me what I'm overseeing here.
I fixed it...partly, but the next error is (in my opinion) an architectural problem.
The problem is that the Actor/Learner setup is build on a PyEnvironment whereas the
TFUniformReplayBuffer is using the TFPyEnvironment which ends up in the failure above...
Using the PyUniformReplayBuffer with a converted py-spec solved this problem.
from tf_agents.specs import tensor_spec
# convert agent spec to py-data-spec
py_collect_data_spec = tensor_spec.to_array_spec(agent.collect_data_spec)
# create replay buffer based on the py-data-spec
replay_buffer = py_uniform_replay_buffer.PyUniformReplayBuffer(
data_spec= py_collect_data_spec,
capacity=replay_capacity*batch_size
)
This snippet solved the issue of having an incompatible buffer in the background but ends up in another issue
--> The add_batch function does not work
I found this approach which advises to use either a batched environment or to make the following adaptions for the replay observer (add_batch method).
from tf_agents.utils.nest_utils import batch_nested_array
#********* Adpations add_batch method - START *********#
rb_observer = lambda x: replay_buffer.add_batch(batch_nested_array(x))
#********* Adpations add_batch method - END *********#
# create batch dataset
dataset = replay_buffer.as_dataset(
sample_batch_size=32,
single_deterministic_pass=False)
experience_dataset_fn = lambda: dataset
This helped me to solve the issue regarding this post but now I run into another problem where I need to ask someone of the tf-agents-team...
--> It seems that the Learner/Actor structure is no able to work with another buffer than the ReverbBuffer, because the data-spec which is processed by the PyUniformReplayBuffer sets up a wrong buffer structure...
For anyone who has the same problem: I just created this Github-Issue report to get further answers and/or fix my lack of knowledge.
the full fix is shown below...
--> The dimensionality issue was valid and should indicate the the (uploaded) batched samples are not in the correct shape
--> This issue happens due to the fact that the "add_batch" method loads values with the wrong shape
rb_observer = replay_buffer.add_batch
Long story short, this line should be replaced by
rb_observer = lambda x: replay_buffer.add_batch(batch_nested_array(x))
--> Afterwards the (replay buffer) inputs are of correct shape and the Learner Actor Setup starts training.
The full replay buffer is shown below:
# create buffer for storing experience
replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
agent.collect_data_spec,
1,
max_length=1000000)
# create batch dataset
dataset = replay_buffer.as_dataset(
sample_batch_size=32,
num_steps = 2,
single_deterministic_pass=False).prefetch(4)
# create batched nested array input for rb_observer
rb_observer = lambda x: replay_buffer.add_batch(batch_nested_array(x))
# create batched readout of dataset
experience_dataset_fn = lambda: dataset

Getting Tensorflow To Run Faster

I have developed a machine learning python script (let's call it classify_obj written with python 3.6) that imports TensorFlow. It was developed initially for bulk analysis but now I find the need to run this script repeatedly on smaller datasets to cater for more real time usage. I am doing this on Linux RH7.
Process Flow:
Master tool (written in Java) call classify_obj with object input to categorize.
classify_obj generates the classification result as a csv (takes about 7-10s)
Master tool reads the result from #2
Master tool proceeds to do other logic
Repeat #1 with next object input
To breakdown the time taken, I switched off the main logic and just do the modules import without performing any other action. I found that the import takes about 4-5s out of the 7-10s run time on the small dataset. The classification takes about 2s. I am also looking at other ways to reduce the run time for other areas but the bulk seems to be from the import.
Import time: 4-6s
Classify time: 1s
Read, write and other logic time: 0.2s
I am thinking what options are there to reduce the import time?
One idea I had was to modify the classify_obj into a "stay alive" process. The master tool after completing all its activity will stop this process/service. The intent (not sure if this would be the case) is that all the required libraries are already loaded during the process start and when the master tool calls that process/service, it will only incur the classification time instead of needing to import the libraries repeated.
What do you think about this? Also how can I set this up on Linux RHEL 7.4? Some reference links would be greatly appreciated.
Other suggestion would be greatly appreciated.
Thanks and have a great day!
This is the solution I designed to achieve the above.
Reference: https://realpython.com/python-sockets/
I have to create 2 scripts.
1. client python script: Used to pass the raw data to be classified to the server python script using socket programming.
server python script: Loads the keras (tensorflow) lib and model at launch. Continues to stay alive until a 'stop' request from client (to exit the while loop). When the client script sends the data to the server script, server script will process the incoming data and return a ok/not ok output back to the client script.
In the end, the classification time is reduced to 0.1 - 0.3s.
Client Script
import socket
import argparse
from argparse import ArgumentParser
def main():
parser = ArgumentParser(description='XXXXX')
parser.add_argument('-i','--input', default='NA', help='Input txt file path')
parser.add_argument('-o','--output', default='NA', help='Output csv path with class')
parser.add_argument('-stop','--stop', default='no', help='Stop the server script')
args = parser.parse_args()
str = args.input + ',' + args.output + ',' + args.stop
HOST = '127.0.0.1' # The server's hostname or IP address
PORT = 65432 # The port used by the server
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect((HOST, PORT))
bytedata = str.encode()
sock.send(bytedata)
data = sock.recv(1024)
print('Received', data)
if __name__== "__main__":
main()
Server Script
def main():
HOST = '127.0.0.1' # Standard loopback interface address (localhost)
PORT = 65432 # Port to listen on (non-privileged ports are > 1023)
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.bind((HOST,PORT))
sock.listen(5)
stop_process = 'no'
while (stop_process == 'no'):
# print('Waiting for connection')
conn, addr = sock.accept()
data = ''
try:
# print('Connected by', addr)
while True:
data = conn.recv(1024)
if data:
stop_process = process_input(data) # process_input function processes incoming data. If client sends 'yes' for the stop argument, the stop_process variable will be set to 'yes' by the function.
byte_reply = stop_process.encode()
conn.sendall(byte_reply) # send reply back to client
else:
break
conn.close()
# print('Closing connection',addr)
finally:
conn.close()
if __name__== "__main__":
main()

(Dask) How to distribute expensive resource needed for computation?

What is the best way to distribute a task across a dataset that uses a relatively expensive-to-create resource or object for the computation.
# in pandas
df = pd.read_csv(...)
foo = Foo() # expensive initialization.
result = df.apply(lambda x: foo.do(x))
# in dask?
# is it possible to scatter the foo to the workers?
client.scatter(...
I plan on using this with dask_jobqueue with SGECluster.
foo = dask.delayed(Foo)() # create your expensive thing on the workers instead of locally
def do(row, foo):
return foo.do(row)
df.apply(do, foo=foo) # include it as an explicit argument, not a closure within a lambda

How to assign tasks to specific worker within Dask.Distributed

I am interesting in using Dask Distributed as task executor.
In Celery it is possible to assign task to specific worker. How is it possible using Dask Distributed?
There are 2 options:
Specify workers by name or host or IP (but only positive declarations):
dask-worker scheduler_address:8786 --name worker_1
and then one of option:
client.map(func, sequence, workers='worker_1')
client.map(func, sequence, workers=['192.168.1.100', '192.168.1.100:8989', 'alice', 'alice:8989'])
client.submit(f, x, workers='127.0.0.1')
client.submit(f, x, workers='127.0.0.1:55852')
client.submit(f, x, workers=['192.168.1.101', '192.168.1.100'])
future = client.compute(z, workers={z: '127.0.0.1',
x: '192.168.0.1:9999'})
future = client.compute(z, workers={(x, y): ['192.168.1.100', '192.168.1.101:9999']})
Use Resources concept. You can specify available resources to worker like:
dask-worker scheduler:8786 --resources "CAN_PROCESS_QUEUE_ALICE=2"
and specify required resources like
client.submit(aggregate, processed, resources={'CAN_PROCESS_QUEUE_ALICE': 1})
or
z = some_dask_object.map_parititons(func)
z.compute(resources={tuple(y.__dask_keys__()): {'CAN_PROCESS_QUEUE_ALICE': 1})

how to use more than one ps in distributed tensorflow?

I am trying to run the distributed tensorflow. But I have some troubles.
Firstly, it can process 35 images/sec on a single GPU(GTX TITAN X),single host(intel E5-2630 v3), however running it with the distributed code can only process 26 images/sec each process on 4 GPUs ,single host. Moreover, it can process 8.5 images/sec on 2 hosts, each with 4 GPUs. So the performance of this distributed version seems very poor. Could anybody give me some suggestions that why I got such a poor result.
Secondly, I wonder whether more ps server can improve the performance. So I tried to use 2 ps server, the program was blocked with log info :
CreateSession still waiting for response from worker: /job:ps/replica:0/task:1
I ran the program on the slurm system, so I used the python multiprocessing model to start the ps server.
def get_slurm_env():
node_list = expand_hostlist(os.environ['SLURM_NODELIST'])
node_id = int(os.environ['SLURM_NODEID'])
tasks_per_node = int(os.environ['SLURM_NTASKS_PER_NODE'])
# It is difficult to assign the port and gpu id in slurm env.
# The assigned gpu in different host is not always the same, and you nerver know
# which gpu is assigned in another host.
# Different slurm job may run in the same machine, so the port num may be conflict as well
task_id = int(os.environ['SLURM_PROCID'])
task_num = int(os.environ['SLURM_NTASKS'])
visible_gpu_ids = os.environ['CUDA_VISIBLE_DEVICES'].split(',')
visible_gpu_ids = [int(gpu) for gpu in visible_gpu_ids]
worker_port_list=[FLAGS.worker_port_start + incr for incr in range(len(visible_gpu_ids))]
FLAGS.worker_hosts = ["%s:%d" % (name, port) for name in node_list for port in worker_port_list]
assert len(FLAGS.worker_hosts) == task_num, 'Job count is not equal %d : %d' % (len(FLAGS.worker_hosts), task_num)
FLAGS.worker_hosts = ','.join(FLAGS.worker_hosts)
FLAGS.ps_hosts = ["%s:%d" % (name, FLAGS.ps_port_start) for name in node_list]
FLAGS.ps_hosts = ','.join(FLAGS.ps_hosts)
FLAGS.job_name = "worker"
FLAGS.task_id = task_id
os.environ['CUDA_VISIBLE_DEVICES'] = str(visible_gpu_ids[task_id%tasks_per_node])
def ps_runner(cluster, task_id):
tf.logging.info('Setup ps process, id: %d' % FLAGS.task_id)
os.environ['CUDA_VISIBLE_DEVICES'] = ""
server = tf.train.Server(cluster, job_name="ps", task_index=task_id)
server.join()
tf.logging.info('Stop ps process, id: %d' % FLAGS.task_id)
def main(unused_args):
get_slurm_env()
# Extract all the hostnames for the ps and worker jobs to construct the
# cluster spec.
ps_hosts = FLAGS.ps_hosts.split(',')
worker_hosts = FLAGS.worker_hosts.split(',')
tf.logging.info('PS hosts are: %s' % ps_hosts)
tf.logging.info('Worker hosts are: %s' % worker_hosts)
cluster_spec = tf.train.ClusterSpec({'ps': ps_hosts,
'worker': worker_hosts})
if FLAGS.task_id == 0:
p = multiprocessing.Process(target = ps_runner, args = ({'ps': ps_hosts,'worker': worker_hosts}, 0))
p.start()
server = tf.train.Server(
{'ps': ps_hosts,
'worker': worker_hosts},
job_name=FLAGS.job_name,
task_index=FLAGS.task_id)
# `worker` jobs will actually do the work.
dataset = ImagenetData(subset=FLAGS.subset)
assert dataset.data_files()
# Only the chief checks for or creates train_dir.
if FLAGS.task_id == 0:
if not tf.gfile.Exists(FLAGS.train_dir):
tf.gfile.MakeDirs(FLAGS.train_dir)
tf.logging.info('Setup worker process, id: %d' % FLAGS.task_id)
inception_distributed_train.train(server.target, dataset, cluster_spec)
Are you willing to consider MPI based solutions which do not require distributed memory specific changes to your code for distributed tensorflow? We have recently developed a version of user-transparent distributed tensorflow using MaTEx. https://github.com/matex-org/matex
We will be able to help you, should you face any problems.