How to assign tasks to specific worker within Dask.Distributed - dask-distributed

I am interesting in using Dask Distributed as task executor.
In Celery it is possible to assign task to specific worker. How is it possible using Dask Distributed?

There are 2 options:
Specify workers by name or host or IP (but only positive declarations):
dask-worker scheduler_address:8786 --name worker_1
and then one of option:
client.map(func, sequence, workers='worker_1')
client.map(func, sequence, workers=['192.168.1.100', '192.168.1.100:8989', 'alice', 'alice:8989'])
client.submit(f, x, workers='127.0.0.1')
client.submit(f, x, workers='127.0.0.1:55852')
client.submit(f, x, workers=['192.168.1.101', '192.168.1.100'])
future = client.compute(z, workers={z: '127.0.0.1',
x: '192.168.0.1:9999'})
future = client.compute(z, workers={(x, y): ['192.168.1.100', '192.168.1.101:9999']})
Use Resources concept. You can specify available resources to worker like:
dask-worker scheduler:8786 --resources "CAN_PROCESS_QUEUE_ALICE=2"
and specify required resources like
client.submit(aggregate, processed, resources={'CAN_PROCESS_QUEUE_ALICE': 1})
or
z = some_dask_object.map_parititons(func)
z.compute(resources={tuple(y.__dask_keys__()): {'CAN_PROCESS_QUEUE_ALICE': 1})

Related

How to parallelize groupby() in dask?

I tried:
df.groupby('name').agg('count').compute(num_workers=1)
df.groupby('name').agg('count').compute(num_workers=4)
They take the same time, why num_workers does not work?
Thanks
By default, Dask will work with multi-threaded tasks which means it uses a single processor on your computer. (Note that using dask is nevertheless interesting if you have data that can't fit in memory)
If you want to use several processors to compute your operation, you have to use a different scheduler:
from dask import dataframe as dd
from dask.distributed import LocalCluster, Client
df = dd.read_csv("data.csv")
def group(num_workers):
start = time.time()
res = df.groupby("name").agg("count").compute(num_workers=num_workers)
end = time.time()
return res, end-start
print(group(4))
clust = LocalCluster()
clt = Client(clust, set_as_default=True)
print(group(4))
Here, I create a local cluster using 4 parallel processes (because I have a quadcore) and then set a default scheduling client that will use this local cluster to perform the Dask operations. With a CSV two columns file of 1.5 Gb, the standard groupby takes around 35 seconds on my laptop whereas the multiprocess one only takes around 22 seconds.

Code implementation of the Redis "Pattern: Reliable queue"

The excellent redis documentation lists a Reliable queue pattern as a good candidate/example for the RPOPLPUSH function.
I understand "reliable queue" to be something with delivery patterns like Amazon SQS FIFO exactly once pattern.
Specifically, you have some N processes feeding into a queue, and some M workers working from the queue. What does this actually look like as an implementation?
I would venture something like:
Make the feeder process populating the work queue.
# feeder1
import redis
import datetime
import time
r = redis.Redis(host='localhost', port=6379, db=0)
while True:
now = datetime.datetime.now()
value_to_work_on = "f1:{}".format(now.second)
r.push('workqueue', value_to_work_on)
time.sleep(1)
Make another
# f2
import redis
import datetime
import time
r = redis.Redis(host='localhost', port=6379, db=0)
while True:
now = datetime.datetime.now()
value_to_work_on = "f2:{}".format(now.second)
r.push('workqueue', value_to_work_on)
time.sleep(1)
Now make the workers
# worker1
import redis
r = redis.Redis(host='localhost', port=6379, db=0)
def do_work(x):
print(x)
return True
while True:
todo = r.rpoplpush("workqueue" "donequeue")
if do_work(todo):
print("success")
else:
r.push("workqueue", todo)
# worker2 is exactly the same, just running elsewhere.
My questions are:
Is this generally what they mean in the documentation? If not, can you provide a fix as an answer?
This seems still incomplete and not really reliable. For example, should there be alternative lists for error vs complete queues? One for every possible error state? What happens if your Redis goes down during processing?
As #rainhacker pointed out in comments, it is now recommended to use Redis Streams for this instead of the recipe described in "Pattern: Reliable Queue"

(Dask) How to distribute expensive resource needed for computation?

What is the best way to distribute a task across a dataset that uses a relatively expensive-to-create resource or object for the computation.
# in pandas
df = pd.read_csv(...)
foo = Foo() # expensive initialization.
result = df.apply(lambda x: foo.do(x))
# in dask?
# is it possible to scatter the foo to the workers?
client.scatter(...
I plan on using this with dask_jobqueue with SGECluster.
foo = dask.delayed(Foo)() # create your expensive thing on the workers instead of locally
def do(row, foo):
return foo.do(row)
df.apply(do, foo=foo) # include it as an explicit argument, not a closure within a lambda

How do I use device_filters with tf.contrib.learn.Experiment?

By default, TensorFlow distributed training establishes all-to-all connections between workers and parameter servers, even though in asynchronous distributed training, the only necessary communication is between each individual worker and the parameter servers.
How do I limit communication when I'm using tf.contrib.learn.Experiment?
# The easiest way to parse TF_CONFIG environment variable is to create a RunConfig.
# Unfortunately, it is an immutable object, so we're going to create a
# temporary one and only use it for `task_type` and `task_id`.
tmp = tf.contrib.learn.RunConfig()
task_type, task_id = tmp.task_type, tmp.task_id
# We use a device_filter to limit the communication between this job
# and the parameter servers, i.e., there is no need to directly
# communicate with the other workers; attempting to do so can result
# in reliability problems.
device_filters = [
'/job:ps', '/job:%s/task:%d' % (task_type, task_id)
]
session_config = tf.ConfigProto(device_filters=device_filters)
run_config = tf.contrib.learn.RunConfig(
model_dir=args.job_dir,
session_config=session_config)
# Create the experiment_fn:
experiment_fn = ...
# Run the experiment
learn_runner.run(experiment_fn, run_config=run_config)

how to use more than one ps in distributed tensorflow?

I am trying to run the distributed tensorflow. But I have some troubles.
Firstly, it can process 35 images/sec on a single GPU(GTX TITAN X),single host(intel E5-2630 v3), however running it with the distributed code can only process 26 images/sec each process on 4 GPUs ,single host. Moreover, it can process 8.5 images/sec on 2 hosts, each with 4 GPUs. So the performance of this distributed version seems very poor. Could anybody give me some suggestions that why I got such a poor result.
Secondly, I wonder whether more ps server can improve the performance. So I tried to use 2 ps server, the program was blocked with log info :
CreateSession still waiting for response from worker: /job:ps/replica:0/task:1
I ran the program on the slurm system, so I used the python multiprocessing model to start the ps server.
def get_slurm_env():
node_list = expand_hostlist(os.environ['SLURM_NODELIST'])
node_id = int(os.environ['SLURM_NODEID'])
tasks_per_node = int(os.environ['SLURM_NTASKS_PER_NODE'])
# It is difficult to assign the port and gpu id in slurm env.
# The assigned gpu in different host is not always the same, and you nerver know
# which gpu is assigned in another host.
# Different slurm job may run in the same machine, so the port num may be conflict as well
task_id = int(os.environ['SLURM_PROCID'])
task_num = int(os.environ['SLURM_NTASKS'])
visible_gpu_ids = os.environ['CUDA_VISIBLE_DEVICES'].split(',')
visible_gpu_ids = [int(gpu) for gpu in visible_gpu_ids]
worker_port_list=[FLAGS.worker_port_start + incr for incr in range(len(visible_gpu_ids))]
FLAGS.worker_hosts = ["%s:%d" % (name, port) for name in node_list for port in worker_port_list]
assert len(FLAGS.worker_hosts) == task_num, 'Job count is not equal %d : %d' % (len(FLAGS.worker_hosts), task_num)
FLAGS.worker_hosts = ','.join(FLAGS.worker_hosts)
FLAGS.ps_hosts = ["%s:%d" % (name, FLAGS.ps_port_start) for name in node_list]
FLAGS.ps_hosts = ','.join(FLAGS.ps_hosts)
FLAGS.job_name = "worker"
FLAGS.task_id = task_id
os.environ['CUDA_VISIBLE_DEVICES'] = str(visible_gpu_ids[task_id%tasks_per_node])
def ps_runner(cluster, task_id):
tf.logging.info('Setup ps process, id: %d' % FLAGS.task_id)
os.environ['CUDA_VISIBLE_DEVICES'] = ""
server = tf.train.Server(cluster, job_name="ps", task_index=task_id)
server.join()
tf.logging.info('Stop ps process, id: %d' % FLAGS.task_id)
def main(unused_args):
get_slurm_env()
# Extract all the hostnames for the ps and worker jobs to construct the
# cluster spec.
ps_hosts = FLAGS.ps_hosts.split(',')
worker_hosts = FLAGS.worker_hosts.split(',')
tf.logging.info('PS hosts are: %s' % ps_hosts)
tf.logging.info('Worker hosts are: %s' % worker_hosts)
cluster_spec = tf.train.ClusterSpec({'ps': ps_hosts,
'worker': worker_hosts})
if FLAGS.task_id == 0:
p = multiprocessing.Process(target = ps_runner, args = ({'ps': ps_hosts,'worker': worker_hosts}, 0))
p.start()
server = tf.train.Server(
{'ps': ps_hosts,
'worker': worker_hosts},
job_name=FLAGS.job_name,
task_index=FLAGS.task_id)
# `worker` jobs will actually do the work.
dataset = ImagenetData(subset=FLAGS.subset)
assert dataset.data_files()
# Only the chief checks for or creates train_dir.
if FLAGS.task_id == 0:
if not tf.gfile.Exists(FLAGS.train_dir):
tf.gfile.MakeDirs(FLAGS.train_dir)
tf.logging.info('Setup worker process, id: %d' % FLAGS.task_id)
inception_distributed_train.train(server.target, dataset, cluster_spec)
Are you willing to consider MPI based solutions which do not require distributed memory specific changes to your code for distributed tensorflow? We have recently developed a version of user-transparent distributed tensorflow using MaTEx. https://github.com/matex-org/matex
We will be able to help you, should you face any problems.