In distributed tensorflow, FIFOQueue.enqueue() sometimes doesn't work - tensorflow

I used following lines to create a shared queue across different workers.
with tf.device("/job:ps/task:0"):
with tf.variable_scope("global"):
self.queue = tf.FIFOQueue(20, None)
In ps worker:
self.queue.dequeue()
In other workers,
self.queue.enqueue(somethings)
However, I found sometimes the enqueue operation doesn't work that nothing is enqueued, and there is no error or expectation. Does anyone have idea?

Related

Ray Tune Stuck with multiple runs

Hi I am trying hyper parameter optimization with ray tune.
Below is my code implementation.
However I get stuck and can't get the result back even though there aren't any error messages.
#ray.remote
def main:
do_somthing
return loss
def ray_pick_best_hypter(config):
runs = 10
loss_avg = np.mean(ray.get([main.remote(config,run=x) for x in range(runs)]))
tune.report(loss_avg=loss_avg)
config = load_config()
analysis = ray.tune.run(ray_pick_best_hypter, config=config,progress_reporter=reporter)
The below code works fine, but I want to run multiple experiments and get the mean value.
def ray_pick_best_hypter(config):
loss_avg = ray.get([main.remote(config,run=x))
tune.report(loss_avg=loss_avg)
What is the problem in the code?
It seems you are starting multiple distributed training processes from within your trainable. Each call to main.remote() will start a new distributed task. Since you're starting 10 of them at the same time, they will try to run in parallel.
However, the default resource allocation for each trial is usually just 1 CPU - so the remote tasks cannot be scheduled.
What you can do to resolve this is to pass resources_per_trial={"cpu": 11} - that way each of your remote tasks will have their own CPU to run on.

Problem when predicting via multiprocess with Tensorflow

I have 4 (or more) models (same structure but different training data). Now I want to ensemble them to make a prediction. I want to pre-load the models and then predict one input message (one message at a time) in parallel via multiprocess. However, the program always stops at "session.run" step. I could not figure it out why.
I tried passing all arguments to the function in each process, as shown in the code below. I also tried using a Queue object and put all the data (except the model object) in the queue. I also tried to set the number of process to 1. It made no difference.
with Manager() as manager:
first_level_test_features=manager.list()
procs =[]
for id in range(4):
p = Process(target=predict, args=(id, (message, models, configs, vocabs, emoji_dict,first_level_test_features)))
procs.append(p)
p.start()
for p in procs:
p.join()
I did not get any error message since it is just stuck there. I would expect the program can start multiple processes and each process uses the model pass to it to make the prediction.
I am unsure how session sharing along different Processes would work, and this is probably where your issue comes from. Given the way TensorFlow works, I would advise implementing the ensemble call as a graph operation, so that it can be run through a single session.run call, with TF handling the parallelization of computations wherever possible.
In practice, if you have symbolic tensors representing the models' predictions, you could use a TF operation to aggregate them (tf.concat, tf.reduce_mean, tf.add_n... whichever suits your design) and end up with a single symbolic tensor representing the ensemble prediction.
I hope this helps; if not, please provide some more details as to what your setting is, notably which form your models have.

How to prevent dask client from dying on worker exception?

I'm not understanding the resiliency model in dask distributed.
Problem
Exceptions raised by a workers kills embarrassingly parallel dask operation. All workers and clients die if any worker encounters an exception.
Expected Behavior
Reading here: http://distributed.dask.org/en/latest/resilience.html#user-code-failures
Suggests that exceptions should be contained to workers and that subsequent tasks would go on without interruption.
"When a function raises an error that error is kept and transmitted to the client on request. Any attempt to gather that result or any dependent result will raise that exception...This does not affect the smooth operation of the scheduler or worker in any way."
I was following the embarrassingly parallel use case here:
http://docs.dask.org/en/latest/use-cases.html
Reproducible example
import numpy as np
np.random.seed(0)
from dask import compute, delayed
from dask.distributed import Client, LocalCluster
def raise_exception(x):
if x == 10:
raise ValueError("I'm an error on a worker")
elif x == 20:
print("I've made it to 20")
else:
return(x)
if __name__=="__main__":
#Create cluster
cluster = LocalCluster(n_workers=2,threads_per_worker=1)
client = Client(cluster)
values = [delayed(raise_exception)(x) for x in range(0,100)]
results=compute(*values,scheduler='distributed')
Task 20 is never accomplished. The exception on task 10 causes the scheduler and workers to die. What am I not understanding about the programming model? Why does this count as gathering? I just want to run each task and capture any exceptions for later inspection, not raise them on the client.
Use Case
Parallel image processing on a University SLURM cluster. My function has a side-effect that saves processed images to file. The processes are independent and never gathered by the scheduler. The exception causes all nodes to die on the cluster.
Cross-listed on issues, since I'm not sure if this is a bug or a feature!
https://github.com/dask/distributed/issues/2436
Answered in repo - dask delayed computes all-or-nothing. Use dask map from concurrent futures interface + wait. This was designed, not a bug.
https://github.com/dask/distributed/issues/2436

Multiple queues causing TF to lock up

I'm trying to use multiple queues for reading and batching, but this is causing TF to occasionally lock up. Here's some sample code:
import tensorflow as tf
coordinator = tf.train.Coordinator()
file_queue = tf.train.string_input_producer(tf.train.match_filenames_once(...))
reader = tf.TextLineReader()
key, serialized_example = reader.read(file_queue)
keys, serialized_examples = tf.train.batch([key, serialized_example], 10)
# repeat the code snippet below multiple times, in my case 4
file_queue_i = tf.train.string_input_producer(tf.train.match_filenames_once(...))
reader_i = tf.TextLineReader()
key_i, serialized_example_i = reader.read(file_queue_i)
initializer = tf.initialize_all_variables()
session = tf.Session(config=tf.ConfigProto(inter_op_parallelism_threads=1, intra_op_parallelism_threads=1))
session.run(initializer)
threads = tf.train.start_queue_runners(sess=session, coord=coordinator)
session.run(keys)
TensorFlow occasionally locks up at the last line, when I actually try to run something. This behavior is rather hard to reproduce using the above code however. In 1000+ runs, I could only get it to hang once. In my real code, the actual reader is more complicated, and it's using TFRecords, but otherwise everything is the same. There it hangs up 2/3 of the time with 3 queues in total. With 5 queues it seemingly never runs, and with 1 queue it seemingly never hangs. This is on a Mac with 0.6. I have a different system running Ubuntu, also with 0.6, and I get the same problem (although the frequency of locking up is much higher on the Ubuntu system).
UPDATE: A more accurate estimate of how often the above code locks up is 1 in 5,000 trials.
This is probably caused by not having enough operation threads. If you have a queue runner 1 depending on work of queue runner 2, and you run them asynchronously, then you'll need at least two op threads, set through inter_op_parallelism_threads, to guarantee that progress is being made.
In your case, you have queue runner that's filling batch thread depending on string_input_producer queue being not empty. If the queue runner associated with string_input_producer queue runs first, then everything is fine. But if batch queue runner is scheduled first, it will get stuck in string_input_producer.dequeue op waiting for string_input_producer queue to get some filenames. Since there's only 1 thread in TensorFlow op thread pool, the enqueue op of string_input_producer will never get allocated a thread to complete (ie, to execute its Compute method)
Simplest solution is to have at least as many operation threads as you have simultaneous run calls (ie, number of queues + 1). If you really want to restrict yourself to one thread, you could preload filename queue file filenames synchronously using main thread.
coordinator = tf.train.Coordinator()
import glob
files = glob.glob('/temp/pipeline/*')
if FLAGS.preload_filenames:
file_queue = tf.FIFOQueue(capacity=len(files), dtypes=tf.string)
enqueue_val = tf.placeholder(dtype=tf.string)
enqueue_op = file_queue.enqueue(enqueue_val)
else:
file_queue = tf.train.string_input_producer(files)
reader = tf.TextLineReader()
key, serialized_example = reader.read(file_queue)
keys, serialized_examples = tf.train.batch([key, serialized_example], 5,
capacity=10)
initializer = tf.initialize_all_variables()
session = tf.Session(config=tf.ConfigProto(inter_op_parallelism_threads=1,
intra_op_parallelism_threads=1))
print 'running initializer'
session.run(initializer)
if FLAGS.preload_filenames:
print 'preloading filenames'
for fn in files:
session.run([enqueue_op], feed_dict={enqueue_val: fn})
print 'size - ', session.run([file_queue.size()])
session.run([file_queue.close()])
print 'starting queue runners'
threads = tf.train.start_queue_runners(sess=session, coord=coordinator)
print 'about to run session'
print session.run(keys)
Code above will need some encapsulation if you have more than one filenames queue. Alternatively here's a hacky work-around which should work if there's exactly prebuffer_amount filenames for all input_producer queues
queue_runners=tf.get_collection(tf.GraphKeys.QUEUE_RUNNERS)
filename_queue_runners=[qr for qr in queue_runners if 'input_producer' in qr.name]
for qr in filename_queue_runners:
for k in prebuffer_amount:
sess.run(qr._enqueue_ops[0])

Parallelism in (I)Python with large blocks of data

I've been toiling with threads and processes for a while now to try to speed up my very parallel job in IPython. I'm not sure how much detail about the function I'm calling is useful, so here's a bash but ask if you need more.
My function's call signature looks like
def intersplit_array(ob,er,nl,m,mi,t,ti,dmax,n0=6,steps=50):
Basically, ob, er and nl are parameters for observed values and m,mi,t,ti and dmax are parameters that represent models against which the observations will be compared. (n0 and steps are fixed numerical parameters for the function.) The function loops through all the models in m and, using associated information in mi, t, ti and dmax, calculates a probability that this model matches. Note that m is quite big: it's a list of about 700 000 22x3 NumPy arrays. mi and dmax are of similar sizes. If releant, my normal IPython instance uses about 25% of system memory in top: 4GB of my 16GB of RAM.
I've tried to parallelize this in two ways. First, I tried to use the parallel_map function given over at the SciPy Cookbook. I made the call
P = parallel_map(lambda i: intersplit_array(ob,er,nl,m[i+1],mi[i:i+2],t[i+1],ti[i:i+2],dmax[i+1],range(1,len(m)-1))
which runs, and provides the correct answer. Without the parallel_ part, this is just the result of applying the function one by one to each element. But this is slower than using a single core. I guess this is related to the Global Interpreter Lock?
Second, I tried to use a Pool from multiprocessing. I initialized a pool with
p = multiprocessing.Pool(6)
and then tried to call my function with
P = p.map(lambda i: intersplit_array(ob,er,nl,m[i+1],mi[i:i+2],t[i+1],ti[i:i+2],dmax[i+1],range(1,len(m)-1))
First, I get an error.
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib64/python2.7/threading.py", line 551, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 504, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib64/python2.7/multiprocessing/pool.py", line 319, in _handle_tasks
put(task)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
Having a look in top, I then see all the extra ipython processes, each of which is apparently taking up 25% of RAM (which can't be so, because I've still got 4GB free) and using 0% CPU. I presume it isn't doing anything. I can't use IPython, either. I tried Ctrl-C for a while, but gave up once I got passed the 300th pool worker.
Does it work not interactively?
multiprocessing doesn't play well interactively, because of the way it splits processes. This is also why you had trouble killing it because it spawned so many processes. You would have to keep track of the master process to cancel it.
From the documentation:
Note
Functionality within this package requires that the __main__ module be importable by the children. This is covered in Programming guidelines however it is worth pointing out here. This means that some examples, such as the multiprocessing.Pool examples will not work in the interactive interpreter.
...
If you try this it will actually output full tracebacks interleaved in a semi-random fashion, and then you may have to stop the master process somehow.
The best solution is probably to just run it as a script from the command line. Alternatively, IPython has its own system for parallel computing, but I've never used it.