Multiple queues causing TF to lock up - tensorflow

I'm trying to use multiple queues for reading and batching, but this is causing TF to occasionally lock up. Here's some sample code:
import tensorflow as tf
coordinator = tf.train.Coordinator()
file_queue = tf.train.string_input_producer(tf.train.match_filenames_once(...))
reader = tf.TextLineReader()
key, serialized_example = reader.read(file_queue)
keys, serialized_examples = tf.train.batch([key, serialized_example], 10)
# repeat the code snippet below multiple times, in my case 4
file_queue_i = tf.train.string_input_producer(tf.train.match_filenames_once(...))
reader_i = tf.TextLineReader()
key_i, serialized_example_i = reader.read(file_queue_i)
initializer = tf.initialize_all_variables()
session = tf.Session(config=tf.ConfigProto(inter_op_parallelism_threads=1, intra_op_parallelism_threads=1))
session.run(initializer)
threads = tf.train.start_queue_runners(sess=session, coord=coordinator)
session.run(keys)
TensorFlow occasionally locks up at the last line, when I actually try to run something. This behavior is rather hard to reproduce using the above code however. In 1000+ runs, I could only get it to hang once. In my real code, the actual reader is more complicated, and it's using TFRecords, but otherwise everything is the same. There it hangs up 2/3 of the time with 3 queues in total. With 5 queues it seemingly never runs, and with 1 queue it seemingly never hangs. This is on a Mac with 0.6. I have a different system running Ubuntu, also with 0.6, and I get the same problem (although the frequency of locking up is much higher on the Ubuntu system).
UPDATE: A more accurate estimate of how often the above code locks up is 1 in 5,000 trials.

This is probably caused by not having enough operation threads. If you have a queue runner 1 depending on work of queue runner 2, and you run them asynchronously, then you'll need at least two op threads, set through inter_op_parallelism_threads, to guarantee that progress is being made.
In your case, you have queue runner that's filling batch thread depending on string_input_producer queue being not empty. If the queue runner associated with string_input_producer queue runs first, then everything is fine. But if batch queue runner is scheduled first, it will get stuck in string_input_producer.dequeue op waiting for string_input_producer queue to get some filenames. Since there's only 1 thread in TensorFlow op thread pool, the enqueue op of string_input_producer will never get allocated a thread to complete (ie, to execute its Compute method)
Simplest solution is to have at least as many operation threads as you have simultaneous run calls (ie, number of queues + 1). If you really want to restrict yourself to one thread, you could preload filename queue file filenames synchronously using main thread.
coordinator = tf.train.Coordinator()
import glob
files = glob.glob('/temp/pipeline/*')
if FLAGS.preload_filenames:
file_queue = tf.FIFOQueue(capacity=len(files), dtypes=tf.string)
enqueue_val = tf.placeholder(dtype=tf.string)
enqueue_op = file_queue.enqueue(enqueue_val)
else:
file_queue = tf.train.string_input_producer(files)
reader = tf.TextLineReader()
key, serialized_example = reader.read(file_queue)
keys, serialized_examples = tf.train.batch([key, serialized_example], 5,
capacity=10)
initializer = tf.initialize_all_variables()
session = tf.Session(config=tf.ConfigProto(inter_op_parallelism_threads=1,
intra_op_parallelism_threads=1))
print 'running initializer'
session.run(initializer)
if FLAGS.preload_filenames:
print 'preloading filenames'
for fn in files:
session.run([enqueue_op], feed_dict={enqueue_val: fn})
print 'size - ', session.run([file_queue.size()])
session.run([file_queue.close()])
print 'starting queue runners'
threads = tf.train.start_queue_runners(sess=session, coord=coordinator)
print 'about to run session'
print session.run(keys)
Code above will need some encapsulation if you have more than one filenames queue. Alternatively here's a hacky work-around which should work if there's exactly prebuffer_amount filenames for all input_producer queues
queue_runners=tf.get_collection(tf.GraphKeys.QUEUE_RUNNERS)
filename_queue_runners=[qr for qr in queue_runners if 'input_producer' in qr.name]
for qr in filename_queue_runners:
for k in prebuffer_amount:
sess.run(qr._enqueue_ops[0])

Related

Ray Tune Stuck with multiple runs

Hi I am trying hyper parameter optimization with ray tune.
Below is my code implementation.
However I get stuck and can't get the result back even though there aren't any error messages.
#ray.remote
def main:
do_somthing
return loss
def ray_pick_best_hypter(config):
runs = 10
loss_avg = np.mean(ray.get([main.remote(config,run=x) for x in range(runs)]))
tune.report(loss_avg=loss_avg)
config = load_config()
analysis = ray.tune.run(ray_pick_best_hypter, config=config,progress_reporter=reporter)
The below code works fine, but I want to run multiple experiments and get the mean value.
def ray_pick_best_hypter(config):
loss_avg = ray.get([main.remote(config,run=x))
tune.report(loss_avg=loss_avg)
What is the problem in the code?
It seems you are starting multiple distributed training processes from within your trainable. Each call to main.remote() will start a new distributed task. Since you're starting 10 of them at the same time, they will try to run in parallel.
However, the default resource allocation for each trial is usually just 1 CPU - so the remote tasks cannot be scheduled.
What you can do to resolve this is to pass resources_per_trial={"cpu": 11} - that way each of your remote tasks will have their own CPU to run on.

How to prevent dask client from dying on worker exception?

I'm not understanding the resiliency model in dask distributed.
Problem
Exceptions raised by a workers kills embarrassingly parallel dask operation. All workers and clients die if any worker encounters an exception.
Expected Behavior
Reading here: http://distributed.dask.org/en/latest/resilience.html#user-code-failures
Suggests that exceptions should be contained to workers and that subsequent tasks would go on without interruption.
"When a function raises an error that error is kept and transmitted to the client on request. Any attempt to gather that result or any dependent result will raise that exception...This does not affect the smooth operation of the scheduler or worker in any way."
I was following the embarrassingly parallel use case here:
http://docs.dask.org/en/latest/use-cases.html
Reproducible example
import numpy as np
np.random.seed(0)
from dask import compute, delayed
from dask.distributed import Client, LocalCluster
def raise_exception(x):
if x == 10:
raise ValueError("I'm an error on a worker")
elif x == 20:
print("I've made it to 20")
else:
return(x)
if __name__=="__main__":
#Create cluster
cluster = LocalCluster(n_workers=2,threads_per_worker=1)
client = Client(cluster)
values = [delayed(raise_exception)(x) for x in range(0,100)]
results=compute(*values,scheduler='distributed')
Task 20 is never accomplished. The exception on task 10 causes the scheduler and workers to die. What am I not understanding about the programming model? Why does this count as gathering? I just want to run each task and capture any exceptions for later inspection, not raise them on the client.
Use Case
Parallel image processing on a University SLURM cluster. My function has a side-effect that saves processed images to file. The processes are independent and never gathered by the scheduler. The exception causes all nodes to die on the cluster.
Cross-listed on issues, since I'm not sure if this is a bug or a feature!
https://github.com/dask/distributed/issues/2436
Answered in repo - dask delayed computes all-or-nothing. Use dask map from concurrent futures interface + wait. This was designed, not a bug.
https://github.com/dask/distributed/issues/2436

feed data into a tf.contrib.data.Dataset like a queue

About the tf.contrib.data.Dataset (from TensorFlow 1.2, see here and here) usage:
The way how to get data doesn't really fit any way how I get the data usually. In my case, I have a thread and I receive data there and I don't know in advance when it will end but I see when it ends. Then I wait until I processed all the buffers and then I have finished one epoch. How can I get this logic with the Dataset?
Note that I prefer the Dataset interface over the QueueBase interface because it gives me the iterator interface which I can reinitialize and even reset to a different Dataset. This is more powerful compared to queues which cannot be reopened currently after they are closed (see here and here).
Maybe a similar question, or the same question: How can I wrap around a Dataset over a queue? I have some thread with reads some data from somewhere and which can feed it and queue it somehow. How do I get the data into the Dataset? I could repeat some dummy tensor infinite times and then use map to just return my queue.dequeue() but that really only gets me back to all the original problems with the queue, i.e. how to reopen the queue.
The new Dataset.from_generator() method allows you to define a Dataset that is fed by a Python generator. (To use this feature at present, you must download a nightly build of TensorFlow or build it yourself from source. It will be part of TensorFlow 1.4.)
The easiest way to implement your example would be to replace your receiving thread with a generator, with pseudocode as follows:
def receiver():
while True:
next_element = ... # Receive next element from external source.
# Note that this method may block.
end_of_epoch = ... # Decide whether or not to stop based on next_element.
if not end_of_epoch:
yield next_element # Note: you may need to convert this to an array.
else:
return # Returning will signal OutOfRangeError on downstream iterators.
dataset = tf.contrib.data.Dataset.from_generator(receiver, output_types=...)
# You can chain other `Dataset` methods after the generator. For example:
dataset = dataset.prefetch(...) # This will start a background thread
# to prefetch elements from `receiver()`.
dataset = dataset.repeat(...) # Note that each repetition will call
# `receiver()` again, and start from
# a fresh state.
dataset = dataset.batch(...)
More complicated topologies are possible. For example, you can use Dataset.interleave() to create many receivers in parallel.

In distributed tensorflow, FIFOQueue.enqueue() sometimes doesn't work

I used following lines to create a shared queue across different workers.
with tf.device("/job:ps/task:0"):
with tf.variable_scope("global"):
self.queue = tf.FIFOQueue(20, None)
In ps worker:
self.queue.dequeue()
In other workers,
self.queue.enqueue(somethings)
However, I found sometimes the enqueue operation doesn't work that nothing is enqueued, and there is no error or expectation. Does anyone have idea?

How to implement a double queue structure in tensorflow

I am using TensorFlow to implement a Neutral Network, and want to achieve such architecture: there are 2 queues, namely Q1 and Q2. Q1 is initialised with some file names, and Q2 will be filled with examples later.
Every time the session runs a step, a file name is popped from Q1, and enters a processing part. In the processing part, data is read from the file, and generated some, say 32, different examples from the data. Then, the generated 32 examples will be enqueued into Q2. If Q2 reached some limit, it dequeues a batch of examples.
In particular, I will generated nearly 1M examples every time read from a file, so such process must run in the background and avoid block the main thread, and enqueueing into Q2 must be asynchronously.
I failed to find a solution. I have tried something like the following:
import tensorflow as tf
q1 = tf.FIFOQueue(capacity=32, dtypes=tf.int32)
init_op = q1.enqueue_many(([0, 1, 2],))
q2 = tf.FIFOQueue(capacity=64, dtypes=tf.int32)
r = q1.dequeue()
# mimic generating examples from data read from the file
for i in range(10):
enq_op = q2.enqueue(r * 10 + i)
s = q2.dequeue()
sess = tf.InteractiveSession()
sess.run(init_op)
# don't know what to do
sess.close()
Could anyone help!
One problem I see is that you are confusing graph construction and execution. Your for i in range(10) loop creates a bunch of enqueue ops, it won't actually add r*10+i to your queue.
I recommend going through the queue tutorial first to understand the basic concepts -- https://www.tensorflow.org/versions/r0.9/how_tos/threading_and_queues/index.html . Also this