How to implement a double queue structure in tensorflow - tensorflow

I am using TensorFlow to implement a Neutral Network, and want to achieve such architecture: there are 2 queues, namely Q1 and Q2. Q1 is initialised with some file names, and Q2 will be filled with examples later.
Every time the session runs a step, a file name is popped from Q1, and enters a processing part. In the processing part, data is read from the file, and generated some, say 32, different examples from the data. Then, the generated 32 examples will be enqueued into Q2. If Q2 reached some limit, it dequeues a batch of examples.
In particular, I will generated nearly 1M examples every time read from a file, so such process must run in the background and avoid block the main thread, and enqueueing into Q2 must be asynchronously.
I failed to find a solution. I have tried something like the following:
import tensorflow as tf
q1 = tf.FIFOQueue(capacity=32, dtypes=tf.int32)
init_op = q1.enqueue_many(([0, 1, 2],))
q2 = tf.FIFOQueue(capacity=64, dtypes=tf.int32)
r = q1.dequeue()
# mimic generating examples from data read from the file
for i in range(10):
enq_op = q2.enqueue(r * 10 + i)
s = q2.dequeue()
sess = tf.InteractiveSession()
sess.run(init_op)
# don't know what to do
sess.close()
Could anyone help!

One problem I see is that you are confusing graph construction and execution. Your for i in range(10) loop creates a bunch of enqueue ops, it won't actually add r*10+i to your queue.
I recommend going through the queue tutorial first to understand the basic concepts -- https://www.tensorflow.org/versions/r0.9/how_tos/threading_and_queues/index.html . Also this

Related

Problem when predicting via multiprocess with Tensorflow

I have 4 (or more) models (same structure but different training data). Now I want to ensemble them to make a prediction. I want to pre-load the models and then predict one input message (one message at a time) in parallel via multiprocess. However, the program always stops at "session.run" step. I could not figure it out why.
I tried passing all arguments to the function in each process, as shown in the code below. I also tried using a Queue object and put all the data (except the model object) in the queue. I also tried to set the number of process to 1. It made no difference.
with Manager() as manager:
first_level_test_features=manager.list()
procs =[]
for id in range(4):
p = Process(target=predict, args=(id, (message, models, configs, vocabs, emoji_dict,first_level_test_features)))
procs.append(p)
p.start()
for p in procs:
p.join()
I did not get any error message since it is just stuck there. I would expect the program can start multiple processes and each process uses the model pass to it to make the prediction.
I am unsure how session sharing along different Processes would work, and this is probably where your issue comes from. Given the way TensorFlow works, I would advise implementing the ensemble call as a graph operation, so that it can be run through a single session.run call, with TF handling the parallelization of computations wherever possible.
In practice, if you have symbolic tensors representing the models' predictions, you could use a TF operation to aggregate them (tf.concat, tf.reduce_mean, tf.add_n... whichever suits your design) and end up with a single symbolic tensor representing the ensemble prediction.
I hope this helps; if not, please provide some more details as to what your setting is, notably which form your models have.

Tensorflow Data API - prefetch

I am trying to use new features of TF, namely Data API, and I am not sure how prefetch works. In the code below
def dataset_input_fn(...)
dataset = tf.data.TFRecordDataset(filenames, compression_type="ZLIB")
dataset = dataset.map(lambda x:parser(...))
dataset = dataset.map(lambda x,y: image_augmentation(...)
, num_parallel_calls=num_threads
)
dataset = dataset.shuffle(buffer_size)
dataset = dataset.batch(batch_size)
dataset = dataset.repeat(num_epochs)
iterator = dataset.make_one_shot_iterator()
does it matter between each lines above I put dataset=dataset.prefetch(batch_size)? Or maybe it should be after every operation that would be using output_buffer_size if the dataset was coming from tf.contrib.data?
In discussion on github I found a comment by mrry:
Note that in TF 1.4 there will be a Dataset.prefetch() method that
makes it easier to add prefetching at any point in the pipeline, not
just after a map(). (You can try it by downloading the current nightly
build.)
and
For example, Dataset.prefetch() will start a background thread to
populate a ordered buffer that acts like a tf.FIFOQueue, so that
downstream pipeline stages need not block. However, the prefetch()
implementation is much simpler, because it doesn't need to support as
many different concurrent operations as a tf.FIFOQueue.
so it means prefetch could be put by any command and it works on the previous command. So far I have noticed the biggest performance gains by putting it only at the very end.
There is one more discussion on Meaning of buffer_size in Dataset.map , Dataset.prefetch and Dataset.shuffle where mrry explains a bit more about the prefetch and buffer.
UPDATE 2018/10/01:
From version 1.7.0 Dataset API (in contrib) has an option to prefetch_to_device. Note that this transformation has to be the last in the pipeline and when TF 2.0 arrives contrib will be gone. To have prefetch work on multiple GPUs please use MultiDeviceIterator (example see #13610) multi_device_iterator_ops.py.
https://www.tensorflow.org/versions/master/api_docs/python/tf/contrib/data/prefetch_to_device

feed data into a tf.contrib.data.Dataset like a queue

About the tf.contrib.data.Dataset (from TensorFlow 1.2, see here and here) usage:
The way how to get data doesn't really fit any way how I get the data usually. In my case, I have a thread and I receive data there and I don't know in advance when it will end but I see when it ends. Then I wait until I processed all the buffers and then I have finished one epoch. How can I get this logic with the Dataset?
Note that I prefer the Dataset interface over the QueueBase interface because it gives me the iterator interface which I can reinitialize and even reset to a different Dataset. This is more powerful compared to queues which cannot be reopened currently after they are closed (see here and here).
Maybe a similar question, or the same question: How can I wrap around a Dataset over a queue? I have some thread with reads some data from somewhere and which can feed it and queue it somehow. How do I get the data into the Dataset? I could repeat some dummy tensor infinite times and then use map to just return my queue.dequeue() but that really only gets me back to all the original problems with the queue, i.e. how to reopen the queue.
The new Dataset.from_generator() method allows you to define a Dataset that is fed by a Python generator. (To use this feature at present, you must download a nightly build of TensorFlow or build it yourself from source. It will be part of TensorFlow 1.4.)
The easiest way to implement your example would be to replace your receiving thread with a generator, with pseudocode as follows:
def receiver():
while True:
next_element = ... # Receive next element from external source.
# Note that this method may block.
end_of_epoch = ... # Decide whether or not to stop based on next_element.
if not end_of_epoch:
yield next_element # Note: you may need to convert this to an array.
else:
return # Returning will signal OutOfRangeError on downstream iterators.
dataset = tf.contrib.data.Dataset.from_generator(receiver, output_types=...)
# You can chain other `Dataset` methods after the generator. For example:
dataset = dataset.prefetch(...) # This will start a background thread
# to prefetch elements from `receiver()`.
dataset = dataset.repeat(...) # Note that each repetition will call
# `receiver()` again, and start from
# a fresh state.
dataset = dataset.batch(...)
More complicated topologies are possible. For example, you can use Dataset.interleave() to create many receivers in parallel.

Force copy of tensor when enqueuing

first, I'm not sure if the title is very good, but it was the best I could come up with given my understanding of the situation.
The background is that I'm trying to understand how queues work in tensorflow and ran into the following issue which puzzled me.
I have a variable n, which I enqueue to a tf.FIFOQueue, and then I increment the variable. This is repeated several times, and one would expect a result similar to 0, 1, 2, ... However, when emptying the queue all values are the same.
More precisely, the code is as follows:
from __future__ import print_function
import tensorflow as tf
q = tf.FIFOQueue(10, tf.float32)
n = tf.Variable(0, trainable=False, dtype=tf.float32)
inc = n.assign(n+1)
enqueue = q.enqueue(n)
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
sess.run(enqueue)
sess.run(inc)
sess.run(enqueue)
sess.run(inc)
sess.run(enqueue)
sess.run(inc)
print(sess.run(q.dequeue()))
print(sess.run(q.dequeue()))
print(sess.run(q.dequeue()))
Which I expect would print:
0.0
1.0
2.0
Instead I get the following result:
3.0
3.0
3.0
It seems like I'm pushing some pointer to n to the queue, instead of the actual value, which is what I want. However, I don't really have any actual understanding of tensorflow internals, so maybe something else is going on?
I tried changing
enqueue = q.enqueue(n)
to
enqueue = q.enqueue(tf.identity(n))
since answers to How can I copy a variable in tensorflow and In TensorFlow, what is tf.identity used for? gives me the impression that it might help, but it does not change the result. I also tried adding a tf.control_dependencies(), but again, all values are the same when dequeueing.
Edit: The output above is from running the code on a computer with a single CPU, when trying to see if there was some difference between different versions of tensorflow, I noticed if I run the code on a computer with CPU and GPU I get the "expected" result. Indeed, if I run with CUDA_VISIBLE_DEVICES="" I get the result above, and with CUDA_VISIBLE_DEVICES="0" I get the "expected" result.
To force a non-caching read you can do
q.enqueue(tf.add(q, 0))
This is what's currently done by the batch-normalization layer to force a copy.
Semantics of how variables get read vs. referenced are in the process of getting revamped so they are temporarily non-intuitive. In particular, I expected q.enqueue(v.read_value()) to force a non-caching read, but it doesn't fix your example on TF 0.12rc1
Using GPU machine puts variable on GPU, while Queue is CPU only, so enqueue op forces a GPU->CPU copy.
In case it helps, I've found that the other answers despite correct they do not work for all dtypes.
For example, this works fine with floats or ints but fails when n is a string tensor:
q.enqueue(tf.add(n, 0))
This one fails when the queue uses tuples with heterogeneous types (e.g., ints and floats):
q.enqueue_many([[n]])
So, if you see yourself caught in any of these situations try this instead:
q.enqueue(tf.add(n, tf.zeros_like(n)))
Or, to enqueue a tuple t:
q.enqueue([tf.add(n, tf.zeros_like(n)) for n in t])
That works even for string tensors and heterogeneous tuple types.
Hope it helps!
--
Update: it looks like tf.bool types do not work with tf.zeros_like(). For those, an explicit cast to an integer type might be needed.

Multiple queues causing TF to lock up

I'm trying to use multiple queues for reading and batching, but this is causing TF to occasionally lock up. Here's some sample code:
import tensorflow as tf
coordinator = tf.train.Coordinator()
file_queue = tf.train.string_input_producer(tf.train.match_filenames_once(...))
reader = tf.TextLineReader()
key, serialized_example = reader.read(file_queue)
keys, serialized_examples = tf.train.batch([key, serialized_example], 10)
# repeat the code snippet below multiple times, in my case 4
file_queue_i = tf.train.string_input_producer(tf.train.match_filenames_once(...))
reader_i = tf.TextLineReader()
key_i, serialized_example_i = reader.read(file_queue_i)
initializer = tf.initialize_all_variables()
session = tf.Session(config=tf.ConfigProto(inter_op_parallelism_threads=1, intra_op_parallelism_threads=1))
session.run(initializer)
threads = tf.train.start_queue_runners(sess=session, coord=coordinator)
session.run(keys)
TensorFlow occasionally locks up at the last line, when I actually try to run something. This behavior is rather hard to reproduce using the above code however. In 1000+ runs, I could only get it to hang once. In my real code, the actual reader is more complicated, and it's using TFRecords, but otherwise everything is the same. There it hangs up 2/3 of the time with 3 queues in total. With 5 queues it seemingly never runs, and with 1 queue it seemingly never hangs. This is on a Mac with 0.6. I have a different system running Ubuntu, also with 0.6, and I get the same problem (although the frequency of locking up is much higher on the Ubuntu system).
UPDATE: A more accurate estimate of how often the above code locks up is 1 in 5,000 trials.
This is probably caused by not having enough operation threads. If you have a queue runner 1 depending on work of queue runner 2, and you run them asynchronously, then you'll need at least two op threads, set through inter_op_parallelism_threads, to guarantee that progress is being made.
In your case, you have queue runner that's filling batch thread depending on string_input_producer queue being not empty. If the queue runner associated with string_input_producer queue runs first, then everything is fine. But if batch queue runner is scheduled first, it will get stuck in string_input_producer.dequeue op waiting for string_input_producer queue to get some filenames. Since there's only 1 thread in TensorFlow op thread pool, the enqueue op of string_input_producer will never get allocated a thread to complete (ie, to execute its Compute method)
Simplest solution is to have at least as many operation threads as you have simultaneous run calls (ie, number of queues + 1). If you really want to restrict yourself to one thread, you could preload filename queue file filenames synchronously using main thread.
coordinator = tf.train.Coordinator()
import glob
files = glob.glob('/temp/pipeline/*')
if FLAGS.preload_filenames:
file_queue = tf.FIFOQueue(capacity=len(files), dtypes=tf.string)
enqueue_val = tf.placeholder(dtype=tf.string)
enqueue_op = file_queue.enqueue(enqueue_val)
else:
file_queue = tf.train.string_input_producer(files)
reader = tf.TextLineReader()
key, serialized_example = reader.read(file_queue)
keys, serialized_examples = tf.train.batch([key, serialized_example], 5,
capacity=10)
initializer = tf.initialize_all_variables()
session = tf.Session(config=tf.ConfigProto(inter_op_parallelism_threads=1,
intra_op_parallelism_threads=1))
print 'running initializer'
session.run(initializer)
if FLAGS.preload_filenames:
print 'preloading filenames'
for fn in files:
session.run([enqueue_op], feed_dict={enqueue_val: fn})
print 'size - ', session.run([file_queue.size()])
session.run([file_queue.close()])
print 'starting queue runners'
threads = tf.train.start_queue_runners(sess=session, coord=coordinator)
print 'about to run session'
print session.run(keys)
Code above will need some encapsulation if you have more than one filenames queue. Alternatively here's a hacky work-around which should work if there's exactly prebuffer_amount filenames for all input_producer queues
queue_runners=tf.get_collection(tf.GraphKeys.QUEUE_RUNNERS)
filename_queue_runners=[qr for qr in queue_runners if 'input_producer' in qr.name]
for qr in filename_queue_runners:
for k in prebuffer_amount:
sess.run(qr._enqueue_ops[0])