How to reuse one tensor when creating a tensorflow dataset iterator of a pair of tensors? - tensorflow

Imagine the case that I want to pair samples from one pool of data with samples from another pool of data together to feed into the network. But many samples from the first pool should be paired with the same sample in the second pool. (let's assume all samples are of the same shape).
For example, if we denote the samples from the first pool as f_i, samples from the second pool as g_j, I might want to have a mini-batch of samples as below (each line is one sample in the mini-batch):
(f_0, g_0)
(f_1, g_0)
(f_2, g_0)
(f_3, g_0)
...
(f_10, g_0)
(f_11, g_1)
(f_12, g_1)
(f_13, g_1)
...
(f_19, g_1)
...
If the data from the second pool are small (like labels), then I can simply store them together with samples from the first pool to tfrecords. But in my case the data from the second pool are of the same size as data from the first pool (for example, both are movie segments). Then saving them in pair in one tfrecords files seems to almost double the disk space use.
I wonder if there is any way in which I can only save all the samples from the second pool once on the disk, but still feed the data to my network as the way I wanted? (Assume I already have already specified the mapping between samples in the first pool and those from the second pool based on their file names).
Thanks a lot!

You can use an iterator for each one of the tfrecords (or pool of samples), so you get two iterators where each can iterate on its own pace. When you run get_next() on an iterator, the next sample is returned, so you have to keep it in a tensor and manually feed it. Quoting from the documentation:
(Note that, like other stateful objects in TensorFlow, calling Iterator.get_next() does not immediately advance the iterator. Instead you must use the returned tf.Tensor objects in a TensorFlow expression, and pass the result of that expression to tf.Session.run() to get the next elements and advance the iterator.)
So all you need is a couple of loops that iterate and combine samples from each iterator as a pair, and then you can feed this when you run your desired operation. For example:
g_iterator = g_dataset.make_one_shot_iterator()
get_next_g = g_iterator.get_next()
f_iterator = f_dataset.make_one_shot_iterator()
get_next_f = f_iterator.get_next()
# loop g:
temp_g = session.run(get_next_g)
# loop f:
temp_f = session.run(get_next_f)
session.run(train, feed_dict={f: temp_f, g: temp_g})
Does this answer your question?

Related

Tensorflow data.Dataset.map and memory storage

I have a dataset of images that is too large to store on memory. What I plan to do is loading pairs of the paths to the images and corresponding labels as my dataset, then use a generator function during training to convert only the paths in my batch to images before feeding them to the network.
Is data.Dataset.map() a good way to do this? Does it return a mapping function, that can be applied only to the current batch during training, or does it perform the mapping operation on the whole dataset at once, occupying lots of memory? In the second case, what is an alternative?
A few tutorials I went through made me believe the mapping takes place per batch, but this quote from the documentation suggests a whole new dataset is returned: "This transformation applies map_func to each element of this dataset, and returns a new dataset containing the transformed elements, in the same order as they appeared in the input."
The key thing to understand here is that tf.data.Dataset objects are generally "lazy" in that elements are only processed as needed (in a batched Dataset, elements == batches). When iterating over a dataset, this usually means that only the next requested element is prepared and then returned. So to answer your question: When using map to load data from disk, and applying this to a dataset of file names, only one batch of the loaded data should be stored in memory at the same time, and you should be able to process the dataset just fine. However, this can significantly slow down training if loading the files is a bottleneck in terms of speed.
There are some exceptions though, for example:
When you use the shuffle method, you need to provide a buffer size, and AFAIK the entire buffer is preprocessed at once. This can lead to issues since you want a large buffer for good shuffling, but this requires more memory. Thus you probably want to use shuffle before applying map.
The prefetch method results in multiple elements being prepared in order to avoid the model having to wait for the next batch to be processed.
Note that this lazy behavior also has some disadvantages, e.g.
You can only iterate over datasets sequentially; there is no random access.
A dataset doesn't even know how many elements it contains (this would require iterating over the entire set).

how does tensorflow.data.Dataset repeat(count=None) method work

Tensorflow data.Dataset has a method repeat(count=None)(https://www.tensorflow.org/api_docs/python/tf/data/Dataset?version=stable#repeat), which repeats this dataset so each original value is seen count times. If count is set to None (by default), the dataset is to be repeated indefinitely. My question is, in the extreme case, how is the dataset of infinite size is handled
and stored in memory? When I try checking its contents using as_numpy_iterator(), the system will get stuck.
The data.Dataset is not a simple repetition of your data based on the repeat(count=X) method. It returns a python iterable, generating an iterator object.
An iterator is an object that implements next, which is expected to return the next element of the iterable object that returned it, and raise a StopIteration exception when no more elements are available.
Source
Having a Dataset with "infinite repetitions" will load "indefinitely" a number of samples equal to your batch size. So, generally speaking, what you store in memory is one batch of samples. Moreover, shuffling with a buffer with a size greater than your dataset will help you having a representative batch based on the distribution of the elements in your actual dataset.

Tensorflow Shuffle Batch Non Deterministic

I am trying to get deterministic behaviour from tf.train.shuffle_batch(). I could, instead, use tf.train.batch() which works fine (always the same order of elements), but I need to get examples from multiple tf-records and so I am stuck with shuffle_batch().
I am using:
random.seed(0)
np.random.seed(0)
tf.set_random_seed(0)
data_entries = tf.train.shuffle_batch(
[data], batch_size=batch_size, num_threads=1, capacity=512,
seed=57, min_after_dequeue=32)
But every time I restart my script I get slightly different results (not completely different, but about 20% of the elements are in the wrong order).
Is there anything I am missing?
Edit: Solved it! See my answer below!
Maybe I misunderstood something, but you can collect multiple tf-records in a queue with tf.train.string_input_producer(), then read the examples into tensors and finally use tf.train.batch().
Take a look at CIFAR-10 input.
Answering my own question:
First the reason shuffle_batch is non deterministic:
The time until I request a batch is inherently random.
In that time, a random number of tensors are available.
Tensorflow calls a shuffle operation that is seeded but depending on the number of items, it will return a different order.
So no matter the seeding, the order is always different unless the number of elements is constant. So the solution is to keep the number of elements constant, but how we do it?
By setting capacity=min_after_dequeue+batch_size. This will force Tensorflow to fill up the queue until it reaches full capacity before dequeuing an item. Therefore, at the time of the shuffle operation, we have capacity many items which is a constant number.
So why are we doing this? Because one tf.record contains many examples but we want examples from multiple tf.records. With a normal batch we would first get all the examples of one record and then of the next one. This also means we should set min_after_dequeue to something larger than the number of items in one tf.record. In my example, I have 50 examples in one file so I set min_after_dequeue=2048.
Alternatively, we can also shuffle the examples before creating the tf.records, but this was not possible for me because I read tf.records from multiple directories (each with their own dataset).
Last Note: You should also use a batch size of 1 to be super save.

Shuffling the training dataset with Tensorflow object detection api

I'm working on a logo detection algorithm using the Faster-RCNN model with the Tensorflow object detection api.
My dataset is alphabetically ordered (so there are a hundred adidas logo, then hundred apple logo etc.). And i would like it to be shuffled while training.
I've put some values in the config file:
train_input_reader:{
shuffle: true
queue_capacity: some value
min_after_dequeue : some other value}
However whatever are the values, I'm putting in, algorithm is at first training on all of the a's logos (adidas, apple and so on) and only a lapse of time after starting to see the b's logos (bmw etc.) and the c's one etc.
Of course I could just shuffle my input dataset directly, but I would like to understand the logic behind it.
PS: I've seen this post about shuffling and min_after_dequeue, but I still dont quite get it. My batch size is 1 so it shouldn't be using tf.train.shuffle_batch() but only tf.RandomShuffleQueue
My training dataset size is 5000 and if I write min_after_dequeue: 4000 or 5000 it is still not shuffled right. Why though?
Update:
#AllenLavoie It's a bit hard for me; as there is a lot of dependencies and I'm new to Tensorflow.
But in the end the queue is constructed by
tf.contrib.slim.parallel_reader.parallel_read( _, string_tensor = parallel_reader.parallel_read(
config.input_path,
reader_class=tf.TFRecordReader,
num_epochs=(input_reader_config.num_epochs
if input_reader_config.num_epochs else None),
num_readers=input_reader_config.num_readers,
shuffle=input_reader_config.shuffle,
dtypes=[tf.string, tf.string],
capacity=input_reader_config.queue_capacity,
min_after_dequeue=input_reader_config.min_after_dequeue)
It seems that when I'm putting num_readers = 1 in the config file the dataset is finally shuffling as I want, (at least in the beginning), but when there are more somehow on the start the logos are getting in the alphabetical order.
I recommend shuffling the dataset prior to training. The way shuffling currently happens is imperfect and my guess at what is happening is that at the beginning the queue starts off empty and only gets examples that start with 'A' --- after a while it may be more shuffled, but there is no getting around the beginning part when the queue hasn't been filled yet.

feed data into a tf.contrib.data.Dataset like a queue

About the tf.contrib.data.Dataset (from TensorFlow 1.2, see here and here) usage:
The way how to get data doesn't really fit any way how I get the data usually. In my case, I have a thread and I receive data there and I don't know in advance when it will end but I see when it ends. Then I wait until I processed all the buffers and then I have finished one epoch. How can I get this logic with the Dataset?
Note that I prefer the Dataset interface over the QueueBase interface because it gives me the iterator interface which I can reinitialize and even reset to a different Dataset. This is more powerful compared to queues which cannot be reopened currently after they are closed (see here and here).
Maybe a similar question, or the same question: How can I wrap around a Dataset over a queue? I have some thread with reads some data from somewhere and which can feed it and queue it somehow. How do I get the data into the Dataset? I could repeat some dummy tensor infinite times and then use map to just return my queue.dequeue() but that really only gets me back to all the original problems with the queue, i.e. how to reopen the queue.
The new Dataset.from_generator() method allows you to define a Dataset that is fed by a Python generator. (To use this feature at present, you must download a nightly build of TensorFlow or build it yourself from source. It will be part of TensorFlow 1.4.)
The easiest way to implement your example would be to replace your receiving thread with a generator, with pseudocode as follows:
def receiver():
while True:
next_element = ... # Receive next element from external source.
# Note that this method may block.
end_of_epoch = ... # Decide whether or not to stop based on next_element.
if not end_of_epoch:
yield next_element # Note: you may need to convert this to an array.
else:
return # Returning will signal OutOfRangeError on downstream iterators.
dataset = tf.contrib.data.Dataset.from_generator(receiver, output_types=...)
# You can chain other `Dataset` methods after the generator. For example:
dataset = dataset.prefetch(...) # This will start a background thread
# to prefetch elements from `receiver()`.
dataset = dataset.repeat(...) # Note that each repetition will call
# `receiver()` again, and start from
# a fresh state.
dataset = dataset.batch(...)
More complicated topologies are possible. For example, you can use Dataset.interleave() to create many receivers in parallel.