Is there a way to keep a Tensorflow record file in memory? - tensorflow

Here is the situation: I am working with a large Tensorflow record file. It's about 50 GB. However the machine I'm doing this training on has 128 GB of RAM. 50 is less than 128, so even though this is a large file you would think that it would be possible to keep it in memory and save on slow I/O operators. But I'm using the TFRecordDataset class and it seems like the whole TFRecord system is designed specifically to not do that, and I don't see any way to force it to keep the records in memory. And since it reloads them every epoch I am wasting an inordinate amount of time on slow I/O operations reading from that 50 GB file.
I suppose I could load the records into memory in python and then load them into my model one by one with a feed_dict, bypassing the whole Dataset class. But that seems like a less elegant way to handle things and would require some redesign. Everything would be much simpler if I could just force the TFRecordDataset to load everything into memory and keep it there between epochs...

You need operation. To achieve the desired effect (keeping the file in memory), put it right after the TFRecordDataset and don't provide any arguments to it:
dataset =
dataset = dataset.cache()
When the cache() operation is invoked without arguments, than caching is done in memory.
Also if you have some postprocessing of these records, like with, then it could be even more beneficial to put the cache() operation in the end of the input pipeline.
More information can be found in the "Input Pipeline Performance Guide" Map and Cache section.


tensorflow how to reduce high "device-to-device" load

I profiled a model that I am running and the vast majority of the time in each step (295 of 320ms) is being taken up by "device-to-device" operations (see image). I assume this means loading data from my cpu onto my gpu and back is the bottleneck.
I am running this on a single machine. The data is stored on an SSD and being fed into a GPU.
I am using tensorflow's API and doing all the recommended things like prefetching and
My questions are:
(1) Is my assumption correct?
(2) How do I reduce this huge burden on my model?
Tensorboard Profiling Overview
Not a proper answer but it's something; by using tensorflow's mixed precision training I was able to reduce the "device-to-device" time to ~ 145ms. This is still an immense burden compared to everything else profiled and I'd love to be able to reduce it further.
I don't know why this helped either. I assume that mp-training means smaller numbers of bytes are being passed around so maybe that helps. VS TFRecords

I have notice that the method (added in r2.3) allows to save a to file in just one line of code, which seems extremely convenient. Are there still some benefits in serializing a and writing it into a TFRecord ourselves, or is this save function supposed to replace this process?
TFRecord have several benefits especially when using the large datasets. TFRecord - If you are working with large datasets, using a binary file format for storage of your data can have a significant impact on the performance of your import pipeline and as a consequence on the training time of your model. Binary data takes up less space on disk, takes less time to copy and can be read much more efficiently from disk. This is especially true if your data is stored on spinning disks, due to the much lower read/write performance in comparison with SSDs. and will be useful if you are not worried about the performance of your import pipeline. - The saved dataset is saved in multiple file "shards". By default, the dataset output is divided to shards in a round-robin fashion. The datasets saved through should only be consumed through, which is guaranteed to be backwards compatible.

I/O performance difference for sequential vs random acess with MxNet data iterators?

I would like to supply to a network many training images that are sampled from a dataset by following certain sampling rules. Now I have two choices:
Use the sampling logic to generate a list of images offline, then convert the .lst file to .rec file and use an sequential DataIter to access it.
Write my own child class of DataIter that can sample the images online. As a result, the class need to support random access, maybe inheriting from MXIndexedRecordIO. I will need to create a .rec file for the original dataset.
My intuition tells me that sequential access will be faster than random access for a .rec file. But I don't know if the difference is big enough to worth the additional time I spend in writing and testing my own iterator class. Could anyone give me a hint on this?
In your case you are better off prepacking images using MXRecordIO. It will give you a boost of performance and also introduce consistency in how you handle the dataset.
It will store the files in a .rec file as a list, where order matters
You can then use mxnet.image.ImageIter to iterate over .rec in order.
Since this is a question about performance, I guess it depends on how fast your network can process images which in turn depends on what hardware you are running your training on.

Scalable, Efficient Hierarchical Softmax in Tensorflow?

I'm interested in implementing a hierarchical softmax model that can handle large vocabularies, say on the order of 10M classes. What is the best way to do this to both be scalable to large class counts and efficient? For instance, at least one paper has shown that HS can achieve a ~25x speedup for large vocabs when using a 2-level tree where each node sqrt(N) classes. I'm interested also in a more general version for an arbitrary depth tree with an arbitrary branching factor.
There are a few options that I see here:
1) Run tf.gather for every batch, where we gather the indices and splits. This creates problems with large batch sizes and fat trees where now the coefficients are being duplicated a lot, leading to OOM errors.
2) Similar to #1, we could use tf.embedding_lookup which would keep help with OOM errors but now keeps everything on the CPU and slows things down quite a bit.
3) Use tf.map_fn with parallel_iterations=1 to process each sample separately and go back to using gather. This is much more scalable but does not really get close to the 25x speedup due to the serialization.
Is there a better way to implement HS? Are there different ways for deep and narrow vs. short and wide trees?
You mention that you want GPU-class performance:
but now keeps everything on the CPU and slows things down quite a bit
and wish to use 300-unit hidden size and 10M-word dictionaries.
This means that (assuming float32), you'll need 4 * 300 * 10M * 2 bytes = 24 GB just to store the parameters and the gradient for the output layer.
Hierarchical Softmax (HSM) doesn't reduce the memory requirements - it just speeds up the training.
Realistically, you'll need a lot more GPU memory, because you'll also need to store:
other parameters and their gradients
optimizer data, e.g. velocities in momentum training
activations and backpropagated temporary data
framework-specific overhead
Therefore, if you want to do all computation on GPUs, you'll have no choice but to distribute this layer across multiple high-memory GPUs.
However, you now have another problem:
To make this concrete, let's suppose you have a 2-level HSM with 3K classes, with 3K words per class (9M words in total). You distribute the 3K classes across 8 GPUs, so that each hosts 384 classes.
What if all target words in a batch are from the same 384 classes, i.e. they belong to the same GPU? One GPU will be doing all the work, while the other 7 wait for it.
The problem is that even if the target words in a batch belong to different GPUs, you'll still have the same performance as in the worst-case scenario, if you want to do this computation in TensorFlow (This is because TensorFlow is a "specify-and-run" framework -- the computational graph is the same for the best case and the worst case)
What is the best way to do this to both be scalable to large class counts and efficient?
The above inefficiency of model parallelism (each GPU must process the whole batch) suggests that one should try to keep everything in one place.
Let us suppose that you are either implementing everything on the host, or on 1 humongous GPU.
If you are not modeling sequences, or if you are, but there is only one output for the whole sequence, then the memory overhead from copying the parameters, to which you referred, is negligible compared to the memory requirements described above:
400 == batch size << number of classes == 3K
In this case, you could simply use gather or embedding_lookup (Although the copying is inefficient)
However, if you do model sequences of length, say, 100, with output at every time step, then the parameter copying becomes a big issue.
In this case, I think you'll need to drop down to C++ / CUDA C and implement this whole layer and its gradient as a custom op.

Store a checkpoint in a variable (or in memory)

I am using Tensorflow and storing the current "best" model on the hard drive for persistence, using tf.Saver:
saver = tf.train.Saver(max_to_keep=1)
My network is rather small and very fast to run, a single epoch on the GPU runs in less than 10 seconds. However, saving the model to the hard drive takes between one to two minutes, taking up a lot time.
Is it possible to store the model in memory, to avoid taking up such a big chunk of the overall run time? If I somehow could store the "best" model in memory for a while, and dump it once I tell the model to, I could cut down the overall run time by a big factor.
I've looked at the tf.Saver documentation and implementation, and I can not see any way to achieve just what I want. Is there some other implementation or tool that can do what I want to?
I don't think tf.Saver supports this. You can, however, mount an in-memory filesystem (like tmpfs in linux) and save to that directory, which should not touch any disks.