Store a tf.Saver.save checkpoint in a variable (or in memory) - tensorflow

I am using Tensorflow and storing the current "best" model on the hard drive for persistence, using tf.Saver:
saver = tf.train.Saver(max_to_keep=1)
[...]
saver.save(
sess,
path_to_file,
global_step=epoch
)
My network is rather small and very fast to run, a single epoch on the GPU runs in less than 10 seconds. However, saving the model to the hard drive takes between one to two minutes, taking up a lot time.
Is it possible to store the model in memory, to avoid taking up such a big chunk of the overall run time? If I somehow could store the "best" model in memory for a while, and dump it once I tell the model to, I could cut down the overall run time by a big factor.
I've looked at the tf.Saver documentation and implementation, and I can not see any way to achieve just what I want. Is there some other implementation or tool that can do what I want to?

I don't think tf.Saver supports this. You can, however, mount an in-memory filesystem (like tmpfs in linux) and save to that directory, which should not touch any disks.

Related

tensorflow how to reduce high "device-to-device" load

I profiled a model that I am running and the vast majority of the time in each step (295 of 320ms) is being taken up by "device-to-device" operations (see image). I assume this means loading data from my cpu onto my gpu and back is the bottleneck.
I am running this on a single machine. The data is stored on an SSD and being fed into a GPU.
I am using tensorflow's tf.data.Dataset API and doing all the recommended things like prefetching and num_parallel_calls=tf.data.experimental.AUTOTUNE
My questions are:
(1) Is my assumption correct?
(2) How do I reduce this huge burden on my model?
Tensorboard Profiling Overview
Not a proper answer but it's something; by using tensorflow's mixed precision training I was able to reduce the "device-to-device" time to ~ 145ms. This is still an immense burden compared to everything else profiled and I'd love to be able to reduce it further.
I don't know why this helped either. I assume that mp-training means smaller numbers of bytes are being passed around so maybe that helps.

tf.data.experimental.save VS TFRecords

I have notice that the method tf.data.experimental.save (added in r2.3) allows to save a tf.data.Dataset to file in just one line of code, which seems extremely convenient. Are there still some benefits in serializing a tf.data.Dataset and writing it into a TFRecord ourselves, or is this save function supposed to replace this process?
TFRecord have several benefits especially when using the large datasets. TFRecord - If you are working with large datasets, using a binary file format for storage of your data can have a significant impact on the performance of your import pipeline and as a consequence on the training time of your model. Binary data takes up less space on disk, takes less time to copy and can be read much more efficiently from disk. This is especially true if your data is stored on spinning disks, due to the much lower read/write performance in comparison with SSDs.
tf.data.experimental.save and tf.data.experimental.load will be useful if you are not worried about the performance of your import pipeline.
tf.data.experimental.save - The saved dataset is saved in multiple file "shards". By default, the dataset output is divided to shards in a round-robin fashion. The datasets saved through tf.data.experimental.save should only be consumed through tf.data.experimental.load, which is guaranteed to be backwards compatible.

Is there a way to keep a Tensorflow record file in memory?

Here is the situation: I am working with a large Tensorflow record file. It's about 50 GB. However the machine I'm doing this training on has 128 GB of RAM. 50 is less than 128, so even though this is a large file you would think that it would be possible to keep it in memory and save on slow I/O operators. But I'm using the TFRecordDataset class and it seems like the whole TFRecord system is designed specifically to not do that, and I don't see any way to force it to keep the records in memory. And since it reloads them every epoch I am wasting an inordinate amount of time on slow I/O operations reading from that 50 GB file.
I suppose I could load the records into memory in python and then load them into my model one by one with a feed_dict, bypassing the whole Dataset class. But that seems like a less elegant way to handle things and would require some redesign. Everything would be much simpler if I could just force the TFRecordDataset to load everything into memory and keep it there between epochs...
You need tf.data.Dataset.cache() operation. To achieve the desired effect (keeping the file in memory), put it right after the TFRecordDataset and don't provide any arguments to it:
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.cache()
When the cache() operation is invoked without arguments, than caching is done in memory.
Also if you have some postprocessing of these records, like with dataset.map(...), then it could be even more beneficial to put the cache() operation in the end of the input pipeline.
More information can be found in the "Input Pipeline Performance Guide" Map and Cache section.

Does `config.gpu_options.allow_growth=True` reduce performance in the long run?

I am interested in the costs of using config.gpu_options.allow_growth=True, which I read about here.
I understand that there are some performance losses initially, as tensorflow allocates memory in multiple steps, but are there long run consequences?
E.g. if I have a computer that only runs tensorflow with config.gpu_options.allow_growth=True, will it after say an hour of training run slower (batches per second) than if I didn't use the option?
When you use allow_growth = True , the GPU memory is not pre-allocated and will be able to grow as you need it. This will lead to smaller memory usage (as otherwise default options was to use the whole of memory) but decreases the performance if not user properly, as it requires a more complex handling of the memory.

RRDtool what use are multiple RRAs?

I'm trying to implement rrdtool. I've read the various tutorials and got my first database up and running. However, there is something that I don't understand.
What eludes me is why so many of the examples I come across instruct me to create multiple RRAs?
Allow me to explain: Let's say I have a sensor that I wish to monitor. I will want to ultimately see graphs of the sensor data on an hourly, daily, weekly and monthly basis and one that spans (I'm still on the fence on this one) about 1.5 yrs (for visualising seasonal influences).
Now, why would I want to create an RRA for each of these views? Why not just create a database like this (stepsize=300 seconds):
DS:sensor:GAUGE:600:U:U \
RRA:AVERAGE:0.5:1:160000
If I understand correctly, I can then create any graph I desire, for any given period with whatever resolution I need.
What would be the use of all the other RRAs people tell me I need to define?
BTW: I can imagine that in the past this would have been helpful when computing power was more rare. Nowadays, with fast disks, high-speed interfaces and powerful CPUs I guess you don't need the kind of pre-processing that RRAs seem to be designed for.
EDIT:
I'm aware of this page. Although it explains about consolidation very clearly, it is my understanding that rrdtool graph can do this consolidation aswell at the moment the data is graphed. There still appears (to me) no added value in "harvest-time consolidation".
Each RRA is a pre-consolidated set of data points at a specific resolution. This performs two important functions.
Firstly, it saves on disk space. So, if you are interested in high-detail graphs for the last 24h, but only low-detail graphs for the last year, then you do not need to keep the high-detail data for a whole year -- consolidated data will be sufficient. In this way, you can minimise the amount of storage required to hold the data for graph generation (although of course you lose the detail so cant access it if you should want to). Yes, disk is cheap, but if you have a lot of metrics and are keeping low-resolution data for a long time, this can be a surprisingly large amount of space (in our case, it would be in the hundreds of GB)
Secondly, it means that the consolidation work is moved from graphing time to update time. RRDTool generates graphs very quickly, because most of the calculation work is already done in the RRAs at update time, if there is an RRA of the required configuration. If there is no RRA available at the correct resolution, then RRDtool will perform the consolidation on the fly from a high-granularity RRA, but this takes time and CPU. RRDTool graphs are usually generated on the fly by CGI scripts, so this is important, particularly if you expect to have a large number of queries coming in. In your example, using a single 5min RRA to make a 1.5yr graph (where 1pixel would be about 1 day) you would need to read and process 288 times more data in order to generate the graph than if you had a 1-day granularity RRA available!
In short, yes, you could have a single RRA and let the graphing work harder. If your particular implementation needs faster updates and doesnt care about slower graph generation, and you need to keep the detailed data for the entire time, then maybe this is a solution for you, and RRDTool can be used in this way. However, usually, people will optimise for graph generation and disk space, meaning using tiered sets of RRAs with decreasing granularity.