tf.data.experimental.save VS TFRecords - tensorflow

I have notice that the method tf.data.experimental.save (added in r2.3) allows to save a tf.data.Dataset to file in just one line of code, which seems extremely convenient. Are there still some benefits in serializing a tf.data.Dataset and writing it into a TFRecord ourselves, or is this save function supposed to replace this process?

TFRecord have several benefits especially when using the large datasets. TFRecord - If you are working with large datasets, using a binary file format for storage of your data can have a significant impact on the performance of your import pipeline and as a consequence on the training time of your model. Binary data takes up less space on disk, takes less time to copy and can be read much more efficiently from disk. This is especially true if your data is stored on spinning disks, due to the much lower read/write performance in comparison with SSDs.
tf.data.experimental.save and tf.data.experimental.load will be useful if you are not worried about the performance of your import pipeline.
tf.data.experimental.save - The saved dataset is saved in multiple file "shards". By default, the dataset output is divided to shards in a round-robin fashion. The datasets saved through tf.data.experimental.save should only be consumed through tf.data.experimental.load, which is guaranteed to be backwards compatible.

Related

optimal size of a tfrecord file

From your experience, what would be an ideal size of a .tfrecord file that would work best across a wide variety of devices (hard-disk, ssd, nvme) and storage locations (local machine, hpc cluster with network mounts) ?
In case I get slower performance on a technically more powerful computer in the cloud than on my local PC, could the size of the tfrecord dataset be the root cause of the bottleneck ?
Thanks
Official Tensorflow website recommends ~100MB (https://docs.w3cub.com/tensorflow~guide/performance/performance_guide/)
Reading large numbers of small files significantly impacts I/O
performance. One approach to get maximum I/O throughput is to
preprocess input data into larger (~100MB) TFRecord files. For smaller
data sets (200MB-1GB), the best approach is often to load the entire
data set into memory.
Currently (19-09-2020) Google recommends the following rule of thumb:
"In general, you should shard your data across multiple files so that you can parallelize I/O (within a single host or across multiple hosts). The rule of thumb is to have at least 10 times as many files as there will be hosts reading data. At the same time, each file should be large enough (at least 10+MB and ideally 100MB+) so that you benefit from I/O prefetching. For example, say you have X GBs of data and you plan to train on up to N hosts. Ideally, you should shard the data to ~10N files, as long as ~X/(10N) is 10+ MBs (and ideally 100+ MBs). If it is less than that, you might need to create fewer shards to trade off parallelism benefits and I/O prefetching benefits."
Source: https://www.tensorflow.org/tutorials/load_data/tfrecord

Is there a way to keep a Tensorflow record file in memory?

Here is the situation: I am working with a large Tensorflow record file. It's about 50 GB. However the machine I'm doing this training on has 128 GB of RAM. 50 is less than 128, so even though this is a large file you would think that it would be possible to keep it in memory and save on slow I/O operators. But I'm using the TFRecordDataset class and it seems like the whole TFRecord system is designed specifically to not do that, and I don't see any way to force it to keep the records in memory. And since it reloads them every epoch I am wasting an inordinate amount of time on slow I/O operations reading from that 50 GB file.
I suppose I could load the records into memory in python and then load them into my model one by one with a feed_dict, bypassing the whole Dataset class. But that seems like a less elegant way to handle things and would require some redesign. Everything would be much simpler if I could just force the TFRecordDataset to load everything into memory and keep it there between epochs...
You need tf.data.Dataset.cache() operation. To achieve the desired effect (keeping the file in memory), put it right after the TFRecordDataset and don't provide any arguments to it:
dataset = tf.data.TFRecordDataset(filenames)
dataset = dataset.cache()
When the cache() operation is invoked without arguments, than caching is done in memory.
Also if you have some postprocessing of these records, like with dataset.map(...), then it could be even more beneficial to put the cache() operation in the end of the input pipeline.
More information can be found in the "Input Pipeline Performance Guide" Map and Cache section.

Optimal data streaming and processing solution for enormous datasets into tf.data.Dataset

Context:
My text input pipeline currently consists of two main parts:
I. A complex text preprocessing and exporting of tf.SequenceExamples to tfrecords (custom tokenization, vocabulary creation, statistics calculation, normalization and many more over the full dataset as well as per each individual example). That is done once for each data configuration.
II. A tf.Dataset (TFRecords) pipeline that does quite a bit of processing during training, too (string_split into characters, table lookups, bucketing, conditional filtering, etc.).
Original Dataset is present across multiple locations (BigQuery, GCS, RDS, ...).
Problem:
The problem is that as the production dataset increases rapidly (several terabytes), it is not feasible to recreate a tfrecords files for each possible data configuration (part 1 has a lot of hyperparameters) as each will have an enormous size of hundreds of terabytes. Not to mention, that tf.Dataset reading speed surprisingly slows down when tf.SequenceExamples or tfrecords grow in size.
There are quite a few possible solutions:
Apache Beam + Cloud DataFlow + feed_dict;
tf.Transform;
Apache Beam + Cloud DataFlow + tf.Dataset.from_generator;
tensorflow/ecosystem + Hadoop or Spark
tf.contrib.cloud.BigQueryReader
, but neither of the following seem to fully fulfill my requierments:
Streaming and processing on the fly data from BigQuery, GCS, RDS, ... as in part I.
Sending data (protos?) directly to tf.Dataset in one way or another to be used in part II.
Fast and reliable for both training and inference.
(optional) Being able to pre-calculate some full pass statistics over the selected part of the data.
EDIT: Python 3 support would be just wonderful.
What is the most suitable choice for the tf.data.Dataset pipeline? What are the best practices in this case?
Thanks in advance!
I recommend to orchestrate the whole use case using Cloud Composer(GCP integration of Airflow).
Airflow provided operators which let you orchestrate a pipeline with a script.
In your case you can use the dataflow_operator to have the dataflow job spin up when you have enough data to process.
To get the data from BigQuery you can use the bigquery_operator.
Furthermore you can use the python operator or the bash operator to monitor and pre-calculate statistics.

I/O performance difference for sequential vs random acess with MxNet data iterators?

I would like to supply to a network many training images that are sampled from a dataset by following certain sampling rules. Now I have two choices:
Use the sampling logic to generate a list of images offline, then convert the .lst file to .rec file and use an sequential DataIter to access it.
Write my own child class of DataIter that can sample the images online. As a result, the class need to support random access, maybe inheriting from MXIndexedRecordIO. I will need to create a .rec file for the original dataset.
My intuition tells me that sequential access will be faster than random access for a .rec file. But I don't know if the difference is big enough to worth the additional time I spend in writing and testing my own iterator class. Could anyone give me a hint on this?
In your case you are better off prepacking images using MXRecordIO. It will give you a boost of performance and also introduce consistency in how you handle the dataset.
It will store the files in a .rec file as a list, where order matters
You can then use mxnet.image.ImageIter to iterate over .rec in order.
http://mxnet.io/api/python/io.html#mxnet.image.ImageIter
Since this is a question about performance, I guess it depends on how fast your network can process images which in turn depends on what hardware you are running your training on.

Store a tf.Saver.save checkpoint in a variable (or in memory)

I am using Tensorflow and storing the current "best" model on the hard drive for persistence, using tf.Saver:
saver = tf.train.Saver(max_to_keep=1)
[...]
saver.save(
sess,
path_to_file,
global_step=epoch
)
My network is rather small and very fast to run, a single epoch on the GPU runs in less than 10 seconds. However, saving the model to the hard drive takes between one to two minutes, taking up a lot time.
Is it possible to store the model in memory, to avoid taking up such a big chunk of the overall run time? If I somehow could store the "best" model in memory for a while, and dump it once I tell the model to, I could cut down the overall run time by a big factor.
I've looked at the tf.Saver documentation and implementation, and I can not see any way to achieve just what I want. Is there some other implementation or tool that can do what I want to?
I don't think tf.Saver supports this. You can, however, mount an in-memory filesystem (like tmpfs in linux) and save to that directory, which should not touch any disks.