What is the difference between TFRecordDataset and FixedLengthRecordDataset? - tensorflow

It will be great to get a use case possibly from a project and explain the use of each. Thanks in advance.

TFRecordDataset, FixedLengthRecordDataset as well as TextLineDataset are classes of Dataset.
Dataset is a base class containing methods to create and transform datasets. Also allows you initialize a dataset from data in memory, or from a Python generator.
Since release 1.4, Datasets is a new way to create input pipelines to TensorFlow models. This API is much more performant than using feed_dict or the queue-based pipelines, and it's cleaner and easier to use.
As a use case, you can think of the pre-processing of data to feed it into a model for training (Examples in the links below are pretty self-explanatory).
TFRecordDataset: Reads records from TFRecord files (Example 1, Example 2).
#Python
dataset = tf.data.TFRecordDataset("/path/to/file.tfrecord")
FixedLengthRecordDataset: Reads fixed size records from binary files (Example).
#Python
images = tf.data.FixedLengthRecordDataset(
images_file, 28 * 28, header_bytes=16).map(decode_image)
TextLineDataset: Reads lines from text files.
See this documentation (TextLineDataset example included)

Related

How to load in a downloaded tfrecord dataset into TensorFlow?

I am quite new to TensorFlow, and have never worked with TFRecords before.
I have downloaded a dataset of images from online and the download format was TFRecord.
This is the file structure in the downloaded dataset:
1.
2.
E.g. inside "test"
What I want to do is load in the training, validation and testing data into TensorFlow in a similar way to what happens when you load a built-in dataset, e.g. you might load in the MNIST dataset like this, and get arrays containing pixel data and arrays containing the corresponding image labels.
(train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.mnist.load_data()
However, I have no idea how to do so.
I know that I can use dataset = tf.data.TFRecordDataset(filename) somehow to open the dataset, but would this act on the entire dataset folder, one of the subfolders, or the actual files? If it is the actual files, would it be on the .TFRecord file? How do I use/what do I do with the .PBTXT file which contains a label map?
And even after opening the dataset, how can I extract the data and create the necessary arrays which I can then feed into a TensorFlow model?
It's mostly archaeology, and plus a few tricks.
First, I'd read the README.dataset and README.roboflow files. Can you show us what's in them?
Second, pbtxt are text formatted so we may be able to understand what that file is if you just open it with a text editor. Can you show us what's in that.
The think to remember about a TFRecord file is that it's nothing but a sequence of binary records. tf.data.TFRecordDataset('balls.tfrecord') will give you a dataset that yields those records in order.
Number 3. is the hard part, because here you'll have binary blobs of data, but we don't have any clues yet about how they're encoded.
It's common for TFRecord filed to contian serialized tf.train.Example.
So it would be worth a shot to try and decode it as a tf.train.Example to see if that tells us what's inside.
ref
for record in tf.data.TFRecordDataset('balls.tfrecord'):
break
example = tf.train.Example()
example.ParseFromString(record.numpy())
print(example)
The Example object is just a representation of a dict. If you get something other than en error there look for the dict keys and see if you can make sense out of them.
Then to make a dataset that decodes them you'll want something like:
def decode(record):
return tf.train.parse_example(record, {key:tf.io.RaggedFeature(dtype) for key, dtype in key_dtypes.items()})
ds = ds.map(decode)

Create round-robin sharding while generating sharded tfrecords

I am new to tensorflow and I am working on image segmentation problem in tensorflow 1.14. I have a huge dataset and generating tfrecords is very slow, when I try to generate one big tfrecord file. So, I would like to create 'n' shards of tfrecords. I could not find a way to do it online. Say I have 600 images and 600 masks. I want to generate 6 shards of tfrecords, with 100 images and 100 masks each in round robin fashion. A high level /pseudo-code of what I want is as follows -
sharded_tf_record_writer:
create n TFRecordWriter
----> for each_item in n TFRecordWriter
-----> write_example in round-robin fashion
I did search online and could not find relevant answer. I do not want to use apache beam for sharding. I appreciate any idea/help/guidance to achieve this.
I had asked the same question in one of the issues of tensorflow datasets and the user - Conchylicultor responded this -
Writing is done by _TFRecordWriter. Tfds will automatically compute the required number of shards and distribute examples across shards, However each shard is written sequentially.
You do not have control over the number of shards, it is also automatically computed.
However, the fact that examples are distributed between shards do not make the writing faster as examples are not pre-processed in parallel. If you want parallelism, then you'll have to use Apache Beam which allow to scale even to huge datasets
The link to the tensorflow/datasets issue is - https://github.com/tensorflow/datasets/issues/676
This might help.
Since you are working with object detection in tensorflow, there are some nice code in the official Tensorflow models repository that will do what you want. Note this code is for Tensorflow2 (not sure if it'll work in TF1)
See this example of writing sharded tfrecords from coco annotations. The idea is that you open up a list of TFRecordWriter in an exit stack (using contextlib2.ExitStack()), which will automatically close the TFRecords when each thread finishes writing to it.
The utility function open_sharded_output_tfrecords function creates this list of TFRecordWriter
import contextlib2
import tensorflow as tf
with contextlib2.ExitStack() as tf_record_close_stack, tf.gfile.GFile(
annotations_file, 'r'
) as fid:
output_tfrecords = tf_record_creation_util.open_sharded_output_tfrecords(
tf_record_close_stack, output_path, num_shards
)
Next you can use the ProcessPoolExecutor to write tfrecords into each shard in a round-robin fashion in parallel (4 workers in this example)
from concurrent.futures.process import ProcesPoolExecutor
with ProcessPoolExecutor(4) as executor:
for idx, image in enumerate(images):
futures = []
future = executor.submit(
_write_tf_record,
image,
idx,
num_shards,
output_tfrecords,
)
futures.append(future)
for future in futures:
future.result()
where _write_tf_record may look something like this:
def _write_tf_record(image, idx, num_shards, output_tfrecords)
tf_example = create_tf_example(image)
shard_idx = idx % num_shards
output_tfrecords[shard_idx].write(tf_example.SerializeToString())
Just make sure you have more shards than multiprocess workers, otherwise the same writer may be accessed by two different processes.

Using a subset of tfrecord

Is it possible to use an existing tfrecord for one or a subset of the labels which was used to generate it
I'm training several models with the same data each would require only a one or a subset of labels used to originally create the tfrecord. The tfrecord is quite large so I want to about create one for each models subset of labels.
tf.data.Datasets have filter, skip and take methods which you may find useful. Alternatively you could split your original dataset across multiple tfrecord files and create a Dataset based on a subset of those files.
If you are happy to recreate the data using tensorflow_datasets, splits may also give you what you want.

What is the best way to feed the image+vector dataset to Tensorflow

I am trying to do a Deep Learning project by using Tensorflow.
Each of my data sets contains 2 files( PNGimage file + TXTvectors file ), where are put in different folders as follow:
./data/image/ #Folders contains different size of images
./data/vector/ #Folders contains vectors of corresponding image
#For example: apple.png + apple.txt
The example content of vector shows as follow:
10.0,2.5,5,13
And since image size are different, the resize and some transformation apply on vectors are required. It is important to make sure that I can do these processing during Tensorflow is running. Is there any good way to manage this kind of datasets?
I referred to a lot of basic tutorial however most of them are not so many details about arrange customized data input and output. Please give me some advice!
I recommend you to take a look at TFRecords and queues. Basically the idea is the following: you resize all your images to the same format and store them together with your txt vectors in one TFRecord file. This is done separately before you run your model.
When you create your model you create a queue which reads data from the TFRecord file and feeds it to your model.

Caching a dataset with examples of varied length

My dataset is comprised of audio segments of between 5-180 seconds. The number of examples is small enough to allow caching it in memory, instead of reading from the disk over and over. Storing the data in a constant tensor / variable and using tf.train.slice_input_producer will allow me to cache the dataset in memory, but it requires storing all the data in one matrix. Since some examples are much longer than others, this matrix might be unnecessarily large and perhaps too large for the RAM.
I can simply have a list of numpy arrays for my data, and do the whole input reading-randomizing-preprocessing in a non-tensforflow way with a feed_dict, but I wonder if there is a way to do it without completely giving up on tensorflow for the input reading-randomizing-preprocessing part.
Thanks!
The more recent tf.data library provides a tf.data.Dataset.cache method to cache an entire dataset into memory or into a file.
For instance:
dataset = ...
dataset = dataset.map(preprocessing_fn) # apply preprocessing
dataset = dataset.cache() # cache entire dataset in memory after preprocessing
I've provided more details on how to use cache() in this answer.