Tensorflow: Count number of examples in a TFRecord file -- without using deprecated `tf.python_io.tf_record_iterator` - tensorflow

Please read post before marking Duplicate:
I was looking for an efficient way to count the number of examples in a TFRecord file of images. Since a TFRecord file does not save any metadata about the file itself, the user has to loop through the file in order to calculate this information.
There are a few different questions on StackOverflow that answer this question. The problem is that all of them seem to use the DEPRECATED tf.python_io.tf_record_iterator command, so this is not a stable solution. Here is the sample of existing posts:
Obtaining total number of records from .tfrecords file in Tensorflow
Number of examples in each tfrecord
Number of examples in each tfrecord
So I was wondering if there was a way to count the number of records using the new Dataset API.

There is a reduce method listed under the Dataset class. They give an example of counting records using the method:
# generate the dataset (batch size and repeat must be 1, maybe avoid dataset manipulation like map and shard)
ds = tf.data.Dataset.range(5)
# count the examples by reduce
cnt = ds.reduce(np.int64(0), lambda x, _: x + 1)
## produces 5
Don't know whether this method is faster than the #krishnab's for loop.

I got the following code to work without the deprecated command. Hopefully this will help others.
Using the Dataset API I setup and iterator and then loop over it. Not sure if this is the fastest, but it works. MAKE SURE THE BATCH SIZE AND REPEAT ARE SET TO 1, otherwise the code will return the number of batches and not the number of examples in the dataset.
count_test = tf.data.TFRecordDataset('testing.tfrecord')
count_test = count_test.map(_parse_image_function)
count_test = count_test.repeat(1)
count_test = count_test.batch(1)
test_counter = count_test.make_one_shot_iterator()
c = 0
for ex in test_counter:
c += 1
f"There are {c} testing records"
This seemed to work reasonably well even on a relatively large file.

The following works for me using TensorFlow version 2.1 (using the code found in this answer):
def count_tfrecord_examples(
tfrecords_dir: str,
) -> int:
"""
Counts the total number of examples in a collection of TFRecord files.
:param tfrecords_dir: directory that is assumed to contain only TFRecord files
:return: the total number of examples in the collection of TFRecord files
found in the specified directory
"""
count = 0
for file_name in os.listdir(tfrecords_dir):
tfrecord_path = os.path.join(tfrecords_dir, file_name)
count += sum(1 for _ in tf.data.TFRecordDataset(tfrecord_path))
return count

Related

How to split the dataset into mutiple folds while keeping the ratio of an attribute fixed

Let's say that I have a dataset with multiple input features and one single output. For the sake of simplicity, let's say the output is binary. Either zero or one.
I want to split this dataset into k parts and use a k-fold cross-validation model to learn the mapping from the input features to the output one. If the dataset is imbalanced, the ratio between the number of records with output 0 and 1 is not going to be one. To make it concrete, let's say that 90% of the records are 0 and only 10% are 1.
I think it's important that within each part of k-folds we should see the same ratio of 0s and 1s in order for successful training (the same 9 to 1 ratio). I know how to do this in Pandas but my question is how to do it in TFX.
Reading the TFX documentation, I know that I can split a dataset by specifying an output_config to the class loading the examples:
output = tfx.proto.Output(
split_config=tfx.proto.SplitConfig(splits=[
tfx.proto.SplitConfig.Split(name='fold_1', hash_buckets=1),
tfx.proto.SplitConfig.Split(name='fold_2', hash_buckets=1),
tfx.proto.SplitConfig.Split(name='fold_3', hash_buckets=1),
tfx.proto.SplitConfig.Split(name='fold_4', hash_buckets=1),
tfx.proto.SplitConfig.Split(name='fold_5', hash_buckets=1)
]))
example_gen = CsvExampleGen(input_base=input_dir, output_config=output)
But then, the aforementioned ratio of the examples in each fold will be random at best. My question is: Is there any way I can specify what goes into each split? Can I somehow enforce the ratio of a feature?
BTW, I have seen and experimented with the partition_feature_name argument of the SplitConfig class. It's not useful here unless there's a feature with the ID of the fold for each example which I think is not practical since I might want to change the number of folds as part of the experiment without changing the dataset.
I'm going to answer my own question but only as a workaround. I'll be happy to see someone develop a real solution to this question.
What I could come up with at this point was to split the dataset into a number of tfrecord files. I've chosen a "composite" number of files so I can split them into (almost) any number I want. For this, I've settled down on 60 since it can be divided by 2, 3, 4, 5, 6, 10, and 12 (I don't think anyone would want KFold with k higher than 12). Then at the time of loading them, I have to somehow select which files will go into each split. There are two things to consider here.
First, the ImportExampleGen class from TFX supports glob file patterns. This means we can have multiple files loaded for each split:
input = tfx.proto.Input(splits=[
tfx.proto.Input.Split(name="fold_1", pattern="fold_1*"),
tfx.proto.Input.Split(name="fold_2", pattern="fold_2*")
])
example_gen = tfx.components.ImportExampleGen(input_base=_dataset_folder,
input_config=input)
Next, we need some ingenuity to enable splitting the files into any number we like at the time of loading them. And this is my approach to it:
fold_3.0_4.0_5.0_6.0_10.0/part-###.tfrecords.gz
fold_3.0_4.0_5.1_6.0_10.6/part-###.tfrecords.gz
fold_3.0_4.0_5.2_6.0_10.2/part-###.tfrecords.gz
fold_3.0_4.0_5.3_6.0_10.8/part-###.tfrecords.gz
...
The file pattern is like this. Between each two _ I include the divisor, a ., and then the remainder. And I'll have as many of these as I want to have the "split possibility" later, at the time of loading the dataset.
In the example above, I'll have the option to load them into 3, 4, 5, 6, and 10 folds. The first file will be loaded as part of the 0th split if I want to split the dataset into any number of folds while the second file will be in the 1st split of 5-fold and 6th split of 10-fold.
And this is how I'll load them:
NUM_FOLDS = 5
input = tfx.proto.Input(splits=[
tfx.proto.Input.Split(name=f'fold_{index + 1}',
pattern=f"fold_*{str(NUM_FOLDS)+'.'+str(index)}*/*")
for index in range(NUM_FOLDS)
])
example_gen = tfx.components.ImportExampleGen(input_base=_dataset_folder,
input_config=input)
I could change the NUM_FOLDS to any of the options 3, 4, 5, 6, or 10 and the loaded dataset will consist of pre-curated k-fold splits. It is worth mentioning that I have made sure of the ratio of the samples within each file at the time of creating them. So any combination of them will also have the same ratio.
Again, this is only a trick in the absence of an actual solution. The main drawback of this approach is the fact that you have to split the dataset manually yourself. I've done so, in this case, using pandas. That meant that I had to load the whole dataset into memory. Which might not be possible for all the datasets.

Applying Tensorflow Dataset .map() to subsequent dataset elements

I've got a TFRecordDataset and I'm trying to preprocess the features of two subsequent elements by means of the map() API.
dataset_ext = dataset.map(lambda x: tf.py_function(parse_data, [x], [tf.float32]))
As map applies the function parse_data to every dataset element, I don't know what parse_data should look like in order to keep track of the feature extracted from the previous dataset element.
Can anyone help? Thank you
EDIT: I'm working on the Waymo dataset, so each element is a frame. You can refer to https://github.com/Jossome/Waymo-open-dataset-document for its structure.
This is my parse function parse_data:
from waymo_open_dataset import dataset_pb2 as open_dataset
def parse_data(input_data):
frame = open_dataset.Frame()
frame.ParseFromString(bytearray(input_data.numpy()))
av_speed = (frame.images[0].velocity.v_x, frame.images[0].velocity.v_y, frame.images[0].velocity.v_z)
return av_speed
I'd like to build a dataset whose features are the car speed and acceleration, defined as the speed variation between subsequent frames (the first value can be 0).
One way I thought about is to give the map function dataset and dataset.skip(1) as inputs but I'm not sure about it yet.
I am not sure but it might be unnecessary to make your mapped function a tf.py_function. How parse_data is supposed to look like depends on your dataset dataset_ext. If it has for example two file paths (1 instace of input data and 1 instance of output data), the mapping function should have 2 arguments and should return 2 arguments.
For example: if your dataset contains images and you want them to be randomly cropped each time an example of your dataset is drawn the mapping function looks like this:
def process_img_random_crop(img_in, img_out, output_shape):
merged = tf.stack([img_in, img_out])
mergedCrop = tf.image.random_crop(merged, size=(2,) + output_shape)
img_in_cropped, img_out_cropped = tf.unstack(mergedCrop, 2, 0)
return img_in_cropped, img_out_cropped
I call it as follows:
image_ds_test = image_ds_test.map(lambda i, o: process_img_random_crop(i, o, output_shape=(64, 64, 1)), num_parallel_calls=tf.data.experimental.AUTOTUNE)
What exactly is your plan with dataset_ext and what does it contain?
Edit:
Okay, got what you meant with you the two frames. So the map function is applied to each entry of your dataset separatly. If you need cross-entry information, a single entry of your dataset needs to contain two frames. With this more complicated set-up, I would suggest you to use a tensorflow Sequence: The explanation from the tensorflow team is pretty straigth forward. Hope this help!

How to find the size of tensorflow dataset object?

I have created tensorflow dataset object and I would like to know the size of this dataset.
Sadly tf.data.Dataset doesn't have a fully defined length.
One workaround approach would be to iterate over it once to get the number of elements
def get_ds_length(dataset):
len = 0
for _ in dataset:
len += 1
return len
Obviously this would be slow for large datasets and ones that use heavy preprocessing.

How to create a fixed length tf.Dataset from generator?

I have a generator which yields infinite amount of data (Random image crops). I would like to create a tf.Dataset based on let's say 10,000 first data points and cache it to use them to train models?
Currently, I have a generator which takes 1-2 seconds to create each datapoint and this is the main performance blocker. I have to wait a minute to generate a batch of 64 images (the preprocessing() function is very expensive, so I would like to reuse the results).
ds = tf.Dataset.from_generator() method allows us to create such infinite dataset. Instead, I would like to create a finite dataset using N first outputs from the generator and cache it like:
ds = ds.cache().
Alternative solution is to keep generating new data, and using cached datapoints while rendering the generator.
You can use the Dataset.cache function with the Dataset.take function to accomplish this.
If everything fits in memory its as simple as doing something like this:
def generate_example():
i = 0
while(True):
print ('yielding value {}'.format(i))
yield tf.random.uniform((64,64,3))
i +=1
ds = tf.data.Dataset.from_generator(generate_example, tf.float32)
first_n_datapoints = ds.take(n).cache()
Now note, that if I set n to 3 say then do something trivial like:
for i in first_n_datapoints.repeat():
print ('')
print (i.shape)
then I see output confirming that the first 3 values are cached (I only see the yielding value {i} output once for each of the first 3 values generated:
yielding value 0
(64,64,3)
yielding value 1
(64,64,3)
yielding value 2
(64,64,3)
(64,64,3)
(64,64,3)
(64,64,3)
...
If everything does not fit in memory then we can pass a filepath to the cache function where it will cache the generated tensors to disk.
More info here: https://www.tensorflow.org/api_docs/python/tf/data/Dataset#cache

Reading sequential data from TFRecords files within the TensorFlow graph?

I'm working with video data, but I believe this question should apply to any sequential data. I want to pass my RNN 10 sequential examples (video frames) from a TFRecords file. When I first start reading the file, I need to grab 10 examples, and use this to create a sequence-example which is then pushed onto the queue for the RNN to take when it's ready. However, now that I have the 10 frames, next time I read from the TFRecords file, I only need to take 1 example and just shift the other 9 over. But when I hit the end of the first TFRecords file, I need to restart the process on the second TFRecords file. It's my understanding that the cond op will process the ops required under each condition even if that condition is not the one that is to be used. This would be a problem when using a condition to check whether to read 10 examples or only 1. Is there anyway to resolve this problem to still have the desired result outlined above?
You can use the recently added Dataset.window() transformation in TensorFlow 1.12 to do this:
filenames = tf.data.Dataset.list_files(...)
# Define a function that will be applied to each filename, and return the sequences in that
# file.
def get_examples_from_file(filename):
# Read and parse the examples from the file using the appropriate logic.
examples = tf.data.TFRecordDataset(filename).map(...)
# Selects a sliding window of 10 examples, shifting along 1 example at a time.
sequences = examples.window(size=10, shift=1, drop_remainder=True)
# Each element of `sequences` is a nested dataset containing 10 consecutive examples.
# Use `Dataset.batch()` and get the resulting tensor to convert it to a tensor value
# (or values, if there are multiple features in an example).
return sequences.map(
lambda d: tf.data.experimental.get_single_element(d.batch(10)))
# Alternatively, you can use `filenames.interleave()` to mix together sequences from
# different files.
sequences = filenames.flat_map(get_examples_from_file)