Reading sequential data from TFRecords files within the TensorFlow graph? - tensorflow

I'm working with video data, but I believe this question should apply to any sequential data. I want to pass my RNN 10 sequential examples (video frames) from a TFRecords file. When I first start reading the file, I need to grab 10 examples, and use this to create a sequence-example which is then pushed onto the queue for the RNN to take when it's ready. However, now that I have the 10 frames, next time I read from the TFRecords file, I only need to take 1 example and just shift the other 9 over. But when I hit the end of the first TFRecords file, I need to restart the process on the second TFRecords file. It's my understanding that the cond op will process the ops required under each condition even if that condition is not the one that is to be used. This would be a problem when using a condition to check whether to read 10 examples or only 1. Is there anyway to resolve this problem to still have the desired result outlined above?

You can use the recently added Dataset.window() transformation in TensorFlow 1.12 to do this:
filenames = tf.data.Dataset.list_files(...)
# Define a function that will be applied to each filename, and return the sequences in that
# file.
def get_examples_from_file(filename):
# Read and parse the examples from the file using the appropriate logic.
examples = tf.data.TFRecordDataset(filename).map(...)
# Selects a sliding window of 10 examples, shifting along 1 example at a time.
sequences = examples.window(size=10, shift=1, drop_remainder=True)
# Each element of `sequences` is a nested dataset containing 10 consecutive examples.
# Use `Dataset.batch()` and get the resulting tensor to convert it to a tensor value
# (or values, if there are multiple features in an example).
return sequences.map(
lambda d: tf.data.experimental.get_single_element(d.batch(10)))
# Alternatively, you can use `filenames.interleave()` to mix together sequences from
# different files.
sequences = filenames.flat_map(get_examples_from_file)

Related

How to split the dataset into mutiple folds while keeping the ratio of an attribute fixed

Let's say that I have a dataset with multiple input features and one single output. For the sake of simplicity, let's say the output is binary. Either zero or one.
I want to split this dataset into k parts and use a k-fold cross-validation model to learn the mapping from the input features to the output one. If the dataset is imbalanced, the ratio between the number of records with output 0 and 1 is not going to be one. To make it concrete, let's say that 90% of the records are 0 and only 10% are 1.
I think it's important that within each part of k-folds we should see the same ratio of 0s and 1s in order for successful training (the same 9 to 1 ratio). I know how to do this in Pandas but my question is how to do it in TFX.
Reading the TFX documentation, I know that I can split a dataset by specifying an output_config to the class loading the examples:
output = tfx.proto.Output(
split_config=tfx.proto.SplitConfig(splits=[
tfx.proto.SplitConfig.Split(name='fold_1', hash_buckets=1),
tfx.proto.SplitConfig.Split(name='fold_2', hash_buckets=1),
tfx.proto.SplitConfig.Split(name='fold_3', hash_buckets=1),
tfx.proto.SplitConfig.Split(name='fold_4', hash_buckets=1),
tfx.proto.SplitConfig.Split(name='fold_5', hash_buckets=1)
]))
example_gen = CsvExampleGen(input_base=input_dir, output_config=output)
But then, the aforementioned ratio of the examples in each fold will be random at best. My question is: Is there any way I can specify what goes into each split? Can I somehow enforce the ratio of a feature?
BTW, I have seen and experimented with the partition_feature_name argument of the SplitConfig class. It's not useful here unless there's a feature with the ID of the fold for each example which I think is not practical since I might want to change the number of folds as part of the experiment without changing the dataset.
I'm going to answer my own question but only as a workaround. I'll be happy to see someone develop a real solution to this question.
What I could come up with at this point was to split the dataset into a number of tfrecord files. I've chosen a "composite" number of files so I can split them into (almost) any number I want. For this, I've settled down on 60 since it can be divided by 2, 3, 4, 5, 6, 10, and 12 (I don't think anyone would want KFold with k higher than 12). Then at the time of loading them, I have to somehow select which files will go into each split. There are two things to consider here.
First, the ImportExampleGen class from TFX supports glob file patterns. This means we can have multiple files loaded for each split:
input = tfx.proto.Input(splits=[
tfx.proto.Input.Split(name="fold_1", pattern="fold_1*"),
tfx.proto.Input.Split(name="fold_2", pattern="fold_2*")
])
example_gen = tfx.components.ImportExampleGen(input_base=_dataset_folder,
input_config=input)
Next, we need some ingenuity to enable splitting the files into any number we like at the time of loading them. And this is my approach to it:
fold_3.0_4.0_5.0_6.0_10.0/part-###.tfrecords.gz
fold_3.0_4.0_5.1_6.0_10.6/part-###.tfrecords.gz
fold_3.0_4.0_5.2_6.0_10.2/part-###.tfrecords.gz
fold_3.0_4.0_5.3_6.0_10.8/part-###.tfrecords.gz
...
The file pattern is like this. Between each two _ I include the divisor, a ., and then the remainder. And I'll have as many of these as I want to have the "split possibility" later, at the time of loading the dataset.
In the example above, I'll have the option to load them into 3, 4, 5, 6, and 10 folds. The first file will be loaded as part of the 0th split if I want to split the dataset into any number of folds while the second file will be in the 1st split of 5-fold and 6th split of 10-fold.
And this is how I'll load them:
NUM_FOLDS = 5
input = tfx.proto.Input(splits=[
tfx.proto.Input.Split(name=f'fold_{index + 1}',
pattern=f"fold_*{str(NUM_FOLDS)+'.'+str(index)}*/*")
for index in range(NUM_FOLDS)
])
example_gen = tfx.components.ImportExampleGen(input_base=_dataset_folder,
input_config=input)
I could change the NUM_FOLDS to any of the options 3, 4, 5, 6, or 10 and the loaded dataset will consist of pre-curated k-fold splits. It is worth mentioning that I have made sure of the ratio of the samples within each file at the time of creating them. So any combination of them will also have the same ratio.
Again, this is only a trick in the absence of an actual solution. The main drawback of this approach is the fact that you have to split the dataset manually yourself. I've done so, in this case, using pandas. That meant that I had to load the whole dataset into memory. Which might not be possible for all the datasets.

Tensorflow: Count number of examples in a TFRecord file -- without using deprecated `tf.python_io.tf_record_iterator`

Please read post before marking Duplicate:
I was looking for an efficient way to count the number of examples in a TFRecord file of images. Since a TFRecord file does not save any metadata about the file itself, the user has to loop through the file in order to calculate this information.
There are a few different questions on StackOverflow that answer this question. The problem is that all of them seem to use the DEPRECATED tf.python_io.tf_record_iterator command, so this is not a stable solution. Here is the sample of existing posts:
Obtaining total number of records from .tfrecords file in Tensorflow
Number of examples in each tfrecord
Number of examples in each tfrecord
So I was wondering if there was a way to count the number of records using the new Dataset API.
There is a reduce method listed under the Dataset class. They give an example of counting records using the method:
# generate the dataset (batch size and repeat must be 1, maybe avoid dataset manipulation like map and shard)
ds = tf.data.Dataset.range(5)
# count the examples by reduce
cnt = ds.reduce(np.int64(0), lambda x, _: x + 1)
## produces 5
Don't know whether this method is faster than the #krishnab's for loop.
I got the following code to work without the deprecated command. Hopefully this will help others.
Using the Dataset API I setup and iterator and then loop over it. Not sure if this is the fastest, but it works. MAKE SURE THE BATCH SIZE AND REPEAT ARE SET TO 1, otherwise the code will return the number of batches and not the number of examples in the dataset.
count_test = tf.data.TFRecordDataset('testing.tfrecord')
count_test = count_test.map(_parse_image_function)
count_test = count_test.repeat(1)
count_test = count_test.batch(1)
test_counter = count_test.make_one_shot_iterator()
c = 0
for ex in test_counter:
c += 1
f"There are {c} testing records"
This seemed to work reasonably well even on a relatively large file.
The following works for me using TensorFlow version 2.1 (using the code found in this answer):
def count_tfrecord_examples(
tfrecords_dir: str,
) -> int:
"""
Counts the total number of examples in a collection of TFRecord files.
:param tfrecords_dir: directory that is assumed to contain only TFRecord files
:return: the total number of examples in the collection of TFRecord files
found in the specified directory
"""
count = 0
for file_name in os.listdir(tfrecords_dir):
tfrecord_path = os.path.join(tfrecords_dir, file_name)
count += sum(1 for _ in tf.data.TFRecordDataset(tfrecord_path))
return count

how can we get benefit from sharding the data to speed the training time?

My main issue is : I have 204 GB training tfrecords for 2 million images, and 28GB for validation tf.records files, of 302900 images. it takes 8 hour to train one epoch and this will take 33 day for training. I want to speed that by using multiple threads and shards but I am little bit confused about couple of things.
In tf.data.Dataset API there is shard function , So in the documentation they mentioned the following about shard function :
Creates a Dataset that includes only 1/num_shards of this dataset.
This dataset operator is very useful when running distributed training, as it allows each worker to read a unique subset.
When reading a single input file, you can skip elements as follows:
d = tf.data.TFRecordDataset(FLAGS.input_file)
d = d.shard(FLAGS.num_workers, FLAGS.worker_index)
d = d.repeat(FLAGS.num_epochs)
d = d.shuffle(FLAGS.shuffle_buffer_size)
d = d.map(parser_fn, num_parallel_calls=FLAGS.num_map_threads)
Important caveats:
Be sure to shard before you use any randomizing operator (such as shuffle).
Generally it is best if the shard operator is used early in the dataset pipeline. >For example, when reading from a set of TFRecord files, shard before converting >the dataset to input samples. This avoids reading every file on every worker. The >following is an example of an efficient sharding strategy within a complete >pipeline:
d = Dataset.list_files(FLAGS.pattern)
d = d.shard(FLAGS.num_workers, FLAGS.worker_index)
d = d.repeat(FLAGS.num_epochs)
d = d.shuffle(FLAGS.shuffle_buffer_size)
d = d.repeat()
d = d.interleave(tf.data.TFRecordDataset,
cycle_length=FLAGS.num_readers, block_length=1)
d = d.map(parser_fn, num_parallel_calls=FLAGS.num_map_threads)
So my question regarding the code above is when I try to makes d.shards of my data using shard function, if I set the number of shards (num_workers)to 10 , I will have 10 splits of my data , then should I set the num_reader in d.interleave function to 10 to guarantee that each reader take one split from the 10 split?
and how I can control which split the function interleave will take? because if I set the shard_index (worker_index) in shard function to 1 it will give me the first split. Can anyone give me an idea how can I perform this distributed training using the above functions?
then what about the num_parallel_call . should I set it to 10 as well?
knowing that I have single tf.records file for training and another one for validation , I don't split the tf.records files into multiple files.
First of all, how come dataset is 204GB for only 2million images? I think your image is way too large. Try to resize the image. After all, you would probably need to resize it to 224 x 224 in the end.
Second, try to reduce the size of your model. your model could be either too deep or not efficient enough.
Third, try to parallelize your input reading process. It could the bottleneck.

How to reuse one tensor when creating a tensorflow dataset iterator of a pair of tensors?

Imagine the case that I want to pair samples from one pool of data with samples from another pool of data together to feed into the network. But many samples from the first pool should be paired with the same sample in the second pool. (let's assume all samples are of the same shape).
For example, if we denote the samples from the first pool as f_i, samples from the second pool as g_j, I might want to have a mini-batch of samples as below (each line is one sample in the mini-batch):
(f_0, g_0)
(f_1, g_0)
(f_2, g_0)
(f_3, g_0)
...
(f_10, g_0)
(f_11, g_1)
(f_12, g_1)
(f_13, g_1)
...
(f_19, g_1)
...
If the data from the second pool are small (like labels), then I can simply store them together with samples from the first pool to tfrecords. But in my case the data from the second pool are of the same size as data from the first pool (for example, both are movie segments). Then saving them in pair in one tfrecords files seems to almost double the disk space use.
I wonder if there is any way in which I can only save all the samples from the second pool once on the disk, but still feed the data to my network as the way I wanted? (Assume I already have already specified the mapping between samples in the first pool and those from the second pool based on their file names).
Thanks a lot!
You can use an iterator for each one of the tfrecords (or pool of samples), so you get two iterators where each can iterate on its own pace. When you run get_next() on an iterator, the next sample is returned, so you have to keep it in a tensor and manually feed it. Quoting from the documentation:
(Note that, like other stateful objects in TensorFlow, calling Iterator.get_next() does not immediately advance the iterator. Instead you must use the returned tf.Tensor objects in a TensorFlow expression, and pass the result of that expression to tf.Session.run() to get the next elements and advance the iterator.)
So all you need is a couple of loops that iterate and combine samples from each iterator as a pair, and then you can feed this when you run your desired operation. For example:
g_iterator = g_dataset.make_one_shot_iterator()
get_next_g = g_iterator.get_next()
f_iterator = f_dataset.make_one_shot_iterator()
get_next_f = f_iterator.get_next()
# loop g:
temp_g = session.run(get_next_g)
# loop f:
temp_f = session.run(get_next_f)
session.run(train, feed_dict={f: temp_f, g: temp_g})
Does this answer your question?

Incorporating very large constants in Tensorflow

For example, the comments for the Tensorflow image captioning example model state:
NOTE: This script will consume around 100GB of disk space because each image
in the MSCOCO dataset is replicated ~5 times (once per caption) in the output.
This is done for two reasons:
1. In order to better shuffle the training data.
2. It makes it easier to perform asynchronous preprocessing of each image in
TensorFlow.
The primary goal of this question is to see if there is an alternative to this type of duplication. In my use case, storing the data in this way would require each image to be duplicated in the TFRecord files many more times, on the order of 20 - 50 times.
I should note first that I have already fed the images through VGGnet to extract 4096 dim features, and I have these stored as a mapping between filename and the vectors.
Before switching over to Tensorflow, I had been feeding batches containing filename strings and then looking up the corresponding vector on a per-batch basis. This allows me to store all of the image data in ~15GB without needing to duplicate the data on disk.
My first attempt to do this in in Tensorflow involved storing indices in the TFExample buffers and then doing a "preprocessing" step to slice into the corresponding matrix:
img_feat = pd.read_pickle("img_feats.pkl")
img_matrix = np.stack(img_feat)
preloaded_images = tf.Variable(img_matrix)
first_image = tf.slice(preloaded_images, [0,0], [1,4096])
However, in this case, Tensorflow disallows a variable larger than 2GB. So my next thought was to partition this across several variables:
img_tensors = []
for i in range(NUM_SPLITS):
with tf.Graph().as_default():
img_tensors.append(tf.Variable(img_matrices[i], name="preloaded_images_%i"%i))
first_image = tf.concat(1, [tf.slice(t, [0,0], [1,4096//NUM_SPLITS]) for t in img_tensors])
In this case, I'm forced to store each partition on a separate graph, because it seems any one graph cannot be this large either. However, now the concat fails because each tensor I am concatenating is on a separate graph.
Any advice on incorporating a large amount (~15GB) of preloaded into the Tensorflow graph.
Potentially related is this question; however in this case I'd like to override the decoding of the actual JPEG file with the preprocessed value in a tensor op.