How to create a fixed length tf.Dataset from generator? - tensorflow

I have a generator which yields infinite amount of data (Random image crops). I would like to create a tf.Dataset based on let's say 10,000 first data points and cache it to use them to train models?
Currently, I have a generator which takes 1-2 seconds to create each datapoint and this is the main performance blocker. I have to wait a minute to generate a batch of 64 images (the preprocessing() function is very expensive, so I would like to reuse the results).
ds = tf.Dataset.from_generator() method allows us to create such infinite dataset. Instead, I would like to create a finite dataset using N first outputs from the generator and cache it like:
ds = ds.cache().
Alternative solution is to keep generating new data, and using cached datapoints while rendering the generator.

You can use the Dataset.cache function with the Dataset.take function to accomplish this.
If everything fits in memory its as simple as doing something like this:
def generate_example():
i = 0
while(True):
print ('yielding value {}'.format(i))
yield tf.random.uniform((64,64,3))
i +=1
ds = tf.data.Dataset.from_generator(generate_example, tf.float32)
first_n_datapoints = ds.take(n).cache()
Now note, that if I set n to 3 say then do something trivial like:
for i in first_n_datapoints.repeat():
print ('')
print (i.shape)
then I see output confirming that the first 3 values are cached (I only see the yielding value {i} output once for each of the first 3 values generated:
yielding value 0
(64,64,3)
yielding value 1
(64,64,3)
yielding value 2
(64,64,3)
(64,64,3)
(64,64,3)
(64,64,3)
...
If everything does not fit in memory then we can pass a filepath to the cache function where it will cache the generated tensors to disk.
More info here: https://www.tensorflow.org/api_docs/python/tf/data/Dataset#cache

Related

Applying Tensorflow Dataset .map() to subsequent dataset elements

I've got a TFRecordDataset and I'm trying to preprocess the features of two subsequent elements by means of the map() API.
dataset_ext = dataset.map(lambda x: tf.py_function(parse_data, [x], [tf.float32]))
As map applies the function parse_data to every dataset element, I don't know what parse_data should look like in order to keep track of the feature extracted from the previous dataset element.
Can anyone help? Thank you
EDIT: I'm working on the Waymo dataset, so each element is a frame. You can refer to https://github.com/Jossome/Waymo-open-dataset-document for its structure.
This is my parse function parse_data:
from waymo_open_dataset import dataset_pb2 as open_dataset
def parse_data(input_data):
frame = open_dataset.Frame()
frame.ParseFromString(bytearray(input_data.numpy()))
av_speed = (frame.images[0].velocity.v_x, frame.images[0].velocity.v_y, frame.images[0].velocity.v_z)
return av_speed
I'd like to build a dataset whose features are the car speed and acceleration, defined as the speed variation between subsequent frames (the first value can be 0).
One way I thought about is to give the map function dataset and dataset.skip(1) as inputs but I'm not sure about it yet.
I am not sure but it might be unnecessary to make your mapped function a tf.py_function. How parse_data is supposed to look like depends on your dataset dataset_ext. If it has for example two file paths (1 instace of input data and 1 instance of output data), the mapping function should have 2 arguments and should return 2 arguments.
For example: if your dataset contains images and you want them to be randomly cropped each time an example of your dataset is drawn the mapping function looks like this:
def process_img_random_crop(img_in, img_out, output_shape):
merged = tf.stack([img_in, img_out])
mergedCrop = tf.image.random_crop(merged, size=(2,) + output_shape)
img_in_cropped, img_out_cropped = tf.unstack(mergedCrop, 2, 0)
return img_in_cropped, img_out_cropped
I call it as follows:
image_ds_test = image_ds_test.map(lambda i, o: process_img_random_crop(i, o, output_shape=(64, 64, 1)), num_parallel_calls=tf.data.experimental.AUTOTUNE)
What exactly is your plan with dataset_ext and what does it contain?
Edit:
Okay, got what you meant with you the two frames. So the map function is applied to each entry of your dataset separatly. If you need cross-entry information, a single entry of your dataset needs to contain two frames. With this more complicated set-up, I would suggest you to use a tensorflow Sequence: The explanation from the tensorflow team is pretty straigth forward. Hope this help!

How to find the size of tensorflow dataset object?

I have created tensorflow dataset object and I would like to know the size of this dataset.
Sadly tf.data.Dataset doesn't have a fully defined length.
One workaround approach would be to iterate over it once to get the number of elements
def get_ds_length(dataset):
len = 0
for _ in dataset:
len += 1
return len
Obviously this would be slow for large datasets and ones that use heavy preprocessing.

Tensorflow: Count number of examples in a TFRecord file -- without using deprecated `tf.python_io.tf_record_iterator`

Please read post before marking Duplicate:
I was looking for an efficient way to count the number of examples in a TFRecord file of images. Since a TFRecord file does not save any metadata about the file itself, the user has to loop through the file in order to calculate this information.
There are a few different questions on StackOverflow that answer this question. The problem is that all of them seem to use the DEPRECATED tf.python_io.tf_record_iterator command, so this is not a stable solution. Here is the sample of existing posts:
Obtaining total number of records from .tfrecords file in Tensorflow
Number of examples in each tfrecord
Number of examples in each tfrecord
So I was wondering if there was a way to count the number of records using the new Dataset API.
There is a reduce method listed under the Dataset class. They give an example of counting records using the method:
# generate the dataset (batch size and repeat must be 1, maybe avoid dataset manipulation like map and shard)
ds = tf.data.Dataset.range(5)
# count the examples by reduce
cnt = ds.reduce(np.int64(0), lambda x, _: x + 1)
## produces 5
Don't know whether this method is faster than the #krishnab's for loop.
I got the following code to work without the deprecated command. Hopefully this will help others.
Using the Dataset API I setup and iterator and then loop over it. Not sure if this is the fastest, but it works. MAKE SURE THE BATCH SIZE AND REPEAT ARE SET TO 1, otherwise the code will return the number of batches and not the number of examples in the dataset.
count_test = tf.data.TFRecordDataset('testing.tfrecord')
count_test = count_test.map(_parse_image_function)
count_test = count_test.repeat(1)
count_test = count_test.batch(1)
test_counter = count_test.make_one_shot_iterator()
c = 0
for ex in test_counter:
c += 1
f"There are {c} testing records"
This seemed to work reasonably well even on a relatively large file.
The following works for me using TensorFlow version 2.1 (using the code found in this answer):
def count_tfrecord_examples(
tfrecords_dir: str,
) -> int:
"""
Counts the total number of examples in a collection of TFRecord files.
:param tfrecords_dir: directory that is assumed to contain only TFRecord files
:return: the total number of examples in the collection of TFRecord files
found in the specified directory
"""
count = 0
for file_name in os.listdir(tfrecords_dir):
tfrecord_path = os.path.join(tfrecords_dir, file_name)
count += sum(1 for _ in tf.data.TFRecordDataset(tfrecord_path))
return count

How to reuse one tensor when creating a tensorflow dataset iterator of a pair of tensors?

Imagine the case that I want to pair samples from one pool of data with samples from another pool of data together to feed into the network. But many samples from the first pool should be paired with the same sample in the second pool. (let's assume all samples are of the same shape).
For example, if we denote the samples from the first pool as f_i, samples from the second pool as g_j, I might want to have a mini-batch of samples as below (each line is one sample in the mini-batch):
(f_0, g_0)
(f_1, g_0)
(f_2, g_0)
(f_3, g_0)
...
(f_10, g_0)
(f_11, g_1)
(f_12, g_1)
(f_13, g_1)
...
(f_19, g_1)
...
If the data from the second pool are small (like labels), then I can simply store them together with samples from the first pool to tfrecords. But in my case the data from the second pool are of the same size as data from the first pool (for example, both are movie segments). Then saving them in pair in one tfrecords files seems to almost double the disk space use.
I wonder if there is any way in which I can only save all the samples from the second pool once on the disk, but still feed the data to my network as the way I wanted? (Assume I already have already specified the mapping between samples in the first pool and those from the second pool based on their file names).
Thanks a lot!
You can use an iterator for each one of the tfrecords (or pool of samples), so you get two iterators where each can iterate on its own pace. When you run get_next() on an iterator, the next sample is returned, so you have to keep it in a tensor and manually feed it. Quoting from the documentation:
(Note that, like other stateful objects in TensorFlow, calling Iterator.get_next() does not immediately advance the iterator. Instead you must use the returned tf.Tensor objects in a TensorFlow expression, and pass the result of that expression to tf.Session.run() to get the next elements and advance the iterator.)
So all you need is a couple of loops that iterate and combine samples from each iterator as a pair, and then you can feed this when you run your desired operation. For example:
g_iterator = g_dataset.make_one_shot_iterator()
get_next_g = g_iterator.get_next()
f_iterator = f_dataset.make_one_shot_iterator()
get_next_f = f_iterator.get_next()
# loop g:
temp_g = session.run(get_next_g)
# loop f:
temp_f = session.run(get_next_f)
session.run(train, feed_dict={f: temp_f, g: temp_g})
Does this answer your question?

Setting up the input on an RNN in Keras

So I had a specific question with setting up the input in Keras.
I understand that the sequence length refers to the window length of the longest sequence that you are looking to model with the rest being padded by 0's.
However, how do I set up something that is already in a time series array?
For example, right now I have an array that is 550k x 28. So there are 550k rows each with 28 columns (27 features and 1 target). Do I have to manually split the array into (550k- sequence length) different arrays and feed all of those to the network?
Assuming that I want to the first layer to be equivalent to the number of features per row, and looking at the past 50 rows, how do I size the input layer?
Is that simply input_size = (50,27), and again do I have to manually split the dataset up or would Keras automatically do that for me?
RNN inputs are like: (NumberOfSequences, TimeSteps, ElementsPerStep)
Each sequence is a row in your input array. This is also called "batch size", number of examples, samples, etc.
Time steps are the amount of steps for each sequence
Elements per step is how much info you have in each step of a sequence
I'm assuming the 27 features are inputs and relate to ElementsPerStep, while the 1 target is the expected output having 1 output per step.
So I'm also assuming that your output is a sequence with also 550k steps.
Shaping the array:
Since you have only one sequence in the array, and this sequence has 550k steps, then you must reshape your array like this:
(1, 550000, 28)
#1 sequence
#550000 steps per sequence
#28 data elements per step
#PS: this sequence is too long, if it creates memory problems to you, maybe it will be a good idea to use a `stateful=True` RNN, but I'm explaining the non stateful method first.
Now you must split this array for inputs and targets:
X_train = thisArray[:, :, :27] #inputs
Y_train = thisArray[:, :, 27] #targets
Shaping the keras layers:
Keras layers will ignore the batch size (number of sequences) when you define them, so you will use input_shape=(550000,27).
Since your desired result is a sequence with same length, we will use return_sequences=True. (Else, you'd get only one result).
LSTM(numberOfCells, input_shape=(550000,27), return_sequences=True)
This will output a shape of (BatchSize, 550000, numberOfCells)
You may use a single layer with 1 cell to achieve your output, or you could stack more layers, considering that the last one should have 1 cell to match the shape of your output. (If you're using only recurrent layers, of course)
stateful = True:
When you have sequences so long that your memory can't handle them well, you must define the layer with stateful=True.
In that case, you will have to divide X_train in smaller length sequences*. The system will understand that every new batch is a sequel of the previous batches.
Then you will need to define batch_input_shape=(BatchSize,ReducedTimeSteps,Elements). In this case, the batch size should not be ignored like in the other case.
* Unfortunately I have no experience with stateful=True. I'm not sure about whether you must manually divide your array (less likely, I guess), or if the system automatically divides it internally (more likely).
The sliding window case:
In this case, what I often see is people dividing the input data like this:
From the 550k steps, get smaller arrays with 50 steps:
X = []
for i in range(550000-49):
X.append(originalX[i:i+50]) #then take care of the 28th element
Y = #it seems you just exclude the first 49 ones from the original