How to find the size of tensorflow dataset object? - tensorflow

I have created tensorflow dataset object and I would like to know the size of this dataset.

Sadly tf.data.Dataset doesn't have a fully defined length.
One workaround approach would be to iterate over it once to get the number of elements
def get_ds_length(dataset):
len = 0
for _ in dataset:
len += 1
return len
Obviously this would be slow for large datasets and ones that use heavy preprocessing.

Related

PyTorch alternative for tf.data.experimental.sample_from_datasets

Suppose I have two datasets, dataset one with 100 items and dataset two with 5000 items.
Now I want that during training my model sees as much items from dataset one as from dataset two.
In Tensorflow I can do:
dataset = tf.data.experimental.sample_from_datasets(
[dataset_one, dataset_two], weights=[50,1], seed=None
)
Is there an alternative in PyTorch that does the same?
I think this is not too difficult to implement by creating a custom dataset (not working example)
from torch.utils.data import Dataset
class SampleDataset(Dataset):
def __init__(self, datasets, weights):
self.datasets = datasets
self.weights = weights
def __len__(self):
return sum([len(dataset) for dataset in self.datasets])
def __getitem__(self, idx):
# sample a random number and based on that sample an item
return self.datasets[dataset_idx][sample_idx]
However, this seems quite common. Is there already something like this available?
I don't think there is a direct equivalent in PyTorch.
However, there's a function called torch.utils.data.WeightedRandomSampler which samples indices based on a list of probabilities. You can use this in combination with torch.data.utils.ConcatDataset and torch.utils.data.DataLoader's sampler option.
I'll give an example with two datasets: SetA has 500 elements and SetB which only has 10.
First, you can create a concatenation of all your datasets with ConcaDataset:
ds = ConcatDataset([SetA(), SetB()])
Then, we need to sample it. The problem is, you can't just give WeightedRandomSampler [50, 1], as you did in Tensorflow. As a workaround, you can create a list of probabilities of the same length as the size of the total dataset.
The corresponding probability list for this example would be:
dist = np.array([1/51]*500 + [50/51]*10)
Essentially, the first 500 indices (i.e. indices 'pointing' to SetA) will have a probability of 1/51 of being choosen while the following 10 indices (i.e. indices in SetB) will have a probability of 50/51 (i.e much more likely to being sampled since there are less elements in SetB, this is the desired result!)
We can create a sampler from that distribution:
WeightedRandomSampler(dist, 10)
Where 10 is the number of sampled elements. I would put the size of the smallest dataset, otherwise you would likely be going over the same datapoints multiple times during the same epoch...
Finally, we just have to instanciate the dataloader with our dataset and sampler:
dl = DataLoader(ds, sampler=sampler)
To summarize:
ds = ConcatDataset([SetA(), SetB()])
dist = np.array([1/51]*500 + [50/51]*10)
sampler = WeightedRandomSampler(dist, 10)
dl = DataLoader(ds, sampler=sampler)
Edit, for any number of datasets:
sets = [SetA(), SetB(), SetC()]
ds = ConcatDataset(sets)
dist = np.concatenate([[(len(ds) - len(s))/len(ds)]*len(s) for s in sets])
sampler = WeightedRandomSampler(weights=dist, num_samplesmin([len(s) for s in sets])
dl = DataLoader(ds, sampler=sampler)

Applying Tensorflow Dataset .map() to subsequent dataset elements

I've got a TFRecordDataset and I'm trying to preprocess the features of two subsequent elements by means of the map() API.
dataset_ext = dataset.map(lambda x: tf.py_function(parse_data, [x], [tf.float32]))
As map applies the function parse_data to every dataset element, I don't know what parse_data should look like in order to keep track of the feature extracted from the previous dataset element.
Can anyone help? Thank you
EDIT: I'm working on the Waymo dataset, so each element is a frame. You can refer to https://github.com/Jossome/Waymo-open-dataset-document for its structure.
This is my parse function parse_data:
from waymo_open_dataset import dataset_pb2 as open_dataset
def parse_data(input_data):
frame = open_dataset.Frame()
frame.ParseFromString(bytearray(input_data.numpy()))
av_speed = (frame.images[0].velocity.v_x, frame.images[0].velocity.v_y, frame.images[0].velocity.v_z)
return av_speed
I'd like to build a dataset whose features are the car speed and acceleration, defined as the speed variation between subsequent frames (the first value can be 0).
One way I thought about is to give the map function dataset and dataset.skip(1) as inputs but I'm not sure about it yet.
I am not sure but it might be unnecessary to make your mapped function a tf.py_function. How parse_data is supposed to look like depends on your dataset dataset_ext. If it has for example two file paths (1 instace of input data and 1 instance of output data), the mapping function should have 2 arguments and should return 2 arguments.
For example: if your dataset contains images and you want them to be randomly cropped each time an example of your dataset is drawn the mapping function looks like this:
def process_img_random_crop(img_in, img_out, output_shape):
merged = tf.stack([img_in, img_out])
mergedCrop = tf.image.random_crop(merged, size=(2,) + output_shape)
img_in_cropped, img_out_cropped = tf.unstack(mergedCrop, 2, 0)
return img_in_cropped, img_out_cropped
I call it as follows:
image_ds_test = image_ds_test.map(lambda i, o: process_img_random_crop(i, o, output_shape=(64, 64, 1)), num_parallel_calls=tf.data.experimental.AUTOTUNE)
What exactly is your plan with dataset_ext and what does it contain?
Edit:
Okay, got what you meant with you the two frames. So the map function is applied to each entry of your dataset separatly. If you need cross-entry information, a single entry of your dataset needs to contain two frames. With this more complicated set-up, I would suggest you to use a tensorflow Sequence: The explanation from the tensorflow team is pretty straigth forward. Hope this help!

Tensorflow: Count number of examples in a TFRecord file -- without using deprecated `tf.python_io.tf_record_iterator`

Please read post before marking Duplicate:
I was looking for an efficient way to count the number of examples in a TFRecord file of images. Since a TFRecord file does not save any metadata about the file itself, the user has to loop through the file in order to calculate this information.
There are a few different questions on StackOverflow that answer this question. The problem is that all of them seem to use the DEPRECATED tf.python_io.tf_record_iterator command, so this is not a stable solution. Here is the sample of existing posts:
Obtaining total number of records from .tfrecords file in Tensorflow
Number of examples in each tfrecord
Number of examples in each tfrecord
So I was wondering if there was a way to count the number of records using the new Dataset API.
There is a reduce method listed under the Dataset class. They give an example of counting records using the method:
# generate the dataset (batch size and repeat must be 1, maybe avoid dataset manipulation like map and shard)
ds = tf.data.Dataset.range(5)
# count the examples by reduce
cnt = ds.reduce(np.int64(0), lambda x, _: x + 1)
## produces 5
Don't know whether this method is faster than the #krishnab's for loop.
I got the following code to work without the deprecated command. Hopefully this will help others.
Using the Dataset API I setup and iterator and then loop over it. Not sure if this is the fastest, but it works. MAKE SURE THE BATCH SIZE AND REPEAT ARE SET TO 1, otherwise the code will return the number of batches and not the number of examples in the dataset.
count_test = tf.data.TFRecordDataset('testing.tfrecord')
count_test = count_test.map(_parse_image_function)
count_test = count_test.repeat(1)
count_test = count_test.batch(1)
test_counter = count_test.make_one_shot_iterator()
c = 0
for ex in test_counter:
c += 1
f"There are {c} testing records"
This seemed to work reasonably well even on a relatively large file.
The following works for me using TensorFlow version 2.1 (using the code found in this answer):
def count_tfrecord_examples(
tfrecords_dir: str,
) -> int:
"""
Counts the total number of examples in a collection of TFRecord files.
:param tfrecords_dir: directory that is assumed to contain only TFRecord files
:return: the total number of examples in the collection of TFRecord files
found in the specified directory
"""
count = 0
for file_name in os.listdir(tfrecords_dir):
tfrecord_path = os.path.join(tfrecords_dir, file_name)
count += sum(1 for _ in tf.data.TFRecordDataset(tfrecord_path))
return count

What is an effective way to pad a variable length dataset for batching in Tensorflow that does not have have exact

I am trying to integrate the Dataset API into my input pipeline. Before this integration, the program used tf.train.batch_join(), which had dynamic padding enabled. Hence, this would batch elements and pad them according to the largest one in the mini-batch.
image, width, label, length, text, filename = tf.train.batch_join(
data_tuples,
batch_size=batch_size,
capacity=queue_capacity,
allow_smaller_final_batch=final_batch,
dynamic_pad=True)
For dataset, however, I was unable to find the exact alternative to this. I cannot use padded batch, since the dimensions of the images does not have a set threshold. The image width could be anything. My partner and I were able to come up with a work around for this using tf.contrib.data.bucket_by_sequence(). Here is an excerpt:
dataset = dataset.apply(tf.contrib.data.bucket_by_sequence_length
(element_length_func=_element_length_fn,
bucket_batch_sizes=np.full(len([0]) + 1, batch_size),
bucket_boundaries=[0]))
What this does is basically dumps all the elements into the overflow bucket since the boundary is set to 0. Then, it batches it from that bucket since bucketing pads the elements according to the largest one.
Is there a better way to achieve this functionality?
I meet exactly the same problem. Now I know how to solve this. If your input_data only has one dimension that is of variable length, try to use tf.contrib.data.bucket_by_sequence_length to dataset.apply() function, make bucket_batch_sizes = [batch_size] * (len(buckets) + 1). And there is another way to do so just as #mrry has said in comments.
iterator = dataset.make_one_shot_iterator()
item = iterator.get_next()
padded_shapes = []
for i in item:
padded_shapes.append(i.get_shape())
padded_shapes = tf.contrib.framework.nest.pack_sequence_as(item, padded_shapes)
dataset = dataset.padded_batch(batch_size, padded_shapes)
If one dimension in the shapes of a tensor is None or -1, then padded_batch will pad the tensor on that dimension to max length of the batch.
My training data has two features of varibale length, And this method works fine.

tensorflow tfrecord storage for large datasets

I'm trying to understand the "proper" method of storage for large datasets for tensorflow ingestion. The documentation seems relatively clear that no matter what, tfrecord files are preferred. Large is a subjective measure, but the examples below are randomly generated regression datasets from sklearn.datasets.make_regression() of 10,000 rows and between 1 and 5,000 features, all float64.
I've experimented with two different methods of writing tfrecord files with dramatically different performance.
For numpy arrays, X, y (X.shape=(10000, n_features), y.shape=(10000,)
tf.train.Example with per-feature tf.train.Features
I construct a tf.train.Example in the way that tensorflow developers seem to prefer, at least judging by tensorflow example code at https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/how_tos/reading_data/convert_to_records.py.
For each observation or row in X, I create a dictionary keyed with feature names (f_0, f_1, ...) whose values are tf.train.Feature objects with the feature's observation data as a single element of its float_list.
def _feature_dict_from_row(row):
"""
Take row of length n+1 from 2-D ndarray and convert it to a dictionary of
float features:
{
'f_0': row[0],
'f_1': row[1],
...
'f_n': row[n]
}
"""
def _float64_feature(feature):
return tf.train.Feature(float_list=tf.train.FloatList(value=[feature]))
features = { "f_{:d}".format(i): _float64_feature(value) for i, value in enumerate(row) }
return features
def write_regression_data_to_tfrecord(X, y, filename):
with tf.python_io.TFRecordWriter('{:s}'.format(filename)) as tfwriter:
for row_index in range(X.shape[0]):
features = _feature_dict_from_row(X[row_index])
features['label'] = y[row_index]
example = tf.train.Example(features=tf.train.Features(feature=features))
tfwriter.write(example.SerializeToString())
tf.train.Example with one large tf.train.Feature containing all features
I construct a dictionary with one feature (really two counting the label) whose value is a tf.train.Feature with the entire feature row in as its float_list
def write_regression_data_to_tfrecord(X, y, filename, store_by_rows=True):
with tf.python_io.TFRecordWriter('{:s}'.format(filename)) as tfwriter:
for row_index in range(X.shape[0]):
features = { 'f_0': tf.train.Feature(float_list=tf.train.FloatList(value=X[row_index])) }
features['label'] = y[row_index]
example = tf.train.Example(features=tf.train.Features(feature=features))
tfwriter.write(example.SerializeToString())
As the number of features in the dataset grows, the second option gets considerably faster than the first, as shown in the following graph. Note the log scale
10,000 rows:
It makes intuitive sense to me that creating 5,000 tf.train.Feature objects is significantly slower than creating one object with a float_list of 5,000 elements, but it's not clear that this is the "intended" method for feeding large numbers of features into a tensorflow model.
Is there something inherently wrong with doing this the faster way?