TF DATA API: How to produce tensorflow input to object set recognition - tensorflow

Consider this problem: select a random number of samples from a random subject in an image dataset (like ImageNet) as an input element for Tensorflow graph which functions as an object set recognizer. For each batch, each class has a same number of samples to facilitate computation. But a different batch would have a different number of images for one class, i.e. batch_0:num_imgs_per_cls=2; batch_1000:num_imgs_per_cls=3.
If there is existing functionality in Tensorflow, explanation for the whole process from scratch (like from directories of images) will be really appreciated.

There is a very similar answer by #mrry here.
Sampling balanced batches
In face recognition we often use triplet loss (or similar losses) to train the model. The usual way to sample triplets to compute the loss is to create a balanced batch of images where we have for instance 10 different classes (i.e. 10 different people) with 5 images each. This gives a total batch size of 50 in this example.
More generally the problem is to sample num_classes_per_batch (10 in the example) classes, and then sample num_images_per_class (5 in the example) images for each class. The total batch size is:
batch_size = num_classes_per_batch * num_images_per_class
Have one dataset for each class
The easiest way to deal with a lot of different classes (100,000 in MS-Celeb) is to create one dataset for each class.
For instance you can have one tfrecord for each class and create the datasets like this:
# Build one dataset per class.
filenames = ["class_0.tfrecords", "class_1.tfrecords"...]
per_class_datasets = [tf.data.TFRecordDataset(f).repeat(None) for f in filenames]
Sample from the datasets
Now we would like to be able to sample from these datasets. For instance we want the following labels in our batch:
1 1 1 3 3 3 9 9 9 4 4 4
This corresponds to num_classes_per_batch=4 and num_images_per_class=3.
To do this we will need to use features that will be released in r1.9. The function should be called tf.contrib.data.choose_from_datasets (see here for a discussion on this).
It should look like:
def choose_from_datasets(datasets, selector):
"""Chooses elements with indices from selector among the datasets in `datasets`."""
So we create this selector which will output 1 1 1 3 3 3 9 9 9 4 4 4 and combine it with datasets to obtain our final dataset that will output balanced batches:
def generator(_):
# Sample `num_classes_per_batch` classes for the batch
sampled = tf.random_shuffle(tf.range(num_classes))[:num_classes_per_batch]
# Repeat each element `num_images_per_class` times
batch_labels = tf.tile(tf.expand_dims(sampled, -1), [1, num_images_per_class])
return tf.to_int64(tf.reshape(batch_labels, [-1]))
selector = tf.contrib.data.Counter().map(generator)
selector = selector.apply(tf.contrib.data.unbatch())
dataset = tf.contrib.data.choose_from_datasets(datasets, selector)
# Batch
batch_size = num_classes_per_batch * num_images_per_class
dataset = dataset.batch(batch_size)
You can test this with the nightly TensorFlow build and by using DirectedInterleaveDataset as a workaround:
# The working option right now is
from tensorflow.contrib.data.python.ops.interleave_ops import DirectedInterleaveDataset
dataset = DirectedInterleaveDataset(selector, datasets)
I also wrote about this workaround here.

Related

How to see the indices of the split on the data that GridSearchCV used when it made the split?

When using GridSearchCV() to perform a k-fold cross validation analysis on some data is there a way to know which data was used for each split?
For example, assumed the goal is to build a binary classifier of your choosing, named 'model'. There are 100 data points (rows) with 5 features each and an associated 1 or 0 target. 20 of the 100 data points are held out for testing after training and hyperparameter tuning, GridSearchCV will never see those 20 data points. The other 80 data rows are put into the estimator as X and Y, so GridSearchCV will only see 80 rows of data. Various hyper parameters are tuned and laid out in the param_grid variable. For this case the cross validation parameter of cv is assigned a value of 3, as shown:
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=3) grid_result = grid.fit(X, Y)
Is there a way to see which data was used as the training data and as the cross validation data for each fold? Maybe seeing which indices were used for the split?

How to load huge time series windows dataset without memory errors?

I want to convert a typical time series dataset of about 1 million lines into 100-item windows with 50% overlap. Note that it's a multivariate one, so for example given 8 features and 1000 windows with 100 items the final shape would be (1000, 100, 8) replacing (n_samples, n_timesteps, n_features). The goal is to use it for training machine learning algorithms including deep neural networks.
So far, I've enjoyed using numpy's sliding_window_view as shown below;
x = np.arange(100).reshape(20, 5)
v = sliding_window_view(x, (3, 5))
v
Unfortunately, I get crashes as I run out of RAM in large datasets with millions of lines. Do you have any suggestion?
Additionally, one serious restriction is that there's a consecutive label for every timestep (integer) according to which the dataset needs to be grouped by (using pandas) so this limits some options about reading it in portions.
I think you are looking for tf.data.Dataset. I'm working on a million rows dataset, and the following code runs well for me:
convert = tf.data.TextLineDataset("path_to_file.txt")
dataset = tf.data.Dataset.zip(convert)
Now you have initialized your dataset, but for don't stepping into memory issues:
def dataset_batches(ds, batch_size):
return (
ds
.cache()
.batch(batch_size)
.prefetch(tf.data.AUTOTUNE) )
# you can do more operations here
train_batches = dataset_batches(dataset, 64)
And to run it, you'll have to loop:
for (batch, row) in enumerate(train_batche):
# do stuff
# batch = current batch (0, 1, 2, ...) so if your dataset has 1600 rows and you've used batch_size=16 you'll have 100 batches
# row is the actual data (tensor)

PyTorch alternative for tf.data.experimental.sample_from_datasets

Suppose I have two datasets, dataset one with 100 items and dataset two with 5000 items.
Now I want that during training my model sees as much items from dataset one as from dataset two.
In Tensorflow I can do:
dataset = tf.data.experimental.sample_from_datasets(
[dataset_one, dataset_two], weights=[50,1], seed=None
)
Is there an alternative in PyTorch that does the same?
I think this is not too difficult to implement by creating a custom dataset (not working example)
from torch.utils.data import Dataset
class SampleDataset(Dataset):
def __init__(self, datasets, weights):
self.datasets = datasets
self.weights = weights
def __len__(self):
return sum([len(dataset) for dataset in self.datasets])
def __getitem__(self, idx):
# sample a random number and based on that sample an item
return self.datasets[dataset_idx][sample_idx]
However, this seems quite common. Is there already something like this available?
I don't think there is a direct equivalent in PyTorch.
However, there's a function called torch.utils.data.WeightedRandomSampler which samples indices based on a list of probabilities. You can use this in combination with torch.data.utils.ConcatDataset and torch.utils.data.DataLoader's sampler option.
I'll give an example with two datasets: SetA has 500 elements and SetB which only has 10.
First, you can create a concatenation of all your datasets with ConcaDataset:
ds = ConcatDataset([SetA(), SetB()])
Then, we need to sample it. The problem is, you can't just give WeightedRandomSampler [50, 1], as you did in Tensorflow. As a workaround, you can create a list of probabilities of the same length as the size of the total dataset.
The corresponding probability list for this example would be:
dist = np.array([1/51]*500 + [50/51]*10)
Essentially, the first 500 indices (i.e. indices 'pointing' to SetA) will have a probability of 1/51 of being choosen while the following 10 indices (i.e. indices in SetB) will have a probability of 50/51 (i.e much more likely to being sampled since there are less elements in SetB, this is the desired result!)
We can create a sampler from that distribution:
WeightedRandomSampler(dist, 10)
Where 10 is the number of sampled elements. I would put the size of the smallest dataset, otherwise you would likely be going over the same datapoints multiple times during the same epoch...
Finally, we just have to instanciate the dataloader with our dataset and sampler:
dl = DataLoader(ds, sampler=sampler)
To summarize:
ds = ConcatDataset([SetA(), SetB()])
dist = np.array([1/51]*500 + [50/51]*10)
sampler = WeightedRandomSampler(dist, 10)
dl = DataLoader(ds, sampler=sampler)
Edit, for any number of datasets:
sets = [SetA(), SetB(), SetC()]
ds = ConcatDataset(sets)
dist = np.concatenate([[(len(ds) - len(s))/len(ds)]*len(s) for s in sets])
sampler = WeightedRandomSampler(weights=dist, num_samplesmin([len(s) for s in sets])
dl = DataLoader(ds, sampler=sampler)

tensorflow profile explanation

I use tensorflow profile to test the inference of my model and here is the profile details. I find that there are 0,1,2,3, four numbers where 1 and 2 are filled with blank. So what is the meaning of 0-4 and why there are blanks in 1 and 2.
The machine has 80 cores and does it mean that the inference course only occupy 4 cores of them ?
Thanks.
I suppose that each row corresponds to each worker thread to run operators.
So your inference processing only occupies 4 cores as you say.
Tensorflow uses multi-threads when
There are some independent graph parts.
There is a operator using multi-threads.
So you can use multi-core effectively, if your graph have many independent graph parts.
In the following code, the graph has many independent graph parts. Therefore the number of the rows in profiler matches to "inter_op_parallelism_threads".
config = tf.ConfigProto(inter_op_parallelism_threads=5, intra_op_parallelism_threads=1)
with tf.device("/cpu:0"):
list_r = []
for i in range(80):
r = tf.random_normal(shape=[100, 100])
list_r.append(r)
v = tf.add_n(list_r)
global_step = tf.train.create_global_step()
hook = tf.train.ProfilerHook(save_steps=1)
increment_global = global_step.assign_add(1)
with tf.train.SingularMonitoredSession(hooks=[hook], config=config) as sess:
sess.run([v, increment_global])
If you want to know the detail of ConfigProto, you can get information from https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/protobuf/config.proto

how can we get benefit from sharding the data to speed the training time?

My main issue is : I have 204 GB training tfrecords for 2 million images, and 28GB for validation tf.records files, of 302900 images. it takes 8 hour to train one epoch and this will take 33 day for training. I want to speed that by using multiple threads and shards but I am little bit confused about couple of things.
In tf.data.Dataset API there is shard function , So in the documentation they mentioned the following about shard function :
Creates a Dataset that includes only 1/num_shards of this dataset.
This dataset operator is very useful when running distributed training, as it allows each worker to read a unique subset.
When reading a single input file, you can skip elements as follows:
d = tf.data.TFRecordDataset(FLAGS.input_file)
d = d.shard(FLAGS.num_workers, FLAGS.worker_index)
d = d.repeat(FLAGS.num_epochs)
d = d.shuffle(FLAGS.shuffle_buffer_size)
d = d.map(parser_fn, num_parallel_calls=FLAGS.num_map_threads)
Important caveats:
Be sure to shard before you use any randomizing operator (such as shuffle).
Generally it is best if the shard operator is used early in the dataset pipeline. >For example, when reading from a set of TFRecord files, shard before converting >the dataset to input samples. This avoids reading every file on every worker. The >following is an example of an efficient sharding strategy within a complete >pipeline:
d = Dataset.list_files(FLAGS.pattern)
d = d.shard(FLAGS.num_workers, FLAGS.worker_index)
d = d.repeat(FLAGS.num_epochs)
d = d.shuffle(FLAGS.shuffle_buffer_size)
d = d.repeat()
d = d.interleave(tf.data.TFRecordDataset,
cycle_length=FLAGS.num_readers, block_length=1)
d = d.map(parser_fn, num_parallel_calls=FLAGS.num_map_threads)
So my question regarding the code above is when I try to makes d.shards of my data using shard function, if I set the number of shards (num_workers)to 10 , I will have 10 splits of my data , then should I set the num_reader in d.interleave function to 10 to guarantee that each reader take one split from the 10 split?
and how I can control which split the function interleave will take? because if I set the shard_index (worker_index) in shard function to 1 it will give me the first split. Can anyone give me an idea how can I perform this distributed training using the above functions?
then what about the num_parallel_call . should I set it to 10 as well?
knowing that I have single tf.records file for training and another one for validation , I don't split the tf.records files into multiple files.
First of all, how come dataset is 204GB for only 2million images? I think your image is way too large. Try to resize the image. After all, you would probably need to resize it to 224 x 224 in the end.
Second, try to reduce the size of your model. your model could be either too deep or not efficient enough.
Third, try to parallelize your input reading process. It could the bottleneck.