PyTorch alternative for tf.data.experimental.sample_from_datasets - tensorflow

Suppose I have two datasets, dataset one with 100 items and dataset two with 5000 items.
Now I want that during training my model sees as much items from dataset one as from dataset two.
In Tensorflow I can do:
dataset = tf.data.experimental.sample_from_datasets(
[dataset_one, dataset_two], weights=[50,1], seed=None
)
Is there an alternative in PyTorch that does the same?
I think this is not too difficult to implement by creating a custom dataset (not working example)
from torch.utils.data import Dataset
class SampleDataset(Dataset):
def __init__(self, datasets, weights):
self.datasets = datasets
self.weights = weights
def __len__(self):
return sum([len(dataset) for dataset in self.datasets])
def __getitem__(self, idx):
# sample a random number and based on that sample an item
return self.datasets[dataset_idx][sample_idx]
However, this seems quite common. Is there already something like this available?

I don't think there is a direct equivalent in PyTorch.
However, there's a function called torch.utils.data.WeightedRandomSampler which samples indices based on a list of probabilities. You can use this in combination with torch.data.utils.ConcatDataset and torch.utils.data.DataLoader's sampler option.
I'll give an example with two datasets: SetA has 500 elements and SetB which only has 10.
First, you can create a concatenation of all your datasets with ConcaDataset:
ds = ConcatDataset([SetA(), SetB()])
Then, we need to sample it. The problem is, you can't just give WeightedRandomSampler [50, 1], as you did in Tensorflow. As a workaround, you can create a list of probabilities of the same length as the size of the total dataset.
The corresponding probability list for this example would be:
dist = np.array([1/51]*500 + [50/51]*10)
Essentially, the first 500 indices (i.e. indices 'pointing' to SetA) will have a probability of 1/51 of being choosen while the following 10 indices (i.e. indices in SetB) will have a probability of 50/51 (i.e much more likely to being sampled since there are less elements in SetB, this is the desired result!)
We can create a sampler from that distribution:
WeightedRandomSampler(dist, 10)
Where 10 is the number of sampled elements. I would put the size of the smallest dataset, otherwise you would likely be going over the same datapoints multiple times during the same epoch...
Finally, we just have to instanciate the dataloader with our dataset and sampler:
dl = DataLoader(ds, sampler=sampler)
To summarize:
ds = ConcatDataset([SetA(), SetB()])
dist = np.array([1/51]*500 + [50/51]*10)
sampler = WeightedRandomSampler(dist, 10)
dl = DataLoader(ds, sampler=sampler)
Edit, for any number of datasets:
sets = [SetA(), SetB(), SetC()]
ds = ConcatDataset(sets)
dist = np.concatenate([[(len(ds) - len(s))/len(ds)]*len(s) for s in sets])
sampler = WeightedRandomSampler(weights=dist, num_samplesmin([len(s) for s in sets])
dl = DataLoader(ds, sampler=sampler)

Related

How to load huge time series windows dataset without memory errors?

I want to convert a typical time series dataset of about 1 million lines into 100-item windows with 50% overlap. Note that it's a multivariate one, so for example given 8 features and 1000 windows with 100 items the final shape would be (1000, 100, 8) replacing (n_samples, n_timesteps, n_features). The goal is to use it for training machine learning algorithms including deep neural networks.
So far, I've enjoyed using numpy's sliding_window_view as shown below;
x = np.arange(100).reshape(20, 5)
v = sliding_window_view(x, (3, 5))
v
Unfortunately, I get crashes as I run out of RAM in large datasets with millions of lines. Do you have any suggestion?
Additionally, one serious restriction is that there's a consecutive label for every timestep (integer) according to which the dataset needs to be grouped by (using pandas) so this limits some options about reading it in portions.
I think you are looking for tf.data.Dataset. I'm working on a million rows dataset, and the following code runs well for me:
convert = tf.data.TextLineDataset("path_to_file.txt")
dataset = tf.data.Dataset.zip(convert)
Now you have initialized your dataset, but for don't stepping into memory issues:
def dataset_batches(ds, batch_size):
return (
ds
.cache()
.batch(batch_size)
.prefetch(tf.data.AUTOTUNE) )
# you can do more operations here
train_batches = dataset_batches(dataset, 64)
And to run it, you'll have to loop:
for (batch, row) in enumerate(train_batche):
# do stuff
# batch = current batch (0, 1, 2, ...) so if your dataset has 1600 rows and you've used batch_size=16 you'll have 100 batches
# row is the actual data (tensor)

Applying Tensorflow Dataset .map() to subsequent dataset elements

I've got a TFRecordDataset and I'm trying to preprocess the features of two subsequent elements by means of the map() API.
dataset_ext = dataset.map(lambda x: tf.py_function(parse_data, [x], [tf.float32]))
As map applies the function parse_data to every dataset element, I don't know what parse_data should look like in order to keep track of the feature extracted from the previous dataset element.
Can anyone help? Thank you
EDIT: I'm working on the Waymo dataset, so each element is a frame. You can refer to https://github.com/Jossome/Waymo-open-dataset-document for its structure.
This is my parse function parse_data:
from waymo_open_dataset import dataset_pb2 as open_dataset
def parse_data(input_data):
frame = open_dataset.Frame()
frame.ParseFromString(bytearray(input_data.numpy()))
av_speed = (frame.images[0].velocity.v_x, frame.images[0].velocity.v_y, frame.images[0].velocity.v_z)
return av_speed
I'd like to build a dataset whose features are the car speed and acceleration, defined as the speed variation between subsequent frames (the first value can be 0).
One way I thought about is to give the map function dataset and dataset.skip(1) as inputs but I'm not sure about it yet.
I am not sure but it might be unnecessary to make your mapped function a tf.py_function. How parse_data is supposed to look like depends on your dataset dataset_ext. If it has for example two file paths (1 instace of input data and 1 instance of output data), the mapping function should have 2 arguments and should return 2 arguments.
For example: if your dataset contains images and you want them to be randomly cropped each time an example of your dataset is drawn the mapping function looks like this:
def process_img_random_crop(img_in, img_out, output_shape):
merged = tf.stack([img_in, img_out])
mergedCrop = tf.image.random_crop(merged, size=(2,) + output_shape)
img_in_cropped, img_out_cropped = tf.unstack(mergedCrop, 2, 0)
return img_in_cropped, img_out_cropped
I call it as follows:
image_ds_test = image_ds_test.map(lambda i, o: process_img_random_crop(i, o, output_shape=(64, 64, 1)), num_parallel_calls=tf.data.experimental.AUTOTUNE)
What exactly is your plan with dataset_ext and what does it contain?
Edit:
Okay, got what you meant with you the two frames. So the map function is applied to each entry of your dataset separatly. If you need cross-entry information, a single entry of your dataset needs to contain two frames. With this more complicated set-up, I would suggest you to use a tensorflow Sequence: The explanation from the tensorflow team is pretty straigth forward. Hope this help!

Importance of seed and num_runs in the KMeans clustering

New to ML so trying to make sense of the following code. Specifically
In for run in np.arange(1, num_runs+1), what is the need for this loop? Why didn't the author use setMaxIter method of KMeans?
What is the importance of seeding in clustering?
Why did the author chose to set the seed explicitly rather than using the default one?
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
def optimal_k(df_in,index_col,k_min, k_max,num_runs):
'''
Determine optimal number of clusters by using Silhoutte Score Analysis.
:param df_in: the input dataframe
:param index_col: the name of the index column
:param k_min: the train dataset
:param k_min: the minmum number of the clusters
:param k_max: the maxmum number of the clusters
:param num_runs: the number of runs for each fixed clusters
:return k: optimal number of the clusters
:return silh_lst: Silhouette score
:return r_table: the running results table
:author: Wenqiang Feng
:email: von198#gmail.com.com
'''
start = time.time()
silh_lst = []
k_lst = np.arange(k_min, k_max+1)
r_table = df_in.select(index_col).toPandas()
r_table = r_table.set_index(index_col)
centers = pd.DataFrame()
for k in k_lst:
silh_val = []
for run in np.arange(1, num_runs+1):
# Trains a k-means model.
kmeans = KMeans()\
.setK(k)\
.setSeed(int(np.random.randint(100, size=1)))
model = kmeans.fit(df_in)
# Make predictions
predictions = model.transform(df_in)
r_table['cluster_{k}_{run}'.format(k=k, run=run)]= predictions.select('prediction').toPandas()
# Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)
silh_val.append(silhouette)
silh_array=np.asanyarray(silh_val)
silh_lst.append(silh_array.mean())
elapsed = time.time() - start
silhouette = pd.DataFrame(list(zip(k_lst,silh_lst)),columns = ['k', 'silhouette'])
print('+------------------------------------------------------------+')
print("| The finding optimal k phase took %8.0f s. |" %(elapsed))
print('+------------------------------------------------------------+')
return k_lst[np.argmax(silh_lst, axis=0)], silhouette , r_table
I'll try to answer your questions based on my reading of the material.
The reason for this loop is that the author sets a new seed for every loop using int(np.random.randint(100, size=1)). If the feature variables exhibit patterns that automatically group them into visible clusters, then the starting seed should not have an impact on the final cluster memberships. However, if the data is evenly distributed, then we might end up with different cluster members based on the initial random variable. I believe the author is changing these seeds for each run to test different initial distributions. Using setMaxIter would set maximum iterations for the same seed (initial distribution).
Similar to the above - the seed defines the initial distribution of k points around which you're going to cluster. Depending on your underlying data distribution, the clusters can converge in different final distributions.
The author has control over the seed, as discussed in points 1 and 2. You can see for what seed your code converges around clusters as desired, and for which you might not get convergence. Also, if you iterate for, say, 100 different seeds and your code still converges into the same final clusters, you can remove the default seed as it likely doesn't matter. Another use is from a more software engineering perspective, setting explicit seed is super important if you want to, for example, write tests for your code and don't want it to randomly fail.

TF DATA API: How to produce tensorflow input to object set recognition

Consider this problem: select a random number of samples from a random subject in an image dataset (like ImageNet) as an input element for Tensorflow graph which functions as an object set recognizer. For each batch, each class has a same number of samples to facilitate computation. But a different batch would have a different number of images for one class, i.e. batch_0:num_imgs_per_cls=2; batch_1000:num_imgs_per_cls=3.
If there is existing functionality in Tensorflow, explanation for the whole process from scratch (like from directories of images) will be really appreciated.
There is a very similar answer by #mrry here.
Sampling balanced batches
In face recognition we often use triplet loss (or similar losses) to train the model. The usual way to sample triplets to compute the loss is to create a balanced batch of images where we have for instance 10 different classes (i.e. 10 different people) with 5 images each. This gives a total batch size of 50 in this example.
More generally the problem is to sample num_classes_per_batch (10 in the example) classes, and then sample num_images_per_class (5 in the example) images for each class. The total batch size is:
batch_size = num_classes_per_batch * num_images_per_class
Have one dataset for each class
The easiest way to deal with a lot of different classes (100,000 in MS-Celeb) is to create one dataset for each class.
For instance you can have one tfrecord for each class and create the datasets like this:
# Build one dataset per class.
filenames = ["class_0.tfrecords", "class_1.tfrecords"...]
per_class_datasets = [tf.data.TFRecordDataset(f).repeat(None) for f in filenames]
Sample from the datasets
Now we would like to be able to sample from these datasets. For instance we want the following labels in our batch:
1 1 1 3 3 3 9 9 9 4 4 4
This corresponds to num_classes_per_batch=4 and num_images_per_class=3.
To do this we will need to use features that will be released in r1.9. The function should be called tf.contrib.data.choose_from_datasets (see here for a discussion on this).
It should look like:
def choose_from_datasets(datasets, selector):
"""Chooses elements with indices from selector among the datasets in `datasets`."""
So we create this selector which will output 1 1 1 3 3 3 9 9 9 4 4 4 and combine it with datasets to obtain our final dataset that will output balanced batches:
def generator(_):
# Sample `num_classes_per_batch` classes for the batch
sampled = tf.random_shuffle(tf.range(num_classes))[:num_classes_per_batch]
# Repeat each element `num_images_per_class` times
batch_labels = tf.tile(tf.expand_dims(sampled, -1), [1, num_images_per_class])
return tf.to_int64(tf.reshape(batch_labels, [-1]))
selector = tf.contrib.data.Counter().map(generator)
selector = selector.apply(tf.contrib.data.unbatch())
dataset = tf.contrib.data.choose_from_datasets(datasets, selector)
# Batch
batch_size = num_classes_per_batch * num_images_per_class
dataset = dataset.batch(batch_size)
You can test this with the nightly TensorFlow build and by using DirectedInterleaveDataset as a workaround:
# The working option right now is
from tensorflow.contrib.data.python.ops.interleave_ops import DirectedInterleaveDataset
dataset = DirectedInterleaveDataset(selector, datasets)
I also wrote about this workaround here.

tensorflow tfrecord storage for large datasets

I'm trying to understand the "proper" method of storage for large datasets for tensorflow ingestion. The documentation seems relatively clear that no matter what, tfrecord files are preferred. Large is a subjective measure, but the examples below are randomly generated regression datasets from sklearn.datasets.make_regression() of 10,000 rows and between 1 and 5,000 features, all float64.
I've experimented with two different methods of writing tfrecord files with dramatically different performance.
For numpy arrays, X, y (X.shape=(10000, n_features), y.shape=(10000,)
tf.train.Example with per-feature tf.train.Features
I construct a tf.train.Example in the way that tensorflow developers seem to prefer, at least judging by tensorflow example code at https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/how_tos/reading_data/convert_to_records.py.
For each observation or row in X, I create a dictionary keyed with feature names (f_0, f_1, ...) whose values are tf.train.Feature objects with the feature's observation data as a single element of its float_list.
def _feature_dict_from_row(row):
"""
Take row of length n+1 from 2-D ndarray and convert it to a dictionary of
float features:
{
'f_0': row[0],
'f_1': row[1],
...
'f_n': row[n]
}
"""
def _float64_feature(feature):
return tf.train.Feature(float_list=tf.train.FloatList(value=[feature]))
features = { "f_{:d}".format(i): _float64_feature(value) for i, value in enumerate(row) }
return features
def write_regression_data_to_tfrecord(X, y, filename):
with tf.python_io.TFRecordWriter('{:s}'.format(filename)) as tfwriter:
for row_index in range(X.shape[0]):
features = _feature_dict_from_row(X[row_index])
features['label'] = y[row_index]
example = tf.train.Example(features=tf.train.Features(feature=features))
tfwriter.write(example.SerializeToString())
tf.train.Example with one large tf.train.Feature containing all features
I construct a dictionary with one feature (really two counting the label) whose value is a tf.train.Feature with the entire feature row in as its float_list
def write_regression_data_to_tfrecord(X, y, filename, store_by_rows=True):
with tf.python_io.TFRecordWriter('{:s}'.format(filename)) as tfwriter:
for row_index in range(X.shape[0]):
features = { 'f_0': tf.train.Feature(float_list=tf.train.FloatList(value=X[row_index])) }
features['label'] = y[row_index]
example = tf.train.Example(features=tf.train.Features(feature=features))
tfwriter.write(example.SerializeToString())
As the number of features in the dataset grows, the second option gets considerably faster than the first, as shown in the following graph. Note the log scale
10,000 rows:
It makes intuitive sense to me that creating 5,000 tf.train.Feature objects is significantly slower than creating one object with a float_list of 5,000 elements, but it's not clear that this is the "intended" method for feeding large numbers of features into a tensorflow model.
Is there something inherently wrong with doing this the faster way?