How to load huge time series windows dataset without memory errors? - numpy

I want to convert a typical time series dataset of about 1 million lines into 100-item windows with 50% overlap. Note that it's a multivariate one, so for example given 8 features and 1000 windows with 100 items the final shape would be (1000, 100, 8) replacing (n_samples, n_timesteps, n_features). The goal is to use it for training machine learning algorithms including deep neural networks.
So far, I've enjoyed using numpy's sliding_window_view as shown below;
x = np.arange(100).reshape(20, 5)
v = sliding_window_view(x, (3, 5))
v
Unfortunately, I get crashes as I run out of RAM in large datasets with millions of lines. Do you have any suggestion?
Additionally, one serious restriction is that there's a consecutive label for every timestep (integer) according to which the dataset needs to be grouped by (using pandas) so this limits some options about reading it in portions.

I think you are looking for tf.data.Dataset. I'm working on a million rows dataset, and the following code runs well for me:
convert = tf.data.TextLineDataset("path_to_file.txt")
dataset = tf.data.Dataset.zip(convert)
Now you have initialized your dataset, but for don't stepping into memory issues:
def dataset_batches(ds, batch_size):
return (
ds
.cache()
.batch(batch_size)
.prefetch(tf.data.AUTOTUNE) )
# you can do more operations here
train_batches = dataset_batches(dataset, 64)
And to run it, you'll have to loop:
for (batch, row) in enumerate(train_batche):
# do stuff
# batch = current batch (0, 1, 2, ...) so if your dataset has 1600 rows and you've used batch_size=16 you'll have 100 batches
# row is the actual data (tensor)

Related

How to see the indices of the split on the data that GridSearchCV used when it made the split?

When using GridSearchCV() to perform a k-fold cross validation analysis on some data is there a way to know which data was used for each split?
For example, assumed the goal is to build a binary classifier of your choosing, named 'model'. There are 100 data points (rows) with 5 features each and an associated 1 or 0 target. 20 of the 100 data points are held out for testing after training and hyperparameter tuning, GridSearchCV will never see those 20 data points. The other 80 data rows are put into the estimator as X and Y, so GridSearchCV will only see 80 rows of data. Various hyper parameters are tuned and laid out in the param_grid variable. For this case the cross validation parameter of cv is assigned a value of 3, as shown:
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=3) grid_result = grid.fit(X, Y)
Is there a way to see which data was used as the training data and as the cross validation data for each fold? Maybe seeing which indices were used for the split?

PyTorch alternative for tf.data.experimental.sample_from_datasets

Suppose I have two datasets, dataset one with 100 items and dataset two with 5000 items.
Now I want that during training my model sees as much items from dataset one as from dataset two.
In Tensorflow I can do:
dataset = tf.data.experimental.sample_from_datasets(
[dataset_one, dataset_two], weights=[50,1], seed=None
)
Is there an alternative in PyTorch that does the same?
I think this is not too difficult to implement by creating a custom dataset (not working example)
from torch.utils.data import Dataset
class SampleDataset(Dataset):
def __init__(self, datasets, weights):
self.datasets = datasets
self.weights = weights
def __len__(self):
return sum([len(dataset) for dataset in self.datasets])
def __getitem__(self, idx):
# sample a random number and based on that sample an item
return self.datasets[dataset_idx][sample_idx]
However, this seems quite common. Is there already something like this available?
I don't think there is a direct equivalent in PyTorch.
However, there's a function called torch.utils.data.WeightedRandomSampler which samples indices based on a list of probabilities. You can use this in combination with torch.data.utils.ConcatDataset and torch.utils.data.DataLoader's sampler option.
I'll give an example with two datasets: SetA has 500 elements and SetB which only has 10.
First, you can create a concatenation of all your datasets with ConcaDataset:
ds = ConcatDataset([SetA(), SetB()])
Then, we need to sample it. The problem is, you can't just give WeightedRandomSampler [50, 1], as you did in Tensorflow. As a workaround, you can create a list of probabilities of the same length as the size of the total dataset.
The corresponding probability list for this example would be:
dist = np.array([1/51]*500 + [50/51]*10)
Essentially, the first 500 indices (i.e. indices 'pointing' to SetA) will have a probability of 1/51 of being choosen while the following 10 indices (i.e. indices in SetB) will have a probability of 50/51 (i.e much more likely to being sampled since there are less elements in SetB, this is the desired result!)
We can create a sampler from that distribution:
WeightedRandomSampler(dist, 10)
Where 10 is the number of sampled elements. I would put the size of the smallest dataset, otherwise you would likely be going over the same datapoints multiple times during the same epoch...
Finally, we just have to instanciate the dataloader with our dataset and sampler:
dl = DataLoader(ds, sampler=sampler)
To summarize:
ds = ConcatDataset([SetA(), SetB()])
dist = np.array([1/51]*500 + [50/51]*10)
sampler = WeightedRandomSampler(dist, 10)
dl = DataLoader(ds, sampler=sampler)
Edit, for any number of datasets:
sets = [SetA(), SetB(), SetC()]
ds = ConcatDataset(sets)
dist = np.concatenate([[(len(ds) - len(s))/len(ds)]*len(s) for s in sets])
sampler = WeightedRandomSampler(weights=dist, num_samplesmin([len(s) for s in sets])
dl = DataLoader(ds, sampler=sampler)

Why is TensorFlow's tf.data.Dataset.shuffle so slow?

The shuffle step in the following code works very slow for a moderate buffer_size (say 1000):
filenames = tf.constant(filenames)
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(_parse_function)
dataset = dataset.batch(batch_size)
dataset = dataset.shuffle(buffer_size)
If we use numpy to shuffle the data, the code looks as follows:
idx = np.arange(len(filenames))
np.random.shuffle(idx)
new_filenames = [filenames[i] for i in idx]
next_batch_filenames = new_filenames[:batch_size]
# get the corresponding files in batch
This is much faster. I wonder if TF does something beyond simply shuffles the data.
As Anton Codes wrote, your first snippet shuffles batches of whatever _parse_function parses from your files (probably feature data), while your second snippet only shuffles filenames.
If shuffling on file level is sufficient, you can actually achieve (roughly) the same performance via the tf.data.Dataset API:
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.shuffle(len(filenames)) # shuffle file names
dataset = dataset.map(_parse_function)
dataset = dataset.batch(batch_size)
This practice of shuffling "pointers" to your training samples instead of the samples themselves can often improve performance.
NumPy might still be a little bit more efficient though, due to the overhead of shuffling inside the computational graph (which tf.data.Dataset.shuffle does, there is actually a C++ kernel specifically for this operation).
The advantage of the tf.data.Dataset approach is that it can automatically reshuffle the Dataset after each epoch.
The comparison is of two quite different operations.
Your dataset = tf.data.Dataset.from_tensor_slices((filenames, labels)) reads from disk. Like physical long term storage, possibly a magnetic spinning hard drive. This is slow. If you have the ability to store all of this in ram instead, or on an ultra fast raid style flash drive, then you'll address your largest bottle neck.
You also have a _parse_function that is fired off for each data point, every time there is a data read. The computation of that parse will take time and depending on what is in there it could be significant.
The comparison to numpy isn't really fair, in that your numpy example doesn't involve reading from disk or parsing data.
That should be the bulk of the difference. If you've addressed the above, the next place to look for more speedup is with these lines
3) dataset = dataset.map(_parse_function)
4) dataset = dataset.batch(batch_size)
5) dataset = dataset.shuffle(buffer_size)
These are your code lines. Line 4 makes batches of data, possibly 32 (batch_size for sure). Then line 5 kicks in and tries to shuffle your batches of 32 in a buffer of length 1000. That happens every time the training loop requests a new training batch. The shuffle step shuffles all those big batches, picks out the first one and adds a new one ... every ... single ... time.
We can reverse the order of batch and shuffle like so
3) dataset = dataset.map(_parse_function)
4) dataset = dataset.shuffle(buffer_size)
5) dataset = dataset.batch(batch_size)
This is better anyway, because before the contents of the batches were always the same but the order was mixed. This way the contents of the batches will be randomized also. Next, the shuffle has to only shuffle 1000 items, not 32x1000 items. Last, we can challenge if we really need a buffer size of 1000. Let's say our data set is 2000 items. A buffer size of 320 and a batch size of 32 will certainly randomize our data well, effectively giving any data in the buffer a 10% of going into the next batch and a 90% of being pushed back to mix with other data. That's pretty good. A buffer size of 64 and a batch size of 64 seems almost useless, other than the items are pulled out of the batch randomly one at a time, and so actually have a chance of not getting drawn and mixing with later data. Just not so much.

Numpy- Deep Learning, Training Examples

Silly Question, I am going through the third week of Andrew Ng's newest Deep learning course, and getting stuck at a fairly simple Numpy function ( i think? ).
The exercise is to find How many training examples, m , we have.
Any idea what the Numpy function is to find out about the size of a preloaded training example.
Thanks!
shape_X = X.shape
shape_Y = Y.shape
m = ?
print ('The shape of X is: ' + str(shape_X))
print ('The shape of Y is: ' + str(shape_Y))
print ('I have m = %d training examples!' % (m))
It depends on what kind of storage-approach you use.
Most python-based tools use the [n_samples, n_features] approach where the first dimension is the sample-dimension, the second dimension is the feature-dimension (like in scikit-learn and co.). Alternatively expressed: samples are rows and features are columns.
So:
# feature 1 2 3 4
x = np.array([[1,2,3,4], # first sample
[2,3,4,5], # second sample
[3,4,5,6]
])
is a training-set of 3 samples with 4 features each.
The sizes M,N (again: interpretation might be different for others) you can get with:
M, N = x.shape
because numpy's first dimension are rows, numpy's second dimension are columns like in matrix-algebra.
For the above example, the target-array is of shape (M) = n_samples.
Anytime you want to find the number of training examples or the size of an array, you can use
m = X.size
This will give you the size or the total number of the examples. In this case, it would be 400.
The above method is also correct but not the optimal method to find the size since, in large datasets, the values could be large and while python easily handles large values, it is not advisable to utilize extra unneeded space.
Or a better way of doing the above scenario is
m=X.shape[1]

Why shuffling data gives significantly higher accuracy?

In Tensorflow, I've wrote a big model for 2 image classes problem. My question is concerned with the following code snippet:
X, y, X_val, y_val = prepare_data()
probs = calc_probs(model, session, X)
accuracy = float(np.equal(np.argmax(probs, 1), np.argmax(y, 1)).sum()) / probs.shape[0]
loss = log_loss(y, probs)
X is an np.array of shape: (25000,244,244,3). That code results in accuracy=0.5834 (towards random accuracy) and loss=2.7106. But
when I shuffle the data, by adding these 3 lines after the first line:
sample_idx = random.sample(range(0, X.shape[0]), 25000)
X = X[sample_idx]
y = y[sample_idx]
, the results become convenient: accuracy=0.9933 and loss=0.0208.
Why shuffling data can give significantly higher accuracy ? or what can be a reason for that ?
The function calc_probs is mainly a run call:
probs = session.run(model.probs, feed_dict={model.X: X})
Update:
After hours of debugging, I figured out that evaluating a single image gives different result. For example, if you run the following line of code multiple times, you get a different result each time:
session.run(model.props, feed_dict={model.X: [X[20]])
My data is normally sorted, X contains class 1 samples first then class 2. And in calc_probs function, I run using each batch of the data sequentially. So, without shuffling, each run has data of a single class.
I've also noted that with shuffling, if batch size is very small, I get the random accuracy.
There is some mathematical justification for this in the context of randomized Kaczmarz algorithm. Regular Kaczmarz algorithm is an old algorithm which can be seen as an non-shuffling SGD on a least squares problem, and there are guaranteed faster convergence rates that come out if you use randomization, follow references in http://www.cs.ubc.ca/~nickhar/W15/Lecture21Notes.pdf