Tensorflow sliding window transformation over large data - tensorflow

I would like to feed my model with stride-1 windows from a very long data sequence (tens of millions of entries). This is similar to the aim presented in this thread, only that my data sequence may contain several features to begin with, so the final number of features is n_features * window_size. i.e. with two original features and a window size of 3, this would mean transforming this:
[[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]
to:
[[1, 2, 3, 6, 7, 8], [2, 3, 4, 7, 8, 9], [3, 4, 5, 8, 9, 10]]
I was trying to use slicing with map_fn or Dataset.map, applied to a sequence of indices (per the answer in the above-mentioned thread), as in:
ti = tf.range(data.shape[0] - window_size)
train_dataset = tf.data.Dataset.from_tensor_slices((ti, labels))
def get_window(l, label):
wnd = tf.reshape(data_tensor[l:(l + window_size), :], (-1, window_size * n_features))
wnd = tf.squeeze(wnd)
return (wnd, label)
train_dataset = train_dataset.map(get_window)
train_dataset = train_dataset.batch(batch_size)
...
This is working in principle, but the training is extremely slow, with minimal GPU utilization (1-5%, probably in part because the mapping is done in the CPU).
When trying to do the same with tf.map_fn, the graph building becomes very lengthy, with tremendous memory utilization.
Another option I was trying is to transform all of the data in advance before I load it in Tensorflow. This works much faster (even when considering the pre-processing time, I wonder why - shouldn't it be the same operation as the mapping during training?) but is very inefficient in terms of memory and storage, as the data becomes window_size-fold larger. That is a deal-breaker for my large datasets.
I thought about splitting these transformed bloated datasets into several files ("hyper-batches") and go through them in sequence for each epoch, but this seems very inefficient and I was wondering if there is a better way to achieve this simple transformation.

Related

How to relaibly create a multi-dimensional array and a one-dimensional view of it in numpy, so that the memory layout be contiguous?

According to the documentation of numpy.ravel,
Return a contiguous flattened array.
A 1-D array, containing the elements of the input, is returned. A copy is made only if needed.
For convenience and efficiency of indexing, I would like to have a one-dimensional view of a 2-dimensional array. I am using ravel for creating the view, and so far so good.
However, it is not clear to me what is meant by "A copy is made only if needed." If some day a copy is created while my code is executed, the code will stop working.
I know that there is numpy.reshape, but its documentation says:
It is not always possible to change the shape of an array without copying the data.
In any case, I would like the data to be contiguous.
How can I reliably create at 2-dimensional array and a 1-dimensional view into it? I would like the data to be contiguous in memory (for efficiency). Are there any attributes to specify when creating the 2-dimensional array to assure that it is contiguous and ravel will not need to copy it?
Related question: What is the difference between flatten and ravel functions in numpy?
The warnings for ravel and reshape are the same. ravel is just reshape(-1), to 1d. Conversely reshape docs tells us that we can think of it as first doing a ravel.
Normal array construction produces a contiguous array, and reshape with the same order will produce a view. You can visually test that by looking at the ravel and checking if the values appear in the expected order.
In [348]: x = np.arange(6).reshape(2,3)
In [349]: x
Out[349]:
array([[0, 1, 2],
[3, 4, 5]])
In [350]: x.ravel()
Out[350]: array([0, 1, 2, 3, 4, 5])
I started with the arange, reshaped it to 2d, and back to 1d. No change in order.
But if I make a sliced view:
In [351]: x[:,:2]
Out[351]:
array([[0, 1],
[3, 4]])
In [352]: x[:,:2].ravel()
Out[352]: array([0, 1, 3, 4])
This ravel has a gap, and thus is a copy.
Transpose is also a view, which cannot be reshaped to a view:
In [353]: x.T
Out[353]:
array([[0, 3],
[1, 4],
[2, 5]])
In [354]: x.T.ravel()
Out[354]: array([0, 3, 1, 4, 2, 5])
Except, if we specify the right order, the ravel is a view.
In [355]: x.T.ravel(order='F')
Out[355]: array([0, 1, 2, 3, 4, 5])
reshape has a extensive discussion of order. And transpose actually works by returning a view with different shape and strides. For a 2d array transpose produces a order F array.
So as long as you are aware of manipulations like this, you can safely assume that the reshape/ravel is contiguous.
Note that even though [354] is a copy, assignment to the flat changes the original
In [361]: x[:,:2].flat[:] = [3,4,2,1]
In [362]: x
Out[362]:
array([[3, 4, 2],
[2, 1, 5]])
x[:,:2].ravel()[:] = [10,11,2,3] does not change x. In cases like this y = x[:,:2].flat may be more useful than the ravel equivalent.

numpy array of array adding up another array

I am having the following array of array
a = np.array([[1,2,3],[4,5,6]])
b = np.array([[1,5,10])
and want to add up the value in b into a, like
np.array([[2,7,13],[5,10,16]])
what is the best approach with performance concern to achieve the goal?
Thanks
Broadcasting does that for you, so:
>>> a+b
just works:
array([[ 2, 7, 13],
[ 5, 10, 16]])
And it can also be done with
>>> a + np.tile(b,(2,1))
which gives the result
array([[ 2, 7, 13],
[ 5, 10, 16]])
Depending on size of inputs and time constraints, both methods might be of consideration
Method 1: Numpy Broadcasting
Operation on two arrays are possible if they are compatible
Operation generally done along with broadcasting
broadcasting in lay man terms could be called repeating elements along a specified axis
Conditions for broadcasting
Arrays need to be compatible
Compatibility is decided based on their shapes
shapes are compared from right to left.
from right to left while comparing, either they should be equal or one of them should be 1
smaller array is broadcasted(repeated) over bigger array
a.shape, b.shape
((2, 3), (1, 3))
From the rules they are compatible, so they can be added, b is smaller, so b is repeated long 1 dimension, so b can be treated as [[ 5, 10, 16], [ 5, 10, 16]]. But note numpy does not allocate new memory, it is just view.
a + b
array([[ 2, 7, 13],
[ 5, 10, 16]])
Method 2: Numba
Numba gives parallelism
It will convert to optimized machine code
Why this is because, sometimes numpy broadcasting is not good enough, ufuncs(np.add, np.matmul, etc) allocate temp memory during operations and it might be time consuming if already on memory limits
Easy parallelization
Using numba based on your requirement, you might not need temp memory allocation or various checks which numpy does, which can speed up code for huge inputs, for example. Why are np.hypot and np.subtract.outer very fast?
import numba as nb
#nb.njit(parallel=True)
def sum(a, b):
s = np.empty(a.shape, dtype=a.dtype)
# nb.prange gives numba hint to what to parallelize
for i in nb.prange(a.shape[0]):
s[i] = a[i] + b
return s
sum(a, b)

Using GridSearchCV of xgboost with DMatrix

I have some problems when I was practicing how to use xgboost.
As I know, the "DMatrix" is a special internal structure that makes the model run faster.
Here's the problem:
To tune the model, (I guess) GridSearchCV or RandomizedSearchCV are considerable.
With the code below:
params = {
'min_child_weight': [1, 5, 10],
'gamma': [0.5, 1, 1.5, 2, 5],
'subsample': [0.6, 0.8, 1.0],
'colsample_bytree': [0.6, 0.8, 1.0],
'max_depth': [3, 4, 5]
}
random_search = RandomizedSearchCV(xgb, param_distributions=params, n_iter=param_comb, scoring='roc_auc', n_jobs=4, cv=skf.split(X,Y), verbose=3, random_state=1001 )
I can also do the cross validation by passing cv. That was great.
However, it really takes time (almost 40 mins with big data and colab gpu) and I really want to improve it.
After I transform my train data to DMatrix:
xgbtrain = xgb.DMatrix(train_x, train_y)
I'm not knowing what to do next because the .fit requires X and y..
How to do that? Or any way to make it faster?
Thanks
This question is pretty old, so I suspect you may have already found an answer. XGBoost can be tricky to navigate the different options when incorporating CV or parameter tuning.
Instead of using xgb.fit() you can use xgb.train() to utilize the DMatrix object. Additionally, XGB has xgb.cv() for performing a cross validation. I myself am hoping to find an alternative to GridSearchCV, but I don't think there is one. The best method may be to create a loop of xgb.cv() to compare evaluation results and identify the best performing parameters.
XGB has really helpful documentation, you may want to check outXGB Python Intro: Training and Cross Validation Demo
Try Optuna for hyperparameter tuning of XGBoost, much much faster, and use gpu (tree_method = gpu_hist). Kaggle has free GPU every week.

How do I resolve one hot encoding if my test data has missing values in a col?

For example if my training data has the categorical values (1,2,3,4,5) in the col,then one hot encoding will give me 5 cols. But in the test data I have, say only 4 out of the 5 values i.e.(1,3,4,5).So one hot encoding will give me only 4 cols.Therefore if I apply my trained weights on the test data, I will get an error as the dimensions of the cols do not match in the train and test data, dim(4)!=dim(5).Any suggestions on what do I do with the missing col values?
The image of my code is provided below:
image
Guys don't do this mistake, please!
Yes, you can do this hack with the concatenation of train and test and fool yourself, but the real problem is in production. There your model will someday face an unknown level of your categorical variable and then break.
In reality, some of the more viable options could be:
Retrain your model periodically to account for new data.
Do not use one hot. Seriously, there are many better options like leave one out encoding (https://www.kaggle.com/c/caterpillar-tube-pricing/discussion/15748#143154) conditional probability encoding (https://medium.com/airbnb-engineering/designing-machine-learning-models-7d0048249e69), target encoding to name a few. Some classifiers like CatBoost even have a built-in mechanism for encoding, there are mature libraries like target_encoders in Python, where you will find lots of other options.
Embed categorical features and this could save you from a complete retrain (http://flovv.github.io/Embeddings_with_keras/)
You can first combine two dataframes, then get_dummies then split them so they can have exact number of columns i.e
#Example Dataframes
Xtrain = pd.DataFrame({'x':np.array([4,2,3,5,3,1])})
Xtest = pd.DataFrame({'x':np.array([4,5,1,3])})
# Concat with keys then get dummies
temp = pd.get_dummies(pd.concat([Xtrain,Xtest],keys=[0,1]), columns=['x'])
# Selecting data from multi index and assigning them i.e
Xtrain,Xtest = temp.xs(0),temp.xs(1)
# Xtrain.as_matrix()
# array([[0, 0, 0, 1, 0],
# [0, 1, 0, 0, 0],
# [0, 0, 1, 0, 0],
# [0, 0, 0, 0, 1],
# [0, 0, 1, 0, 0],
# [1, 0, 0, 0, 0]], dtype=uint8)
# Xtest.as_matrix()
# array([[0, 0, 0, 1, 0],
# [0, 0, 0, 0, 1],
# [1, 0, 0, 0, 0],
# [0, 0, 1, 0, 0]], dtype=uint8)
Do not follow this approach. Its a simple trick with lot of disadvantages. #Vast Academician answer explains better.
Use dummy(binary) encoding instead of one hot encoding. Pandas pd.dummies() with drop_first = True creates dummy encoding to get k-1 dummies out of k categorical levels by removing the first level. The default option drop_first = False creates one hot encoding.
See pandas official documentation
Also dummy(binary) encoding creates less number of columns.

Parallel h5py/hdf5 writing to large dataset skips data chunks

I am using mpi and h5py/hdf5 (Hdf5 and h5py were compiled to have parallel capabilities and everything runs on python 3.4.) on a cluster to stitch a dataset of overlapping tiles (200 or more images/numpy arrays 2048x2048).
Each tile has an assigned index number that correspond to the position where it should be written and all the indexes are stored in an array:
Example tile_array:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]], dtype=int32)
The issue I have are:
When Running the code on the cluster:
- When I use multiple cores to write the stitched image in parallel I get randomly missing tiles.
- When I run the stitching serially in rank==0 I still have missing tiles.
- If the dataset is small there are no issues.
- If do not activate atomic mode in the parallel writing the number of errors increases. The number of errors increases if the atomic mode is
activated when I run the writing on rank==0
Running the code on a laptop:
- When I run the same code on single core in a laptop, stripped of the mpi code, no tiles are missing.
Any comment will be greatly appreciated.
Highlight of the procedure:
0) Before stitching, each image of the set is processed and the overlapping edges are blended in order to get a better result in the overlapping regions
1) I open the hdf file where I want to save the data: stitching_file=h5py.File(fpath,'a',driver='mpio',comm=MPI.COMM_WORLD)
2) I create a big dataset that will contain the stitched images
3) I fill the dataset with zeros
4) The processed images will be written (added) to the big dataset by different cores. However, adjacent images have overlapping regions. In order to avoid that different cores will write tiles with some overlap and break the writing I will run four rounds of writing in order to cover the whole dataset with this kind of patttern:
Example tile_array:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]], dtype=int32)
Writing rounds:
writing round one: array([ 0, 2, 8, 10], dtype=int32)
writing round two: array([ 1, 3, 9, 11], dtype=int32)
writing round three: array([ 4, 6, 12, 14], dtype=int32)
writing round four: array([ 5, 7, 13, 15], dtype=int32)
If I write on single rank I don't use this scheme and just write each position sequentially
Pseudocode
# Preprocessing of the images
------------------------------
------------------------------
# MPI data chunking function
def tasks_chunking(tasks,size):
# This function scatter any kind of list, not only ordered ones
# If the number of cores is bigger than the length of the list
# the size of the chunk will be zero and the
# Chunk the list of tasks
Chunked_list=np.array_split(tasks,size)
NumberChunks=np.zeros(len(Chunked_list),dtype='int32')
for idx,chunk in enumerate(Chunked_list):
NumberChunks[idx]=len(chunk)
# Calculate the displacement
Displacement=np.zeros(len(NumberChunks),dtype='int32')
for idx,count in enumerate(NumberChunks[0:-1]):
Displacement[idx+1]=Displacement[idx]+count
return Chunked_list,NumberChunks,Displacement
# Flush the file after the preprocessing
stitching_file.flush()
# Writing dictionary. Each group will contain a subset of the tiles to be written.
# Data is the size of the final array where to save the data
fake_tile_set = np.arange(Data.size,dtype='int32')
fake_tile_set = fake_tile_set.reshape(Data.shape)
writing_dict = {}
writing_dict['even_group1']=fake_tile_array[::2, ::2].flat[:]
writing_dict['even_group2']=fake_tile_array[::2, 1::2].flat[:]
writing_dict['odd_group1'] =fake_tile_array[1::2, ::2].flat[:]
writing_dict['odd_group2'] =fake_tile_array[1::2, 1::2].flat[:]
# Withouth activating the atomic mode the number of errors in parallel mode is higher.
# Withouth activating the atomic mode the number of errors in parallel mode is higher.
stitching_file.atomic=True
# Loop through the dictionary items to start the writing
for key, item in writing_dict.items():
# Chunk the positions that need to be written
if rank==0:
Chunked_list,NumberChunks,Displacement=tasks_chunking(item,size)
else:
NumberChunks = None
Displacement = None
Chunked_list= None
# Make the cores aware of the number of jobs that will need to run
# The variable count is created by the scatter function and has the number of
# processes and is different in every core
cnt = np.zeros(1, dtype='int32')
comm.Scatter(NumberChunks, cnt, root=0)
# Define the local variable that will be filled up wuth the scattered data
xlocal = np.zeros(cnt, dtype='int32') # Use rank for determine the size of the xlocal on the different cores
# Scatter the value of tasks to the different cores
comm.Scatterv([item,NumberChunks,Displacement,MPI.INT],xlocal, root=0)
for tile_ind in xlocal:
# This writing function is the same called when I run the writing on rank 0 or on a single core in a laptop.
paste_in_final_image(joining, temp_file, stitched_group, tile_ind, nr_pixels)
comm.Barrier()
stitching_file.flush()
stitching_file.close()