Improve performance broadcast multiplication and 1 - broadcast

Improve performance broadcast multiplication and 1 - broadcast - numpy

How can I improve the last two "operations" of this code in terms of time? (minus1_power_x_t*x and 1-minus1_power_x). They are taking 1.73 and 1.47 seconds respectively. I asked for the last two operations because the others ones will be constants.
import time
from multiprocessing import Pool
import multiprocessing
import numpy as np
def create_ext_component_function(dim):
x = np.array([i for i in range(dim)], dtype=float)
t3 = time.time()
y_shift = np.fromfunction(lambda i,j: (i)>>j, ((2**dim), dim), dtype=np.uint32)
elapsed3 = time.time() - t3
print('y_sfhit Elapsed: %s' % elapsed3)
t3 = time.time()
um=np.ones(((2**dim), dim), dtype=np.uint8);
elapsed3 = time.time() - t3
print('um Elapsed: %s' % elapsed3)
t3 = time.time()
and_list = np.bitwise_and(y_shift,um)
elapsed3 = time.time() - t3
print('and_list Elapsed: %s' % elapsed3)
t3 = time.time()
minus1_power_x_t = np.power(-1,and_list)
elapsed3 = time.time() - t3
print('minus1_power_x_t Elapsed: %s' % elapsed3)
# I need to improve the last two operations
t3 = time.time()
minus1_power_x = minus1_power_x_t*x
elapsed3 = time.time() - t3
print('minus1_power*x Elapsed: %s' % elapsed3)
t3 = time.time()
um_minus_minus1_power = 1-minus1_power_x
elapsed3 = time.time() - t3
print('um_minus_minus1_power Elapsed: %s' % elapsed3)
return um_minus_minus1_power
if __name__ == '__main__':
dim = 24
print(create_ext_component_function(dim))
EDIT: Taking in account that minus1_power_x_t are values -1 or 1.

The problem with this code is that it creates many big temporary arrays for basic memory-bound operations. Big temporary arrays are slow to fill because of the speed of the RAM (unsurprisingly) and also because page fault.
Indeed, the operating system (OS) perform the mapping between the the virtual memory page and physical memory pages when the big temporary array are filled for the first time. This process is slow because the Numpy code is sequential (so the page faults), pages are generally small and so there is a lot of pages to map (typically 4 KB on a x86 system) and most OS pre-fill pages to zero for security reasons (so that the page is not filled with your bank account cumming from a recently closed browser tab). Note that there are (transparent) solutions to reduce the number of pages (using huge pages), but there are costly in this case too due to the pre-fill.
The best thing to do to solve this problem is to minimize the number of temporary buffers. This can be done thanks to the out argument of Numpy many functions (eg. subtract). You can also compute the arrays in-place for better performance. This solution also reduce the memory footprint so that the memory is not swapped on your slow storage device (swap memory). An alternative solution is to use a parallel implementation of Numpy or to write a parallel Numba/Cython code (Numba is probably the best option here). Another solution is to use Numexpr.
Note that using smaller data-types also help to improve the performance of the code (as the raw buffer in memory will be smaller and so faster to read/write/pre-fill). Using float32 is faster than float64 although is may not fit your needs. The same applies for integers (eg. int8 vs int32 vs int64 for and_list and minus1_power_x_t).
Here is an example:
# [...]
# The dtype parameter is important to reduce the size in memory
minus1_power_x_t = np.power(-1,and_list, dtype=np.int8)
# Pre-fill a buffer in memory with 0.0 (can be reused multiple times)
buffer = np.full(minus1_power_x_t.shape, 0.0)
# Note that minus1_power_x is overwritten once um_minus_minus1_power has been computed.
# If you need both, you can use 2 pre-filled buffers (only usefull if used multiple times like in a loop)
minus1_power_x = np.multiply(minus1_power_x_t, x, out=buffer)
um_minus_minus1_power = np.subtract(1.0, minus1_power_x, out=buffer)
With this method, the multiply is about 2.5 times faster on my (Intel Xeon) machine and the subtract is about 4 times faster.
Numexpr can be used to fuse the multiply and the subtract. It also support user-defined output buffers. Moreover, it can parallelize the computation. Here is an example:
um_minus_minus1_power = numexpr.evaluate('1.0 - minus1_power_x_t * x', out=buffer)
The Numexpr code is about 12.3 times faster on my machine. Note that using a float32 arrays instead of float64 ones should be about 2 times faster.

Related

How to load huge time series windows dataset without memory errors?

I want to convert a typical time series dataset of about 1 million lines into 100-item windows with 50% overlap. Note that it's a multivariate one, so for example given 8 features and 1000 windows with 100 items the final shape would be (1000, 100, 8) replacing (n_samples, n_timesteps, n_features). The goal is to use it for training machine learning algorithms including deep neural networks.
So far, I've enjoyed using numpy's sliding_window_view as shown below;
x = np.arange(100).reshape(20, 5)
v = sliding_window_view(x, (3, 5))
v
Unfortunately, I get crashes as I run out of RAM in large datasets with millions of lines. Do you have any suggestion?
Additionally, one serious restriction is that there's a consecutive label for every timestep (integer) according to which the dataset needs to be grouped by (using pandas) so this limits some options about reading it in portions.

I think you are looking for tf.data.Dataset. I'm working on a million rows dataset, and the following code runs well for me:
convert = tf.data.TextLineDataset("path_to_file.txt")
dataset = tf.data.Dataset.zip(convert)
Now you have initialized your dataset, but for don't stepping into memory issues:
def dataset_batches(ds, batch_size):
return (
ds
.cache()
.batch(batch_size)
.prefetch(tf.data.AUTOTUNE) )
# you can do more operations here
train_batches = dataset_batches(dataset, 64)
And to run it, you'll have to loop:
for (batch, row) in enumerate(train_batche):
# do stuff
# batch = current batch (0, 1, 2, ...) so if your dataset has 1600 rows and you've used batch_size=16 you'll have 100 batches
# row is the actual data (tensor)

Why is the execution time for numpy faster than cupy?

I am playing around with the differences between numpy and cupy and have noticed that within these two similiar programs I have created, the cupy version is much slower despite the fact that is runs on a GPU.
Here is the numpy version:
import time
import numpy as np
size = 5000
upperBound = 20
dataSet = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
dataLength = np.random.randint(0, high=upperBound, size=size, dtype='l')
randomNumber = np.random.randint(0, high=62, size=size * upperBound, dtype='l')
count = 0
dataCount = 0
start_time = time.time()
for i in range(size):
lineData = ""
for j in range(dataLength[i]):
lineData = lineData + dataSet[randomNumber[count]]
count = count + 1
print(lineData)
dataCount = dataCount + 1
time = str(time.time() - start_time)
print("------------------------\n" + "It took this many sedonds: " + time)
print("There were " + str(dataCount) + " many data generations.")
Here is the cupy version:
import time
import cupy as cp
size = 5000
upperBound = 20
dataSet = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
dataLength = cp.random.randint(0, high=upperBound, size= size,dtype='l')
randomNumber = cp.random.randint(0, high=62, size= upperBound * size,dtype='l')
count = 0
dataCount = 0
start_time = time.time()
for i in range(size):
lineData = ""
for j in range(int(dataLength[i])):
lineData = lineData + str(dataSet[int(randomNumber[count])])
count = count + 1
print(lineData)
dataCount = dataCount + 1
time = str(time.time() - start_time)
print("-------------------\n" +"It took this many seconds: " + time)
print("There were " + str(dataCount) + " many data generations.")
They are essentially the same code except for the fact that one is using numpy and the other is using cupy. I was expecting cupy to execute faster due to the GPU ussage, but that was not the case. The run time for numpy was: 0.032. While the run time for cupy was: 0.484.

This is a pitfall that catches many people new to GPUs. It is very common for a naive GPU version of a program to be slower than the CPU version. Making code go fast with a GPU is not trivial, mostly because of the extra latencies for copying data to and from the GPU. Whatever speedup you get from using the GPU has to overcome this overhead first. You are not doing nearly enough work on the GPU to make the overhead worth it. You're spending far more time in that cp.random.randint() call waiting for data to move than you are actually calculating anything. Do more work on the GPU and you will see the GPU take charge, like maybe a reduction operation on a large data set.
Numpy is much faster than you might expect because it is written in well-optimized C under the covers. It is not pure Python. So the benchmark you're trying to beat is actually quite fast.
If you really want to explore the depths of GPU performance tuning, try writing some CUDA and use the NVIDIA Visual Profiler to check out what the GPU is actually doing. Supposedly cupy has hooks for this but I've never used it: https://docs-cupy.chainer.org/en/stable/reference/cuda.html#profiler

I cannot see a user-defined kernel in this code, so it is not using the GPU for any weighty matrix calculation. So the delay moving data to/from GPU and type conversions probably predominate.

GPU Array multiplications using Pycuda on Numpy arrays

I have tried to implement Element-wise multiplication of two numpy arrays by making similar GPU arrays and performing the operations. However, the resulting execution time is much slower than the original numpy pointwise multiplication. I was hoping to get a good speedup using the GPU. zz0 is complex128 type, (64,256,16) shape numpy array and xx0 is float64 type,(16,151) shape numpy array. Can someone please help me figure out what I am doing wrong with respect to the implementation:
import sys
import numpy as np
import matplotlib.pyplot as plt
import pdb
import time
import pycuda.driver as drv
import pycuda.autoinit
from pycuda.compiler import SourceModule
from pycuda.elementwise import ElementwiseKernel
import pycuda.gpuarray as gpuarray
import pycuda.cumath
import skcuda.linalg as linalg
linalg.init()
# Function for doing a point-wise multiplication using GPU
def calc_Hyp(zz,xx):
zz_stretch = np.tile(zz, (1,1,1,xx.shape[3]))
xx_stretch = np.tile(xx, (zz.shape[0],zz.shape[1],1,1))
zzg = gpuarray.to_gpu(zz_stretch)
xxg = gpuarray.to_gpu(xx_stretch)
zz_Hypg = linalg.multiply(zzg,xxg)
zz_Hyp = zz_Hypg.get()
return zz_Hyp
zz0 = np.random.uniform(10.0/5000, 20000.0/5000, (64,256,16)).astype('complex128')
xx0 = np.random.uniform(10.0/5000, 20000.0/5000, (16,151)).astype('float64')
xx0_exp = np.exp(-1j*xx0)
t1 = time.time()
#Using GPU for the calculation
zz0_Hyp = calc_Hyp(zz0[:,:,:,None],xx0_exp[None,None,:,:])
#np.save('zz0_Hyp',zz0_Hyp)
t2 = time.time()
print('Time taken with GPU:{}'.format(t2-t1))
#Original calculation
zz0_Hyp_actual = zz0[:,:,:,None]*xx0_exp[None,None,:,:]
#np.save('zz0_Hyp_actual',zz0_Hyp_actual)
t3 = time.time()
print('Time taken without GPU:{}'.format(t3-t2))

The first issue is that your timing metrics are not accurate.
Linalg compiles cuda modules on the fly, and you may see code being compiles as you run it. I made some slight modifications to your code to reduce the size of the arrays being multiplied, but regardless, after two runs with no other improvements I saw massive gains in performance ex:
Time taken with GPU:2.5476348400115967
Time taken without GPU:0.16627931594848633
vs
Time taken with GPU:0.8741757869720459
Time taken without GPU:0.15836167335510254
However that is still much slower than the CPU version. The next thing I did was give a more accurate timing based upon where the actual computation is happening. You aren't tiling in your numpy version, so don't time it in your cuda version:
REAL Time taken with GPU:0.6461708545684814
You also copy to the GPU, and include that in the calculation, but that in itself takes a non trivial amount of time, so lets remove that:
t1 = time.time()
zz_Hypg = linalg.multiply(zzg,xxg)
t2 = time.time()
...
REAL Time taken with GPU:0.3689603805541992
Wow, that contributed a lot. But we still are slower than the numpy version? Why?
Remember when I said that numpy doesn't tile? It doesn't copy memory at all for broad casting. To get the real speed, you would have to:
not Tile
broadcast dimensions
implement this in a kernel.
Pycuda provides the utilities for kernel implementation, but its GPU array does not provide broadcasting. Essentially what you would have to do is this (DISCLAIMER: I haven't tested this, there are probably bugs, this is just to demonstrate approximately what the kernel should look like):
#include <pycuda-complex.hpp>
//KERNEL CODE
constexpr unsigned work_tile_dim = 32
//instruction level parallelism factor, how much extra work to do per thread, may be changed but effects the launch dimensions. thread group size should be (tile_factor, tile_factor/ilp_factor)
constexpr unsigned ilp_factor = 4
//assuming c order:
// x axis contiguous out,
// y axis contiguous in zz,
// x axis contiguous in xx
// using restrict because we know that all pointers will refer to different parts of memory.
__global__
void element_wise_multiplication(
pycuda::complex<double>* __restrict__ array_zz,
pycuda::complex<double>* __restrict__ array_xx,
pycuda::complex<double>* __restrict__ out_array,
unsigned array_zz_w, /*size of w,z,y, dimensions used in zz*/
unsigned array_zz_z,
unsigned array_zz_xx_y,/*size of y,x, dimensions used in xx, but both have same y*/
unsigned array_xx_x){
// z dimensions in blocks often have restrictions on size that can be fairly small, and sometimes can cause performance issues on older cards, we are going to derive x,y,z,w index from just the x and y indicies instead.
unsigned x_idx = blockIdx.x * (work_tile_dim) + threadIdx.x
unsigned y_idx = blockIdx.y * (work_tile_dim) + threadIdx.y
//blockIdx.z stores both z and w and should not over shoot, and aren't used
//shown for the sake of how to get these dimensions.
unsigned z_idx = blockIdx.z % array_zz_z;
unsigned w_idx = blockIdx.z / array_zz_z;
//we already know this part of the indexing calculation.
unsigned out_idx_zw = blockIdx.z * (array_zz_xx_y * array_xx_z);
// since our input array is actually 3D, this is a different calcualation
unsigned array_zz_zw = blockIdx.z * (array_zz_xx_y)
//ensures if our launch dimensions don't exactly match our input size, we don't
//accidently access out of bound memory, while branching can be bad, this isn't
// because 99.999% of the time no branch will occur and our instruction pointer
//will be the same per warp, meaning virtually zero cost.
if(x_idx < array_xx_x){
//moving over y axis to coalesce memory accesses in the x dimension per warp.
for(int i = 0; i < ilp_factor; ++i){
//need to also check y, these checks are virtually cost-less
// because memory access dominates time in such simple calculations,
// and arithmetic will be hidden by overlapping execution
if((y_idx+i) < array_zz_xx_y){
//splitting up calculation for simplicity sake
out_array_idx = out_idx_zw+(y_idx+i)*array_xx_x + x_idx;
array_zz_idx = array_zz_zw + (y_idx+i);
array_xx_idx = ((y_idx+i) * array_xx_x) + x_idx;
//actual final output.
out_array[out_array_idx] = array_zz[array_zz_idx] * array_xx[array_xx_idx];
}
}
}
}
You will have to make the launch dimensions something like:
thread_dim = (work_tile_dim, work_tile_dim/ilp_factor) # (32,8)
y_dim = xx0.shape[0]
x_dim = xx0.shape[1]
wz_dim = zz0.shape[0] * zz0.shape[1]
block_dim = (x_dim/work_tile_dim, y_dim/work_tile_dim, wz_dim)
And there are several further optimizations you may be able to take advantage of:
store global memory accesses in work tile in shared memory inside of kernel, this ensures that accesses to zz0s "y", but really x dimension are coallesced when put into shared memory, increasing performance, then accessed from shared memory (where coalescing doesn't matter, but bank conflicts do). See here on how to deal with that kind of bank conflict.
instead of calculating eulers formula and expanding a double into a complex double, expand it inside of the kernel itself, use sincos(-x, &out_sin, &out_cos) to achieve the same result, but utilizing way less memory bandwidth (see here).
But note, even doing this will likely not give you the performance you want (though will still likely be faster) unless you are on a higher end GPU with full double precision units, which aren't on most GPUs (most of the time it is emulated). Double precision floating point units take up a lot of space, and since gpus are used for graphics, they don't have much use for double precision. If you want higher precision than floating point, but want to take advantage of floating point hardware with out a 1/8 to 1/32 throughput hit of double, you can use the techniques used in this answer to achieve this on the gpu, getting you closer to 1/2 to 1/3 throughput.

How to calculate a very large correlation matrix

I have an np.array of observations z where z.shape is (100000, 60). I want to efficiently calculate the 100000x100000 correlation matrix and then write to disk the coordinates and values of just those elements > 0.95 (this is a very small fraction of the total).
My brute-force version of this looks like the following but is, not surprisingly, very slow:
for i1 in range(z.shape[0]):
for i2 in range(i1+1):
r = np.corrcoef(z[i1,:],z[i2,:])[0,1]
if r > 0.95:
file.write("%6d %6d %.3f\n" % (i1,i2,r))
I realize that the correlation matrix itself could be calculated much more efficiently in one operation using np.corrcoef(z), but the memory requirement is then huge. I'm also aware that one could break up the data set into blocks and calculate bite-size subportions of the correlation matrix at one time, but programming that and keeping track of the indices seems unnecessarily complicated.
Is there another way (e.g., using memmap or pytables) that is both simple to code and doesn't put excessive demands on physical memory?

After experimenting with the memmap solution proposed by others, I found that while it was faster than my original approach (which took about 4 days on my Macbook), it still took a very long time (at least a day) -- presumably due to inefficient element-by-element writes to the outputfile. That wasn't acceptable given my need to run the calculation numerous times.
In the end, the best solution (for me) was to sign in to Amazon Web Services EC2 portal, create a virtual machine instance (starting with an Anaconda Python-equipped image) with 120+ GiB of RAM, upload the input data file, and do the calculation (using the matrix multiplication method) entirely in core memory. It completed in about two minutes!
For reference, the code I used was basically this:
import numpy as np
import pickle
import h5py
# read nparray, dimensions (102000, 60)
infile = open(r'file.dat', 'rb')
x = pickle.load(infile)
infile.close()
# z-normalize the data -- first compute means and standard deviations
xave = np.average(x,axis=1)
xstd = np.std(x,axis=1)
# transpose for the sake of broadcasting (doesn't seem to work otherwise!)
ztrans = x.T - xave
ztrans /= xstd
# transpose back
z = ztrans.T
# compute correlation matrix - shape = (102000, 102000)
arr = np.matmul(z, z.T)
arr /= z.shape[0]
# output to HDF5 file
with h5py.File('correlation_matrix.h5', 'w') as hf:
hf.create_dataset("correlation", data=arr)

From my rough calculations, you want a correlation matrix that has 100,000^2 elements. That takes up around 40 GB of memory, assuming floats.
That probably won't fit in computer memory, otherwise you could just use corrcoef.
There's a fancy approach based on eigenvectors that I can't find right now, and that gets into the (necessarily) complicated category...
Instead, rely on the fact that for zero mean data the covariance can be found using a dot product.
z0 = z - mean(z, 1)[:, None]
cov = dot(z0, z0.T)
cov /= z.shape[-1]
And this can be turned into the correlation by normalizing by the variances
sigma = std(z, 1)
corr = cov
corr /= sigma
corr /= sigma[:, None]
Of course memory usage is still an issue.
You can work around this with memory mapped arrays (make sure it's opened for reading and writing) and the out parameter of dot (For another example see Optimizing my large data code with little RAM)
N = z.shape[0]
arr = np.memmap('corr_memmap.dat', dtype='float32', mode='w+', shape=(N,N))
dot(z0, z0.T, out=arr)
arr /= sigma
arr /= sigma[:, None]
Then you can loop through the resulting array and find the indices with a large correlation coefficient. (You may be able to find them directly with where(arr > 0.95), but the comparison will create a very large boolean array which may or may not fit in memory).

You can use scipy.spatial.distance.pdist with metric = correlation to get all the correlations without the symmetric terms. Unfortunately this will still leave you with about 5e10 terms that will probably overflow your memory.
You could try reformulating a KDTree (which can theoretically handle cosine distance, and therefore correlation distance) to filter for higher correlations, but with 60 dimensions it's unlikely that would give you much speedup. The curse of dimensionality sucks.
You best bet is probably brute forcing blocks of data using scipy.spatial.distance.cdist(..., metric = correlation), and then keep only the high correlations in each block. Once you know how big a block your memory can handle without slowing down due to your computer's memory architecture it should be much faster than doing one at a time.

please check out deepgraph package.
https://deepgraph.readthedocs.io/en/latest/tutorials/pairwise_correlations.html
I tried on z.shape = (2500, 60) and pearsonr for 2500 * 2500. It has an extreme fast speed.
Not sure for 100000 x 100000 but worth trying.

Why is TensorFlow's tf.data.Dataset.shuffle so slow?

The shuffle step in the following code works very slow for a moderate buffer_size (say 1000):
filenames = tf.constant(filenames)
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(_parse_function)
dataset = dataset.batch(batch_size)
dataset = dataset.shuffle(buffer_size)
If we use numpy to shuffle the data, the code looks as follows:
idx = np.arange(len(filenames))
np.random.shuffle(idx)
new_filenames = [filenames[i] for i in idx]
next_batch_filenames = new_filenames[:batch_size]
# get the corresponding files in batch
This is much faster. I wonder if TF does something beyond simply shuffles the data.

As Anton Codes wrote, your first snippet shuffles batches of whatever _parse_function parses from your files (probably feature data), while your second snippet only shuffles filenames.
If shuffling on file level is sufficient, you can actually achieve (roughly) the same performance via the tf.data.Dataset API:
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.shuffle(len(filenames)) # shuffle file names
dataset = dataset.map(_parse_function)
dataset = dataset.batch(batch_size)
This practice of shuffling "pointers" to your training samples instead of the samples themselves can often improve performance.
NumPy might still be a little bit more efficient though, due to the overhead of shuffling inside the computational graph (which tf.data.Dataset.shuffle does, there is actually a C++ kernel specifically for this operation).
The advantage of the tf.data.Dataset approach is that it can automatically reshuffle the Dataset after each epoch.

The comparison is of two quite different operations.
Your dataset = tf.data.Dataset.from_tensor_slices((filenames, labels)) reads from disk. Like physical long term storage, possibly a magnetic spinning hard drive. This is slow. If you have the ability to store all of this in ram instead, or on an ultra fast raid style flash drive, then you'll address your largest bottle neck.
You also have a _parse_function that is fired off for each data point, every time there is a data read. The computation of that parse will take time and depending on what is in there it could be significant.
The comparison to numpy isn't really fair, in that your numpy example doesn't involve reading from disk or parsing data.
That should be the bulk of the difference. If you've addressed the above, the next place to look for more speedup is with these lines
3) dataset = dataset.map(_parse_function)
4) dataset = dataset.batch(batch_size)
5) dataset = dataset.shuffle(buffer_size)
These are your code lines. Line 4 makes batches of data, possibly 32 (batch_size for sure). Then line 5 kicks in and tries to shuffle your batches of 32 in a buffer of length 1000. That happens every time the training loop requests a new training batch. The shuffle step shuffles all those big batches, picks out the first one and adds a new one ... every ... single ... time.
We can reverse the order of batch and shuffle like so
3) dataset = dataset.map(_parse_function)
4) dataset = dataset.shuffle(buffer_size)
5) dataset = dataset.batch(batch_size)
This is better anyway, because before the contents of the batches were always the same but the order was mixed. This way the contents of the batches will be randomized also. Next, the shuffle has to only shuffle 1000 items, not 32x1000 items. Last, we can challenge if we really need a buffer size of 1000. Let's say our data set is 2000 items. A buffer size of 320 and a batch size of 32 will certainly randomize our data well, effectively giving any data in the buffer a 10% of going into the next batch and a 90% of being pushed back to mix with other data. That's pretty good. A buffer size of 64 and a batch size of 64 seems almost useless, other than the items are pulled out of the batch randomly one at a time, and so actually have a chance of not getting drawn and mixing with later data. Just not so much.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas