Slicing a 300MB CuPy array is ~5x slower than NumPy - cupy

My code involves slicing into 432x432x400 arrays a total of ~10 million times to generate batches of data for neural network training. As these are fairly large arrays (92 million data points/300MB), I was hoping to speed this up using CuPy (and maybe even speed training up by generating data on the same GPU as training), but found it actually made the code about 5x slower.
Is this expected behaviour due to CuPy overheads or am I missing something?
Code to reproduce:
import cupy as cp
import numpy as np
import timeit
cp_arr = cp.zeros((432, 432, 400), dtype=cp.float32)
np_arr = np.zeros((432, 432, 400), dtype=np.float32)
# numbers below are representative of my code
cp_code = 'arr2 = cp_arr[100:120, 100:120, 100:120]'
np_code = 'arr2 = np_arr[100:120, 100:120, 100:120]'
timeit.timeit(cp_code, number=8192*4, globals=globals()) # prints 0.122
timeit.timeit(np_code, number=8192*4, globals=globals()) # prints 0.027
GPU: NVIDIA Quadro P4000
CuPy Version: 7.3.0
OS: CentOS Linux 7
CUDA Version: 10.1
cuDNN Version: 7.6.5

I also confirmed that the slicing is about 5x times slower in cupy, while there's a more precise way to measure the time (see e.g.
The size of the array does not matter because slice operations do not copy the data but create views. The result with the following is similar.
cp_arr = cp.zeros((4, 4, 4), dtype=cp.float32)
cp_code = 'arr2 = cp_arr[1:3, 1:3, 1:3]'
It is natural that "take slice then send it to GPU" is faster because it reduces the bytes to be transferred. Consider doing so if the first preprocess is the slicing.

Slicing in NumPy and CuPy is not actually copying the data anywhere, but simply returning a new array where the data is the same but with the its pointer being offset to the first element of the new slice and an adjusted shape. Note below how both the original array and the slice have the same strides:
In [1]: import cupy as cp
In [2]: a = cp.zeros((432, 432, 400), dtype=cp.float32)
In [3]: b = a[100:120, 100:120, 100:120]
In [4]: a.strides
Out[4]: (691200, 1600, 4)
In [5]: b.strides
Out[5]: (691200, 1600, 4)
The same above could be verified by replacing CuPy with NumPy.
If you want to time the actual slicing operation, the most reliable way of doing this would be to add a .copy() to each operation, thus enforcing the memory accessing/copying:
cp_code = 'arr2 = cp_arr[100:120, 100:120, 100:120].copy()' # 0.771 seconds
np_code = 'arr2 = np_arr[100:120, 100:120, 100:120].copy()' # 0.154 seconds
Unfortunately, for the case above the memory pattern is bad for GPUs as the small chunks won't be able to saturate memory channels, thus it's still slower than NumPy. However, CuPy can be much faster if the chunks are able to get close to memory channel saturation, for example:
cp_code = 'arr2 = cp_arr[:, 100:120, 100:120].copy()' # 0.786 seconds
np_code = 'arr2 = np_arr[:, 100:120, 100:120].copy()' # 2.911 seconds


Why does pytorch matmul get different results when executed on cpu and gpu?

I am trying to figure out the rounding difference between numpy/pytorch, gpu/cpu, float16/float32 numbers and what I'm finding confuses me.
The basic version is:
a = torch.rand(3, 4, dtype=torch.float32)
b = torch.rand(4, 5, dtype=torch.float32)
print(a.numpy()#b.numpy() - a#b)
The result is all zeros as expected, however
print((a.cuda()#b.cuda()).cpu() - a#b)
gets non-zero results. Why is Pytorch float32 matmul executed differently on gpu and cpu?
An even more confusing experiment involves float16, as follows:
a = torch.rand(3, 4, dtype=torch.float16)
b = torch.rand(4, 5, dtype=torch.float16)
print(a.numpy()#b.numpy() - a#b)
print((a.cuda()#b.cuda()).cpu() - a#b)
these two results are all non-zero. Why are float16 numbers handled differently by numpy and torch? I know cpu can only do float32 operations and numpy convert float16 to float32 before computing, however the torch calculation is also executed on cpu.
And guess what, print((a.cuda()#b.cuda()).cpu() - a.numpy()#b.numpy()) gets an all zero result! This is pure fantasy for me...
The environment is as follow:
python: 3.8.5
torch: 1.7.0
numpy: 1.21.2
cuda: 11.1
gpu: GeForce RTX 3090
On the advice of some of the commenters, I add the following equal test
(a.numpy()#b.numpy() - (a#b).numpy()).any()
((a.cuda()#b.cuda()).cpu() - a#b).numpy().any()
(a.numpy()#b.numpy() - (a#b).numpy()).any()
((a.cuda()#b.cuda()).cpu() - a#b).numpy().any()
((a.cuda()#b.cuda()).cpu().numpy() - a.numpy()#b.numpy()).any()
respectively directly following the above five print functions, and the results are:
And for the last one, I've tried several times and I think I can rule out luck.
The differences are mostly numerical, as mentioned by #talonmies. CPU/GPU and their respectively BLAS libraries are implemented differently and use different operations/order-of-operation, hence the numerical difference.
One possible cause is sequential operation vs. reduction (, e.g. (((a+b)+c)+d) will have different numerical properties as compared with ((a+b)+(c+d)).
This question also talks about fused operations (multiply-add) which can cause numerical differences.
I did a little bit of testing, and find that the GPU's output in float16 mode can be matched if we promote the datatype to float32 before computation and demote it afterward. This can be caused by internal intermediate casting or the better numerical stability of fused operations (torch.backends.cudnn.enabled does not matter). This does not solve the case in float32 though.
import torch
def test(L, M, N):
# test (L*M) # (M*N)
for _ in range(5000):
a = torch.rand(L, M, dtype=torch.float16)
b = torch.rand(M, N, dtype=torch.float16)
cpu_result = a#b
gpu_result = (a.cuda()#b.cuda()).cpu()
if (cpu_result-gpu_result).any():
print(f'({L}x{M}) # ({M}x{N}) failed')
print(f'({L}x{M}) # ({M}x{N}) passed')
test(1, 1, 1)
test(1, 2, 1)
test(4, 1, 4)
test(4, 4, 4)
def test2():
for _ in range(5000):
a = torch.rand(1, 2, dtype=torch.float16)
b = torch.rand(2, 1, dtype=torch.float16)
cpu_result = a#b
gpu_result = (a.cuda()#b.cuda()).cpu()
half_result = a[0,0]*b[0,0] + a[0,1]*b[1,0]
convert_result = (a[0,0].float()*b[0,0].float() + a[0,1].float()*b[1,0].float()).half()
if ((cpu_result-half_result).any()):
print('CPU != half')
if (gpu_result-convert_result).any():
print('GPU != convert')
print('All passed')
(1x1) # (1x1) passed
(1x2) # (2x1) failed
(4x1) # (1x4) passed
(4x4) # (4x4) failed
All passed
You can tell that when the inner dimension is 1, it passes the check (no multiply-add/reduction needed).

numpy.random.multinomial at version 1.16.6 is 10x faster than later version

Here are codes and result:
python -c "import numpy as np; from timeit import timeit; print('numpy version {}: {:.1f} seconds'.format(np.__version__, timeit('np.random.multinomial(1, [0.1, 0.2, 0.3, 0.4])', number=1000000, globals=globals())))"
numpy version 1.16.6: 1.5 seconds # 10x faster
numpy version 1.18.1: 15.5 seconds
numpy version 1.19.0: 17.4 seconds
numpy version 1.21.4: 15.1 seconds
It is noted that with fixed random seed, the output are the same with different numpy version
python -c "import numpy as np; np.random.seed(0); print(np.__version__); print(np.random.multinomial(1, [0.1, 0.2, 0.3, 0.4], size=10000))" /tmp/tt
Any advice on why numpy version after 1.16.6 is 10x slower?
We have upgraded pandas to latest version 1.3.4, which needs numpy version after 1.16.6
TL;DR: this is a local performance regression caused by the overhead of additional checks in the numpy.random.multinomial function. Very small arrays are strongly impacted due to the relative execution time of the required checks.
Under the hood
A binary search on the Git commits of the Numpy code shows that the performance regression appear the first time in mid April 2019. It can be reproduced in the commit dd77ce3cb but not 7e8e19f9a. There are some build issues for the commit in-between, but with some quick fix we can show that the commit 0f3dd0650 is the first to cause the issue. The commit says that it:
Extend multinomial to allow broadcasting
Fix zipf changes missed in NumPy
Enable 0 as valid input for hypergeometric
A deeper analysis of this commit shows that it modifies the multinomial function defined in Cython file mtrand.pyx to perform two additional following checks:
def multinomial(self, np.npy_intp n, object pvals, size=None):
cdef np.npy_intp d, i, sz, offset
cdef np.ndarray parr, mnarr
cdef double *pix
cdef int64_t *mnix
cdef int64_t ni
d = len(pvals)
parr = <np.ndarray>np.PyArray_FROM_OTF(pvals, np.NPY_DOUBLE, np.NPY_ALIGNED)
pix = <double*>np.PyArray_DATA(parr)
check_array_constraint(parr, 'pvals', CONS_BOUNDED_0_1) # <==========[HERE]
if kahan_sum(pix, d-1) > (1.0 + 1e-12):
raise ValueError("sum(pvals[:-1]) > 1.0")
if size is None:
shape = (d,)
shape = (operator.index(size), d)
shape = tuple(size) + (d,)
multin = np.zeros(shape, dtype=np.int64)
mnarr = <np.ndarray>multin
mnix = <int64_t*>np.PyArray_DATA(mnarr)
sz = np.PyArray_SIZE(mnarr)
ni = n
check_constraint(ni, 'n', CONS_NON_NEGATIVE) # <==========[HERE]
offset = 0
with self.lock, nogil:
for i in range(sz // d):
random_multinomial(self._brng, ni, &mnix[offset], pix, d, self._binomial)
offset += d
return multin
These two checks are required for the code to be robust. However, they are currently pretty expensive considering their purpose.
Indeed, on my machine, the first check is responsible for ~75% of the overhead and the second for ~20%. The checks takes few micro-seconds but since your input is very small, the overhead is huge compared to the computation time.
One workaround to fix this issue is to write a specific Numba function for this since your input array is very small. On my machine, np.random.multinomial in a trivial Numba function results in good performance.
I checked some generators that are under the hood and saw no much change in the timings.
I guessed difference may be due to some overhead, because you are sampling only single value. And it seems to be good hypothesis. When I increased size of the generated random samples to 1000, difference between 1.16.6 and 1.19.2 (my current Numpy version) diminished to ~20%.
python -c "import numpy as np; from timeit import timeit; print('numpy version {}: {:.1f} seconds'.format(np.__version__, timeit('np.random.
multinomial(1, [0.1, 0.2, 0.3, 0.4], size=1000)', number=10000, globals=globals())))"
numpy version 1.16.6: 1.1 seconds
numpy version 1.19.2: 1.3 seconds
Note that both versions have this overhead, just newer version has it much larger. In both versions it is much faster to sample 1000 values once than sample 1 value 1000 times.
They changed by much the code between 1.16.6 and 1.17.0, see for example this commit, it's hard to analyse. Sorry that can't help you better - I propose to make an issue on Numpy's github.

Numpy speed efficiency using broadcasting, transpose and reshape in large size array

Is there a way to speed up the following line of code:
fast_idx = np.broadcast_to(np.arange(desired_channel)[:, None], (desired_channel, len_indices)).T.reshape(-1)
Thank you.
The last line of code is simply equal to np.tile(np.arange(desired_channel), len_indices).
On my machine, the performance of np.tile like many Numpy calls is bounded by the operating system (page faults), the memory allocator and the memory throughput. There are two ways to overcome this limitation: not to allocate/fill temporary buffers, to produce smaller arrays in memory using shorter types like np.uint8 or np.uint16 regarding your needs.
Since there is no out parameter for the np.tile function, Numba can be used to generate a fast alternative function. Here is an example:
import numba as nb
#nb.njit('int32[::1](int32, int32, int32[::1])', parallel=True)
def generate(desired_channel, len_indices, out):
for i in nb.prange(len_indices):
for j in range(desired_channel):
out[i*desired_channel+j] = j
return out
buffer = np.full(desired_channel * len_indices, 0, dtype=np.int32)
%timeit -n 200 generate(desired_channel, len_indices, fast_idx)
Here are the performance results:
Original code: 1.25 ms
np.tile: 1.24 ms
Numba: 0.20 ms
I am new to jax library. I have compared your code by jax one using the following code on Colab TPU:
import numpy as np
from jax import jit
import jax.numpy as jnp
import timeit
def ex_():
return np.broadcast_to(np.arange(desired_channel)[:, None], (desired_channel, len_indices)).T.reshape(-1)
%timeit -n1000 -r10 ex_()
def exj_():
return jnp.broadcast_to(jnp.arange(desired_channel)[:, None], (desired_channel, len_indices)).T.reshape(-1)
%timeit -n1000 -r10 exj_()
in one of my efforts, the results were as:
1000 loops, best of 10: 901 µs per loop
1000 loops, best of 10: 317 µs per loop
in this way, jax could speed up your code about two to three times.

Fastest way to compute many 3x3 matrix-matrix multiplications

I need to compute the combination of many 3x3 rotation matrices.
Here is a comparison of applying functools.reduce on matmul with numpy and cupy:
import timeit
from functools import reduce
import numpy as np
import cupy as cp
from pyrr.matrix33 import create_from_axis_rotation
# generate random rotation matrices
axes = np.random.rand(10000, 3)
angles = np.pi * np.random.rand(10000)
rotations = [create_from_axis_rotation(*params) for params in zip(axes, angles)]
# then reduce with matmul
xp = np # numpy
xp_rotations = [xp.asarray(rotation) for rotation in rotations]
timexp = timeit.timeit("reduce(xp.matmul, xp_rotations)", number=10, globals=globals())
print(f"{xp.__name__}: {timexp * 1000:0.3}ms")
xp = cp # cupy
xp_rotations = [xp.asarray(rotation) for rotation in rotations]
timexp = timeit.timeit("reduce(xp.matmul, xp_rotations)", number=10, globals=globals())
print(f"{xp.__name__}: {timexp * 1000:0.3}ms")
On a good machine with a Titan GPU, this gives :
numpy: 1.63e+02ms
cupy: 8.78e+02ms
For some reason the GPU is much slower.
In any case, is there a way to calculate this significantly faster ?
I found a rather simple solution, that works for all chains of small linear transformations (and can be extended to affine transformations easily).
def reduce_loop(matrices):
""" non-optimized reduce """
mat = matrices[0]
for _mat in matrices[1:]:
mat = mat # _mat
return mat
def reduce_split(matrices):
""" reduce by multiplying pairs of matrices recursively """
if len(matrices) == 1:
return matrices[0]
neven = (len(matrices) // 2) * 2
reduced = matrices[:neven:2] # matrices[1:neven:2]
if len(matrices) > neven: # len(matrices) is odd
reduced[-1] = reduced[-1] # matrices[-1]
return reduce_split(reduced)
time = timeit.timeit("reduce_loop(rotations)", number=10, globals=globals())
print(f"reduce_loop: {time * 1000:0.3}ms")
time = timeit.timeit("reduce_split(rotations)", number=10, globals=globals())
print(f"reduce_split: {time * 1000:0.3}ms")
reduce_loop: 2.14e+02ms
reduce_split: 24.5ms
I'm sure it's not optimal, but it uses numpy's (and probably cupy's) optimization.
functools.reduce() was removed from core python because it is inefficient and not pythonic. There is no cuPy equivalent, only the host version in the functools library
your cuPy code is spending most of its time fruitlessly copying data from host to device and back again... thousands of times - because reduce() runs only on the host not on the GPU. You are straining your PCI bus, not the GPU
consider making the list “rotations” into a cuPy matrix, and then use striding (not a python list)
use a cuPy reduction kernel to do the matmul

How to use torch to speed up some common computations?

I am trying make some common computations, like matrix multiplication, but without gradient computation. An example of my computation is like
import numpy as np
from scipy.special import logsumexp
var = 1e-8
a = np.random.randint(0,10,(128,20))
result = np.logsumexp(a, axis=1) / 2. + np.log(np.pi * var)
I want to use torch (gpu) to speed up the computation. Here is the code
import numpy as np
import torch
var = 1e-8
a = np.random.randint(0,10,(128,20))
a = torch.numpy_from(a).cuda()
result = torch.logsumexp(a, dim=1)/ 2. + np.log(np.pi*var)
but i have some questions:
Could the above code speed up the computation? I don't know if it works.
Do I need to convert all values into torch.tensor, like from var to torch.tensor(var).cuda() and from np.log(np.pi*var) to a torch.tensor?
Do I need to convert all tensors into gpu by myself, especially for some intermediate variable?
If the above code doesn't work, how can I speed up the computation with gpu?
You could use torch only to do the computations.
import torch
# optimization by passing device argument, tensor is created on gpu and hence move operation is saved
# convert to float to use with logsumexp
a = torch.randint(0,10, (128,20), device="cuda").float()
result = torch.logsumexp(a, dim=1)/ 2.
Answers to your some of your questions:
Could the above code speed up the computation?
It depends. If you have too many matrix multiplication, using gpu can give speed up.
Do I need to convert all values into torch.tensor, like from var to torch.tensor(var).cuda() and from np.log(np.pi*var) to a torch.tensor?
Do I need to convert all tensors into gpu by myself, especially for some intermediate variable?
Only leaf variables need to converted, intermediate variable will be placed on device on which the operations are done. For ex: if a and b are on gpu, then as a result of operation c=a+b, c will also be on gpu.