numpy.random.multinomial at version 1.16.6 is 10x faster than later version - numpy

Here are codes and result:
python -c "import numpy as np; from timeit import timeit; print('numpy version {}: {:.1f} seconds'.format(np.__version__, timeit('np.random.multinomial(1, [0.1, 0.2, 0.3, 0.4])', number=1000000, globals=globals())))"
numpy version 1.16.6: 1.5 seconds # 10x faster
numpy version 1.18.1: 15.5 seconds
numpy version 1.19.0: 17.4 seconds
numpy version 1.21.4: 15.1 seconds
It is noted that with fixed random seed, the output are the same with different numpy version
python -c "import numpy as np; np.random.seed(0); print(np.__version__); print(np.random.multinomial(1, [0.1, 0.2, 0.3, 0.4], size=10000))" /tmp/tt
Any advice on why numpy version after 1.16.6 is 10x slower?
We have upgraded pandas to latest version 1.3.4, which needs numpy version after 1.16.6

TL;DR: this is a local performance regression caused by the overhead of additional checks in the numpy.random.multinomial function. Very small arrays are strongly impacted due to the relative execution time of the required checks.
Under the hood
A binary search on the Git commits of the Numpy code shows that the performance regression appear the first time in mid April 2019. It can be reproduced in the commit dd77ce3cb but not 7e8e19f9a. There are some build issues for the commit in-between, but with some quick fix we can show that the commit 0f3dd0650 is the first to cause the issue. The commit says that it:
Extend multinomial to allow broadcasting
Fix zipf changes missed in NumPy
Enable 0 as valid input for hypergeometric
A deeper analysis of this commit shows that it modifies the multinomial function defined in Cython file mtrand.pyx to perform two additional following checks:
def multinomial(self, np.npy_intp n, object pvals, size=None):
cdef np.npy_intp d, i, sz, offset
cdef np.ndarray parr, mnarr
cdef double *pix
cdef int64_t *mnix
cdef int64_t ni
d = len(pvals)
parr = <np.ndarray>np.PyArray_FROM_OTF(pvals, np.NPY_DOUBLE, np.NPY_ALIGNED)
pix = <double*>np.PyArray_DATA(parr)
check_array_constraint(parr, 'pvals', CONS_BOUNDED_0_1) # <==========[HERE]
if kahan_sum(pix, d-1) > (1.0 + 1e-12):
raise ValueError("sum(pvals[:-1]) > 1.0")
if size is None:
shape = (d,)
shape = (operator.index(size), d)
shape = tuple(size) + (d,)
multin = np.zeros(shape, dtype=np.int64)
mnarr = <np.ndarray>multin
mnix = <int64_t*>np.PyArray_DATA(mnarr)
sz = np.PyArray_SIZE(mnarr)
ni = n
check_constraint(ni, 'n', CONS_NON_NEGATIVE) # <==========[HERE]
offset = 0
with self.lock, nogil:
for i in range(sz // d):
random_multinomial(self._brng, ni, &mnix[offset], pix, d, self._binomial)
offset += d
return multin
These two checks are required for the code to be robust. However, they are currently pretty expensive considering their purpose.
Indeed, on my machine, the first check is responsible for ~75% of the overhead and the second for ~20%. The checks takes few micro-seconds but since your input is very small, the overhead is huge compared to the computation time.
One workaround to fix this issue is to write a specific Numba function for this since your input array is very small. On my machine, np.random.multinomial in a trivial Numba function results in good performance.

I checked some generators that are under the hood and saw no much change in the timings.
I guessed difference may be due to some overhead, because you are sampling only single value. And it seems to be good hypothesis. When I increased size of the generated random samples to 1000, difference between 1.16.6 and 1.19.2 (my current Numpy version) diminished to ~20%.
python -c "import numpy as np; from timeit import timeit; print('numpy version {}: {:.1f} seconds'.format(np.__version__, timeit('np.random.
multinomial(1, [0.1, 0.2, 0.3, 0.4], size=1000)', number=10000, globals=globals())))"
numpy version 1.16.6: 1.1 seconds
numpy version 1.19.2: 1.3 seconds
Note that both versions have this overhead, just newer version has it much larger. In both versions it is much faster to sample 1000 values once than sample 1 value 1000 times.
They changed by much the code between 1.16.6 and 1.17.0, see for example this commit, it's hard to analyse. Sorry that can't help you better - I propose to make an issue on Numpy's github.


Fastest way to compute many 3x3 matrix-matrix multiplications

I need to compute the combination of many 3x3 rotation matrices.
Here is a comparison of applying functools.reduce on matmul with numpy and cupy:
import timeit
from functools import reduce
import numpy as np
import cupy as cp
from pyrr.matrix33 import create_from_axis_rotation
# generate random rotation matrices
axes = np.random.rand(10000, 3)
angles = np.pi * np.random.rand(10000)
rotations = [create_from_axis_rotation(*params) for params in zip(axes, angles)]
# then reduce with matmul
xp = np # numpy
xp_rotations = [xp.asarray(rotation) for rotation in rotations]
timexp = timeit.timeit("reduce(xp.matmul, xp_rotations)", number=10, globals=globals())
print(f"{xp.__name__}: {timexp * 1000:0.3}ms")
xp = cp # cupy
xp_rotations = [xp.asarray(rotation) for rotation in rotations]
timexp = timeit.timeit("reduce(xp.matmul, xp_rotations)", number=10, globals=globals())
print(f"{xp.__name__}: {timexp * 1000:0.3}ms")
On a good machine with a Titan GPU, this gives :
numpy: 1.63e+02ms
cupy: 8.78e+02ms
For some reason the GPU is much slower.
In any case, is there a way to calculate this significantly faster ?
I found a rather simple solution, that works for all chains of small linear transformations (and can be extended to affine transformations easily).
def reduce_loop(matrices):
""" non-optimized reduce """
mat = matrices[0]
for _mat in matrices[1:]:
mat = mat # _mat
return mat
def reduce_split(matrices):
""" reduce by multiplying pairs of matrices recursively """
if len(matrices) == 1:
return matrices[0]
neven = (len(matrices) // 2) * 2
reduced = matrices[:neven:2] # matrices[1:neven:2]
if len(matrices) > neven: # len(matrices) is odd
reduced[-1] = reduced[-1] # matrices[-1]
return reduce_split(reduced)
time = timeit.timeit("reduce_loop(rotations)", number=10, globals=globals())
print(f"reduce_loop: {time * 1000:0.3}ms")
time = timeit.timeit("reduce_split(rotations)", number=10, globals=globals())
print(f"reduce_split: {time * 1000:0.3}ms")
reduce_loop: 2.14e+02ms
reduce_split: 24.5ms
I'm sure it's not optimal, but it uses numpy's (and probably cupy's) optimization.
functools.reduce() was removed from core python because it is inefficient and not pythonic. There is no cuPy equivalent, only the host version in the functools library
your cuPy code is spending most of its time fruitlessly copying data from host to device and back again... thousands of times - because reduce() runs only on the host not on the GPU. You are straining your PCI bus, not the GPU
consider making the list “rotations” into a cuPy matrix, and then use striding (not a python list)
use a cuPy reduction kernel to do the matmul

Slicing a 300MB CuPy array is ~5x slower than NumPy

My code involves slicing into 432x432x400 arrays a total of ~10 million times to generate batches of data for neural network training. As these are fairly large arrays (92 million data points/300MB), I was hoping to speed this up using CuPy (and maybe even speed training up by generating data on the same GPU as training), but found it actually made the code about 5x slower.
Is this expected behaviour due to CuPy overheads or am I missing something?
Code to reproduce:
import cupy as cp
import numpy as np
import timeit
cp_arr = cp.zeros((432, 432, 400), dtype=cp.float32)
np_arr = np.zeros((432, 432, 400), dtype=np.float32)
# numbers below are representative of my code
cp_code = 'arr2 = cp_arr[100:120, 100:120, 100:120]'
np_code = 'arr2 = np_arr[100:120, 100:120, 100:120]'
timeit.timeit(cp_code, number=8192*4, globals=globals()) # prints 0.122
timeit.timeit(np_code, number=8192*4, globals=globals()) # prints 0.027
GPU: NVIDIA Quadro P4000
CuPy Version: 7.3.0
OS: CentOS Linux 7
CUDA Version: 10.1
cuDNN Version: 7.6.5
I also confirmed that the slicing is about 5x times slower in cupy, while there's a more precise way to measure the time (see e.g.
The size of the array does not matter because slice operations do not copy the data but create views. The result with the following is similar.
cp_arr = cp.zeros((4, 4, 4), dtype=cp.float32)
cp_code = 'arr2 = cp_arr[1:3, 1:3, 1:3]'
It is natural that "take slice then send it to GPU" is faster because it reduces the bytes to be transferred. Consider doing so if the first preprocess is the slicing.
Slicing in NumPy and CuPy is not actually copying the data anywhere, but simply returning a new array where the data is the same but with the its pointer being offset to the first element of the new slice and an adjusted shape. Note below how both the original array and the slice have the same strides:
In [1]: import cupy as cp
In [2]: a = cp.zeros((432, 432, 400), dtype=cp.float32)
In [3]: b = a[100:120, 100:120, 100:120]
In [4]: a.strides
Out[4]: (691200, 1600, 4)
In [5]: b.strides
Out[5]: (691200, 1600, 4)
The same above could be verified by replacing CuPy with NumPy.
If you want to time the actual slicing operation, the most reliable way of doing this would be to add a .copy() to each operation, thus enforcing the memory accessing/copying:
cp_code = 'arr2 = cp_arr[100:120, 100:120, 100:120].copy()' # 0.771 seconds
np_code = 'arr2 = np_arr[100:120, 100:120, 100:120].copy()' # 0.154 seconds
Unfortunately, for the case above the memory pattern is bad for GPUs as the small chunks won't be able to saturate memory channels, thus it's still slower than NumPy. However, CuPy can be much faster if the chunks are able to get close to memory channel saturation, for example:
cp_code = 'arr2 = cp_arr[:, 100:120, 100:120].copy()' # 0.786 seconds
np_code = 'arr2 = np_arr[:, 100:120, 100:120].copy()' # 2.911 seconds

predicting p of binomial with beta prior in edward2 & tensorflow2

The following code predicts the p of the binomial distribution by using beta as prior. Somehow, sometimes, I get meaningless results (acceptance rate = 0). When I write the same logic with pymc3, I have no issue.
I couldn't see what I am missing here.
import numpy as np
import tensorflow as tf
import tensorflow_probability as tfp
import edward2 as ed
from pymc3.stats import hpd
import numpy as np
import seaborn
import matplotlib.pyplot as plt
p_true = .15
N = [10, 100, 1000]
successN = np.random.binomial(p=p_true, n=N)
def beta_binomial(N):
p = ed.Beta(
concentration1=tf.ones( len(N) ),
concentration0=tf.ones( len(N) ),
return ed.Binomial(total_count=N, probs=p, name='obs')
log_joint = ed.make_log_joint_fn(beta_binomial)
def target_log_prob_fn(p):
return log_joint(N=N, p=p, obs=successN)
#kernel = tfp.mcmc.HamiltonianMonteCarlo(
# target_log_prob_fn=target_log_prob_fn,
# step_size=0.01,
# num_leapfrog_steps=5)
kernel = tfp.mcmc.NoUTurnSampler(
trace, kernel_results = tfp.mcmc.sample_chain(
tf.random.uniform(( len(N) ,))
trace_fn=(lambda current_state, kernel_results: kernel_results),
p, = trace
p = p.numpy()
print('acceptance rate ', np.mean(kernel_results.is_accepted))
def printSummary(name, v):
print(name, v.shape)
print(np.mean(v, axis=0))
printSummary('p', p)
for data in p.T:
seaborn.distplot(data, kde=False)
pip install -U pip
pip install -e git+
pip install
pip install tensorflow-probability
Sometimes I see the following (when acceptance rate=0):
And, sometimes I see the following (when acceptance rate>.9):
When I get unstable results in Bayesian inference (I use mc-stan, but it's also using NUTS), it's usually because either the priors and likelihood are mis-specified, or the hyperparameters are not good for the problem.
That first graph shows that the sampler never moved away from the initial guess at the answers (hence the 0 acceptance rate). It also worries me that the green distribution seems to be right on 0. The beta(1,1) has positive probability at 0 but a p=0 might be an unstable solution here? (as in, the sampler may not be able to calculate the derivative at that point and returns a NaN, so doesn't know where to sample next?? Complete guess there).
Can you force the initial condition to be 0 and see if that always creates a failed sampling?
Other than that, I would try tweaking the hyperparameters, such as step size, number of iterations, etc...
Also, you may want to simplify the example by only using one N. Might help you diagnose. Good luck!
random.uniform's maxval default value is None. I changed it to 1, the result became stable.
random.uniform(( len(N) ,), minval=0, maxval=1)

pandas "isin" is much slower than numpy "in1d"

There is a huge difference between pandas "isin" and numpy "in1d" from the efficiency aspect. After some research I've noticed that the type of the data and the values that passed as parameter to the "in" method has huge impact on the run time. Anyway it looks that numpy implementation suffer much less from this problem.
What's going on here?
import timeit
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,10,(10**6),dtype='int8'),columns=['A'])
vals = np.array([5,7],dtype='int64')
f = lambda: df['A'].isin(vals)
g = lambda:['A'],vals)
print 'pandas:', timeit.timeit(stmt='f()',setup='from __main__ import f',number=10)/10
print 'numpy :', timeit.timeit(stmt='g()',setup='from __main__ import g',number=10)/10
**pandas: 0.0541711091995
numpy : 0.000645089149475**
Numpy and Pandas use different algorithms for isin. For some cases numpy's version is faster and for some pandas'. For your test case numpy seems to be faster.
Pandas' version has however a better asymptotic running time, in will win for bigger datasets.
Let's assume that there are n elements in the data-series (df in your example) and m elements in the query (vals in your example).
Usually, Numpy's algorithm does the following:
Use np.unique(..) to find all unique elements in the series. Thus is done via sorting, i.e. O(n*log(n)), there might be N<=n unique elements.
For every element use binary search to look up whether element is in the series, i.e. O(m*log(N)) in overall.
Which leads to overall running time of O(n*log(n) + m*log(N)).
There are some hard-coded optimizations in place for the cases, when vals only few elements and for this cases numpy really shines.
Pandas does something different:
Populates a hash-map (wrapped khash-functionality) in order to find all unique elements, which takes O(n).
Looks-up in the hash map in O(1) for every query, i.e. O(m) in overall.
I overall, running time is O(n)+O(m), which is much better than Numpy's.
However, for smaller inputs, constant factors and not the asymptotic behavior is that what counts and it is just way better for Numpy. There are also other consideration, like memory consumption (which is higher for Pandas) which might play a role.
But if we take a bigger query set, the situation is completely different:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,10,(10**6),dtype='int8'),columns=['A'])
vals = np.array([5,7],dtype='int64')
vals2 = np.random.randint(0,10,(10**6),dtype='int64')
And now:
%timeit df['A'].isin(vals) # 17.0 ms
%timeit df['A'].isin(vals2) # 16.8 ms
%timeit['A'],vals) # 1.36
%timeit['A'],vals2) # 82.9 ms
Numpy is really losing ground as long as there are more queries. It can also be seen, that building of the hash-map is the bottleneck for Pandas and not the queries.
In the end it doesn't make much sense (even if I just did!) to evaluate the performance for only one input size - it should be done for a range of input sizes - there are some surprises to be discovered!
E.g. fun fact: if you would take
df = pd.DataFrame(np.random.randint(0,10,(10**6+1), dtype='int8'),columns=['A'])
i.e. 10^6+1 instead of 10^6, pandas would fall back to numpy's algorithm (which is not clever in my opinion) and would become better for small inputs but worse for big:
%timeit df['A'].isin(vals) # 6ms was 17.0 ms
%timeit df['A'].isin(vals2) # 100ms was 16.8 ms

Why the difference between octave's prctile and numpy's percentile?

I've been rewriting a matlab/octave program into numpy and ran across a difference in some resultant values.
This occurs with both the percentile/prctile and the stdard-deviation functions.
In Numpy:
import matplotlib.mlab as ml
import numpy
>>> t = numpy.linspace(0,100, 100)
>>> numpy.percentile(t,95)
>>> numpy.std(t)
>>> ml.prctile(t,95)
In Octave:
octave:1> t = linspace(0,100,100)';
octave:2> prctile(t,95)
ans = 95.454545
octave:3> std(t)
ans = 29.304537
Although the array values of 't' are the same, the results are more different than I would suspect.
In the numpy help(numpy.std) they specifically mention that the algorithm is:
std = sqrt(mean(abs(x - x.mean())**2))
So I implemented that in octave and got the exact answer numpy gives. So it seems the std-deviation function differs.
But why/how? And which is correct? (if there is such a thing)
And even prctile/percentile?
Just in case since I'm in Linux aptosid...
GNU Octave, version 3.6.2
numpy.version '1.6.2rc1'
Numpy simply uses a different algorithm when the percentile lies between two data points. Octave, Matlab and R always center it exactly between two points when needed (I believe), numpy does a bit more then that... if you check you will see there are a couple of ways to calculate percentiles.
It seems like Octave assumes ddof=1, at least by default, and numpy uses 0 by default:
>>> numpy.std(t, ddof=0)
>>> numpy.std(t, ddof=1)