Numpy speed efficiency using broadcasting, transpose and reshape in large size array

Numpy speed efficiency using broadcasting, transpose and reshape in large size array - numpy

Is there a way to speed up the following line of code:
desired_channel=32
len_indices=50000
fast_idx = np.broadcast_to(np.arange(desired_channel)[:, None], (desired_channel, len_indices)).T.reshape(-1)
Thank you.

The last line of code is simply equal to np.tile(np.arange(desired_channel), len_indices).
On my machine, the performance of np.tile like many Numpy calls is bounded by the operating system (page faults), the memory allocator and the memory throughput. There are two ways to overcome this limitation: not to allocate/fill temporary buffers, to produce smaller arrays in memory using shorter types like np.uint8 or np.uint16 regarding your needs.
Since there is no out parameter for the np.tile function, Numba can be used to generate a fast alternative function. Here is an example:
import numba as nb
#nb.njit('int32[::1](int32, int32, int32[::1])', parallel=True)
def generate(desired_channel, len_indices, out):
for i in nb.prange(len_indices):
for j in range(desired_channel):
out[i*desired_channel+j] = j
return out
desired_channel=32
len_indices=50000
buffer = np.full(desired_channel * len_indices, 0, dtype=np.int32)
%timeit -n 200 generate(desired_channel, len_indices, fast_idx)
Here are the performance results:
Original code: 1.25 ms
np.tile: 1.24 ms
Numba: 0.20 ms

I am new to jax library. I have compared your code by jax one using the following code on Colab TPU:
import numpy as np
from jax import jit
import jax.numpy as jnp
import timeit
desired_channel=32
len_indices=50000
def ex_():
return np.broadcast_to(np.arange(desired_channel)[:, None], (desired_channel, len_indices)).T.reshape(-1)
%timeit -n1000 -r10 ex_()
#jit
def exj_():
return jnp.broadcast_to(jnp.arange(desired_channel)[:, None], (desired_channel, len_indices)).T.reshape(-1)
%timeit -n1000 -r10 exj_()
in one of my efforts, the results were as:
1000 loops, best of 10: 901 µs per loop
1000 loops, best of 10: 317 µs per loop
in this way, jax could speed up your code about two to three times.

Related

numpy.random.multinomial at version 1.16.6 is 10x faster than later version

Here are codes and result:
python -c "import numpy as np; from timeit import timeit; print('numpy version {}: {:.1f} seconds'.format(np.__version__, timeit('np.random.multinomial(1, [0.1, 0.2, 0.3, 0.4])', number=1000000, globals=globals())))"
numpy version 1.16.6: 1.5 seconds # 10x faster
numpy version 1.18.1: 15.5 seconds
numpy version 1.19.0: 17.4 seconds
numpy version 1.21.4: 15.1 seconds
It is noted that with fixed random seed, the output are the same with different numpy version
python -c "import numpy as np; np.random.seed(0); print(np.__version__); print(np.random.multinomial(1, [0.1, 0.2, 0.3, 0.4], size=10000))" /tmp/tt
Any advice on why numpy version after 1.16.6 is 10x slower?
We have upgraded pandas to latest version 1.3.4, which needs numpy version after 1.16.6

TL;DR: this is a local performance regression caused by the overhead of additional checks in the numpy.random.multinomial function. Very small arrays are strongly impacted due to the relative execution time of the required checks.
Under the hood
A binary search on the Git commits of the Numpy code shows that the performance regression appear the first time in mid April 2019. It can be reproduced in the commit dd77ce3cb but not 7e8e19f9a. There are some build issues for the commit in-between, but with some quick fix we can show that the commit 0f3dd0650 is the first to cause the issue. The commit says that it:
Extend multinomial to allow broadcasting
Fix zipf changes missed in NumPy
Enable 0 as valid input for hypergeometric
A deeper analysis of this commit shows that it modifies the multinomial function defined in Cython file mtrand.pyx to perform two additional following checks:
def multinomial(self, np.npy_intp n, object pvals, size=None):
cdef np.npy_intp d, i, sz, offset
cdef np.ndarray parr, mnarr
cdef double *pix
cdef int64_t *mnix
cdef int64_t ni
d = len(pvals)
parr = <np.ndarray>np.PyArray_FROM_OTF(pvals, np.NPY_DOUBLE, np.NPY_ALIGNED)
pix = <double*>np.PyArray_DATA(parr)
check_array_constraint(parr, 'pvals', CONS_BOUNDED_0_1) # <==========[HERE]
if kahan_sum(pix, d-1) > (1.0 + 1e-12):
raise ValueError("sum(pvals[:-1]) > 1.0")
if size is None:
shape = (d,)
else:
try:
shape = (operator.index(size), d)
except:
shape = tuple(size) + (d,)
multin = np.zeros(shape, dtype=np.int64)
mnarr = <np.ndarray>multin
mnix = <int64_t*>np.PyArray_DATA(mnarr)
sz = np.PyArray_SIZE(mnarr)
ni = n
check_constraint(ni, 'n', CONS_NON_NEGATIVE) # <==========[HERE]
offset = 0
with self.lock, nogil:
for i in range(sz // d):
random_multinomial(self._brng, ni, &mnix[offset], pix, d, self._binomial)
offset += d
return multin
These two checks are required for the code to be robust. However, they are currently pretty expensive considering their purpose.
Indeed, on my machine, the first check is responsible for ~75% of the overhead and the second for ~20%. The checks takes few micro-seconds but since your input is very small, the overhead is huge compared to the computation time.
One workaround to fix this issue is to write a specific Numba function for this since your input array is very small. On my machine, np.random.multinomial in a trivial Numba function results in good performance.

I checked some generators that are under the hood and saw no much change in the timings.
I guessed difference may be due to some overhead, because you are sampling only single value. And it seems to be good hypothesis. When I increased size of the generated random samples to 1000, difference between 1.16.6 and 1.19.2 (my current Numpy version) diminished to ~20%.
python -c "import numpy as np; from timeit import timeit; print('numpy version {}: {:.1f} seconds'.format(np.__version__, timeit('np.random.
multinomial(1, [0.1, 0.2, 0.3, 0.4], size=1000)', number=10000, globals=globals())))"
numpy version 1.16.6: 1.1 seconds
numpy version 1.19.2: 1.3 seconds
Note that both versions have this overhead, just newer version has it much larger. In both versions it is much faster to sample 1000 values once than sample 1 value 1000 times.
They changed by much the code between 1.16.6 and 1.17.0, see for example this commit, it's hard to analyse. Sorry that can't help you better - I propose to make an issue on Numpy's github.

pandas "isin" is much slower than numpy "in1d"

There is a huge difference between pandas "isin" and numpy "in1d" from the efficiency aspect. After some research I've noticed that the type of the data and the values that passed as parameter to the "in" method has huge impact on the run time. Anyway it looks that numpy implementation suffer much less from this problem.
What's going on here?
import timeit
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,10,(10**6),dtype='int8'),columns=['A'])
vals = np.array([5,7],dtype='int64')
f = lambda: df['A'].isin(vals)
g = lambda: pd.np.in1d(df['A'],vals)
print 'pandas:', timeit.timeit(stmt='f()',setup='from __main__ import f',number=10)/10
print 'numpy :', timeit.timeit(stmt='g()',setup='from __main__ import g',number=10)/10
>>
**pandas: 0.0541711091995
numpy : 0.000645089149475**

Numpy and Pandas use different algorithms for isin. For some cases numpy's version is faster and for some pandas'. For your test case numpy seems to be faster.
Pandas' version has however a better asymptotic running time, in will win for bigger datasets.
Let's assume that there are n elements in the data-series (df in your example) and m elements in the query (vals in your example).
Usually, Numpy's algorithm does the following:
Use np.unique(..) to find all unique elements in the series. Thus is done via sorting, i.e. O(n*log(n)), there might be N<=n unique elements.
For every element use binary search to look up whether element is in the series, i.e. O(m*log(N)) in overall.
Which leads to overall running time of O(n*log(n) + m*log(N)).
There are some hard-coded optimizations in place for the cases, when vals only few elements and for this cases numpy really shines.
Pandas does something different:
Populates a hash-map (wrapped khash-functionality) in order to find all unique elements, which takes O(n).
Looks-up in the hash map in O(1) for every query, i.e. O(m) in overall.
I overall, running time is O(n)+O(m), which is much better than Numpy's.
However, for smaller inputs, constant factors and not the asymptotic behavior is that what counts and it is just way better for Numpy. There are also other consideration, like memory consumption (which is higher for Pandas) which might play a role.
But if we take a bigger query set, the situation is completely different:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,10,(10**6),dtype='int8'),columns=['A'])
vals = np.array([5,7],dtype='int64')
vals2 = np.random.randint(0,10,(10**6),dtype='int64')
And now:
%timeit df['A'].isin(vals) # 17.0 ms
%timeit df['A'].isin(vals2) # 16.8 ms
%timeit pd.np.in1d(df['A'],vals) # 1.36
%timeit pd.np.in1d(df['A'],vals2) # 82.9 ms
Numpy is really losing ground as long as there are more queries. It can also be seen, that building of the hash-map is the bottleneck for Pandas and not the queries.
In the end it doesn't make much sense (even if I just did!) to evaluate the performance for only one input size - it should be done for a range of input sizes - there are some surprises to be discovered!
E.g. fun fact: if you would take
df = pd.DataFrame(np.random.randint(0,10,(10**6+1), dtype='int8'),columns=['A'])
i.e. 10^6+1 instead of 10^6, pandas would fall back to numpy's algorithm (which is not clever in my opinion) and would become better for small inputs but worse for big:
%timeit df['A'].isin(vals) # 6ms was 17.0 ms
%timeit df['A'].isin(vals2) # 100ms was 16.8 ms

Evaluating a tensor partially

Suppose we have a convolution operation such as this:
y = tf.nn.conv2d( ... )
Tensorflow allows you to evaluate a part of a tensor, e.g.:
print(sess.run(y[0]))
When we evaluate a tensor partially like above, which one of the following is correct?
TF runs the whole operation, i.e. it computes y completely, and then returns the value of y[0]
TF runs only the necessary operations to compute y[0].

I set up a small sample program:
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '-1' # forcing to run on the CPU
import tensorflow as tf
def full_array(sess, arr):
sess.run(arr)[0]
def partial_array(sess, arr):
sess.run(arr[0])
sess = tf.Session()
arr = tf.random_uniform([100])
arr = arr + tf.random_uniform([100])
These are my results:
%timeit partial_array(sess, arr)
100 loops, best of 3: 15.8 ms per loop
%timeit full_array(sess, arr)
10000 loops, best of 3: 85.9 µs per loop
It seems from the timings that the partial run is actually much slower than the full run (which is confusing to me to be honest...)
With these timings, I'd exclude alternative 1), since I would expect the timing to be approximately the same in the two functions if that were true.
Given my simplified test code, I'd lean towards the idea that the logic to figure out what part of the graph needs to run to satisfy the tensor slice is the cause of the performance difference, but I don't currently have a proof for that.
Update:
I also ran a similar test with a convolution op instead of an add (which I thought might be excessively simple an example):
def full_array(sess, arr):
return sess.run(arr)[0]
def partial_array(sess, arr):
return sess.run(arr[0])
sess = tf.Session()
arr = tf.random_uniform([1,100,100,3])
conv = tf.nn.conv2d(arr, tf.constant(1/9, shape=[3,3,3,6]), [1,1,1,1], 'SAME')
The results, however, are consistent with the previous ones:
%timeit full_array(sess, conv)
1000 loops, best of 3: 949 µs per loop
%timeit partial_array(sess, conv)
100 loops, best of 3: 20 ms per loop

Make Pandas DataFrame apply() use all cores?

As of August 2017, Pandas DataFame.apply() is unfortunately still limited to working with a single core, meaning that a multi-core machine will waste the majority of its compute-time when you run df.apply(myfunc, axis=1).
How can you use all your cores to run apply on a dataframe in parallel?

You may use the swifter package:
pip install swifter
(Note that you may want to use this in a virtualenv to avoid version conflicts with installed dependencies.)
Swifter works as a plugin for pandas, allowing you to reuse the apply function:
import swifter
def some_function(data):
return data * 10
data['out'] = data['in'].swifter.apply(some_function)
It will automatically figure out the most efficient way to parallelize the function, no matter if it's vectorized (as in the above example) or not.
More examples and a performance comparison are available on GitHub. Note that the package is under active development, so the API may change.
Also note that this will not work automatically for string columns. When using strings, Swifter will fallback to a “simple” Pandas apply, which will not be parallel. In this case, even forcing it to use dask will not create performance improvements, and you would be better off just splitting your dataset manually and parallelizing using multiprocessing.

The simplest way is to use Dask's map_partitions. You need these imports (you will need to pip install dask):
import pandas as pd
import dask.dataframe as dd
from dask.multiprocessing import get
and the syntax is
data = <your_pandas_dataframe>
ddata = dd.from_pandas(data, npartitions=30)
def myfunc(x,y,z, ...): return <whatever>
res = ddata.map_partitions(lambda df: df.apply((lambda row: myfunc(*row)), axis=1)).compute(get=get)
(I believe that 30 is a suitable number of partitions if you have 16 cores). Just for completeness, I timed the difference on my machine (16 cores):
data = pd.DataFrame()
data['col1'] = np.random.normal(size = 1500000)
data['col2'] = np.random.normal(size = 1500000)
ddata = dd.from_pandas(data, npartitions=30)
def myfunc(x,y): return y*(x**2+1)
def apply_myfunc_to_DF(df): return df.apply((lambda row: myfunc(*row)), axis=1)
def pandas_apply(): return apply_myfunc_to_DF(data)
def dask_apply(): return ddata.map_partitions(apply_myfunc_to_DF).compute(get=get)
def vectorized(): return myfunc(data['col1'], data['col2'] )
t_pds = timeit.Timer(lambda: pandas_apply())
print(t_pds.timeit(number=1))
28.16970546543598
t_dsk = timeit.Timer(lambda: dask_apply())
print(t_dsk.timeit(number=1))
2.708152851089835
t_vec = timeit.Timer(lambda: vectorized())
print(t_vec.timeit(number=1))
0.010668013244867325
Giving a factor of 10 speedup going from pandas apply to dask apply on partitions. Of course, if you have a function you can vectorize, you should - in this case the function (y*(x**2+1)) is trivially vectorized, but there are plenty of things that are impossible to vectorize.

you can try pandarallel instead: A simple and efficient tool to parallelize your pandas operations on all your CPUs (On Linux & macOS)
Parallelization has a cost (instanciating new processes, sending data via shared memory, etc ...), so parallelization is efficiant only if the amount of calculation to parallelize is high enough. For very little amount of data, using parallezation not always worth it.
Functions applied should NOT be lambda functions.
from pandarallel import pandarallel
from math import sin
pandarallel.initialize()
# FORBIDDEN
df.parallel_apply(lambda x: sin(x**2), axis=1)
# ALLOWED
def func(x):
return sin(x**2)
df.parallel_apply(func, axis=1)
see https://github.com/nalepae/pandarallel

If you want to stay in native python:
import multiprocessing as mp
with mp.Pool(mp.cpu_count()) as pool:
df['newcol'] = pool.map(f, df['col'])
will apply function f in a parallel fashion to column col of dataframe df

Just want to give an update answer for Dask
import dask.dataframe as dd
def your_func(row):
#do something
return row
ddf = dd.from_pandas(df, npartitions=30) # find your own number of partitions
ddf_update = ddf.apply(your_func, axis=1).compute()
On my 100,000 records, without Dask:
CPU times: user 6min 32s, sys: 100 ms, total: 6min 32s
Wall time: 6min 32s
With Dask:
CPU times: user 5.19 s, sys: 784 ms, total: 5.98 s
Wall time: 1min 3s

To use all (physical or logical) cores, you could try mapply as an alternative to swifter and pandarallel.
You can set the amount of cores (and the chunking behaviour) upon init:
import pandas as pd
import mapply
mapply.init(n_workers=-1)
...
df.mapply(myfunc, axis=1)
By default (n_workers=-1), the package uses all physical CPUs available on the system. If your system uses hyper-threading (usually twice the amount of physical CPUs would show up as logical cores), mapply will spawn one extra worker to prioritise the multiprocessing pool over other processes on the system.
Depending on your definition of all your cores, you could also use all logical cores instead (beware that like this the CPU-bound processes will be fighting for physical CPUs, which might slow down your operation):
import multiprocessing
n_workers = multiprocessing.cpu_count()
# or more explicit
import psutil
n_workers = psutil.cpu_count(logical=True)

Here is an example of sklearn base transformer, in which pandas apply is parallelized
import multiprocessing as mp
from sklearn.base import TransformerMixin, BaseEstimator
class ParllelTransformer(BaseEstimator, TransformerMixin):
def __init__(self,
n_jobs=1):
"""
n_jobs - parallel jobs to run
"""
self.variety = variety
self.user_abbrevs = user_abbrevs
self.n_jobs = n_jobs
def fit(self, X, y=None):
return self
def transform(self, X, *_):
X_copy = X.copy()
cores = mp.cpu_count()
partitions = 1
if self.n_jobs <= -1:
partitions = cores
elif self.n_jobs <= 0:
partitions = 1
else:
partitions = min(self.n_jobs, cores)
if partitions == 1:
# transform sequentially
return X_copy.apply(self._transform_one)
# splitting data into batches
data_split = np.array_split(X_copy, partitions)
pool = mp.Pool(cores)
# Here reduce function - concationation of transformed batches
data = pd.concat(
pool.map(self._preprocess_part, data_split)
)
pool.close()
pool.join()
return data
def _transform_part(self, df_part):
return df_part.apply(self._transform_one)
def _transform_one(self, line):
# some kind of transformations here
return line
for more info see https://towardsdatascience.com/4-easy-steps-to-improve-your-machine-learning-code-performance-88a0b0eeffa8

The native Python solution (with numpy) that can be applied on the whole DataFrame as the original question asks (not only on a single column)
import numpy as np
import multiprocessing as mp
dfs = np.array_split(df, 8000) # divide the dataframe as desired
def f_app(df):
return df.apply(myfunc, axis=1)
with mp.Pool(mp.cpu_count()) as pool:
res = pd.concat(pool.map(f_app, dfs))

Here another one using Joblib and some helper code from scikit-learn. Lightweight (if you already have scikit-learn), good if you prefer more control over what it is doing since joblib is easily hackable.
from joblib import parallel_backend, Parallel, delayed, effective_n_jobs
from sklearn.utils import gen_even_slices
from sklearn.utils.validation import _num_samples
def parallel_apply(df, func, n_jobs= -1, **kwargs):
""" Pandas apply in parallel using joblib.
Uses sklearn.utils to partition input evenly.
Args:
df: Pandas DataFrame, Series, or any other object that supports slicing and apply.
func: Callable to apply
n_jobs: Desired number of workers. Default value -1 means use all available cores.
**kwargs: Any additional parameters will be supplied to the apply function
Returns:
Same as for normal Pandas DataFrame.apply()
"""
if effective_n_jobs(n_jobs) == 1:
return df.apply(func, **kwargs)
else:
ret = Parallel(n_jobs=n_jobs)(
delayed(type(df).apply)(df[s], func, **kwargs)
for s in gen_even_slices(_num_samples(df), effective_n_jobs(n_jobs)))
return pd.concat(ret)
Usage: result = parallel_apply(my_dataframe, my_func)

Instead of
df["new"] = df["old"].map(fun)
do
from joblib import Parallel, delayed
df["new"] = Parallel(n_jobs=-1, verbose=10)(delayed(fun)(i) for i in df["old"])
To me this is a slight improvement over
import multiprocessing as mp
with mp.Pool(mp.cpu_count()) as pool:
df["new"] = pool.map(fun, df["old"])
as you get a progress indication and automatic batching if the jobs are very small.

Since the question was "How can you use all your cores to run apply on a dataframe in parallel?", the answer can also be with modin. You can run all cores in parallel, though the real time is worse.
See https://github.com/modin-project/modin . It runs of top of dask or ray. They say "Modin is a DataFrame designed for datasets from 1MB to 1TB+." I tried: pip3 install "modin"[ray]". Modin vs pandas was - 12 sec on six cores vs. 6 sec.

In case you need to do something based on the column name inside the function beware that .apply function may give you some trouble. In my case I needed to change the column type using astype() function based on the column name. This is probably not the most efficient way of doing it but suffices the purpose and keeps the column names as the original one.
import multiprocessing as mp
def f(df):
""" the function that you want to apply to each column """
column_name = df.columns[0] # this is the same as the original column name
# do something what you need to do to that column
return df
# Here I just make a list of all the columns. If you don't use .to_frame()
# it will pass series type instead of a dataframe
dfs = [df[column].to_frame() for column in df.columns]
with mp.Pool(mp.cpu_num) as pool:
processed_df = pd.concat(pool.map(f, dfs), axis=1)

vectorising linalg.eig() in numpy

I have an m*m*n numpy array (call it A) and I would like to find the eigenvalues of every submatrix A[:,:,n] in this array. I could do it with linalg.eig() in a loop with relative ease, but there really ought to be a way to vectorise this. Something like a ufunc, but that can process subvectors instead of individual elements. Is this possible?

The computation of the eigenvalues and eigenvectors can not be vectorised in the sense that there's no way in general to share work for different matrices. np.linalg.eig (for real input) is just a wrapper for dgeev, which according to the docs only accepts a single matrix per call and the computation is fairly expensive, so for matrices that are not small the overhead of a python loop will be negligible.
Though, if you're doing this for many very small matrices it can become too slow. There are several questions related to this and the solution usually ends up being a compiled extension. As enigmaticPhysicist says in a comment, the idea of processing subvectors and submatrices in the same way as ufuncs would be useful in general. These are called generalised ufuncs and are already in numpy's development version. I find it around 8 times faster for matrices of shape (1000, 3, 3):
In [2]: np.__version__
Out[2]: '1.8.0.dev-dcf7cac'
In [3]: A = np.random.rand(1000, 3, 3)
In [4]: timeit np.linalg.eig(A)
P100 loops, best of 3: 9.65 ms per loop
In [5]: timeit [np.linalg.eig(Ai) for Ai in A]
10 loops, best of 3: 74.6 ms per loop
In [6]: a1 = np.linalg.eig(A)
In [7]: a2 = [np.linalg.eig(Ai) for Ai in A]
In [8]: all(np.allclose(a1[i][j], a2[j][i]) for j in xrange(1000) for i in xrange(2))
Out[8]: True

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Numpy speed efficiency using broadcasting, transpose and reshape in large size array - numpy

Is there a way to speed up the following line of code: desired_channel=32 len_indices=50000 fast_idx = np.broadcast_to(np.arange(desired_channel)[:, None], (desired_channel, len_indices)).T.reshape(-1) Thank you.

Related

numpy.random.multinomial at version 1.16.6 is 10x faster than later version

pandas "isin" is much slower than numpy "in1d"

Evaluating a tensor partially

Make Pandas DataFrame apply() use all cores?

vectorising linalg.eig() in numpy

Categories

Resources