Why random.choices is faster than NumPy’s random choice? - numpy

I am trying to do random sampling in the most efficient way in Python, however, I am puzzled because when using the numpy's random.choices() was slower than using the random.choices()
import numpy as np
import random
np.random.seed(12345)
# use gamma distribution
shape, scale = 2.0, 2.0
s = np.random.gamma(shape, scale, 1000000)
meansample = []
samplesize = 500
%timeit meansample = [ np.mean( np.random.choice( s, samplesize, replace=False)) for _ in range(500)]
23.3 s ± 229 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit meansample = [np.mean(random.choices(s, k=samplesize)) for x in range(0,500)]
152 ms ± 324 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
23 Seconds vs 152 ms is a lot of time
What i'am doing wrong?

Two issues here. First, for the pure-Python random library, you probably mean to use sample instead of choices to sample without replacement. That alters the benchmark somewhat. Second, np.random.choice has better performing alternatives for sampling without replacement. This is a known issue related to random generator API. You can use np.random.Generator to get better performance. My timings:
%timeit meansample = [ np.mean( np.random.choice( s, samplesize, replace=False)) for _ in range(500)]
# 1 loop, best of 3: 12.4 s per loop
%timeit meansample = [np.mean(random.choices(s, k=samplesize)) for x in range(0,500)]
# 10 loops, best of 3: 118 ms per loop
sl = s.tolist()
%timeit meansample = [np.mean(random.sample(sl, k=samplesize)) for x in range(0,500)]
# 1 loop, best of 3: 219 ms per loop
g = np.random.Generator(np.random.PCG64())
%timeit meansample = [ np.mean( g.choice( s, samplesize, replace=False)) for _ in range(500)]
# 10 loops, best of 3: 25 ms per loop
So, without replacement, random.sample outperforms np.random.choice but is slower than np.random.Generator.choice.

Related

Understanding Numba Performance Differences

I'm trying to understand the performance differences I am seeing by using various numba implementations of an algorithm. In particular, I would expect func1d from below to be the fastest implementation since it it the only algorithm that is not copying data, however from my timings func1b appears to be fastest.
import numpy
import numba
def func1a(data, a, b, c):
# pure numpy
return a * (1 + numpy.tanh((data / b) - c))
#numba.njit(fastmath=True)
def func1b(data, a, b, c):
new_data = a * (1 + numpy.tanh((data / b) - c))
return new_data
#numba.njit(fastmath=True)
def func1c(data, a, b, c):
new_data = numpy.empty(data.shape)
for i in range(new_data.shape[0]):
for j in range(new_data.shape[1]):
new_data[i, j] = a * (1 + numpy.tanh((data[i, j] / b) - c))
return new_data
#numba.njit(fastmath=True)
def func1d(data, a, b, c):
for i in range(data.shape[0]):
for j in range(data.shape[1]):
data[i, j] = a * (1 + numpy.tanh((data[i, j] / b) - c))
return data
Helper functions for testing memory copying
def get_data_base(arr):
"""For a given NumPy array, find the base array
that owns the actual data.
https://ipython-books.github.io/45-understanding-the-internals-of-numpy-to-avoid-unnecessary-array-copying/
"""
base = arr
while isinstance(base.base, numpy.ndarray):
base = base.base
return base
def arrays_share_data(x, y):
return get_data_base(x) is get_data_base(y)
def test_share(func):
data = data = numpy.random.randn(100, 3)
print(arrays_share_data(data, func(data, 0.5, 2.5, 2.5)))
Timings
# force compiling
data = numpy.random.randn(10_000, 300)
_ = func1a(data, 0.5, 2.5, 2.5)
_ = func1b(data, 0.5, 2.5, 2.5)
_ = func1c(data, 0.5, 2.5, 2.5)
_ = func1d(data, 0.5, 2.5, 2.5)
data = numpy.random.randn(10_000, 300)
%timeit func1a(data, 0.5, 2.5, 2.5)
%timeit func1b(data, 0.5, 2.5, 2.5)
%timeit func1c(data, 0.5, 2.5, 2.5)
%timeit func1d(data, 0.5, 2.5, 2.5)
67.2 ms ± 230 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
13 ms ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
69.8 ms ± 60.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
69.8 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Test which implementations copy memory
test_share(func1a)
test_share(func1b)
test_share(func1c)
test_share(func1d)
False
False
False
True
Here, copying of data doesn't play a big role: the bottle neck is fast how the tanh-function is evaluated. There are many algorithms: some of them are faster some of them are slower, some are more precise some less.
Different numpy-distributions use different implementations of tanh-function, e.g. it could be one from mkl/vml or the one from the gnu-math-library.
Depending on numba version, also either the mkl/svml impelementation is used or gnu-math-library.
The easiest way to look inside is to use a profiler, for example perf.
For the numpy-version on my machine I get:
>>> perf record python run.py
>>> perf report
Overhead Command Shared Object Symbol
46,73% python libm-2.23.so [.] __expm1
24,24% python libm-2.23.so [.] __tanh
4,89% python _multiarray_umath.cpython-37m-x86_64-linux-gnu.so [.] sse2_binary_scalar2_divide_DOUBLE
3,59% python [unknown] [k] 0xffffffff8140290c
As one can see, numpy uses the slow gnu-math-library (libm) functionality.
For the numba-function I get:
53,98% python libsvml.so [.] __svml_tanh4_e9
3,60% python [unknown] [k] 0xffffffff81831c57
2,79% python python3.7 [.] _PyEval_EvalFrameDefault
which means that fast mkl/svml functionality is used.
That is (almost) all there is to it.
As #user2640045 has rightly pointed out, the numpy performance will be hurt by additional cache misses due to creation of temporary arrays.
However, cache misses don't play such a big role as the calculation of tanh:
%timeit func1a(data, 0.5, 2.5, 2.5) # 91.5 ms ± 2.88 ms per loop
%timeit numpy.tanh(data) # 76.1 ms ± 539 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
i.e. creation of temporary objects is responsible for around 20% of the running time.
FWIW, also for version with the handwritten loops, my numba version (0.50.1) is able to vectorize and call mkl/svml functionality. If for some other version this not happens - numba will fall back to gnu-math-library functionality, what seems to be happening on your machine.
Listing of run.py:
import numpy
# TODO: define func1b for checking numba
def func1a(data, a, b, c):
# pure numpy
return a * (1 + numpy.tanh((data / b) - c))
data = numpy.random.randn(10_000, 300)
for _ in range(100):
func1a(data, 0.5, 2.5, 2.5)
The performance difference is NOT in the evaluation of the tanh-function
I must disagree with #ead. Let's assume for the moment that
the main performance difference is in the evaluation of the tanh-function
Then one would expect that running just tanh from numpy and numba with fast math would show that speed difference.
def func_a(data):
return np.tanh(data)
#nb.njit(fastmath=True)
def func_b(data):
new_data = np.tanh(data)
return new_data
data = np.random.randn(10_000, 300)
%timeit func_a(data)
%timeit func_b(data)
Yet on my machine the above code shows almost no difference in performance.
15.7 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
15.8 ms ± 82 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Short detour on NumExpr
I tried a NumExpr version of your code. But before being amazed that it runts almost 7 times faster you should keep in mind that it uses all 10 cores available on my machine. After allowing numba to run in parallel too and optimising that a little bit the performance benefit is small but sill there 2.56 ms vs 3.87 ms. See code below.
#nb.njit(fastmath=True)
def func_a(data):
new_data = a * (1 + np.tanh((data / b) - c))
return new_data
#nb.njit(fastmath=True, parallel=True)
def func_b(data):
new_data = a * (1 + np.tanh((data / b) - c))
return new_data
#nb.njit(fastmath=True, parallel=True)
def func_c(data):
for i in nb.prange(data.shape[0]):
for j in range(data.shape[1]):
data[i, j] = a * (1 + np.tanh((data[i, j] / b) - c))
return data
def func_d(data):
return ne.evaluate('a * (1 + tanh((data / b) - c))')
data = np.random.randn(10_000, 300)
%timeit func_a(data)
%timeit func_b(data)
%timeit func_c(data)
%timeit func_d(data)
17.4 ms ± 146 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.31 ms ± 193 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.87 ms ± 152 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.56 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The actual explanation
The ~34% time that NumExpr saves compared to numba are nice but even nicer is that they have a concise explanation why they are faster than numpy. I am pretty sure that this applies to numba too.
From the NumExpr github page:
The main reason why NumExpr achieves better performance than NumPy is
that it avoids allocating memory for intermediate results. This
results in better cache utilization and reduces memory access in
general.
So
a * (1 + numpy.tanh((data / b) - c))
is slower because it does a lot of steps producing intermediate results.

difference of complexity in ordering and sorting?

I'm trying to understand the complexity of numpy array indexing here.
Given a 1-d numpy array A. and b = numpy.argsort(A)
what's the difference in time compleixty between np.sort(A) vs A[b] ?
for np.sort(A), it would be O(n log (n)), while A[b] should be O(n) ?
Under the hood argsort does a sort, which again gives complexity O(n log(n)).
You can actually specify the algorithm as described here
To conclude, while only A[b] is linear you cannot use this to beat the general complexity of sorting, as you yet have to determine b (by sorting).
Do a simple timing:
In [233]: x = np.random.random(100000)
In [234]: timeit np.sort(x)
6.79 ms ± 21 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [235]: timeit x[np.argsort(x)]
8.42 ms ± 220 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [236]: %%timeit b = np.argsort(x)
...: x[b]
...:
235 µs ± 694 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [237]: timeit np.argsort(x)
8.08 ms ± 15.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Timing only one size doesn't give O complexity, but it reveals the relative significance of the different steps.
If you don't need the argsort, then sort directly. If you already have b use it rather than sorting again.
Here is a visual comparison to see it better:
#sort
def m1(A,b):
return np.sort(A)
#compute argsort and them index
def m2(A,b):
return A[np.argsort(A)]
#index with precomputed argsort
def m3(A,b):
return A[b]
A = [np.random.rand(n) for n in [10,100,1000,10000]]
Runtime on a log-log scale:

Matrix multiplication in Numpy takes too much time

I am trying to simply implement a loss function (MSE) in Python using numpy and this is my code:
import numpy as np
def loss(X, y, w):
N = (X.shape)[0]
X_new = np.concatenate((np.ones((N, 1)), X), axis=1)
E = y-np.matmul(X_new, w)
E_t = np.transpose(E)
loss_value = (1/N)*(np.matmul(E_t, E))
return loss_value
The dimension of E is (15000, 1) and E_t is obviously (1,15000). However, when debugging, I realized that np.matmul(E_t,E) takes too much time. I have a laptop with 16GB of RAM and Core i7, so it's weird for me that np.matmul is failing here. Is this normal if the matrices I am dealing with have these dimensions?
On a rather basic 4GB machine:
In [477]: E=np.ones((15000, 1))
In [478]: E.T#E
Out[478]: array([[15000.]])
In [479]: timeit E.T#E
10.5 µs ± 241 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
You don't tell us anything about X, but assuming the worse case:
In [480]: E=np.ones((15000, 1),object)
In [481]: E.T#E
Out[481]: array([[15000]], dtype=object)
In [482]: timeit E.T#E
577 µs ± 492 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Fastest way to find all unique elements in an array with Cython

I am attempting to find the most performant method to find unique values from a NumPy array. NumPy's unique function is very slow and sorts the values first before finding the unique. Pandas hashes the values using the klib C library which is much faster. I am looking for a Cython solution.
The simplest solution seems to just iterate through the array and use a Python set to add each element like this:
from numpy cimport ndarray
from cpython cimport set
#cython.wraparound(False)
#cython.boundscheck(False)
def unique_cython_int(ndarray[np.int64_t] a):
cdef int i
cdef int n = len(a)
cdef set s = set()
for i in range(n):
s.add(a[i])
return s
I also tried an unordered_set from c++
from libcpp.unordered_set cimport unordered_set
#cython.wraparound(False)
#cython.boundscheck(False)
def unique_cpp_int(ndarray[np.int64_t] a):
cdef int i
cdef int n = len(a)
cdef unordered_set[int] s
for i in range(n):
s.insert(a[i])
return s
Performance
# create array of 1,000,000
a = np.random.randint(0, 50, 1000000)
# Pure Python
%timeit set(a)
86.4 ms ± 2.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Convert to list first
a_list = a.tolist()
%timeit set(a_list)
10.2 ms ± 74.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# NumPy
%timeit np.unique(a)
32 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Pandas
%timeit pd.unique(a)
5.3 ms ± 257 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Cython
%timeit unique_cython_int(a)
13.4 ms ± 1.02 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Cython - c++ unordered_set
%timeit unique_cpp_int(a)
17.8 ms ± 158 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Discussion
So pandas is about 2.5x faster than a cythonized set. Its lead increases when there are more distinct elements. Surprisingly, a pure python set (on a list) beats out a cythonized set.
My question here - is there a faster way to do this in Cython than just use the add method repeatedly? And could the c++ unordered_set be improved?
Using Unicode strings
The story changes when we use unicode strings. I believe I have to convert the numpy array to an object data type to properly add its type for Cython.
#cython.wraparound(False)
#cython.boundscheck(False)
def unique_cython_str(ndarray[object] a):
cdef int i
cdef int n = len(a)
cdef set s = set()
for i in range(n):
s.add(a[i])
return s
And again I tried an unordered_set from c++
#cython.wraparound(False)
#cython.boundscheck(False)
def unique_cpp_str(ndarray[object] a):
cdef int i
cdef int n = len(a)
cdef unordered_set[string] s
for i in range(n):
s.insert(a[i])
return s
Performance
Create an array of 1 million strings with 1,000 distinct values
s_1000 = []
for i in range(1000):
s = np.random.choice(list('abcdef'), np.random.randint(5, 50))
s_1000.append(''.join(s))
s_all = np.random.choice(s_1000, 1000000)
# s_all has numpy unicode as its data type. Must convert to object
s_unicode_obj = s_all.astype('O')
# c++ does not easily handle unicode. Convert to bytes and then to object
s_bytes_obj = s_all.astype('S').astype('O')
# Pure Python
%timeit set(s_all)
451 ms ± 5.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit set(s_unicode_obj)
71.9 ms ± 5.91 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# using set on a list
s_list = s_all.tolist()
%timeit set(s_list)
63.1 ms ± 7.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# NumPy
%timeit np.unique(s_unicode_obj)
1.69 s ± 97.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.unique(s_all)
633 ms ± 3.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# Pandas
%timeit pd.unique(s_unicode_obj)
97.6 ms ± 6.61 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Cython
%timeit unique_cython_str(s_unicode_obj)
60 ms ± 5.81 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Cython - c++ unordered_set
%timeit unique_cpp_str2(s_bytes_obj)
247 ms ± 8.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Discussion
So, it appears that Python's set outperforms pandas for unicode strings but not on integers. And again, iterating through the array in Cython doesn't really help us at all.
Cheating with integers
It's possible to circumvent sets if you know the range of your integers isn't too crazy. You can simply create a second array of all zeros/False and turn their position True when you encounter each one and append that number to a list. This is extremely fast since no hashing is done.
The following works for positive integer arrays. If you had negative integers, you would have to add a constant to shift the numbers up to 0.
#cython.wraparound(False)
#cython.boundscheck(False)
def unique_bounded(ndarray[np.int64_t] a):
cdef int i, n = len(a)
cdef ndarray[np.uint8_t, cast=True] unique = np.zeros(n, dtype=bool)
cdef list result = []
for i in range(n):
if not unique[a[i]]:
unique[a[i]] = True
result.append(a[i])
return result
%timeit unique_bounded(a)
1.18 ms ± 21.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The downside is of course memory usage since your largest integer could force an extremely large array. But this method could work for floats too if you knew precisely how many significant digits each number had.
Summary
Integers 50 unique of 1,000,000 total
Pandas - 5 ms
Python set of list - 10 ms
Cython set - 13 ms
'Cheating' with integers - 1.2 ms
Strings 1,000 unique of 1,000,000 total
Cython set - 60 ms
Python set of list - 63 ms
Pandas - 98 ms
Appreciate all the help making these faster.
I think the answer to you question "what is the fastest way to find unique elements" is "it depends". It depends on your data set and on your hardware.
For your scenarios (I mostly looked at integer case) pandas (and used khash) does a pretty decent job. I was not able to match this performance using std::unordered_map.
However, google::dense_hash_set was slightly faster in my experiments than the pandas-solution.
Please read on for a more detailed explanation.
I would like to start out by explaining the results you are observing and use these insights later on.
I start with your int-example: there are only 50 unique elements but 1,000,000 in the array:
import numpy as np
import pandas as pd
a=np.random.randint(0,50, 10**6, dtype=np.int64)
As baseline the timings of np.unique() and pd.unique() for my machine:
%timeit np.unique(a)
>>>82.3 ms ± 539 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.unique(a)
>>>9.4 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
pandas approach with the set (O(n)) is about 10 times faster than numpy's approach with sorting (O(nlogn)). log n = 20 for n=10**6, so the factor 10 is about the expected difference.
Another difference is, that np.unique returns a sorted array, so one could use binary search to look up the elements. pd.unique returns an unsorted array so we need either to sort it (which might be O(n log n) if there are not many duplicates in the original data) or to transform it to a set-like structure.
Let's take a look at the simple Python-Set:
%timeit set(a)
>>> 257 ms ± 21.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
First thing we must be aware here: we are comparing apples and oranges. The previous unique-functions return numpy arrays, which consists out of lowly c-integers. This one returns a set of full-fledged Python-integers. Quite a different thing!
That means for every element in the numpy-array we must first create a python-object - quite an overhead and only then can we add it to the set.
The conversion to Python-integers can be done in a preprocessing step - your version with list:
A=list(a)
%timeit set(A)
>>> 104 ms ± 952 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit set(list(a))
>>> 270 ms ± 23.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
More than 100 ms are needed for the creation of the Python-integers. However, the python-integers are more complex than the lowly C-ints and thus handling them costs more. Using pd.unique on C-int and than promoting to Python-set is much faster.
And now your Cython version:
%timeit unique_cython_int(a)
31.3 ms ± 630 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
That I don't really understand. I would expect it to perform similar to set(a) -cython would cut out the interpreter, but that would not explain the factor 10. However, we have only 50 different integers (which are even in the integers-pool because they are smaller than 256), so there is probably some optimization, which plays a role/difference.
Let's try another data-set (there are now 10**5 different numbers):
b=np.random.randint(0, 10**5,10**6, dtype=np.int64)
%timeit unique_cython_int(b)
>>> 236 ms ± 31.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit set(b)
>>> 388 ms ± 15.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
A speed-up less than 2 is something what I would expect.
Let's take a look at cpp-version:
%timeit unique_cpp_int(a)
>>> 25.4 ms ± 534 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit unique_cpp_int(b)
>>> 100 ms ± 4.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
There is some overhead in copying the data from the cpp-set to the Python set (as DavidW have pointed out), but otherwise the behavior as I would expect given my experience with it: std::unordered_map is somewhat faster than Python, but not the greatest implementation around - panda seems to beat it:
%timeit set(pd.unique(b))
>>> 45.8 ms ± 3.48 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
So it looks like, that in the situation, where there are many duplicated and the hash-function is cheap, the pandas-solution is hard to beat.
One maybe could try out the google data structures.
However, when the data has only very few duplicates, the numpy's sorting solution may become the faster one. The main reason is, that numpy's unique needs only twice the memory - the original data and the output, while pandas hash-set-solution needs much more memory: the original data, the set and the output. For huge datasets it might become the difference between having enough RAM and not having enough RAM.
It depends on the set-implementation how much memory-overhead is needed and it is always about the trade-off between memory and speed. For example std::unordered_set needs at least 32 byte to save a 8-byte integer. Some google's data structures can do better.
Running /usr/bin/time -fpeak_used_memory:%M python check_mem.py with pandas/numpy unique:
#check_mem.py
import numpy as np
import pandas as pd
c=np.random.randint(0, 2**63,5*10**7, dtype=np.int64)
#pd.unique(c)
np.unique(c)
shows 1.2 GB for numpy and 2.0GB for pandas.
Actually, on my Windows machine np.unique is faster than pd.unique if there are (next to) only unique elements in the array, even for "only" 10^6 elements (probably because of the needed rehashes as the used set grows). This is however not the case for my Linux machine.
Another scenario in which pandas doesn't shine is when the calculation of the hash function is not cheap: Consider long strings (let's say of 1000 characters) as objects.
To calculate the hash-value one needs to consider all 1000 characters (which means a lot of data-> a lot of hash misses), the comparison of two strings is mostly done after one or two characters - the probability is then already very high, that we know that the strings are different. So the log n factor of the numpy's unique doesn't look that bad anymore.
It could be better to use a tree-set instead of a hash-set in this case.
Improving on cpp-unordered set:
The method using cpp's unordered set could be improved due to its method reserve(), which would eliminate the need for rehashing. But it is not imported to cython, so the usage is quite cumbersome from Cython.
The reserving however would not have any impact on the runtimes for data with only 50 unique elements and at most factor 2 (amortized costs due to the used resize-strategy) for the data with almost all elements unique.
The hash-function for ints is identity (at least for gcc), so not much to gain here (I don't think using a more fancy hash-function would help here).
I see no way how cpp's unordered-set could be tweaked to beat the khash-implementation used by pandas, which seems to be quite good for this type of tasks.
Here are for example these pretty old benchmarks, which show that khash is somewhat faster than std::unordered_map with only google_dense being even faster.
Using google dense map:
In my experiments, google dense map (from here) was able to beat khash - benchmark code can be found at the end of the answer.
It was faster if there were only 50 unique elements:
#50 unique elements:
%timeit google_unique(a,r)
1.85 ms ± 8.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit pd.unique(a)
3.52 ms ± 33.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
but also faster if there were only unique elements:
%timeit google_unique(c,r)
54.4 ms ± 375 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [3]: %timeit pd.unique(c)
75.4 ms ± 499 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
My few experiments have also shown, that google_hash_set uses maybe more memory (up to 20%) than khash, but more tests are needed to see whether this is really the case.
I'm not sure my answer helped you at all. My take-aways are:
If we need a set of Python-integers, set(pd.unique(...)) seems to be a good starting point.
There are some cases for which numpy's sorting solution might be better (less memory, sometimes hash-calculation is too expensive)
Knowing more about data can be used to tweak the solution, by making a better trade-off (e.g. using less/more memory/preallocating so we don't need to rehash or to use a bitset for look-up).
Pandas solution seems to be tweaked pretty good for some usual cases, but then for other cases another trade-off might be better - google_dense being the most promising candidate.
Listings for google-tests:
#google_hash.cpp
#include <cstdint>
#include <functional>
#include <sparsehash/dense_hash_set>
typedef int64_t lli;
void cpp_unique(lli *input, int n, lli *output){
google::dense_hash_set<lli, std::hash<lli> > set;
set.set_empty_key(-1);
for (int i=0;i<n;i++){
set.insert(input[i]);
}
int cnt=0;
for(auto x : set)
output[cnt++]=x;
}
the corresponding pyx-file:
#google.pyx
cimport numpy as np
cdef extern from "google_hash.cpp":
void cpp_unique(np.int64_t *inp, int n, np.int64_t *output)
#out should have enough memory:
def google_unique(np.ndarray[np.int64_t,ndim=1] inp, np.ndarray[np.int64_t,ndim=1] out):
cpp_unique(&inp[0], len(inp), &out[0])
the setup.py-file:
from distutils.core import setup, Extension
from Cython.Build import cythonize
import numpy as np
setup(ext_modules=cythonize(Extension(
name='google',
language='c++',
extra_compile_args=['-std=c++11'],
sources = ["google.pyx"],
include_dirs=[np.get_include()]
)))
Ipython-benchmark script, after calling python setup.py build_ext --inplace:
import numpy as np
import pandas as pd
from google import google_unique
a=np.random.randint(0,50,10**6,dtype=np.int64)
b=np.random.randint(0, 10**5,10**6, dtype=np.int64)
c=np.random.randint(0, 2**63,10**6, dtype=np.int64)
r=np.zeros((10**6,), dtype=np.int64)
%timeit google_unique(a,r
%timeit pd.unique(a)
Other listings
Cython version after fixes:
%%cython
cimport cython
from numpy cimport ndarray
from cpython cimport set
cimport numpy as np
#cython.wraparound(False)
#cython.boundscheck(False)
def unique_cython_int(ndarray[np.int64_t] a):
cdef int i
cdef int n = len(a)
cdef set s = set()
for i in range(n):
s.add(a[i])
return s
C++ version after fixes:
%%cython -+ -c=-std=c++11
cimport cython
cimport numpy as np
from numpy cimport ndarray
from libcpp.unordered_set cimport unordered_set
#cython.wraparound(False)
#cython.boundscheck(False)
def unique_cpp_int(ndarray[np.int64_t] a):
cdef int i
cdef int n = len(a)
cdef unordered_set[int] s
for i in range(n):
s.insert(a[i])
return s

What is maybe_convert_objects good for?

I'm profiling the timing of one od my functions and I see that I spent alot of time on pandas DataFrame creation - I'm talking about 2.5 seconds to construct a dataFrame with 1000 columns and 10k rows:
def test(size):
samples = []
for r in range(10000):
a,b = np.random.randint(100, size=2)
data = np.random.beta(a,b ,size = size)
samples.append(data)
return DataFrame(samples, dtype = np.float64)
Running %prun -l 4 test(1000) returns:
Is there anyway I can avoid this check? This really not seems Tried to find out about this method and ways to bypass here but didnt find anything online.
pandas must introspect each row because you are passing it a list of arrays. Here are some more efficient methods in this case.
In [27]: size=1000
In [28]: samples = []
...: for r in range(10000):
...: data = np.random.beta(1,1 ,size = size)
...: samples.append(data)
...:
In [29]: np.asarray(samples).shape
Out[29]: (10000, 1000)
# original
In [30]: %timeit DataFrame(samples)
2.29 s ± 91.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# numpy is less flexible on the conversion, but in this case
# it is fine
In [31]: %timeit DataFrame(np.asarray(samples))
30.9 ms ± 426 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# you should probably just do this
In [32]: samples = np.random.beta(1,1, size=(10000, 1000))
In [33]: %timeit DataFrame(samples)
74.4 µs ± 381 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)