Direct access to bitarray from cython - indexing

I can access bitarray bits with slice syntax..
b = bitarray(10)
b[5]
How would I access an element directly ?
Similar to the way I can directly access array elements:
ary.data.as_ints[5]
instead of :
ary[5]
I'm asking because when I tried this for array in some scenarios I got 20-30 fold speedup.
I found what I need to get access to, but don't know how !
bitarray.h
look at getbit() and setbit().
How can I access them from Cython ?
current speeds
Shape: (10000, 10000)
VSize: 100.00Mil
Mem: 12207.03kb, 11.92mb
----------------------
sa[5,5]=1
108 ns +- 0.451 ns per loop (mean +- std. dev. of 7 runs, 10000000 loops each)
sa[5,5]
146 ns +- 37.1 ns per loop (mean +- std. dev. of 7 runs, 10000000 loops each)
sa[100:120,100:120]
34.8 µs +- 7.39 µs per loop (mean +- std. dev. of 7 runs, 10000 loops each)
sa[:100,:100]
614 µs +- 135 µs per loop (mean +- std. dev. of 7 runs, 1000 loops each)
sa[[0,1,2],[0,1,2]]
1.11 µs +- 301 ns per loop (mean +- std. dev. of 7 runs, 1000000 loops each)
sa.sum()
6.74 ms +- 1.82 ms per loop (mean +- std. dev. of 7 runs, 100 loops each)
sa.sum(axis=0)
9.92 ms +- 2.49 ms per loop (mean +- std. dev. of 7 runs, 100 loops each)
sa.sum(axis=1)
646 ms +- 42.4 ms per loop (mean +- std. dev. of 7 runs, 1 loop each)
sa.mean()
5.17 ms +- 160 µs per loop (mean +- std. dev. of 7 runs, 100 loops each)
sa.mean(axis=0)
12.8 ms +- 2.5 ms per loop (mean +- std. dev. of 7 runs, 100 loops each)
sa.mean(axis=1)
730 ms +- 25.1 ms per loop (mean +- std. dev. of 7 runs, 1 loop each)
sa[[9269, 5484, 2001, 8881, 30, 9567, 7654, 3034, 4901, 552],:],
6.87 ms +- 1.2 ms per loop (mean +- std. dev. of 7 runs, 100 loops each)
sa[:,[1417, 157, 9793, 1300, 2339, 2439, 2925, 3980, 4550, 5100]],
9.88 ms +- 1.56 ms per loop (mean +- std. dev. of 7 runs, 100 loops each)
sa[[9269, 5484, 2001, 8881, 30, 9567, 7654, 3034, 4901, 552],[1417, 157, 9793, 1300, 2339, 2439, 2925, 3980, 4550, 5100]],
6.59 µs +- 1.78 µs per loop (mean +- std. dev. of 7 runs, 100000 loops each)
sa[[9269, 5484, 2001, 8881, 30, 9567, 7654, 3034, 4901, 552],:].sum(axis=1),
466 ms +- 121 ms per loop (mean +- std. dev. of 7 runs, 1 loop each)

I'd recommend using typed memoryviews (which lets you access chunks of 8 bits) and then using bitwise-and operations to access those bits. That's definitely the easiest and most "native" way to Cython.
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
def sum_bits1(ba):
cdef unsigned char[::1] ba_view = ba
cdef int count = 0
cdef Py_ssize_t n
cdef unsigned char val
for n in range(len(ba)):
idx = n//8
subidx = 1 << (n % 8)
val = ba_view[idx] & subidx
if val:
count += 1
return count
If you want to use it getbit and setbit functions defined in "bitarray.h" then you just define them as cdef extern functions. You need to find the path to "bitarray.h". It's probably in your local pip install directory somewhere. I've put the full path in the file but a better solution would be to specify an include path in setup.py.
cdef extern from "<path to home>/.local/lib/python3.8/site-packages/bitarray/bitarray.h":
ctypedef struct bitarrayobject:
pass # we don't need to know the details
ctypedef class bitarray.bitarray [object bitarrayobject]:
pass
int getbit(bitarray, int)
def sum_bits2(bitarray ba):
cdef int count = 0
cdef Py_ssize_t n
for n in range(len(ba)):
if getbit(ba, n):
count += 1
return count
To test it (and compare against a simple Python only version):
def sum_bits_naive(ba):
count = 0
for n in range(len(ba)):
if ba[n]:
count += 1
return count
def test_funcs():
from bitarray import bitarray
ba = bitarray("110010"*10000)
print(sum_bits1(ba), sum_bits2(ba), sum_bits_naive(ba))
from timeit import timeit
globs = dict(globals())
globs.update(locals())
print(timeit("sum_bits1(ba)", globals=globs, number=1000))
print(timeit("sum_bits2(ba)", globals=globs, number=1000))
print(timeit("sum_bits_naive(ba)", globals=globs, number=1000))
gives
(30000, 30000, 30000)
0.069798200041987
0.09307677199831232
1.3518586970167235
i.e. the memoryview version is the best.

Related

How to speed up df.query

df.query uses numexpr under the hood, but is much slower than pure numexpr
Let's say I have a big DataFrame:
from random import shuffle
import pandas as pd
l1=list(range(1000000))
l2=list(range(1000000))
shuffle(l1)
shuffle(l2)
df = pd.DataFrame([l1, l2]).T
df=df.sample(frac=1)
df=df.rename(columns={0: 'A', 1:'B'})
And I want to compare 2 columns:
%timeit (df.A == df.B) | (df.A / df.B < 1) | (df.A * df.B > 3000)
10.8 ms ± 309 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
It takes 10.8 ms in this example
Now I import numexpr and do the same thing:
import numexpr
a = df.A.__array__()
b = df.B.__array__()
%timeit numexpr.evaluate('((a == b) | (a / b < 1) | (a * b > 3000))')
1.95 ms ± 25.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
numexpr is 6 times faster than Pandas
Now let's use df.loc:
%timeit df.loc[numexpr.evaluate('((a == b) | (a / b < 1) | (a * b > 3000))')]
20.5 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.loc[(df.A == df.B) | (df.A / df.B < 1) | (df.A * df.B > 3000)]
27 ms ± 296 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.query('((A == B) | (A / B < 1) | (A * B > 3000))')
32.5 ms ± 80.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
numexpr is still significantly faster than pure Pandas. But why is df.query so slow? It uses numexpr under the hood. Is there a way to fix that? Or any other way to use numexpr in pandas without doing a lot of tweaking

Fastest way to set a single value in a dataframe?

I have a big dataframe where I need to iteratively perform some calculations and set a subset of the dataframe, as illustrated in the example below. The example below has index 1000 x 100, but in my real dataset, the [100] isn't fixed. Sometimes there are more, sometimes less. The other complication is that on my real dataset, df.loc[0]._is_view returns False (not sure why).
So even though the first option below df.loc[0, 'C'] is faster, I couldn't really use it. I've been using the second option df.loc[df.index.get_level_values('A') == 0, 'C'], which takes twice as long.
Does anyone know of a faster way to edit a subset of the dataframe?
import pandas as pd
df = pd.DataFrame(
np.random.normal(0.0, 1.0, size=(100000, 2)),
index=pd.MultiIndex.from_product(
[list(range(1000)), list(range(100))], names=["A", "B"]
),
columns=["C", "D"],
)
%%timeit
df.loc[0, 'C'] = 1.
870 µs ± 91.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
df.loc[df.index.get_level_values('A') == 0, 'C'] = 1.
1.41 ms ± 4.71 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
An alternative to loc which, in most cases, should be faster is Pandas at property:
%%timeit
df.loc[0, 'C'] = 1
171 µs ± 31.9 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Whereas:
%%timeit
df.at[0, 'C'] = 1
153 µs ± 3.09 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Pandas: is there a difference of speed between the deprecated df.sum(level=...) and df.groupby(level=...).sum()?

I'm using pandas and noticed a HUGE difference in performance between these two statements:
df.sum(level=['My', 'Index', 'Levels']) # uses numpy sum which is vectorized
and
df.groupby(level=['My', 'Index', 'Levels']).sum() # Slow...
First example is using the numpy.sum, which is vectorized, as stated in the documentation.
Unfortunably, using sum(level=...) is deprecated in the API and produces an ugly warning:
FutureWarning: Using the level keyword in DataFrame and Series
aggregations is deprecated and will be removed in a future version.
Use groupby instead. df.sum(level=1) should use
df.groupby(level=1).sum()
I don't want to use the non vectorized version and have a poor processing performance. How can I use numpy.sum along with groupby ?
Edit: following the comments, here is a basic test I have done: Pandas 1.4.4 , 10k random lines, 10 levels (index)
import pandas as pd
import numpy as np
print('pandas:', pd.__version__)
nb_lines = int(1e4)
nb_levels = 10
# sequences of random integers [0, 9] x 10k
ix = np.random.randint(0, nb_levels-1, size=(nb_lines, nb_levels))
cols = [chr(65+i) for i in range(nb_levels)] # A, B, C, ...
df = pd.DataFrame(ix, columns=cols)
df = df.set_index(cols)
df['VALUE'] = np.random.rand(nb_lines) # random values to aggregate
print('with groupby:')
%timeit -n 300 df.groupby(level=cols).sum()
print('without groupby:')
%timeit -n 300 df.sum(level=cols)
And the result is:
pandas: 1.4.4
with groupby:
5.51 ms ± 1.06 µs per loop (mean ± std. dev. of 7 runs, 500 loops each)
without groupby:
<magic-timeit>:1: FutureWarning: Using the level keyword in DataFrame and Series aggregations is deprecated and will be removed in a future version. Use groupby instead. df.sum(level=1) should use df.groupby(level=1).sum().
4.93 ms ± 40.1 µs per loop (mean ± std. dev. of 7 runs, 500 loops each)
This is just an example, but the result is always faster without groupby. Changing parameters (levels, step size for the columns to group on, etc) does not change the result.
In the end, for a big data set, you can see the difference between the two methods (numpy.sum vs other).
#mozway, you results indicate a similar performance, however if you increase the number of levels, you should see results worsening with the groupby version -at least that's the results on my computer. See edited code so you can change the number of levels (example with 10 levels and 100k lines):
import numpy as np
from string import ascii_uppercase as UP
np.random.seed(0)
N = 100_000
nb_levels = 10
cols = [chr(65+i) for i in range(nb_levels)]
d = {cols[i]: np.random.choice(list(UP), size=N) for i in range(nb_levels)}
d.update({'num': np.random.random(size=N)})
df = pd.DataFrame(d).set_index(cols)
print(pd.__version__)
print('with groupby:')
%timeit -n 300 df.groupby(level=cols).sum()
print('without groupby:')
%timeit -n 300 df.sum(level=cols)
... and the result:
1.4.4
with groupby:
50.8 ms ± 536 µs per loop (mean ± std. dev. of 7 runs, 300 loops each)
without groupby:
<magic-timeit>:1: FutureWarning: Using the level keyword in DataFrame and Series aggregations is deprecated and will be removed in a future version. Use groupby instead. df.sum(level=1) should use df.groupby(level=1).sum().
42 ms ± 506 µs per loop (mean ± std. dev. of 7 runs, 300 loops each)
Thanks
This doesn't seem to be true, both approaches have a similar speed.
Setup (3 levels, 26 groups each, ~18k combinations of groups, 1M rows):
import numpy as np
from string import ascii_uppercase as UP
np.random.seed(0)
N = 1_000_000
cols = ['A', 'B', 'C']
df = pd.DataFrame({'A': np.random.choice(list(UP), size=N),
'B': np.random.choice(list(UP), size=N),
'C': np.random.choice(list(UP), size=N),
'num': np.random.random(size=N),}
).set_index(cols)
Test:
pd.__version__
1.4.4
%%timeit # 3 times
df.sum(level=cols)
316 ms ± 85.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
287 ms ± 21.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
297 ms ± 54.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit # 3 times
df.groupby(level=cols).sum()
284 ms ± 41.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
286 ms ± 18.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
311 ms ± 31.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
udpated example from OP
import numpy as np
from string import ascii_uppercase as UP
np.random.seed(0)
N = 1_000_000
nb_cols = 10
cols = [chr(65+i) for i in range(nb_cols)]
d = {cols[i]: np.random.choice(list(UP), size=N) for i in range(nb_cols)}
d.update({'num': np.random.random(size=N)})
df = pd.DataFrame(d).set_index(cols)
print(pd.__version__)
1.5.0
%%timeit
df.sum(level=cols)
3.36 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df.groupby(level=cols).sum()
2.94 s ± 444 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Extremely slow on np.recarray assignment

I'm storing ticks with ndarray, each tick has a utc_timestamp[str] as index, tick price/vols as values. Thus I have an array of 2 different dtypes(str and float). This this the way I store it as a np.recarray
data = np.recarray((100,), dtype=[('time':'U23'),('ask1':'f'),('bid1':'f')])
tick = ['2021-04-28T09:38:30.928',14.21,14.2]
# assigning this tick to the end of data, wield
%%timeit
...: data[-1] = np.rec.array(tick)
...:
1.38 ms ± 13.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
tooks 1.38ms per loop!! plus, i can't set the last row using data[-1] = tick which would raise
ValueError: setting an array element with a sequence
let's try simple ndarray, say i have 2 seperate arrays, one for str and one for float
%%timeit
...: data[:,-1]=tick[1:]
...:
15.2 µs ± 113 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
see? that's 90x faster! why is that?
My times are quite a bit better:
In [503]: timeit data[-1] = np.rec.array(tick)
64.4 µs ± 321 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
np.rec.array(tick) creates a dtype=[('f0', '<U23'), ('f1', '<f8'), ('f2', '<f8')]). I get better speed if I use the final dtype.
In [504]: timeit data[-1] = np.rec.array(tick, data.dtype)
31.1 µs ± 22.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
A bulk of that time is creating the 1 term recarray:
In [516]: %timeit x = np.rec.array(tick, data.dtype)
29.9 µs ± 41.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Making a structured array instead:
In [517]: %timeit x = np.array(tuple(tick), data.dtype)
2.71 µs ± 15.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [519]: timeit data[-1] = np.array(tuple(tick), data.dtype)
3.58 µs ± 11.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
So skipping recarray entirely:
In [521]: data = np.zeros((100,), dtype=[('time','U23'),('ask1','f'),('bid1','f')])
...: tick = ('2021-04-28T09:38:30.928',14.21,14.2)
In [522]: data[-1] = np.array(tick, data.dtype)
In [523]: data[-2:]
Out[523]:
array([('', 0. , 0. ), ('2021-04-28T09:38:30.928', 14.21, 14.2)],
dtype=[('time', '<U23'), ('ask1', '<f4'), ('bid1', '<f4')])
I think recarray has largely been replaced by structured array. The main thing recarray adds is the ability to address fields as attributes
data.time, data.ask1
data['time'], data['ask1']
Your example shows that recarray slows things down.
edit
The tuple tick can be assigned directly without extra conversion:
In [526]: timeit data[-1] = tick
365 ns ± 0.247 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

efficiently extract rows from a pandas DataFrame ignoring missing index labels

I am looking for a more efficient equivalent of
df.reindex(labels).dropna(subset=[0])
that avoids including the NaN rows for missing label in the result rather than having to delete them after reindex puts them in.
Equivalently I am loooking for an efficient version of
df.loc[labels]
that silently ignores labels that are not in df.index, ie the result may have fewer rows than elements of labels.
I need something that is efficient when the numbers of rows, columns and labels are all large and there is a significant miss rate. Specifically, I'm looking for something sublinear in the length of the dataset.
Update 1
Here is a concrete demonstration of the issue following on from #MaxU's answer:
In [2]: L = 10**7
...: M = 10**4
...: N = 10**9
...: np.random.seed([3, 1415])
...: df = pd.DataFrame(np.random.rand(L, 2))
...: labels = np.random.randint(N, size=M)
...: M-len(set(labels))
...:
...:
Out[2]: 0
In [3]: %timeit df[df.index.isin(set(labels))]
904 ms ± 59.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [4]: %timeit df.loc[df.index.intersection(set(labels))]
207 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [5]: %timeit df.loc[np.intersect1d(df.index, labels)]
427 ms ± 37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [6]: %timeit df.loc[labels[labels<L]]
329 µs ± 23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [7]: %timeit df.iloc[labels[labels<L]]
161 µs ± 8.35 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
The last 2 examples are ~1000 times faster than those iterating over df.index. This demonstrates that df.loc[labels] does not iterate over the index and that dataframes have an efficient index structure, ie df.index does indeed index.
So the question is how do I get something as efficient as df.loc[labels[labels<L]] when df.index is not a contiguous sequence of numbers. A partial solution is the the original
In [8]: %timeit df.reindex(labels).dropna(subset=[0])
1.81 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
That is still a ~100 times faster than the suggested solutions, but still losing an order of magnitude to what may be possible.
Update 2
To further demonstrate that it is possible to get sublinear performance even without assuptions on the index repeat the above with a string index
In [16]: df.index=df.index.map(str)
...: labels = np.array(list(map(str, labels)))
...:
...:
In [17]: %timeit df[df.index.isin(set(labels))]
657 ms ± 48.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [18]: %timeit df.loc[df.index.intersection(set(labels))]
974 ms ± 160 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [19]: %timeit df.reindex(labels).dropna()
8.7 ms ± 121 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So to be clear I am after something that is more efficient than df.reindex(labels).dropna(). This is already sublinear in df.shape[0] and makes no assumptions about the index, therefore so should the solution.
The issue I want to address is that df.reindex(labels) will include NaN rows for missing labels that then need removing with dropna. I am after an equivalent of df.reindex(labels) that does not put them there in the first place, without scanning the entire df.index to figure out the missing labels. This must be possible at least in principle: If reindex can efficiently handle missing labels on the fly by inserting dummy rows, it should be possible to handle them even more efficiently on the fly by doing nothing.
Here is a small comparison for different approaches.
Sample DF (shape: 10.000.000 x 2):
np.random.seed([3, 1415])
df = pd.DataFrame(np.random.rand(10**7, 2))
labels = np.random.randint(10**9, size=10**4)
In [88]: df.shape
Out[88]: (10000000, 2)
valid (existing labels):
In [89]: (labels <= 10**7).sum()
Out[89]: 1008
invalid (not existing labels):
In [90]: (labels > 10**7).sum()
Out[90]: 98992
Timings:
In [103]: %timeit df[df.index.isin(set(labels))]
943 ms ± 7.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [104]: %timeit df.loc[df.index.intersection(set(labels))]
360 ms ± 1.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [105]: %timeit df.loc[np.intersect1d(df.index, labels)]
513 ms ± 655 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)