Pandas: is there a difference of speed between the deprecated df.sum(level=...) and df.groupby(level=...).sum()? - pandas

I'm using pandas and noticed a HUGE difference in performance between these two statements:
df.sum(level=['My', 'Index', 'Levels']) # uses numpy sum which is vectorized
and
df.groupby(level=['My', 'Index', 'Levels']).sum() # Slow...
First example is using the numpy.sum, which is vectorized, as stated in the documentation.
Unfortunably, using sum(level=...) is deprecated in the API and produces an ugly warning:
FutureWarning: Using the level keyword in DataFrame and Series
aggregations is deprecated and will be removed in a future version.
Use groupby instead. df.sum(level=1) should use
df.groupby(level=1).sum()
I don't want to use the non vectorized version and have a poor processing performance. How can I use numpy.sum along with groupby ?
Edit: following the comments, here is a basic test I have done: Pandas 1.4.4 , 10k random lines, 10 levels (index)
import pandas as pd
import numpy as np
print('pandas:', pd.__version__)
nb_lines = int(1e4)
nb_levels = 10
# sequences of random integers [0, 9] x 10k
ix = np.random.randint(0, nb_levels-1, size=(nb_lines, nb_levels))
cols = [chr(65+i) for i in range(nb_levels)] # A, B, C, ...
df = pd.DataFrame(ix, columns=cols)
df = df.set_index(cols)
df['VALUE'] = np.random.rand(nb_lines) # random values to aggregate
print('with groupby:')
%timeit -n 300 df.groupby(level=cols).sum()
print('without groupby:')
%timeit -n 300 df.sum(level=cols)
And the result is:
pandas: 1.4.4
with groupby:
5.51 ms ± 1.06 µs per loop (mean ± std. dev. of 7 runs, 500 loops each)
without groupby:
<magic-timeit>:1: FutureWarning: Using the level keyword in DataFrame and Series aggregations is deprecated and will be removed in a future version. Use groupby instead. df.sum(level=1) should use df.groupby(level=1).sum().
4.93 ms ± 40.1 µs per loop (mean ± std. dev. of 7 runs, 500 loops each)
This is just an example, but the result is always faster without groupby. Changing parameters (levels, step size for the columns to group on, etc) does not change the result.
In the end, for a big data set, you can see the difference between the two methods (numpy.sum vs other).
#mozway, you results indicate a similar performance, however if you increase the number of levels, you should see results worsening with the groupby version -at least that's the results on my computer. See edited code so you can change the number of levels (example with 10 levels and 100k lines):
import numpy as np
from string import ascii_uppercase as UP
np.random.seed(0)
N = 100_000
nb_levels = 10
cols = [chr(65+i) for i in range(nb_levels)]
d = {cols[i]: np.random.choice(list(UP), size=N) for i in range(nb_levels)}
d.update({'num': np.random.random(size=N)})
df = pd.DataFrame(d).set_index(cols)
print(pd.__version__)
print('with groupby:')
%timeit -n 300 df.groupby(level=cols).sum()
print('without groupby:')
%timeit -n 300 df.sum(level=cols)
... and the result:
1.4.4
with groupby:
50.8 ms ± 536 µs per loop (mean ± std. dev. of 7 runs, 300 loops each)
without groupby:
<magic-timeit>:1: FutureWarning: Using the level keyword in DataFrame and Series aggregations is deprecated and will be removed in a future version. Use groupby instead. df.sum(level=1) should use df.groupby(level=1).sum().
42 ms ± 506 µs per loop (mean ± std. dev. of 7 runs, 300 loops each)
Thanks

This doesn't seem to be true, both approaches have a similar speed.
Setup (3 levels, 26 groups each, ~18k combinations of groups, 1M rows):
import numpy as np
from string import ascii_uppercase as UP
np.random.seed(0)
N = 1_000_000
cols = ['A', 'B', 'C']
df = pd.DataFrame({'A': np.random.choice(list(UP), size=N),
'B': np.random.choice(list(UP), size=N),
'C': np.random.choice(list(UP), size=N),
'num': np.random.random(size=N),}
).set_index(cols)
Test:
pd.__version__
1.4.4
%%timeit # 3 times
df.sum(level=cols)
316 ms ± 85.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
287 ms ± 21.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
297 ms ± 54.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit # 3 times
df.groupby(level=cols).sum()
284 ms ± 41.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
286 ms ± 18.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
311 ms ± 31.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
udpated example from OP
import numpy as np
from string import ascii_uppercase as UP
np.random.seed(0)
N = 1_000_000
nb_cols = 10
cols = [chr(65+i) for i in range(nb_cols)]
d = {cols[i]: np.random.choice(list(UP), size=N) for i in range(nb_cols)}
d.update({'num': np.random.random(size=N)})
df = pd.DataFrame(d).set_index(cols)
print(pd.__version__)
1.5.0
%%timeit
df.sum(level=cols)
3.36 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df.groupby(level=cols).sum()
2.94 s ± 444 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Related

What is the difference between using filter vs loc to select only certain columns?

df.filter(items=["A","D"])
vs
df.loc(:,["A","D"])
What is the difference between the above options to select all rows from 2 columns?
When should I use filter and when should I used loc?
In addition to the filtering capabilities provided by the filter method (see the documentation), the loc method is much faster.
import pandas as pd
import numpy as np
df = pd.DataFrame({"X":np.random.random((1000,)),
})
%%timeit
df.filter(items=['X'])
Out:
87.5 µs ± 2.55 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%%timeit
df.loc[:,'X']
Out:
8.52 µs ± 69.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Fastest way to set a single value in a dataframe?

I have a big dataframe where I need to iteratively perform some calculations and set a subset of the dataframe, as illustrated in the example below. The example below has index 1000 x 100, but in my real dataset, the [100] isn't fixed. Sometimes there are more, sometimes less. The other complication is that on my real dataset, df.loc[0]._is_view returns False (not sure why).
So even though the first option below df.loc[0, 'C'] is faster, I couldn't really use it. I've been using the second option df.loc[df.index.get_level_values('A') == 0, 'C'], which takes twice as long.
Does anyone know of a faster way to edit a subset of the dataframe?
import pandas as pd
df = pd.DataFrame(
np.random.normal(0.0, 1.0, size=(100000, 2)),
index=pd.MultiIndex.from_product(
[list(range(1000)), list(range(100))], names=["A", "B"]
),
columns=["C", "D"],
)
%%timeit
df.loc[0, 'C'] = 1.
870 µs ± 91.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
df.loc[df.index.get_level_values('A') == 0, 'C'] = 1.
1.41 ms ± 4.71 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
An alternative to loc which, in most cases, should be faster is Pandas at property:
%%timeit
df.loc[0, 'C'] = 1
171 µs ± 31.9 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Whereas:
%%timeit
df.at[0, 'C'] = 1
153 µs ± 3.09 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Extremely slow on np.recarray assignment

I'm storing ticks with ndarray, each tick has a utc_timestamp[str] as index, tick price/vols as values. Thus I have an array of 2 different dtypes(str and float). This this the way I store it as a np.recarray
data = np.recarray((100,), dtype=[('time':'U23'),('ask1':'f'),('bid1':'f')])
tick = ['2021-04-28T09:38:30.928',14.21,14.2]
# assigning this tick to the end of data, wield
%%timeit
...: data[-1] = np.rec.array(tick)
...:
1.38 ms ± 13.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
tooks 1.38ms per loop!! plus, i can't set the last row using data[-1] = tick which would raise
ValueError: setting an array element with a sequence
let's try simple ndarray, say i have 2 seperate arrays, one for str and one for float
%%timeit
...: data[:,-1]=tick[1:]
...:
15.2 µs ± 113 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
see? that's 90x faster! why is that?
My times are quite a bit better:
In [503]: timeit data[-1] = np.rec.array(tick)
64.4 µs ± 321 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
np.rec.array(tick) creates a dtype=[('f0', '<U23'), ('f1', '<f8'), ('f2', '<f8')]). I get better speed if I use the final dtype.
In [504]: timeit data[-1] = np.rec.array(tick, data.dtype)
31.1 µs ± 22.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
A bulk of that time is creating the 1 term recarray:
In [516]: %timeit x = np.rec.array(tick, data.dtype)
29.9 µs ± 41.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Making a structured array instead:
In [517]: %timeit x = np.array(tuple(tick), data.dtype)
2.71 µs ± 15.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [519]: timeit data[-1] = np.array(tuple(tick), data.dtype)
3.58 µs ± 11.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
So skipping recarray entirely:
In [521]: data = np.zeros((100,), dtype=[('time','U23'),('ask1','f'),('bid1','f')])
...: tick = ('2021-04-28T09:38:30.928',14.21,14.2)
In [522]: data[-1] = np.array(tick, data.dtype)
In [523]: data[-2:]
Out[523]:
array([('', 0. , 0. ), ('2021-04-28T09:38:30.928', 14.21, 14.2)],
dtype=[('time', '<U23'), ('ask1', '<f4'), ('bid1', '<f4')])
I think recarray has largely been replaced by structured array. The main thing recarray adds is the ability to address fields as attributes
data.time, data.ask1
data['time'], data['ask1']
Your example shows that recarray slows things down.
edit
The tuple tick can be assigned directly without extra conversion:
In [526]: timeit data[-1] = tick
365 ns ± 0.247 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

is there a way to optimize cumprod in python?

I'm having a pandas data frame df and would like to perform the following calculation in a function. The line that takes by far the longest is a cumprod. I was wondering if there is a way to speed this up? Like in numpy they are different ways to achieve the same result, e.g. np.inner vs np.einsum and I was wondering if one can do something similar here.
import pandas as pd
In [122]: import numpy as np
In [123]: df = pd.DataFrame(np.random.randn(100000, 1000))
In [124]: %time ((1+df).cumprod(axis=0)-1)
CPU times: user 5.22 s, sys: 884 ms, total: 6.1 s
Wall time: 6.12 s
You could do the computation in NumPy instead of Pandas.
For your input sizes this will be of the order of ~5%, not exciting but better than nothing. For smaller inputs, the gains are much more significant.
import pandas as pd
import numpy as np
arr = np.random.randn(100000, 1000)
df = pd.DataFrame(arr)
x = ((1 + df).cumprod(axis=0) - 1)
y = np.cumprod(1 + arr, axis=0) - 1
print(np.allclose(x, y))
Given that this is the same result, the timings are:
arr = np.random.randn(100000, 1000)
df = pd.DataFrame(arr)
%timeit ((1 + df).cumprod(axis=0) - 1)
# 3.64 s ± 76.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.cumprod(1 + arr, axis=0) - 1
# 3.42 s ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
showing the aforementioned speed gains for your inputs.
For smaller inputs, the difference is larger, e.g.:
arr = np.random.randn(1000, 10)
df = pd.DataFrame(arr)
%timeit ((1 + df).cumprod(axis=0) - 1)
# 469 µs ± 4.13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.cumprod(1 + arr, axis=0) - 1
# 36.6 µs ± 427 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
showing that in this case doing the computation in NumPy is ~13x faster than in Pandas.
EDIT:
As suggested by #hpaulj, np.multiply.accumulate() can get slightly faster.
# for shape = (100000, 1000)
%timeit np.multiply.accumulate(1 + arr, axis=0) - 1
# 3.38 s ± 79.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
and, for smaller inputs:
# for shape = (1000, 10)
%timeit np.multiply.accumulate(1 + arr, axis=0) - 1
# 35.8 µs ± 423 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
But, as always, these kind of micro-benchmarks should be taken with a grain of salt, especially when such small differences are observed.
If you are willing to use other modules to speed up your calculations, I can recommend numba. Numba compiles python code to LLVM and is specifically aiming to speed up numeric calculations using numpy.
Since numba does not yet support using kwargs like axis=0 with np.cumprod, your code will look like this:
import numpy as np
import pandas as pd
import numba as nb
#nb.njit(parallel=True)
def nb_cumprod(arr):
y = np.empty_like(arr)
for i in range(arr.shape[1]):
y[:, i] = np.cumprod(1 + arr[:, i]) - 1
return y
arr = np.random.randn(100000, 1000)
df = pd.DataFrame(arr)
x = ((1 + df).cumprod(axis=0) - 1)
y = np.cumprod(1 + arr, axis=0) - 1
z = nb_cumprod(arr)
print(np.allclose(x, z))
And some timings show that numba is about 4 times faster than using cumprod on a DataFrame and about 3.7 times faster than using numpy:
%timeit ((1 + df).cumprod(axis=0) - 1)
# 6.83 s ± 482 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.cumprod(1 + arr, axis=0) - 1
# 6.38 s ± 509 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit nb_cumprod(arr)
# 1.71 s ± 158 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can use additional options like fastmath=True to increase the performance even further, but this will yield slightly different results.

efficiently extract rows from a pandas DataFrame ignoring missing index labels

I am looking for a more efficient equivalent of
df.reindex(labels).dropna(subset=[0])
that avoids including the NaN rows for missing label in the result rather than having to delete them after reindex puts them in.
Equivalently I am loooking for an efficient version of
df.loc[labels]
that silently ignores labels that are not in df.index, ie the result may have fewer rows than elements of labels.
I need something that is efficient when the numbers of rows, columns and labels are all large and there is a significant miss rate. Specifically, I'm looking for something sublinear in the length of the dataset.
Update 1
Here is a concrete demonstration of the issue following on from #MaxU's answer:
In [2]: L = 10**7
...: M = 10**4
...: N = 10**9
...: np.random.seed([3, 1415])
...: df = pd.DataFrame(np.random.rand(L, 2))
...: labels = np.random.randint(N, size=M)
...: M-len(set(labels))
...:
...:
Out[2]: 0
In [3]: %timeit df[df.index.isin(set(labels))]
904 ms ± 59.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [4]: %timeit df.loc[df.index.intersection(set(labels))]
207 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [5]: %timeit df.loc[np.intersect1d(df.index, labels)]
427 ms ± 37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [6]: %timeit df.loc[labels[labels<L]]
329 µs ± 23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [7]: %timeit df.iloc[labels[labels<L]]
161 µs ± 8.35 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
The last 2 examples are ~1000 times faster than those iterating over df.index. This demonstrates that df.loc[labels] does not iterate over the index and that dataframes have an efficient index structure, ie df.index does indeed index.
So the question is how do I get something as efficient as df.loc[labels[labels<L]] when df.index is not a contiguous sequence of numbers. A partial solution is the the original
In [8]: %timeit df.reindex(labels).dropna(subset=[0])
1.81 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
That is still a ~100 times faster than the suggested solutions, but still losing an order of magnitude to what may be possible.
Update 2
To further demonstrate that it is possible to get sublinear performance even without assuptions on the index repeat the above with a string index
In [16]: df.index=df.index.map(str)
...: labels = np.array(list(map(str, labels)))
...:
...:
In [17]: %timeit df[df.index.isin(set(labels))]
657 ms ± 48.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [18]: %timeit df.loc[df.index.intersection(set(labels))]
974 ms ± 160 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [19]: %timeit df.reindex(labels).dropna()
8.7 ms ± 121 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So to be clear I am after something that is more efficient than df.reindex(labels).dropna(). This is already sublinear in df.shape[0] and makes no assumptions about the index, therefore so should the solution.
The issue I want to address is that df.reindex(labels) will include NaN rows for missing labels that then need removing with dropna. I am after an equivalent of df.reindex(labels) that does not put them there in the first place, without scanning the entire df.index to figure out the missing labels. This must be possible at least in principle: If reindex can efficiently handle missing labels on the fly by inserting dummy rows, it should be possible to handle them even more efficiently on the fly by doing nothing.
Here is a small comparison for different approaches.
Sample DF (shape: 10.000.000 x 2):
np.random.seed([3, 1415])
df = pd.DataFrame(np.random.rand(10**7, 2))
labels = np.random.randint(10**9, size=10**4)
In [88]: df.shape
Out[88]: (10000000, 2)
valid (existing labels):
In [89]: (labels <= 10**7).sum()
Out[89]: 1008
invalid (not existing labels):
In [90]: (labels > 10**7).sum()
Out[90]: 98992
Timings:
In [103]: %timeit df[df.index.isin(set(labels))]
943 ms ± 7.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [104]: %timeit df.loc[df.index.intersection(set(labels))]
360 ms ± 1.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [105]: %timeit df.loc[np.intersect1d(df.index, labels)]
513 ms ± 655 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)