How to speed up df.query - pandas

df.query uses numexpr under the hood, but is much slower than pure numexpr
Let's say I have a big DataFrame:
from random import shuffle
import pandas as pd
l1=list(range(1000000))
l2=list(range(1000000))
shuffle(l1)
shuffle(l2)
df = pd.DataFrame([l1, l2]).T
df=df.sample(frac=1)
df=df.rename(columns={0: 'A', 1:'B'})
And I want to compare 2 columns:
%timeit (df.A == df.B) | (df.A / df.B < 1) | (df.A * df.B > 3000)
10.8 ms ± 309 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
It takes 10.8 ms in this example
Now I import numexpr and do the same thing:
import numexpr
a = df.A.__array__()
b = df.B.__array__()
%timeit numexpr.evaluate('((a == b) | (a / b < 1) | (a * b > 3000))')
1.95 ms ± 25.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
numexpr is 6 times faster than Pandas
Now let's use df.loc:
%timeit df.loc[numexpr.evaluate('((a == b) | (a / b < 1) | (a * b > 3000))')]
20.5 ms ± 155 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.loc[(df.A == df.B) | (df.A / df.B < 1) | (df.A * df.B > 3000)]
27 ms ± 296 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.query('((A == B) | (A / B < 1) | (A * B > 3000))')
32.5 ms ± 80.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
numexpr is still significantly faster than pure Pandas. But why is df.query so slow? It uses numexpr under the hood. Is there a way to fix that? Or any other way to use numexpr in pandas without doing a lot of tweaking

Related

What is the difference between using filter vs loc to select only certain columns?

df.filter(items=["A","D"])
vs
df.loc(:,["A","D"])
What is the difference between the above options to select all rows from 2 columns?
When should I use filter and when should I used loc?
In addition to the filtering capabilities provided by the filter method (see the documentation), the loc method is much faster.
import pandas as pd
import numpy as np
df = pd.DataFrame({"X":np.random.random((1000,)),
})
%%timeit
df.filter(items=['X'])
Out:
87.5 µs ± 2.55 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%%timeit
df.loc[:,'X']
Out:
8.52 µs ± 69.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Pandas: is there a difference of speed between the deprecated df.sum(level=...) and df.groupby(level=...).sum()?

I'm using pandas and noticed a HUGE difference in performance between these two statements:
df.sum(level=['My', 'Index', 'Levels']) # uses numpy sum which is vectorized
and
df.groupby(level=['My', 'Index', 'Levels']).sum() # Slow...
First example is using the numpy.sum, which is vectorized, as stated in the documentation.
Unfortunably, using sum(level=...) is deprecated in the API and produces an ugly warning:
FutureWarning: Using the level keyword in DataFrame and Series
aggregations is deprecated and will be removed in a future version.
Use groupby instead. df.sum(level=1) should use
df.groupby(level=1).sum()
I don't want to use the non vectorized version and have a poor processing performance. How can I use numpy.sum along with groupby ?
Edit: following the comments, here is a basic test I have done: Pandas 1.4.4 , 10k random lines, 10 levels (index)
import pandas as pd
import numpy as np
print('pandas:', pd.__version__)
nb_lines = int(1e4)
nb_levels = 10
# sequences of random integers [0, 9] x 10k
ix = np.random.randint(0, nb_levels-1, size=(nb_lines, nb_levels))
cols = [chr(65+i) for i in range(nb_levels)] # A, B, C, ...
df = pd.DataFrame(ix, columns=cols)
df = df.set_index(cols)
df['VALUE'] = np.random.rand(nb_lines) # random values to aggregate
print('with groupby:')
%timeit -n 300 df.groupby(level=cols).sum()
print('without groupby:')
%timeit -n 300 df.sum(level=cols)
And the result is:
pandas: 1.4.4
with groupby:
5.51 ms ± 1.06 µs per loop (mean ± std. dev. of 7 runs, 500 loops each)
without groupby:
<magic-timeit>:1: FutureWarning: Using the level keyword in DataFrame and Series aggregations is deprecated and will be removed in a future version. Use groupby instead. df.sum(level=1) should use df.groupby(level=1).sum().
4.93 ms ± 40.1 µs per loop (mean ± std. dev. of 7 runs, 500 loops each)
This is just an example, but the result is always faster without groupby. Changing parameters (levels, step size for the columns to group on, etc) does not change the result.
In the end, for a big data set, you can see the difference between the two methods (numpy.sum vs other).
#mozway, you results indicate a similar performance, however if you increase the number of levels, you should see results worsening with the groupby version -at least that's the results on my computer. See edited code so you can change the number of levels (example with 10 levels and 100k lines):
import numpy as np
from string import ascii_uppercase as UP
np.random.seed(0)
N = 100_000
nb_levels = 10
cols = [chr(65+i) for i in range(nb_levels)]
d = {cols[i]: np.random.choice(list(UP), size=N) for i in range(nb_levels)}
d.update({'num': np.random.random(size=N)})
df = pd.DataFrame(d).set_index(cols)
print(pd.__version__)
print('with groupby:')
%timeit -n 300 df.groupby(level=cols).sum()
print('without groupby:')
%timeit -n 300 df.sum(level=cols)
... and the result:
1.4.4
with groupby:
50.8 ms ± 536 µs per loop (mean ± std. dev. of 7 runs, 300 loops each)
without groupby:
<magic-timeit>:1: FutureWarning: Using the level keyword in DataFrame and Series aggregations is deprecated and will be removed in a future version. Use groupby instead. df.sum(level=1) should use df.groupby(level=1).sum().
42 ms ± 506 µs per loop (mean ± std. dev. of 7 runs, 300 loops each)
Thanks
This doesn't seem to be true, both approaches have a similar speed.
Setup (3 levels, 26 groups each, ~18k combinations of groups, 1M rows):
import numpy as np
from string import ascii_uppercase as UP
np.random.seed(0)
N = 1_000_000
cols = ['A', 'B', 'C']
df = pd.DataFrame({'A': np.random.choice(list(UP), size=N),
'B': np.random.choice(list(UP), size=N),
'C': np.random.choice(list(UP), size=N),
'num': np.random.random(size=N),}
).set_index(cols)
Test:
pd.__version__
1.4.4
%%timeit # 3 times
df.sum(level=cols)
316 ms ± 85.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
287 ms ± 21.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
297 ms ± 54.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit # 3 times
df.groupby(level=cols).sum()
284 ms ± 41.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
286 ms ± 18.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
311 ms ± 31.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
udpated example from OP
import numpy as np
from string import ascii_uppercase as UP
np.random.seed(0)
N = 1_000_000
nb_cols = 10
cols = [chr(65+i) for i in range(nb_cols)]
d = {cols[i]: np.random.choice(list(UP), size=N) for i in range(nb_cols)}
d.update({'num': np.random.random(size=N)})
df = pd.DataFrame(d).set_index(cols)
print(pd.__version__)
1.5.0
%%timeit
df.sum(level=cols)
3.36 s ± 478 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df.groupby(level=cols).sum()
2.94 s ± 444 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Contains and append to the column cell

if df.ColumnA.str.contains("ABC") , to append apple to ColumnA.
eg: ColumnA -> "ABC Company" , after running the code ColumnA -> "ABC Company apple"
May I know what is the fastest way to achieve this, if i dont wanna use for loop.
You can use np.where:
df['ColumnA'] = np.where(df.ColumnA.str.contains("ABC"),\
df.ColumnA + 'myvalue', df.ColumnA)
Use DataFrame.loc what is same fast like np.where because processing only matched rows:
df.loc[df.ColumnA.str.contains("ABC"), 'ColumnA'] += ' apple'
Performance: For 400k rows, 50% matched
df = pd.DataFrame({'ColumnA':['ABC Company','ABC','temp', 'DDD']* 100000})
print (df)
In [82]: %timeit df['ColumnA'] = np.where(df.ColumnA.str.contains("ABC"), df.ColumnA + 'myvalue', df.ColumnA)
329 ms ± 10.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [83]: %timeit df.loc[df.ColumnA.str.contains("ABC"), 'ColumnA'] += ' apple'
323 ms ± 6.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If there is less matched values, here 25% then loc is faster:
df = pd.DataFrame({'ColumnA':['ABC Company','ee','temp', 'DDD']* 100000})
#print (df)
In [89]: %timeit df['ColumnA'] = np.where(df.ColumnA.str.contains("ABC"), df.ColumnA + 'myvalue', df.ColumnA)
306 ms ± 5.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [90]: %timeit df.loc[df.ColumnA.str.contains("ABC"), 'ColumnA'] += ' apple'
269 ms ± 4.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

is there a way to optimize cumprod in python?

I'm having a pandas data frame df and would like to perform the following calculation in a function. The line that takes by far the longest is a cumprod. I was wondering if there is a way to speed this up? Like in numpy they are different ways to achieve the same result, e.g. np.inner vs np.einsum and I was wondering if one can do something similar here.
import pandas as pd
In [122]: import numpy as np
In [123]: df = pd.DataFrame(np.random.randn(100000, 1000))
In [124]: %time ((1+df).cumprod(axis=0)-1)
CPU times: user 5.22 s, sys: 884 ms, total: 6.1 s
Wall time: 6.12 s
You could do the computation in NumPy instead of Pandas.
For your input sizes this will be of the order of ~5%, not exciting but better than nothing. For smaller inputs, the gains are much more significant.
import pandas as pd
import numpy as np
arr = np.random.randn(100000, 1000)
df = pd.DataFrame(arr)
x = ((1 + df).cumprod(axis=0) - 1)
y = np.cumprod(1 + arr, axis=0) - 1
print(np.allclose(x, y))
Given that this is the same result, the timings are:
arr = np.random.randn(100000, 1000)
df = pd.DataFrame(arr)
%timeit ((1 + df).cumprod(axis=0) - 1)
# 3.64 s ± 76.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.cumprod(1 + arr, axis=0) - 1
# 3.42 s ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
showing the aforementioned speed gains for your inputs.
For smaller inputs, the difference is larger, e.g.:
arr = np.random.randn(1000, 10)
df = pd.DataFrame(arr)
%timeit ((1 + df).cumprod(axis=0) - 1)
# 469 µs ± 4.13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit np.cumprod(1 + arr, axis=0) - 1
# 36.6 µs ± 427 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
showing that in this case doing the computation in NumPy is ~13x faster than in Pandas.
EDIT:
As suggested by #hpaulj, np.multiply.accumulate() can get slightly faster.
# for shape = (100000, 1000)
%timeit np.multiply.accumulate(1 + arr, axis=0) - 1
# 3.38 s ± 79.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
and, for smaller inputs:
# for shape = (1000, 10)
%timeit np.multiply.accumulate(1 + arr, axis=0) - 1
# 35.8 µs ± 423 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
But, as always, these kind of micro-benchmarks should be taken with a grain of salt, especially when such small differences are observed.
If you are willing to use other modules to speed up your calculations, I can recommend numba. Numba compiles python code to LLVM and is specifically aiming to speed up numeric calculations using numpy.
Since numba does not yet support using kwargs like axis=0 with np.cumprod, your code will look like this:
import numpy as np
import pandas as pd
import numba as nb
#nb.njit(parallel=True)
def nb_cumprod(arr):
y = np.empty_like(arr)
for i in range(arr.shape[1]):
y[:, i] = np.cumprod(1 + arr[:, i]) - 1
return y
arr = np.random.randn(100000, 1000)
df = pd.DataFrame(arr)
x = ((1 + df).cumprod(axis=0) - 1)
y = np.cumprod(1 + arr, axis=0) - 1
z = nb_cumprod(arr)
print(np.allclose(x, z))
And some timings show that numba is about 4 times faster than using cumprod on a DataFrame and about 3.7 times faster than using numpy:
%timeit ((1 + df).cumprod(axis=0) - 1)
# 6.83 s ± 482 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.cumprod(1 + arr, axis=0) - 1
# 6.38 s ± 509 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit nb_cumprod(arr)
# 1.71 s ± 158 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can use additional options like fastmath=True to increase the performance even further, but this will yield slightly different results.

efficiently extract rows from a pandas DataFrame ignoring missing index labels

I am looking for a more efficient equivalent of
df.reindex(labels).dropna(subset=[0])
that avoids including the NaN rows for missing label in the result rather than having to delete them after reindex puts them in.
Equivalently I am loooking for an efficient version of
df.loc[labels]
that silently ignores labels that are not in df.index, ie the result may have fewer rows than elements of labels.
I need something that is efficient when the numbers of rows, columns and labels are all large and there is a significant miss rate. Specifically, I'm looking for something sublinear in the length of the dataset.
Update 1
Here is a concrete demonstration of the issue following on from #MaxU's answer:
In [2]: L = 10**7
...: M = 10**4
...: N = 10**9
...: np.random.seed([3, 1415])
...: df = pd.DataFrame(np.random.rand(L, 2))
...: labels = np.random.randint(N, size=M)
...: M-len(set(labels))
...:
...:
Out[2]: 0
In [3]: %timeit df[df.index.isin(set(labels))]
904 ms ± 59.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [4]: %timeit df.loc[df.index.intersection(set(labels))]
207 ms ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [5]: %timeit df.loc[np.intersect1d(df.index, labels)]
427 ms ± 37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [6]: %timeit df.loc[labels[labels<L]]
329 µs ± 23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [7]: %timeit df.iloc[labels[labels<L]]
161 µs ± 8.35 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
The last 2 examples are ~1000 times faster than those iterating over df.index. This demonstrates that df.loc[labels] does not iterate over the index and that dataframes have an efficient index structure, ie df.index does indeed index.
So the question is how do I get something as efficient as df.loc[labels[labels<L]] when df.index is not a contiguous sequence of numbers. A partial solution is the the original
In [8]: %timeit df.reindex(labels).dropna(subset=[0])
1.81 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
That is still a ~100 times faster than the suggested solutions, but still losing an order of magnitude to what may be possible.
Update 2
To further demonstrate that it is possible to get sublinear performance even without assuptions on the index repeat the above with a string index
In [16]: df.index=df.index.map(str)
...: labels = np.array(list(map(str, labels)))
...:
...:
In [17]: %timeit df[df.index.isin(set(labels))]
657 ms ± 48.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [18]: %timeit df.loc[df.index.intersection(set(labels))]
974 ms ± 160 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [19]: %timeit df.reindex(labels).dropna()
8.7 ms ± 121 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
So to be clear I am after something that is more efficient than df.reindex(labels).dropna(). This is already sublinear in df.shape[0] and makes no assumptions about the index, therefore so should the solution.
The issue I want to address is that df.reindex(labels) will include NaN rows for missing labels that then need removing with dropna. I am after an equivalent of df.reindex(labels) that does not put them there in the first place, without scanning the entire df.index to figure out the missing labels. This must be possible at least in principle: If reindex can efficiently handle missing labels on the fly by inserting dummy rows, it should be possible to handle them even more efficiently on the fly by doing nothing.
Here is a small comparison for different approaches.
Sample DF (shape: 10.000.000 x 2):
np.random.seed([3, 1415])
df = pd.DataFrame(np.random.rand(10**7, 2))
labels = np.random.randint(10**9, size=10**4)
In [88]: df.shape
Out[88]: (10000000, 2)
valid (existing labels):
In [89]: (labels <= 10**7).sum()
Out[89]: 1008
invalid (not existing labels):
In [90]: (labels > 10**7).sum()
Out[90]: 98992
Timings:
In [103]: %timeit df[df.index.isin(set(labels))]
943 ms ± 7.86 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [104]: %timeit df.loc[df.index.intersection(set(labels))]
360 ms ± 1.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [105]: %timeit df.loc[np.intersect1d(df.index, labels)]
513 ms ± 655 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)