Select a multiple-key cross section from a DataFrame - pandas

I have a DataFrame "df" with (time,ticker) Multiindex and bid/ask/etc data columns:
tod last bid ask volume
time ticker
2013-02-01 SPY 1600 149.70 150.14 150.17 1300
SLV 1600 30.44 30.38 30.43 3892
GLD 1600 161.20 161.19 161.21 3860
I would like to select a second-level (level=1) cross section using multiple keys. Right now, I can do it using one key, i.e.
df.xs('SPY', level=1)
which gives me a timeseries of SPY. What is the best way to select a multi-key cross section, i.e. a combined cross-section of both SPY and GLD, something like:
df.xs(['SPY', 'GLD'], level=1)
?

There are better ways of doing this with more recent versions of Pandas (see Multi-indexing using slicers in the changelog for version 0.14):
regression_df.loc[(slice(None), ['SPY', 'GLD']), :]
This can be made more readable with the use of pd.IndexSlice:
df.loc[pd.IndexSlice[:, ['SPY', 'GLD']], :]
With the convention idx = pd.IndexSlice, this becomes
df.loc[idx[:, ['SPY', 'GLD']], :]

I couldn't find a more direct way other than using select:
>>> df
last tod
A SPY 1 1600
SLV 2 1600
GLD 3 1600
>>> df.select(lambda x: x[1] in ['SPY','GLD'])
last tod
A SPY 1 1600
GLD 3 1600

For what it is worth, I did the following:
foo = pd.DataFrame(np.random.rand(12,3),
index=pd.MultiIndex.from_product([['A','B','C','D'],['Green','Red','Blue']],
names=['Letter','Color']),
columns=['X','Y','Z']).sort_index()
foo.reset_index()\
.loc[foo.reset_index().Color.isin({'Green','Red'})]\
.set_index(foo.index.names)
This approach is similar to select, but avoids iterating over all rows with a lambda.
However, I compared this to the Panel approach, and it appears the Panel solution is faster (2.91 ms for index/loc vs 1.48 ms for to_panel/to_frame:
foo.to_panel()[:,:,['Green','Red']].to_frame()
Times:
In [56]:
%%timeit
foo.reset_index().loc[foo.reset_index().Color.isin({'Green','Red'})].set_index(foo.index.names)
100 loops, best of 3: 2.91 ms per loop
In [57]:
%%timeit
foo2 = foo.reset_index()
foo2.loc[foo2.Color.eq('Green') | foo2.Color.eq('Red')].set_index(foo.index.names)
100 loops, best of 3: 2.85 ms per loop
In [58]:
%%timeit
foo2 = foo.reset_index()
foo2.loc[foo2.Color.ne('Blue')].set_index(foo.index.names)
100 loops, best of 3: 2.37 ms per loop
In [54]:
%%timeit
foo.to_panel()[:,:,['Green','Red']].to_frame()
1000 loops, best of 3: 1.18 ms per loop
UPDATE
After revisiting this topic (again), I observed the following:
In [100]:
%%timeit
foo2 = pd.DataFrame({k: foo.loc[k] for k in foo.index if k[1] in ['Green','Red']}).transpose()
foo2.index.names = foo.index.names
foo2.columns.names = foo2.columns.names
100 loops, best of 3: 1.97 ms per loop
In [101]:
%%timeit
foo2 = pd.DataFrame.from_dict({k: foo.loc[k] for k in foo.index if k[1] in ['Green','Red']}, orient='index')
foo2.index.names = foo.index.names
foo2.columns.names = foo2.columns.names
100 loops, best of 3: 1.82 ms per loop
If you don't care about preserving the original order and naming of the levels, you can use:
%%timeit
pd.concat({key: foo.xs(key, axis=0, level=1) for key in ['Green','Red']}, axis=0)
1000 loops, best of 3: 1.31 ms per loop
And if you are just selecting on the first level:
%%timeit
pd.concat({key: foo.loc[key] for key in ['A','B']}, axis=0, names=foo.index.names)
1000 loops, best of 3: 1.12 ms per loop
versus:
%%timeit
foo.to_panel()[:,['A','B'],:].to_frame()
1000 loops, best of 3: 1.16 ms per loop
Another Update
If you sort the index of the example foo, many of the times above improve (times have been updated to reflect a pre-sorted index). However, when the index is sorted, you can use the solution described by user674155:
%%timeit
foo.loc[(slice(None), ['Blue','Red']),:]
1000 loops, best of 3: 582 µs per loop
This is the most efficient and intuitive in my opinion (the user doesn't need to understand panels and how they are created from frames).
Note: even if the index has not yet been sorted, sorting the index of foo on the fly is comparable in performance to the to_panel option.

Convert to a panel, then indexing is direct
In [20]: df = pd.DataFrame(dict(time = pd.Timestamp('20130102'),
A = np.random.rand(3),
ticker=['SPY','SLV','GLD'])).set_index(['time','ticker'])
In [21]: df
Out[21]:
A
time ticker
2013-01-02 SPY 0.347209
SLV 0.034832
GLD 0.280951
In [22]: p = df.to_panel()
In [23]: p
Out[23]:
<class 'pandas.core.panel.Panel'>
Dimensions: 1 (items) x 1 (major_axis) x 3 (minor_axis)
Items axis: A to A
Major_axis axis: 2013-01-02 00:00:00 to 2013-01-02 00:00:00
Minor_axis axis: GLD to SPY
In [24]: p.ix[:,:,['SPY','GLD']]
Out[24]:
<class 'pandas.core.panel.Panel'>
Dimensions: 1 (items) x 1 (major_axis) x 2 (minor_axis)
Items axis: A to A
Major_axis axis: 2013-01-02 00:00:00 to 2013-01-02 00:00:00
Minor_axis axis: SPY to GLD

Related

Absurdly slow simple pandas apply function [duplicate]

I have seen many answers posted to questions on Stack Overflow involving the use of the Pandas method apply. I have also seen users commenting under them saying that "apply is slow, and should be avoided".
I have read many articles on the topic of performance that explain apply is slow. I have also seen a disclaimer in the docs about how apply is simply a convenience function for passing UDFs (can't seem to find that now). So, the general consensus is that apply should be avoided if possible. However, this raises the following questions:
If apply is so bad, then why is it in the API?
How and when should I make my code apply-free?
Are there ever any situations where apply is good (better than other possible solutions)?
apply, the Convenience Function you Never Needed
We start by addressing the questions in the OP, one by one.
"If apply is so bad, then why is it in the API?"
DataFrame.apply and Series.apply are convenience functions defined on DataFrame and Series object respectively. apply accepts any user defined function that applies a transformation/aggregation on a DataFrame. apply is effectively a silver bullet that does whatever any existing pandas function cannot do.
Some of the things apply can do:
Run any user-defined function on a DataFrame or Series
Apply a function either row-wise (axis=1) or column-wise (axis=0) on a DataFrame
Perform index alignment while applying the function
Perform aggregation with user-defined functions (however, we usually prefer agg or transform in these cases)
Perform element-wise transformations
Broadcast aggregated results to original rows (see the result_type argument).
Accept positional/keyword arguments to pass to the user-defined functions.
...Among others. For more information, see Row or Column-wise Function Application in the documentation.
So, with all these features, why is apply bad? It is because apply is slow. Pandas makes no assumptions about the nature of your function, and so iteratively applies your function to each row/column as necessary. Additionally, handling all of the situations above means apply incurs some major overhead at each iteration. Further, apply consumes a lot more memory, which is a challenge for memory bounded applications.
There are very few situations where apply is appropriate to use (more on that below). If you're not sure whether you should be using apply, you probably shouldn't.
Let's address the next question.
"How and when should I make my code apply-free?"
To rephrase, here are some common situations where you will want to get rid of any calls to apply.
Numeric Data
If you're working with numeric data, there is likely already a vectorized cython function that does exactly what you're trying to do (if not, please either ask a question on Stack Overflow or open a feature request on GitHub).
Contrast the performance of apply for a simple addition operation.
df = pd.DataFrame({"A": [9, 4, 2, 1], "B": [12, 7, 5, 4]})
df
A B
0 9 12
1 4 7
2 2 5
3 1 4
<!- ->
df.apply(np.sum)
A 16
B 28
dtype: int64
df.sum()
A 16
B 28
dtype: int64
Performance wise, there's no comparison, the cythonized equivalent is much faster. There's no need for a graph, because the difference is obvious even for toy data.
%timeit df.apply(np.sum)
%timeit df.sum()
2.22 ms ± 41.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
471 µs ± 8.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Even if you enable passing raw arrays with the raw argument, it's still twice as slow.
%timeit df.apply(np.sum, raw=True)
840 µs ± 691 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Another example:
df.apply(lambda x: x.max() - x.min())
A 8
B 8
dtype: int64
df.max() - df.min()
A 8
B 8
dtype: int64
%timeit df.apply(lambda x: x.max() - x.min())
%timeit df.max() - df.min()
2.43 ms ± 450 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.23 ms ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In general, seek out vectorized alternatives if possible.
String/Regex
Pandas provides "vectorized" string functions in most situations, but there are rare cases where those functions do not... "apply", so to speak.
A common problem is to check whether a value in a column is present in another column of the same row.
df = pd.DataFrame({
'Name': ['mickey', 'donald', 'minnie'],
'Title': ['wonderland', "welcome to donald's castle", 'Minnie mouse clubhouse'],
'Value': [20, 10, 86]})
df
Name Value Title
0 mickey 20 wonderland
1 donald 10 welcome to donald's castle
2 minnie 86 Minnie mouse clubhouse
This should return the row second and third row, since "donald" and "minnie" are present in their respective "Title" columns.
Using apply, this would be done using
df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)
0 False
1 True
2 True
dtype: bool
df[df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)]
Name Title Value
1 donald welcome to donald's castle 10
2 minnie Minnie mouse clubhouse 86
However, a better solution exists using list comprehensions.
df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]]
Name Title Value
1 donald welcome to donald's castle 10
2 minnie Minnie mouse clubhouse 86
<!- ->
%timeit df[df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)]
%timeit df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]]
2.85 ms ± 38.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
788 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The thing to note here is that iterative routines happen to be faster than apply, because of the lower overhead. If you need to handle NaNs and invalid dtypes, you can build on this using a custom function you can then call with arguments inside the list comprehension.
For more information on when list comprehensions should be considered a good option, see my writeup: Are for-loops in pandas really bad? When should I care?.
Note
Date and datetime operations also have vectorized versions. So, for example, you should prefer pd.to_datetime(df['date']), over,
say, df['date'].apply(pd.to_datetime).
Read more at the
docs.
A Common Pitfall: Exploding Columns of Lists
s = pd.Series([[1, 2]] * 3)
s
0 [1, 2]
1 [1, 2]
2 [1, 2]
dtype: object
People are tempted to use apply(pd.Series). This is horrible in terms of performance.
s.apply(pd.Series)
0 1
0 1 2
1 1 2
2 1 2
A better option is to listify the column and pass it to pd.DataFrame.
pd.DataFrame(s.tolist())
0 1
0 1 2
1 1 2
2 1 2
<!- ->
%timeit s.apply(pd.Series)
%timeit pd.DataFrame(s.tolist())
2.65 ms ± 294 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
816 µs ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Lastly,
"Are there any situations where apply is good?"
Apply is a convenience function, so there are situations where the overhead is negligible enough to forgive. It really depends on how many times the function is called.
Functions that are Vectorized for Series, but not DataFrames
What if you want to apply a string operation on multiple columns? What if you want to convert multiple columns to datetime? These functions are vectorized for Series only, so they must be applied over each column that you want to convert/operate on.
df = pd.DataFrame(
pd.date_range('2018-12-31','2019-01-31', freq='2D').date.astype(str).reshape(-1, 2),
columns=['date1', 'date2'])
df
date1 date2
0 2018-12-31 2019-01-02
1 2019-01-04 2019-01-06
2 2019-01-08 2019-01-10
3 2019-01-12 2019-01-14
4 2019-01-16 2019-01-18
5 2019-01-20 2019-01-22
6 2019-01-24 2019-01-26
7 2019-01-28 2019-01-30
df.dtypes
date1 object
date2 object
dtype: object
This is an admissible case for apply:
df.apply(pd.to_datetime, errors='coerce').dtypes
date1 datetime64[ns]
date2 datetime64[ns]
dtype: object
Note that it would also make sense to stack, or just use an explicit loop. All these options are slightly faster than using apply, but the difference is small enough to forgive.
%timeit df.apply(pd.to_datetime, errors='coerce')
%timeit pd.to_datetime(df.stack(), errors='coerce').unstack()
%timeit pd.concat([pd.to_datetime(df[c], errors='coerce') for c in df], axis=1)
%timeit for c in df.columns: df[c] = pd.to_datetime(df[c], errors='coerce')
5.49 ms ± 247 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.94 ms ± 48.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.16 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.41 ms ± 1.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can make a similar case for other operations such as string operations, or conversion to category.
u = df.apply(lambda x: x.str.contains(...))
v = df.apply(lambda x: x.astype(category))
v/s
u = pd.concat([df[c].str.contains(...) for c in df], axis=1)
v = df.copy()
for c in df:
v[c] = df[c].astype(category)
And so on...
Converting Series to str: astype versus apply
This seems like an idiosyncrasy of the API. Using apply to convert integers in a Series to string is comparable (and sometimes faster) than using astype.
The graph was plotted using the perfplot library.
import perfplot
perfplot.show(
setup=lambda n: pd.Series(np.random.randint(0, n, n)),
kernels=[
lambda s: s.astype(str),
lambda s: s.apply(str)
],
labels=['astype', 'apply'],
n_range=[2**k for k in range(1, 20)],
xlabel='N',
logx=True,
logy=True,
equality_check=lambda x, y: (x == y).all())
With floats, I see the astype is consistently as fast as, or slightly faster than apply. So this has to do with the fact that the data in the test is integer type.
GroupBy operations with chained transformations
GroupBy.apply has not been discussed until now, but GroupBy.apply is also an iterative convenience function to handle anything that the existing GroupBy functions do not.
One common requirement is to perform a GroupBy and then two prime operations such as a "lagged cumsum":
df = pd.DataFrame({"A": list('aabcccddee'), "B": [12, 7, 5, 4, 5, 4, 3, 2, 1, 10]})
df
A B
0 a 12
1 a 7
2 b 5
3 c 4
4 c 5
5 c 4
6 d 3
7 d 2
8 e 1
9 e 10
<!- ->
You'd need two successive groupby calls here:
df.groupby('A').B.cumsum().groupby(df.A).shift()
0 NaN
1 12.0
2 NaN
3 NaN
4 4.0
5 9.0
6 NaN
7 3.0
8 NaN
9 1.0
Name: B, dtype: float64
Using apply, you can shorten this to a a single call.
df.groupby('A').B.apply(lambda x: x.cumsum().shift())
0 NaN
1 12.0
2 NaN
3 NaN
4 4.0
5 9.0
6 NaN
7 3.0
8 NaN
9 1.0
Name: B, dtype: float64
It is very hard to quantify the performance because it depends on the data. But in general, apply is an acceptable solution if the goal is to reduce a groupby call (because groupby is also quite expensive).
Other Caveats
Aside from the caveats mentioned above, it is also worth mentioning that apply operates on the first row (or column) twice. This is done to determine whether the function has any side effects. If not, apply may be able to use a fast-path for evaluating the result, else it falls back to a slow implementation.
df = pd.DataFrame({
'A': [1, 2],
'B': ['x', 'y']
})
def func(x):
print(x['A'])
return x
df.apply(func, axis=1)
# 1
# 1
# 2
A B
0 1 x
1 2 y
This behaviour is also seen in GroupBy.apply on pandas versions <0.25 (it was fixed for 0.25, see here for more information.)
Not all applys are alike
The below chart suggests when to consider apply1. Green means possibly efficient; red avoid.
Some of this is intuitive: pd.Series.apply is a Python-level row-wise loop, ditto pd.DataFrame.apply row-wise (axis=1). The misuses of these are many and wide-ranging. The other post deals with them in more depth. Popular solutions are to use vectorised methods, list comprehensions (assumes clean data), or efficient tools such as the pd.DataFrame constructor (e.g. to avoid apply(pd.Series)).
If you are using pd.DataFrame.apply row-wise, specifying raw=True (where possible) is often beneficial. At this stage, numba is usually a better choice.
GroupBy.apply: generally favoured
Repeating groupby operations to avoid apply will hurt performance. GroupBy.apply is usually fine here, provided the methods you use in your custom function are themselves vectorised. Sometimes there is no native Pandas method for a groupwise aggregation you wish to apply. In this case, for a small number of groups apply with a custom function may still offer reasonable performance.
pd.DataFrame.apply column-wise: a mixed bag
pd.DataFrame.apply column-wise (axis=0) is an interesting case. For a small number of rows versus a large number of columns, it's almost always expensive. For a large number of rows relative to columns, the more common case, you may sometimes see significant performance improvements using apply:
# Python 3.7, Pandas 0.23.4
np.random.seed(0)
df = pd.DataFrame(np.random.random((10**7, 3))) # Scenario_1, many rows
df = pd.DataFrame(np.random.random((10**4, 10**3))) # Scenario_2, many columns
# Scenario_1 | Scenario_2
%timeit df.sum() # 800 ms | 109 ms
%timeit df.apply(pd.Series.sum) # 568 ms | 325 ms
%timeit df.max() - df.min() # 1.63 s | 314 ms
%timeit df.apply(lambda x: x.max() - x.min()) # 838 ms | 473 ms
%timeit df.mean() # 108 ms | 94.4 ms
%timeit df.apply(pd.Series.mean) # 276 ms | 233 ms
1 There are exceptions, but these are usually marginal or uncommon. A couple of examples:
df['col'].apply(str) may slightly outperform df['col'].astype(str).
df.apply(pd.to_datetime) working on strings doesn't scale well with rows versus a regular for loop.
For axis=1 (i.e. row-wise functions) then you can just use the following function in lieu of apply. I wonder why this isn't the pandas behavior. (Untested with compound indexes, but it does appear to be much faster than apply)
def faster_df_apply(df, func):
cols = list(df.columns)
data, index = [], []
for row in df.itertuples(index=True):
row_dict = {f:v for f,v in zip(cols, row[1:])}
data.append(func(row_dict))
index.append(row[0])
return pd.Series(data, index=index)
Are there ever any situations where apply is good?
Yes, sometimes.
Task: decode Unicode strings.
import numpy as np
import pandas as pd
import unidecode
s = pd.Series(['mañana','Ceñía'])
s.head()
0 mañana
1 Ceñía
s.apply(unidecode.unidecode)
0 manana
1 Cenia
Update
I was by no means advocating for the use of apply, just thinking since the NumPy cannot deal with the above situation, it could have been a good candidate for pandas apply. But I was forgetting the plain ol list comprehension thanks to the reminder by #jpp.

quick pandas groupby calculations with cumprod

This question is linked to Speedup of pandas groupby. It is about speeding up a groubby cumproduct calculation. The DataFrame is 2D and has a multi index consisting of 3 integers.
The HDF5 file for the dataframe can be found here: http://filebin.ca/2Csy0E2QuF2w/phi.h5
The actual calculation that I'm performing is similar to this:
>>> phi = pd.read_hdf('phi.h5', 'phi')
>>> %timeit phi.groupby(level='atomic_number').cumprod()
100 loops, best of 3: 5.45 ms per loop
The other speedup that might be possible is that I do this calculation about 100 times using the same index structure but with different numbers. I wonder if it can somehow cache the index.
Any help will be appreciated.
Numba appears to work pretty well here. In fact, these results seem almost too good to be true with the numba function below being about 4,000x faster than the original method and 5x faster than plain cumprod without a groupby. Hopefully these are correct, let me know if there is an error.
np.random.seed(1234)
df=pd.DataFrame({ 'x':np.repeat(range(200),4), 'y':np.random.randn(800) })
df = df.sort('x')
df['cp_groupby'] = df.groupby('x').cumprod()
from numba import jit
#jit
def group_cumprod(x,y):
z = np.ones(len(x))
for i in range(len(x)):
if x[i] == x[i-1]:
z[i] = y[i] * z[i-1]
else:
z[i] = y[i]
return z
df['cp_numba'] = group_cumprod(df.x.values,df.y.values)
df['dif'] = df.cp_groupby - df.cp_numba
Test that both ways give the same answer:
all(df.cp_groupby==df.cp_numba)
Out[1447]: True
Timings:
%timeit df.groupby('x').cumprod()
10 loops, best of 3: 102 ms per loop
%timeit df['y'].cumprod()
10000 loops, best of 3: 133 µs per loop
%timeit group_cumprod(df.x.values,df.y.values)
10000 loops, best of 3: 24.4 µs per loop
pure numpy solution, assuming the data is sorted by the index, though no handling of NaN:
res = np.empty_like(phi.values)
l = 0
r = phi.index.levels[0]
for i in r:
phi.values[l:l+i,:].cumprod(axis=0, out=res[l:l+i])
l += i
about 40 times faster on the multiindex data from the question.
Though a problem is that this does rely on how pandas stores the data in its backend array. So it may stop working when pandas changes.
>>> phi = pd.read_hdf('phi.h5', 'phi')
>>> %timeit phi.groupby(level='atomic_number').cumprod()
100 loops, best of 3: 4.33 ms per loop
>>> %timeit np_cumprod(phi)
10000 loops, best of 3: 111 µs per loop
If you want a fast but not very pretty workaround, you could do something like the following. Here's some sample data and your default approach.
df=pd.DataFrame({ 'x':np.repeat(range(200),4), 'y':np.random.randn(800) })
df = df.sort('x')
df['cp_group'] = df.groupby('x').cumprod()
And here's the workaround. It's looks rather long (it is) but each individual step is simple and fast. (The timings are at the bottom.) The key is simply to avoid using groupby at all in this case by replacing with shift and such -- but because of that you also need to make sure your data is sorted by the groupby column.
df['cp_nogroup'] = df.y.cumprod()
df['last'] = np.where( df.x == df.x.shift(-1), 0, df.y.cumprod() )
df['last'] = np.where( df['last'] == 0., np.nan, df['last'] )
df['last'] = df['last'].shift().ffill().fillna(1)
df['cp_fast'] = df['cp_nogroup'] / df['last']
df['dif'] = df.cp_group - df.cp_fast
Here's what it looks like. 'cp_group' is your default and 'cp_fast' is the above workaround. If you look at the 'dif' column you'll see that several of these are off by very small amounts. This is just a precision issue and not anything to worry about.
x y cp_group cp_nogroup last cp_fast dif
0 0 1.364826 1.364826 1.364826 1.000000 1.364826 0.000000e+00
1 0 0.410126 0.559751 0.559751 1.000000 0.559751 0.000000e+00
2 0 0.894037 0.500438 0.500438 1.000000 0.500438 0.000000e+00
3 0 0.092296 0.046189 0.046189 1.000000 0.046189 0.000000e+00
4 1 1.262172 1.262172 0.058298 0.046189 1.262172 0.000000e+00
5 1 0.832328 1.050541 0.048523 0.046189 1.050541 2.220446e-16
6 1 -0.337245 -0.354289 -0.016364 0.046189 -0.354289 -5.551115e-17
7 1 0.758163 -0.268609 -0.012407 0.046189 -0.268609 -5.551115e-17
8 2 -1.025820 -1.025820 0.012727 -0.012407 -1.025820 0.000000e+00
9 2 1.175903 -1.206265 0.014966 -0.012407 -1.206265 0.000000e+00
Timings
Default method:
In [86]: %timeit df.groupby('x').cumprod()
10 loops, best of 3: 100 ms per loop
Standard cumprod but without the groupby. This should be a good approximation of the maximum possible speed you could achieve.
In [87]: %timeit df.cumprod()
1000 loops, best of 3: 536 µs per loop
And here's the workaround:
In [88]: %%timeit
...: df['cp_nogroup'] = df.y.cumprod()
...: df['last'] = np.where( df.x == df.x.shift(-1), 0, df.y.cumprod() )
...: df['last'] = np.where( df['last'] == 0., np.nan, df['last'] )
...: df['last'] = df['last'].shift().ffill().fillna(1)
...: df['cp_fast'] = df['cp_nogroup'] / df['last']
...: df['dif'] = df.cp_group - df.cp_fast
100 loops, best of 3: 2.3 ms per loop
So the workaround is about 40x faster for this sample dataframe but the speedup will depend on the dataframe (in particular on the number of groups).

Drawing a bootstrap sample from a pandas.DataFrame

I would like to draw a bootstrap sample of a pandas.DataFrame as efficiently as possible. Using the builtin iloc together with a list of integers seems to be slow:
import pandas
import numpy as np
# Generate some data
n = 5000
values = np.random.uniform(size=(n, 5))
# Construct a pandas.DataFrame
columns = ['a', 'b', 'c', 'd', 'e']
df = pandas.DataFrame(values, columns=columns)
# Bootstrap
%timeit df.iloc[np.random.randint(n, size=n)]
# Out: 1000 loops, best of 3: 1.46 ms per loop
Indexing the numpy array is of course much faster:
%timeit values[np.random.randint(n, size=n)]
# Out: 10000 loops, best of 3: 159 µs per loop
But even extracting the values, sampling the numpy array, and constructing a new pandas.DataFrame is faster:
%timeit pandas.DataFrame(df.values[np.random.randint(n, size=n)], columns=columns)
# Out: 1000 loops, best of 3: 302 µs per loop
#JohnE suggested sample which is unfortunately even slower:
%timeit df.sample(n, replace=True)
# Out: 100 loops, best of 3: 5.14 ms per loop
#firelynx suggested merge:
%timeit df.merge(pandas.DataFrame(index=np.random.randint(n, size=n)), left_index=True, right_index=True, how='right')
# Out: 1000 loops, best of 3: 1.23 ms per loop
Does anyone have an idea why iloc is so slow and/or whether there are better alternatives than extracting the values, sampling and then constructing a new pandas.DataFrame?
The merge method in pandas is fairly optimized, so I tried my luck with it and it gave me a significant speed increase. Given my machine is a bit slower than yours, I'm also using pandas 0.15.2 Things may be a bit different.
%timeit df.iloc[np.random.randint(n, size=n)]
# 100 loops, best of 3: 2.41 ms per loop
randlist = pandas.DataFrame(index=np.random.randint(n, size=n))
%timeit df.merge(randlist, left_index=True, right_index=True, how='right')
# 1000 loops, best of 3: 1.87 ms per loop
%timeit df.merge(pandas.DataFrame(index=np.random.randint(n, size=n)), left_index=True, right_index=True, how='right')
# 100 loops, best of 3: 2.29 ms per loop
Indexing Speeds
Boolean Indexing tested to be slightly faster for me:
Boolean Indexing
%timeit -n10000 df[np.random.randint(2, size=n).astype(bool)]
# 10000 loops, best of 3: 307 µs per loop
numpy sampling & reDataFrameing
%timeit -n10000 pandas.DataFrame(df.values[np.random.randint(n, size=n)], columns=columns)
# 10000 loops, best of 3: 380 µs per loop

split data frame based on integer index

In pandas how do I split Series/dataframe into two Series/DataFrames where odd rows in one Series, even rows in different? Right now I am using
rng = range(0, n, 2)
odd_rows = df.iloc[rng]
This is pretty slow.
Use slice:
In [11]: s = pd.Series([1,2,3,4])
In [12]: s.iloc[::2] # even
Out[12]:
0 1
2 3
dtype: int64
In [13]: s.iloc[1::2] # odd
Out[13]:
1 2
3 4
dtype: int64
Here's some comparisions
In [100]: df = DataFrame(randn(100000,10))
simple method (but I think range makes this slow), but will work regardless of the index
(e.g. does not have to be a numeric index)
In [96]: %timeit df.iloc[range(0,len(df),2)]
10 loops, best of 3: 21.2 ms per loop
The following require an Int64Index that is range based (which is easy to get, just reset_index()).
In [107]: %timeit df.iloc[(df.index % 2).astype(bool)]
100 loops, best of 3: 5.67 ms per loop
In [108]: %timeit df.loc[(df.index % 2).astype(bool)]
100 loops, best of 3: 5.48 ms per loop
make sure to give it index positions
In [98]: %timeit df.take(df.index % 2)
100 loops, best of 3: 3.06 ms per loop
same as above but no conversions on negative indicies
In [99]: %timeit df.take(df.index % 2,convert=False)
100 loops, best of 3: 2.44 ms per loop
This winner is #AndyHayden soln; this only works on a single dtype
In [118]: %timeit DataFrame(df.values[::2],index=df.index[::2])
10000 loops, best of 3: 63.5 us per loop

Vectorized way of calculating row-wise dot product two matrices with Scipy

I want to calculate the row-wise dot product of two matrices of the same dimension as fast as possible. This is the way I am doing it:
import numpy as np
a = np.array([[1,2,3], [3,4,5]])
b = np.array([[1,2,3], [1,2,3]])
result = np.array([])
for row1, row2 in a, b:
result = np.append(result, np.dot(row1, row2))
print result
and of course the output is:
[ 26. 14.]
Straightforward way to do that is:
import numpy as np
a=np.array([[1,2,3],[3,4,5]])
b=np.array([[1,2,3],[1,2,3]])
np.sum(a*b, axis=1)
which avoids the python loop and is faster in cases like:
def npsumdot(x, y):
return np.sum(x*y, axis=1)
def loopdot(x, y):
result = np.empty((x.shape[0]))
for i in range(x.shape[0]):
result[i] = np.dot(x[i], y[i])
return result
timeit npsumdot(np.random.rand(500000,50),np.random.rand(500000,50))
# 1 loops, best of 3: 861 ms per loop
timeit loopdot(np.random.rand(500000,50),np.random.rand(500000,50))
# 1 loops, best of 3: 1.58 s per loop
Check out numpy.einsum for another method:
In [52]: a
Out[52]:
array([[1, 2, 3],
[3, 4, 5]])
In [53]: b
Out[53]:
array([[1, 2, 3],
[1, 2, 3]])
In [54]: einsum('ij,ij->i', a, b)
Out[54]: array([14, 26])
Looks like einsum is a bit faster than inner1d:
In [94]: %timeit inner1d(a,b)
1000000 loops, best of 3: 1.8 us per loop
In [95]: %timeit einsum('ij,ij->i', a, b)
1000000 loops, best of 3: 1.6 us per loop
In [96]: a = random.randn(10, 100)
In [97]: b = random.randn(10, 100)
In [98]: %timeit inner1d(a,b)
100000 loops, best of 3: 2.89 us per loop
In [99]: %timeit einsum('ij,ij->i', a, b)
100000 loops, best of 3: 2.03 us per loop
Note: NumPy is constantly evolving and improving; the relative performance of the functions shown above has probably changed over the years. If performance is important to you, run your own tests with the version of NumPy that you will be using.
Played around with this and found inner1d the fastest. That function however is internal, so a more robust approach is to use
numpy.einsum("ij,ij->i", a, b)
Even better is to align your memory such that the summation happens in the first dimension, e.g.,
a = numpy.random.rand(3, n)
b = numpy.random.rand(3, n)
numpy.einsum("ij,ij->j", a, b)
For 10 ** 3 <= n <= 10 ** 6, this is the fastest method, and up to twice as fast as its untransposed equivalent. The maximum occurs when the level-2 cache is maxed out, at about 2 * 10 ** 4.
Note also that the transposed summation is much faster than its untransposed equivalent.
The plot was created with perfplot (a small project of mine)
import numpy
from numpy.core.umath_tests import inner1d
import perfplot
def setup(n):
a = numpy.random.rand(n, 3)
b = numpy.random.rand(n, 3)
aT = numpy.ascontiguousarray(a.T)
bT = numpy.ascontiguousarray(b.T)
return (a, b), (aT, bT)
b = perfplot.bench(
setup=setup,
n_range=[2 ** k for k in range(1, 25)],
kernels=[
lambda data: numpy.sum(data[0][0] * data[0][1], axis=1),
lambda data: numpy.einsum("ij, ij->i", data[0][0], data[0][1]),
lambda data: numpy.sum(data[1][0] * data[1][1], axis=0),
lambda data: numpy.einsum("ij, ij->j", data[1][0], data[1][1]),
lambda data: inner1d(data[0][0], data[0][1]),
],
labels=["sum", "einsum", "sum.T", "einsum.T", "inner1d"],
xlabel="len(a), len(b)",
)
b.save("out1.png")
b.save("out2.png", relative_to=3)
You'll do better avoiding the append, but I can't think of a way to avoid the python loop. A custom Ufunc perhaps? I don't think numpy.vectorize will help you here.
import numpy as np
a=np.array([[1,2,3],[3,4,5]])
b=np.array([[1,2,3],[1,2,3]])
result=np.empty((2,))
for i in range(2):
result[i] = np.dot(a[i],b[i]))
print result
EDIT
Based on this answer, it looks like inner1d might work if the vectors in your real-world problem are 1D.
from numpy.core.umath_tests import inner1d
inner1d(a,b) # array([14, 26])
I came across this answer and re-verified the results with Numpy 1.14.3 running in Python 3.5. For the most part the answers above hold true on my system, although I found that for very large matrices (see example below), all but one of the methods are so close to one another that the performance difference is meaningless.
For smaller matrices, I found that einsum was the fastest by a considerable margin, up to a factor of two in some cases.
My large matrix example:
import numpy as np
from numpy.core.umath_tests import inner1d
a = np.random.randn(100, 1000000) # 800 MB each
b = np.random.randn(100, 1000000) # pretty big.
def loop_dot(a, b):
result = np.empty((a.shape[1],))
for i, (row1, row2) in enumerate(zip(a, b)):
result[i] = np.dot(row1, row2)
%timeit inner1d(a, b)
# 128 ms ± 523 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.einsum('ij,ij->i', a, b)
# 121 ms ± 402 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.sum(a*b, axis=1)
# 411 ms ± 1.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit loop_dot(a, b) # note the function call took negligible time
# 123 ms ± 342 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So einsum is still the fastest on very large matrices, but by a tiny amount. It appears to be a statistically significant (tiny) amount though!