Concat the same series a bunch of times? - pandas

Let us say I have a series of
import pandas as pd
a = pd.Series([1, 2, 3])
Is there a more efficient way of creating a million row df than
chamillion_row_df = pd.concat([a] * 1000000)
?

You can use np.tile for this:
pd.Series(np.tile(a.values, 1000000))
This will be much quicker than building a temporary list as a param to pd.concat
timings
In [42]:
%timeit pd.Series(np.tile(a.values, 1000000))
%timeit pd.concat([a] * 1000000)
100 loops, best of 3: 17.7 ms per loop
1 loops, best of 3: 16.9 s per loop
So only ~1000x quicker using np.tile

Related

Pandas replace .apply(lambda x: with fast solution e.g. using numpy arrays

I am trying to speed up a ranking function that I use to process millions of rows with hundreds of factors. I have provided a sample MCVE below:
to_rank = ['var_1', 'var_2', 'var_3']
df = pd.DataFrame({'var_1' : np.random.randn(1000), 'var_2' : np.random.randn(1000), 'var_3' : np.random.randn(1000)})
df['date_id'] = np.random.choice(range(2001, 2012), df.shape[0])
df['category'] = ','.join(chr(random.randrange(97, 97 + 4 + 1)).upper() for x in range(1,df.shape[0]+1)).split(',')
My ranking code is below as follows:
import pandas as pd
import numpy as np
import bottleneck as bn
%timeit ranked = df[['date_id', 'category'] + to_rank].groupby(['date_id', 'category']).apply(lambda x: x[to_rank].apply(lambda x: bn.nanrankdata(x) * 100 / len(x) - 1))
10 loops, best of 3: 106 ms per loop
With my data, this takes about 30 - 40 seconds. I gather that .apply(lambda x: has big overheads, including a loop, dtype detection, and shape analysis, and I am using this twice to loop over a multi-index, which is probably doubly inefficient. I have read that one can vectorize this by using Series/numpy arrays (e.g. https://tomaugspurger.github.io/modern-4-performance.html but I am struggling to implement this myself; indeed, most similar questions about applying a function over a multi-index seem to use .apply(lambda x: so I suspect others could also benefit from speeding up their code.
You can define a function and use transform, although the time taken is not that much better (only twice as fast) :
def nanrankdata_len(x):
return bn.nanrankdata(x)*100/len(x) - 1
%timeit ranked = df.groupby(['date_id','category']).transform(nanrankdata_len)
#-> 10 loops, best of 3: 55.5 ms per loop

quick pandas groupby calculations with cumprod

This question is linked to Speedup of pandas groupby. It is about speeding up a groubby cumproduct calculation. The DataFrame is 2D and has a multi index consisting of 3 integers.
The HDF5 file for the dataframe can be found here: http://filebin.ca/2Csy0E2QuF2w/phi.h5
The actual calculation that I'm performing is similar to this:
>>> phi = pd.read_hdf('phi.h5', 'phi')
>>> %timeit phi.groupby(level='atomic_number').cumprod()
100 loops, best of 3: 5.45 ms per loop
The other speedup that might be possible is that I do this calculation about 100 times using the same index structure but with different numbers. I wonder if it can somehow cache the index.
Any help will be appreciated.
Numba appears to work pretty well here. In fact, these results seem almost too good to be true with the numba function below being about 4,000x faster than the original method and 5x faster than plain cumprod without a groupby. Hopefully these are correct, let me know if there is an error.
np.random.seed(1234)
df=pd.DataFrame({ 'x':np.repeat(range(200),4), 'y':np.random.randn(800) })
df = df.sort('x')
df['cp_groupby'] = df.groupby('x').cumprod()
from numba import jit
#jit
def group_cumprod(x,y):
z = np.ones(len(x))
for i in range(len(x)):
if x[i] == x[i-1]:
z[i] = y[i] * z[i-1]
else:
z[i] = y[i]
return z
df['cp_numba'] = group_cumprod(df.x.values,df.y.values)
df['dif'] = df.cp_groupby - df.cp_numba
Test that both ways give the same answer:
all(df.cp_groupby==df.cp_numba)
Out[1447]: True
Timings:
%timeit df.groupby('x').cumprod()
10 loops, best of 3: 102 ms per loop
%timeit df['y'].cumprod()
10000 loops, best of 3: 133 µs per loop
%timeit group_cumprod(df.x.values,df.y.values)
10000 loops, best of 3: 24.4 µs per loop
pure numpy solution, assuming the data is sorted by the index, though no handling of NaN:
res = np.empty_like(phi.values)
l = 0
r = phi.index.levels[0]
for i in r:
phi.values[l:l+i,:].cumprod(axis=0, out=res[l:l+i])
l += i
about 40 times faster on the multiindex data from the question.
Though a problem is that this does rely on how pandas stores the data in its backend array. So it may stop working when pandas changes.
>>> phi = pd.read_hdf('phi.h5', 'phi')
>>> %timeit phi.groupby(level='atomic_number').cumprod()
100 loops, best of 3: 4.33 ms per loop
>>> %timeit np_cumprod(phi)
10000 loops, best of 3: 111 µs per loop
If you want a fast but not very pretty workaround, you could do something like the following. Here's some sample data and your default approach.
df=pd.DataFrame({ 'x':np.repeat(range(200),4), 'y':np.random.randn(800) })
df = df.sort('x')
df['cp_group'] = df.groupby('x').cumprod()
And here's the workaround. It's looks rather long (it is) but each individual step is simple and fast. (The timings are at the bottom.) The key is simply to avoid using groupby at all in this case by replacing with shift and such -- but because of that you also need to make sure your data is sorted by the groupby column.
df['cp_nogroup'] = df.y.cumprod()
df['last'] = np.where( df.x == df.x.shift(-1), 0, df.y.cumprod() )
df['last'] = np.where( df['last'] == 0., np.nan, df['last'] )
df['last'] = df['last'].shift().ffill().fillna(1)
df['cp_fast'] = df['cp_nogroup'] / df['last']
df['dif'] = df.cp_group - df.cp_fast
Here's what it looks like. 'cp_group' is your default and 'cp_fast' is the above workaround. If you look at the 'dif' column you'll see that several of these are off by very small amounts. This is just a precision issue and not anything to worry about.
x y cp_group cp_nogroup last cp_fast dif
0 0 1.364826 1.364826 1.364826 1.000000 1.364826 0.000000e+00
1 0 0.410126 0.559751 0.559751 1.000000 0.559751 0.000000e+00
2 0 0.894037 0.500438 0.500438 1.000000 0.500438 0.000000e+00
3 0 0.092296 0.046189 0.046189 1.000000 0.046189 0.000000e+00
4 1 1.262172 1.262172 0.058298 0.046189 1.262172 0.000000e+00
5 1 0.832328 1.050541 0.048523 0.046189 1.050541 2.220446e-16
6 1 -0.337245 -0.354289 -0.016364 0.046189 -0.354289 -5.551115e-17
7 1 0.758163 -0.268609 -0.012407 0.046189 -0.268609 -5.551115e-17
8 2 -1.025820 -1.025820 0.012727 -0.012407 -1.025820 0.000000e+00
9 2 1.175903 -1.206265 0.014966 -0.012407 -1.206265 0.000000e+00
Timings
Default method:
In [86]: %timeit df.groupby('x').cumprod()
10 loops, best of 3: 100 ms per loop
Standard cumprod but without the groupby. This should be a good approximation of the maximum possible speed you could achieve.
In [87]: %timeit df.cumprod()
1000 loops, best of 3: 536 µs per loop
And here's the workaround:
In [88]: %%timeit
...: df['cp_nogroup'] = df.y.cumprod()
...: df['last'] = np.where( df.x == df.x.shift(-1), 0, df.y.cumprod() )
...: df['last'] = np.where( df['last'] == 0., np.nan, df['last'] )
...: df['last'] = df['last'].shift().ffill().fillna(1)
...: df['cp_fast'] = df['cp_nogroup'] / df['last']
...: df['dif'] = df.cp_group - df.cp_fast
100 loops, best of 3: 2.3 ms per loop
So the workaround is about 40x faster for this sample dataframe but the speedup will depend on the dataframe (in particular on the number of groups).

Drawing a bootstrap sample from a pandas.DataFrame

I would like to draw a bootstrap sample of a pandas.DataFrame as efficiently as possible. Using the builtin iloc together with a list of integers seems to be slow:
import pandas
import numpy as np
# Generate some data
n = 5000
values = np.random.uniform(size=(n, 5))
# Construct a pandas.DataFrame
columns = ['a', 'b', 'c', 'd', 'e']
df = pandas.DataFrame(values, columns=columns)
# Bootstrap
%timeit df.iloc[np.random.randint(n, size=n)]
# Out: 1000 loops, best of 3: 1.46 ms per loop
Indexing the numpy array is of course much faster:
%timeit values[np.random.randint(n, size=n)]
# Out: 10000 loops, best of 3: 159 µs per loop
But even extracting the values, sampling the numpy array, and constructing a new pandas.DataFrame is faster:
%timeit pandas.DataFrame(df.values[np.random.randint(n, size=n)], columns=columns)
# Out: 1000 loops, best of 3: 302 µs per loop
#JohnE suggested sample which is unfortunately even slower:
%timeit df.sample(n, replace=True)
# Out: 100 loops, best of 3: 5.14 ms per loop
#firelynx suggested merge:
%timeit df.merge(pandas.DataFrame(index=np.random.randint(n, size=n)), left_index=True, right_index=True, how='right')
# Out: 1000 loops, best of 3: 1.23 ms per loop
Does anyone have an idea why iloc is so slow and/or whether there are better alternatives than extracting the values, sampling and then constructing a new pandas.DataFrame?
The merge method in pandas is fairly optimized, so I tried my luck with it and it gave me a significant speed increase. Given my machine is a bit slower than yours, I'm also using pandas 0.15.2 Things may be a bit different.
%timeit df.iloc[np.random.randint(n, size=n)]
# 100 loops, best of 3: 2.41 ms per loop
randlist = pandas.DataFrame(index=np.random.randint(n, size=n))
%timeit df.merge(randlist, left_index=True, right_index=True, how='right')
# 1000 loops, best of 3: 1.87 ms per loop
%timeit df.merge(pandas.DataFrame(index=np.random.randint(n, size=n)), left_index=True, right_index=True, how='right')
# 100 loops, best of 3: 2.29 ms per loop
Indexing Speeds
Boolean Indexing tested to be slightly faster for me:
Boolean Indexing
%timeit -n10000 df[np.random.randint(2, size=n).astype(bool)]
# 10000 loops, best of 3: 307 µs per loop
numpy sampling & reDataFrameing
%timeit -n10000 pandas.DataFrame(df.values[np.random.randint(n, size=n)], columns=columns)
# 10000 loops, best of 3: 380 µs per loop

New column to pandas dataframe according to values from other column

I have a column in my data frame with string data. I need to create a new column of integers, one for each unique string. I will use this column as the second level of a multiindex. The code below does the trick, but I was wondering if there could be a more efficient solution in Pandas for it?
import pandas as pd
df = pd.DataFrame({'c1':[1,2,3,4],
'c2':['a','a','b','b']})
for i,e in enumerate(df.c2.unique()):
df.loc[df.c2 == e,'c3'] = i
for i,e in enumerate(df.c2.unique()):
df.loc[df.c2 == e,'c3'] = i
can be replaced with
df['c3'] = pd.Categorical(df['c2']).codes
Even for this small DataFrame, using Categorical is (about 4x) quicker:
In [33]: %%timeit
...: for i,e in enumerate(df.c2.unique()):
df.loc[df.c2 == e,'c3'] = i
1000 loops, best of 3: 1.07 ms per loop
In [35]: %timeit pd.Categorical(df['c2']).codes
1000 loops, best of 3: 264 µs per loop
The improvement in speed will increase with the number of unique elements in df['c2'] since the Python for-loop's relative inefficency will become more apparent with more iterations.
For example, if
import string
import numpy as np
import pandas as pd
N = 10000
df = pd.DataFrame({'c1':np.arange(N),
'c2':np.random.choice(list(string.letters), size=N)})
then using Categorical is (about 56x) quicker:
In [53]: %%timeit
....: for i,e in enumerate(df.c2.unique()):
df.loc[df.c2 == e,'c3'] = i
10 loops, best of 3: 58.2 ms per loop
In [54]: %timeit df['c3'] = pd.Categorical(df['c2']).codes
1000 loops, best of 3: 1.04 ms per loop
The benchmarks above were done with IPython's %timeit "magic function".

split data frame based on integer index

In pandas how do I split Series/dataframe into two Series/DataFrames where odd rows in one Series, even rows in different? Right now I am using
rng = range(0, n, 2)
odd_rows = df.iloc[rng]
This is pretty slow.
Use slice:
In [11]: s = pd.Series([1,2,3,4])
In [12]: s.iloc[::2] # even
Out[12]:
0 1
2 3
dtype: int64
In [13]: s.iloc[1::2] # odd
Out[13]:
1 2
3 4
dtype: int64
Here's some comparisions
In [100]: df = DataFrame(randn(100000,10))
simple method (but I think range makes this slow), but will work regardless of the index
(e.g. does not have to be a numeric index)
In [96]: %timeit df.iloc[range(0,len(df),2)]
10 loops, best of 3: 21.2 ms per loop
The following require an Int64Index that is range based (which is easy to get, just reset_index()).
In [107]: %timeit df.iloc[(df.index % 2).astype(bool)]
100 loops, best of 3: 5.67 ms per loop
In [108]: %timeit df.loc[(df.index % 2).astype(bool)]
100 loops, best of 3: 5.48 ms per loop
make sure to give it index positions
In [98]: %timeit df.take(df.index % 2)
100 loops, best of 3: 3.06 ms per loop
same as above but no conversions on negative indicies
In [99]: %timeit df.take(df.index % 2,convert=False)
100 loops, best of 3: 2.44 ms per loop
This winner is #AndyHayden soln; this only works on a single dtype
In [118]: %timeit DataFrame(df.values[::2],index=df.index[::2])
10000 loops, best of 3: 63.5 us per loop