Drawing a bootstrap sample from a pandas.DataFrame - numpy

I would like to draw a bootstrap sample of a pandas.DataFrame as efficiently as possible. Using the builtin iloc together with a list of integers seems to be slow:
import pandas
import numpy as np
# Generate some data
n = 5000
values = np.random.uniform(size=(n, 5))
# Construct a pandas.DataFrame
columns = ['a', 'b', 'c', 'd', 'e']
df = pandas.DataFrame(values, columns=columns)
# Bootstrap
%timeit df.iloc[np.random.randint(n, size=n)]
# Out: 1000 loops, best of 3: 1.46 ms per loop
Indexing the numpy array is of course much faster:
%timeit values[np.random.randint(n, size=n)]
# Out: 10000 loops, best of 3: 159 µs per loop
But even extracting the values, sampling the numpy array, and constructing a new pandas.DataFrame is faster:
%timeit pandas.DataFrame(df.values[np.random.randint(n, size=n)], columns=columns)
# Out: 1000 loops, best of 3: 302 µs per loop
#JohnE suggested sample which is unfortunately even slower:
%timeit df.sample(n, replace=True)
# Out: 100 loops, best of 3: 5.14 ms per loop
#firelynx suggested merge:
%timeit df.merge(pandas.DataFrame(index=np.random.randint(n, size=n)), left_index=True, right_index=True, how='right')
# Out: 1000 loops, best of 3: 1.23 ms per loop
Does anyone have an idea why iloc is so slow and/or whether there are better alternatives than extracting the values, sampling and then constructing a new pandas.DataFrame?

The merge method in pandas is fairly optimized, so I tried my luck with it and it gave me a significant speed increase. Given my machine is a bit slower than yours, I'm also using pandas 0.15.2 Things may be a bit different.
%timeit df.iloc[np.random.randint(n, size=n)]
# 100 loops, best of 3: 2.41 ms per loop
randlist = pandas.DataFrame(index=np.random.randint(n, size=n))
%timeit df.merge(randlist, left_index=True, right_index=True, how='right')
# 1000 loops, best of 3: 1.87 ms per loop
%timeit df.merge(pandas.DataFrame(index=np.random.randint(n, size=n)), left_index=True, right_index=True, how='right')
# 100 loops, best of 3: 2.29 ms per loop

Indexing Speeds
Boolean Indexing tested to be slightly faster for me:
Boolean Indexing
%timeit -n10000 df[np.random.randint(2, size=n).astype(bool)]
# 10000 loops, best of 3: 307 µs per loop
numpy sampling & reDataFrameing
%timeit -n10000 pandas.DataFrame(df.values[np.random.randint(n, size=n)], columns=columns)
# 10000 loops, best of 3: 380 µs per loop

Related

My pandas df is in ascending order but there are gaps in the index, how do I reset it without adding an extra column? [duplicate]

This question already has answers here:
How to convert index of a pandas dataframe into a column
(9 answers)
Closed 3 years ago.
I have a dataframe from which I remove some rows. As a result, I get a dataframe in which index is something like that: [1,5,6,10,11] and I would like to reset it to [0,1,2,3,4]. How can I do it?
The following seems to work:
df = df.reset_index()
del df['index']
The following does not work:
df = df.reindex()
DataFrame.reset_index is what you're looking for. If you don't want it saved as a column, then do:
df = df.reset_index(drop=True)
If you don't want to reassign:
df.reset_index(drop=True, inplace=True)
Another solutions are assign RangeIndex or range:
df.index = pd.RangeIndex(len(df.index))
df.index = range(len(df.index))
It is faster:
df = pd.DataFrame({'a':[8,7], 'c':[2,4]}, index=[7,8])
df = pd.concat([df]*10000)
print (df.head())
In [298]: %timeit df1 = df.reset_index(drop=True)
The slowest run took 7.26 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 105 µs per loop
In [299]: %timeit df.index = pd.RangeIndex(len(df.index))
The slowest run took 15.05 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 7.84 µs per loop
In [300]: %timeit df.index = range(len(df.index))
The slowest run took 7.10 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 14.2 µs per loop
data1.reset_index(inplace=True)

Concat the same series a bunch of times?

Let us say I have a series of
import pandas as pd
a = pd.Series([1, 2, 3])
Is there a more efficient way of creating a million row df than
chamillion_row_df = pd.concat([a] * 1000000)
?
You can use np.tile for this:
pd.Series(np.tile(a.values, 1000000))
This will be much quicker than building a temporary list as a param to pd.concat
timings
In [42]:
%timeit pd.Series(np.tile(a.values, 1000000))
%timeit pd.concat([a] * 1000000)
100 loops, best of 3: 17.7 ms per loop
1 loops, best of 3: 16.9 s per loop
So only ~1000x quicker using np.tile

New column to pandas dataframe according to values from other column

I have a column in my data frame with string data. I need to create a new column of integers, one for each unique string. I will use this column as the second level of a multiindex. The code below does the trick, but I was wondering if there could be a more efficient solution in Pandas for it?
import pandas as pd
df = pd.DataFrame({'c1':[1,2,3,4],
'c2':['a','a','b','b']})
for i,e in enumerate(df.c2.unique()):
df.loc[df.c2 == e,'c3'] = i
for i,e in enumerate(df.c2.unique()):
df.loc[df.c2 == e,'c3'] = i
can be replaced with
df['c3'] = pd.Categorical(df['c2']).codes
Even for this small DataFrame, using Categorical is (about 4x) quicker:
In [33]: %%timeit
...: for i,e in enumerate(df.c2.unique()):
df.loc[df.c2 == e,'c3'] = i
1000 loops, best of 3: 1.07 ms per loop
In [35]: %timeit pd.Categorical(df['c2']).codes
1000 loops, best of 3: 264 µs per loop
The improvement in speed will increase with the number of unique elements in df['c2'] since the Python for-loop's relative inefficency will become more apparent with more iterations.
For example, if
import string
import numpy as np
import pandas as pd
N = 10000
df = pd.DataFrame({'c1':np.arange(N),
'c2':np.random.choice(list(string.letters), size=N)})
then using Categorical is (about 56x) quicker:
In [53]: %%timeit
....: for i,e in enumerate(df.c2.unique()):
df.loc[df.c2 == e,'c3'] = i
10 loops, best of 3: 58.2 ms per loop
In [54]: %timeit df['c3'] = pd.Categorical(df['c2']).codes
1000 loops, best of 3: 1.04 ms per loop
The benchmarks above were done with IPython's %timeit "magic function".

split data frame based on integer index

In pandas how do I split Series/dataframe into two Series/DataFrames where odd rows in one Series, even rows in different? Right now I am using
rng = range(0, n, 2)
odd_rows = df.iloc[rng]
This is pretty slow.
Use slice:
In [11]: s = pd.Series([1,2,3,4])
In [12]: s.iloc[::2] # even
Out[12]:
0 1
2 3
dtype: int64
In [13]: s.iloc[1::2] # odd
Out[13]:
1 2
3 4
dtype: int64
Here's some comparisions
In [100]: df = DataFrame(randn(100000,10))
simple method (but I think range makes this slow), but will work regardless of the index
(e.g. does not have to be a numeric index)
In [96]: %timeit df.iloc[range(0,len(df),2)]
10 loops, best of 3: 21.2 ms per loop
The following require an Int64Index that is range based (which is easy to get, just reset_index()).
In [107]: %timeit df.iloc[(df.index % 2).astype(bool)]
100 loops, best of 3: 5.67 ms per loop
In [108]: %timeit df.loc[(df.index % 2).astype(bool)]
100 loops, best of 3: 5.48 ms per loop
make sure to give it index positions
In [98]: %timeit df.take(df.index % 2)
100 loops, best of 3: 3.06 ms per loop
same as above but no conversions on negative indicies
In [99]: %timeit df.take(df.index % 2,convert=False)
100 loops, best of 3: 2.44 ms per loop
This winner is #AndyHayden soln; this only works on a single dtype
In [118]: %timeit DataFrame(df.values[::2],index=df.index[::2])
10000 loops, best of 3: 63.5 us per loop

Vectorized way of calculating row-wise dot product two matrices with Scipy

I want to calculate the row-wise dot product of two matrices of the same dimension as fast as possible. This is the way I am doing it:
import numpy as np
a = np.array([[1,2,3], [3,4,5]])
b = np.array([[1,2,3], [1,2,3]])
result = np.array([])
for row1, row2 in a, b:
result = np.append(result, np.dot(row1, row2))
print result
and of course the output is:
[ 26. 14.]
Straightforward way to do that is:
import numpy as np
a=np.array([[1,2,3],[3,4,5]])
b=np.array([[1,2,3],[1,2,3]])
np.sum(a*b, axis=1)
which avoids the python loop and is faster in cases like:
def npsumdot(x, y):
return np.sum(x*y, axis=1)
def loopdot(x, y):
result = np.empty((x.shape[0]))
for i in range(x.shape[0]):
result[i] = np.dot(x[i], y[i])
return result
timeit npsumdot(np.random.rand(500000,50),np.random.rand(500000,50))
# 1 loops, best of 3: 861 ms per loop
timeit loopdot(np.random.rand(500000,50),np.random.rand(500000,50))
# 1 loops, best of 3: 1.58 s per loop
Check out numpy.einsum for another method:
In [52]: a
Out[52]:
array([[1, 2, 3],
[3, 4, 5]])
In [53]: b
Out[53]:
array([[1, 2, 3],
[1, 2, 3]])
In [54]: einsum('ij,ij->i', a, b)
Out[54]: array([14, 26])
Looks like einsum is a bit faster than inner1d:
In [94]: %timeit inner1d(a,b)
1000000 loops, best of 3: 1.8 us per loop
In [95]: %timeit einsum('ij,ij->i', a, b)
1000000 loops, best of 3: 1.6 us per loop
In [96]: a = random.randn(10, 100)
In [97]: b = random.randn(10, 100)
In [98]: %timeit inner1d(a,b)
100000 loops, best of 3: 2.89 us per loop
In [99]: %timeit einsum('ij,ij->i', a, b)
100000 loops, best of 3: 2.03 us per loop
Note: NumPy is constantly evolving and improving; the relative performance of the functions shown above has probably changed over the years. If performance is important to you, run your own tests with the version of NumPy that you will be using.
Played around with this and found inner1d the fastest. That function however is internal, so a more robust approach is to use
numpy.einsum("ij,ij->i", a, b)
Even better is to align your memory such that the summation happens in the first dimension, e.g.,
a = numpy.random.rand(3, n)
b = numpy.random.rand(3, n)
numpy.einsum("ij,ij->j", a, b)
For 10 ** 3 <= n <= 10 ** 6, this is the fastest method, and up to twice as fast as its untransposed equivalent. The maximum occurs when the level-2 cache is maxed out, at about 2 * 10 ** 4.
Note also that the transposed summation is much faster than its untransposed equivalent.
The plot was created with perfplot (a small project of mine)
import numpy
from numpy.core.umath_tests import inner1d
import perfplot
def setup(n):
a = numpy.random.rand(n, 3)
b = numpy.random.rand(n, 3)
aT = numpy.ascontiguousarray(a.T)
bT = numpy.ascontiguousarray(b.T)
return (a, b), (aT, bT)
b = perfplot.bench(
setup=setup,
n_range=[2 ** k for k in range(1, 25)],
kernels=[
lambda data: numpy.sum(data[0][0] * data[0][1], axis=1),
lambda data: numpy.einsum("ij, ij->i", data[0][0], data[0][1]),
lambda data: numpy.sum(data[1][0] * data[1][1], axis=0),
lambda data: numpy.einsum("ij, ij->j", data[1][0], data[1][1]),
lambda data: inner1d(data[0][0], data[0][1]),
],
labels=["sum", "einsum", "sum.T", "einsum.T", "inner1d"],
xlabel="len(a), len(b)",
)
b.save("out1.png")
b.save("out2.png", relative_to=3)
You'll do better avoiding the append, but I can't think of a way to avoid the python loop. A custom Ufunc perhaps? I don't think numpy.vectorize will help you here.
import numpy as np
a=np.array([[1,2,3],[3,4,5]])
b=np.array([[1,2,3],[1,2,3]])
result=np.empty((2,))
for i in range(2):
result[i] = np.dot(a[i],b[i]))
print result
EDIT
Based on this answer, it looks like inner1d might work if the vectors in your real-world problem are 1D.
from numpy.core.umath_tests import inner1d
inner1d(a,b) # array([14, 26])
I came across this answer and re-verified the results with Numpy 1.14.3 running in Python 3.5. For the most part the answers above hold true on my system, although I found that for very large matrices (see example below), all but one of the methods are so close to one another that the performance difference is meaningless.
For smaller matrices, I found that einsum was the fastest by a considerable margin, up to a factor of two in some cases.
My large matrix example:
import numpy as np
from numpy.core.umath_tests import inner1d
a = np.random.randn(100, 1000000) # 800 MB each
b = np.random.randn(100, 1000000) # pretty big.
def loop_dot(a, b):
result = np.empty((a.shape[1],))
for i, (row1, row2) in enumerate(zip(a, b)):
result[i] = np.dot(row1, row2)
%timeit inner1d(a, b)
# 128 ms ± 523 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.einsum('ij,ij->i', a, b)
# 121 ms ± 402 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.sum(a*b, axis=1)
# 411 ms ± 1.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit loop_dot(a, b) # note the function call took negligible time
# 123 ms ± 342 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
So einsum is still the fastest on very large matrices, but by a tiny amount. It appears to be a statistically significant (tiny) amount though!