Why the difference between octave's prctile and numpy's percentile? - numpy

I've been rewriting a matlab/octave program into numpy and ran across a difference in some resultant values.
This occurs with both the percentile/prctile and the stdard-deviation functions.
In Numpy:
import matplotlib.mlab as ml
import numpy
>>> t = numpy.linspace(0,100, 100)
>>> numpy.percentile(t,95)
95.0
>>> numpy.std(t)
29.157646512850626
>>> ml.prctile(t,95)
95.000000000000014
In Octave:
octave:1> t = linspace(0,100,100)';
octave:2> prctile(t,95)
ans = 95.454545
octave:3> std(t)
ans = 29.304537
Although the array values of 't' are the same, the results are more different than I would suspect.
In the numpy help(numpy.std) they specifically mention that the algorithm is:
std = sqrt(mean(abs(x - x.mean())**2))
So I implemented that in octave and got the exact answer numpy gives. So it seems the std-deviation function differs.
But why/how? And which is correct? (if there is such a thing)
And even prctile/percentile?
Just in case since I'm in Linux aptosid...
GNU Octave, version 3.6.2
numpy.version '1.6.2rc1'

Numpy simply uses a different algorithm when the percentile lies between two data points. Octave, Matlab and R always center it exactly between two points when needed (I believe), numpy does a bit more then that... if you check http://en.wikipedia.org/wiki/Percentile you will see there are a couple of ways to calculate percentiles.

It seems like Octave assumes ddof=1, at least by default, and numpy uses 0 by default:
>>> numpy.std(t, ddof=0)
29.157646512850633
>>> numpy.std(t, ddof=1)
29.304537349375785

Related

pandas "isin" is much slower than numpy "in1d"

There is a huge difference between pandas "isin" and numpy "in1d" from the efficiency aspect. After some research I've noticed that the type of the data and the values that passed as parameter to the "in" method has huge impact on the run time. Anyway it looks that numpy implementation suffer much less from this problem.
What's going on here?
import timeit
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,10,(10**6),dtype='int8'),columns=['A'])
vals = np.array([5,7],dtype='int64')
f = lambda: df['A'].isin(vals)
g = lambda: pd.np.in1d(df['A'],vals)
print 'pandas:', timeit.timeit(stmt='f()',setup='from __main__ import f',number=10)/10
print 'numpy :', timeit.timeit(stmt='g()',setup='from __main__ import g',number=10)/10
>>
**pandas: 0.0541711091995
numpy : 0.000645089149475**
Numpy and Pandas use different algorithms for isin. For some cases numpy's version is faster and for some pandas'. For your test case numpy seems to be faster.
Pandas' version has however a better asymptotic running time, in will win for bigger datasets.
Let's assume that there are n elements in the data-series (df in your example) and m elements in the query (vals in your example).
Usually, Numpy's algorithm does the following:
Use np.unique(..) to find all unique elements in the series. Thus is done via sorting, i.e. O(n*log(n)), there might be N<=n unique elements.
For every element use binary search to look up whether element is in the series, i.e. O(m*log(N)) in overall.
Which leads to overall running time of O(n*log(n) + m*log(N)).
There are some hard-coded optimizations in place for the cases, when vals only few elements and for this cases numpy really shines.
Pandas does something different:
Populates a hash-map (wrapped khash-functionality) in order to find all unique elements, which takes O(n).
Looks-up in the hash map in O(1) for every query, i.e. O(m) in overall.
I overall, running time is O(n)+O(m), which is much better than Numpy's.
However, for smaller inputs, constant factors and not the asymptotic behavior is that what counts and it is just way better for Numpy. There are also other consideration, like memory consumption (which is higher for Pandas) which might play a role.
But if we take a bigger query set, the situation is completely different:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,10,(10**6),dtype='int8'),columns=['A'])
vals = np.array([5,7],dtype='int64')
vals2 = np.random.randint(0,10,(10**6),dtype='int64')
And now:
%timeit df['A'].isin(vals) # 17.0 ms
%timeit df['A'].isin(vals2) # 16.8 ms
%timeit pd.np.in1d(df['A'],vals) # 1.36
%timeit pd.np.in1d(df['A'],vals2) # 82.9 ms
Numpy is really losing ground as long as there are more queries. It can also be seen, that building of the hash-map is the bottleneck for Pandas and not the queries.
In the end it doesn't make much sense (even if I just did!) to evaluate the performance for only one input size - it should be done for a range of input sizes - there are some surprises to be discovered!
E.g. fun fact: if you would take
df = pd.DataFrame(np.random.randint(0,10,(10**6+1), dtype='int8'),columns=['A'])
i.e. 10^6+1 instead of 10^6, pandas would fall back to numpy's algorithm (which is not clever in my opinion) and would become better for small inputs but worse for big:
%timeit df['A'].isin(vals) # 6ms was 17.0 ms
%timeit df['A'].isin(vals2) # 100ms was 16.8 ms

Numpy gradient with non uniform spacing

I got something wrong going on :
import numpy as np
import matplotlib.pyplot as plt
x = np.concatenate((np.linspace(0,1,100),np.linspace(1,2,50)));
f = np.power(x,2);
df = 2*x;
Df = np.gradient(f,x);
plt.plot(x,df,'r', x,Df,'b');plt.show()
This is what I get :
Otherwise things work ok if using linearly spaced array and not using argument x.
Any suggestions?
I think this is because numpy versions before 1.13 expect the "x" argument to be the constant grid spacing (see https://docs.scipy.org/doc/numpy-1.11.0/reference/generated/numpy.gradient.html#numpy.gradient). Even though the earlier versions expect a scalar dx, they do not check for this, and the result is np.gradient(f) / x, which is a valid division. This is pretty annoying since code written for numpy 1.13 may run on earlier versions with incorrect output and no errors.

How to simulate random returns with numpy

What is a quick way to simulate random returns. I'm aware of numpy.random. However, that doesn't guide me towards how to model asset returns.
I've tried:
import numpy as np
r = np.random.rand(100)
But this doesn't feel accurate. How are others dealing doing this?
I'd suggest one of two approaches:
One: Assume returns are normally distributed with mean equal to 0.1% and stadard deviation about 1%. This looks like:
import numpy as np
np.random.seed(314)
r = np.random.randn(100) / 100 + 0.001
seed(314) sets the random number generator at a specific point so that if we both use the same seed, we should see the same results.
randn pulls from the normal distribution.
I'd also recommend using pandas. It's a library that implements a DataFrame object similar to R
import pandas as pd
df = pd.DataFrame(r)
You can then plot the cumulative returns like this:
df.add(1).cumprod().plot()
Two:
The second way is to assume returns are log normally distributed. That means the log(r) is normal. In this scenario, we pull normally distributed random numbers and then use those values as the exponent of e. It looks like this:
r = np.exp(np.random.randn(100) / 100 + 0.001) - 1
If you plot it, it looks like this:
pd.DataFrame(r).add(1).cumprod().plot()

SciPy UnivariateSpline Specifying Axis?

Using scipy.interpolate.interp1d it is possible to pass in a (1080, 4) nd.array and compute an interpolation function for each 'row' in a single command:
spline = interp1d(np.arange(1,5), np.random.random(1080,4), kind='cubic')
I am getting slightly different interpolation results (off the knots) than some existing Fortran code. I believe this is because the SciPy source is using a b-spline and the Fortran code is using splines derived from numerical recipes.
I am attempting to perform the same interpolation using UnivariateSpline with s=0, so InterpolatedUnivariateSpline.
I am able to get this working if I pass the data row by row, i.e. using an iterator to step over all 1080 rows - this is highly inefficient.
Using:
spline = UnivariateSpline(np.arange(1,5).reshape(-1,1), np.random.random(1080,4), s=0, k=3)
I am seeing:
failed in converting 2nd argument `y' of dfitpack.fpcurf0 to C/Fortran array
I believe this is an issue getting the multi-dimensional array into Fitpack? Any insight in how to avoid an iterator? Additionally, any insight into a SciPy interpolation function that matches the one described in numerical recipes (section 3.3, p.120) - You have to type the page number, I can not direct link, it is a Flash viewer...
In older version of SciPy (I observed it in 0.14) the splines returned by interp1d were of relatively poor quality. In versions 0.19 and later, interp1d is consistent with other spline routines, and since it accepts vector inputs, I think that answers the question. Here is the comparison of three spline constructors: the latter two only take one row as input.
from scipy.interpolate import interp1d, UnivariateSpline, splrep, splev
x = np.arange(1, 5)
y = np.random.normal(size=(1080, 4))
spl1 = interp1d(x, y, kind='cubic')
spl2 = UnivariateSpline(x, y[123, :], s=0, k=3)
spl3 = splrep(x, y[123, :], s=0, k=3)
t = 2.345
print(spl1(t)[123], spl2(t), splev(t, spl3))
This prints (with my random numbers)
-0.333973049011 -0.333973049011 -0.333973049011

Different result of same numpy mean calculation on two computers

I have two computers with python 2.7.2 (MSC v.1500 32 bit (Intel)] on win32) and numpy 1.6.1.
But
numpy.mean(data)
returns
1.13595094681 on my old computer
and
1.13595104218 on my new computer
where
Data = [ 0.20227873 -0.02738848 0.59413314 0.88547146 1.26513398 1.21090782
1.62445402 1.80423951 1.58545554 1.26801944 1.22551131 1.16882968
1.19972098 1.41940248 1.75620842 1.28139281 0.91190684 0.83705413
1.19861531 1.30767155]
In both cases
s=0
for n in data[:20]:
s+=n
print s/20
gives
1.1359509334
Can anyone explain why and how to avoid?
Mads
If you want to avoid any differences between the two, then make them explicitly 32-bit or 64-bit float arrays. NumPy uses several other libraries that may be 32 or 64 bit. Note that rounding can occur in your print statements as well:
>>> import numpy as np
>>> a = [0.20227873, -0.02738848, 0.59413314, 0.88547146, 1.26513398,
1.21090782, 1.62445402, 1.80423951, 1.58545554, 1.26801944,
1.22551131, 1.16882968, 1.19972098, 1.41940248, 1.75620842,
1.28139281, 0.91190684, 0.83705413, 1.19861531, 1.30767155]
>>> x32 = np.array(a, np.float32)
>>> x64 = np.array(a, np.float64)
>>> x32.mean()
1.135951042175293
>>> x64.mean()
1.1359509335
>>> print x32.mean()
1.13595104218
>>> print x64.mean()
1.1359509335
Another point to note is that if you have lower level libraries (e.g., atlas, lapack) that are multi-threaded, then for large arrays, you may have differences in your result regardless, due to possible variable order of operations and floating point precision.
Also, you are at the limit of precision for 32 bit numbers:
>>> x32.sum()
22.719021
>>> np.array(sorted(x32)).sum()
22.719019
This is happening because you have Float32 arrays (single precision). With single precision, the operations are only accurate to 6 decimal place. Hence your results are the same up to the 6th decimal place (after the decimal point, rounding the last digit), but they are not accurate after that. Different architectures/machines/compilers will yield the different results after that. If you want the same results you should use higher precision arrays (e.g. Float64).