I am not sure if I used the correct technical words in the title. What I want is something like the following.
I have the following code
import pandas as pd
import numpy as np
df = pd.DataFrame([[1, None, None, 4, None, None, None, 10]])
df = df.fillna(np.nan)
df = df.transpose().interpolate()
which does a linear interpolation, which gives me something like
1.0 2.0 3.0 4.0 5.5 7.0 8.5 10.0
What I want is an exponentially decaying interpolation. That is, something like below (Not the exact values but you get the idea).
1.0 2.5 3.0 4.0 6.5 8.0 9.2 10.0
That is I want the closer values to change more drastically than the far values. Is there an interpolation method available in pandas that can do it?
You need to apply some transformations to the data. Try this:
df = pd.DataFrame([[1, None, None, 4, None, None, None, 10]])
df = df.fillna(np.nan)
df = 10**df
df = df.transpose().interpolate()
df = np.log10(df)
You can play with the powers to get something that matches what you need.
Related
Consider the following dataframe:
df = pd.DataFrame([np.nan, np.nan,1, 5,np.nan, 6, 6.1 , np.nan,np.nan])
I would like to use the pandas.DataFrame.interpolate method to linearly extrapolate the dataframe entries at the starting and ending rows, similar to what I get if I do the following:
from scipy import interpolate
df_num = df.dropna()
xi = df_num.index.values
yi = df_num.values[:,0]
f = interpolate.interp1d(xi, yi, kind='linear', fill_value='extrapolate')
x = [0, 1 , 7, 8]
print(f(x))
[-7. -3. 6.2 6.3]
It seems that the 'linear' option in pandas interpolate calls numpy's interpolate method which doesn't do linear extrapolation. Is there a way to call the built-in interpolate method to achieve this?
You can use scipy interpolate method directly in pandas. See pandas.DataFrame.interpolate documentation, you can use in method option techniques from scipy.interpolate.interp1d as it's noted in the attached link.
Solution for your example could look like:
df.interpolate(method="slinear", fill_value="extrapolate", limit_direction="both")
# Out:
# 0
# 0 -7.0
# 1 -3.0
# 2 1.0
# 3 5.0
# 4 5.5
# 5 6.0
# 6 6.1
# 7 6.2
# 8 6.3
You can then easily select any values you are interested in, e.g. df_interpolated.loc[x] (where df_interpolated is output of the previous code block) using indexes defined in your question by x variable.
Explanation:
method="slinear" - one of the method listed in pandas doc above that is passed to scipy interp1d (see e.g. this link)
fill_value="extrapolate" - pass any option allowed by scipy (here extrapolate which is exactly what you want)
limit_direction="both" - to get extrapolation in both direction (otherwise default would be set to "forward" in that case and you would see np.nan for the first two values)
I'm writing a teaching document that uses lots of examples of Python code and includes the resulting numeric output. I'm working from inside IPython and a lot of the examples use NumPy.
I want to avoid print statements, explicit formatting or type conversions. They clutter the examples and detract from the principles I'm trying to explain.
What I know:
From IPython I can use %precision to control the displayed precision of any float results.
I can use np.set_printoptions() to control the displayed precision of elements within a NumPy array.
What I'm looking for is a way to control the displayed precision of a NumPy float64 scalar which doesn't respond to either of the above. These get returned by a lot of NumPy functions.
>>> x = some_function()
Out[2]: 0.123456789
>>> type(x)
Out[3]: numpy.float64
>>> %precision 2
Out[4]: '%.2f'
>>> x
Out[5]: 0.123456789
>>> float(x) # that precision works for regular floats
Out[6]: 0.12
>>> np.set_printoptions(precision=2)
>>> x # but doesn't work for the float64
Out[8]: 0.123456789
>>> np.r_[x] # does work if it's in an array
Out[9]: array([0.12])
What I want is
>>> # some formatting command
>>> x = some_function() # that returns a float64 = 0.123456789
Out[2]: 0.12
but I'd settle for:
a way of telling NumPy to give me float scalars by default, rather than float64.
a way of telling IPython how to handling a float64, kind of like what I can do with a repr_pretty for my own classes.
IPython has formatters (core/formatters.py) which contain a dict that maps a type to a format method. There seems to be some knowledge of NumPy in the formatters but not for the np.float64 type.
There are a bunch of formatters, for HTML, LaTeX etc. but text/plain is the one for consoles.
We first get the IPython formatter for console text output
plain = get_ipython().display_formatter.formatters['text/plain']
and then set a formatter for the float64 type, we use the same formatter as already exists for float since it already knows about %precision
plain.for_type(np.float64, plain.lookup_by_type(float))
Now
In [26]: a = float(1.23456789)
In [28]: b = np.float64(1.23456789)
In [29]: %precision 3
Out[29]: '%.3f'
In [30]: a
Out[30]: 1.235
In [31]: b
Out[31]: 1.235
In the implementation I also found that %precision calls np.set_printoptions() with a suitable format string. I didn't know it did this, and potentially problematic if the user has already set this. Following the example above
In [32]: c = np.r_[a, a, a]
In [33]: c
Out[33]: array([1.235, 1.235, 1.235])
we see it is doing the right thing for array elements.
I can do this formatter initialisation explicitly in my own code, but a better fix might to modify IPython code/formatters.py line 677
#default('type_printers')
def _type_printers_default(self):
d = pretty._type_pprinters.copy()
d[float] = lambda obj,p,cycle: p.text(self.float_format%obj)
# suggested "fix"
if 'numpy' in sys.modules:
d[numpy.float64] = lambda obj,p,cycle: p.text(self.float_format%obj)
# end suggested fix
return d
to also handle np.float64 here if NumPy is included. Happy for feedback on this, if I feel brave I might submit a PR.
NumPy's eigenvector solution differs from Wolfram Alpha and my personal calculation by hand.
>>> import numpy.linalg
>>> import numpy as np
>>> numpy.linalg.eig(np.array([[-2, 1], [2, -1]]))
(array([-3., 0.]), array([[-0.70710678, -0.4472136 ],
[ 0.70710678, -0.89442719]]))
Wolfram Alpha https://www.wolframalpha.com/input/?i=eigenvectors+%7B%7B-2,1%7D,%7B%2B2,-1%7D%7D and my personal calculation give the eigenvectors (-1, 1) and (2, 1). The NumPy solution however differs.
NumPy's calculated eigenvalues however are confirmed by Wolfram Alpha and my personal calculation.
So, is this a bug in NumPy or is my understanding of math to simple? A similar thread Numpy seems to produce incorrect eigenvectors sees the main difference in rounding/scaling of the eigenvectors but the deviation between the solutions would be massive.
Regards
numpy.linalg.eig normalizes the eigen vectors with the results being the column vectors
eig_vectors = np.linalg.eig(np.array([[-2, 1], [2, -1]]))[1]
vec_1 = eig_vectors[:,0]
vec_2 = eig_vectors[:,1]
now these 2 vectors are just normalized versions of the vectors you calculated ie
print(vec_1 * np.sqrt(2)) # where root 2 is the magnitude of [-1, 1]
print(vec_1 * np.sqrt(5)) # where root 5 is the magnitude of [2, 1]
So bottom line the both sets of calculations are equivalent just Numpy likes to normalze the results.
I was used to run this code with no issue:
data_0 = data_0.replace([-1, 'NULL'], [None, None])
now, after the update to Pandas 0.21.1, with the very same line of code I get a:
recursionerror: maximum recursion depth exceeded
does anybody experience the same issue ? and knows how to solve ?
Note: rolling back to pandas 0.20.3 will make the trick but I think it's important to solve with latest version
thanx
I think this error message depends on what your input data is. Here's an example of input data where this works in the expected way:
data_0 = pd.DataFrame({'x': [-1, 1], 'y': ['NULL', 'foo']})
data_0.replace([-1, 'NULL'], [None, None])
replaces values of -1 and 'NULL' with None:
x y
0 NaN None
1 1.0 foo
In numpy manual, it is said:
Instead of specifying the full covariance matrix, popular approximations include:
Spherical covariance (cov is a multiple of the identity matrix)
Has anybody ever specified spherical covariance? I am trying to make it work to avoid building the full covariance matrix, which is too much memory-consuming.
If you just have a diagonal covariance matrix, it is usually easier (and more efficient) to just scale standard normal variates yourself instead of using multivariate_normal().
>>> import numpy as np
>>> stdevs = np.array([3.0, 4.0, 5.0])
>>> x = np.random.standard_normal([100, 3])
>>> x.shape
(100, 3)
>>> x *= stdevs
>>> x.std(axis=0)
array([ 3.23973255, 3.40988788, 4.4843039 ])
While #RobertKern's approach is correct, you can let numpy handle all of that for you, as np.random.normal will do broadcasting on multiple means and standard deviations:
>>> np.random.normal(0, [1,2,3])
array([ 0.83227999, 3.40954682, -0.01883329])
To get more than a single random sample, you have to give it an appropriate size:
>>> x = np.random.normal(0, [1, 2, 3], size=(1000, 3))
>>> np.std(x, axis=0)
array([ 1.00034817, 2.07868385, 3.05475583])