I have some data frame in pandas, where the columns can be viewed as smooth functions of the index:
f g
x ------------
0.1 f(0.1) g(0.1)
0.2 f(0.2) g(0.2)
...
And I want to know the x value for some f(x) = y -- where y is a given, and I don't necessarily have a point at the x that I am looking for.
Essentially I want to find the intersection of a line and a data series in pandas. Is there a best way to do this?
Suppose your DataFrame looks something like this:
import numpy as np
import pandas as pd
def unknown_func(x):
return -x ** 3 + 1
x = np.linspace(-10, 10, 100)
df = pd.DataFrame({'f': unknown_func(x)}, index=x)
then, using scipy, you could create an interpolation function:
import scipy.interpolate as interpolate
func = interpolate.interp1d(x, df['f'], kind='linear')
and then use a root finder to solve f(x)-y=0 for x:
import scipy.optimize as optimize
root = optimize.brentq(lambda x: func(x)-y, x.min(), x.max())
import numpy as np
import pandas as pd
import scipy.optimize as optimize
import scipy.interpolate as interpolate
def unknown_func(x):
return -x ** 3 + 1
x = np.linspace(-10, 10, 100)
df = pd.DataFrame({'f': unknown_func(x)}, index=x)
y = 50
func = interpolate.interp1d(x, df['f'], kind='linear')
root = optimize.brentq(lambda x: func(x)-y, x.min(), x.max())
print(root)
# -3.6566397064
print(func(root))
# 50.0
idx = np.searchsorted(df.index.values, root)
print(df.iloc[idx-1:idx+1])
# f
# -3.737374 53.203496
# -3.535354 45.187410
Notice that you need some model for your data. Above, the linear interpolator,
interp1d is implicitly imposing a model for the unknown function that
generated the data.
If you already have a model function (such as unknown_func), then you could use that instead of the func returned by interp1d. If
you have a parametrized model function, then instead of interp1d you could use
optimize.curve_fit to find the best fitting parameters. And if you do choose
to interpolate, there are many other choices (e.g. quadratic or cubic
interpolation) for interpolation which you might use too. What to choose depends on what you think best models your data.
Related
I am trying to use a polynomial expression that would fit my function (signal). I am using numpy.polynomial.polynomial.Polynomial.fit function to fit my function(signal) using the coefficients. Now, after generating the coefficients, I want to put those coefficients back into the polynomial equation - get the corresponding y-values - and plot them on the graph. But I am not getting what I want (orange line) . What am I doing wrong here?
Thanks.
import math
def getYValueFromCoeff(f,coeff_list): # low to high order
y_plot_values=[]
for j in range(len(f)):
item_list= []
for i in range(len(coeff_list)):
item= (coeff_list[i])*((f[j])**i)
item_list.append(item)
y_plot_values.append(sum(item_list))
print(len(y_plot_values))
return y_plot_values
from numpy.polynomial import Polynomial as poly
import numpy as np
import matplotlib.pyplot as plt
no_of_coef= 10
#original signal
x = np.linspace(0, 0.01, 10)
period = 0.01
y = np.sin(np.pi * x / period)
#poly fit
test1= poly.fit(x,y,no_of_coef)
coeffs= test1.coef
#print(test1.coef)
coef_y= getYValueFromCoeff(x, test1.coef)
#print(coef_y)
plt.plot(x,y)
plt.plot(x, coef_y)
If you check out the documentation, consider the two properties: poly.domain and poly.window. To avoid numerical issues, the range poly.domain = [x.min(), x.max()] of independent variable (x) that we pass to the fit() is being normalized to poly.window = [-1, 1]. This means the coefficients you get from poly.coef apply to this normalized range. But you can adjust this behaviour (sacrificing numerical stability) accordingly, that is, adjustig the poly.window will make your curves match:
...
test1 = poly.fit(x, y, deg=no_of_coef, window=[x.min(), x.max()])
...
But unless you have a good reason to do that, I'd stick to the default behaviour of fit().
As a side note: Evaluating polynomials or lists of coefficients is already implemented in numpy, e.g. using directly
coef_y = test1(x)
or alternatively using np.polyval.
I always like to see original solutions to problems. I urge you to continue to pursue that as that is the best way to learn how to fit functions programmatically. I also wanted to provide the solution that is much more tailored towards a standard numpy implementation. As for your custom function, you did really well. The only issue is that the coefficients are from high to low order, while you were counting up in powers from 0 to highest power. Simply counting down from highest power to 0, allows your function to give the correct result. Notice how your function overlays perfectly with the numpy polyval.
import numpy as np
import matplotlib.pyplot as plt
def getYValueFromCoeff(f,coeff_list): # low to high order
y_plot_values=[]
for j in range(len(f)):
item_list= []
for i in range(len(coeff_list)):
item= (coeff_list[i])*((f[j])**(len(coeff_list)-i-1))
item_list.append(item)
y_plot_values.append(sum(item_list))
print(len(y_plot_values))
return y_plot_values
no_of_coef = 10
#original signal
x = np.linspace(0, 0.01, 10)
period = 0.01
y = np.sin(np.pi * x / period)
#poly fit
coeffs = np.polyfit(x,y,no_of_coef)
coef_y = np.polyval(coeffs,x)
COEF_Y = getYValueFromCoeff(x,coeffs)
plt.figure()
plt.plot(x,y)
plt.plot(x, coef_y)
plt.plot(x, COEF_Y)
plt.legend(['Original Function', 'Fitted Function', 'Custom Fitting'])
plt.show()
Output
Here's the simple way of doing it if you didn't know that already...
import math
from numpy.polynomial import Polynomial as poly
import numpy as np
import matplotlib.pyplot as plt
no_of_coef= 10
#original signal
x = np.linspace(0, 0.01, 10)
period = 0.01
y = np.sin(np.pi * x / period)
#poly fit
test1= poly.fit(x,y,no_of_coef)
plt.plot(x, y, 'r', label='original y')
x = np.linspace(0, 0.01, 1000)
plt.plot(x, test1(x), 'b', label='y_fit')
plt.legend()
plt.show()
So this may be hard to explain cause its a chunk of some really large code - I don't expect it to be reproducible.
But essentially it's a simulation which (using multiple simulated datasets) creates a one-way or two-way regression and calculates the respective t-values and p-values for them.
However, putting some of the datasets (with the same information and no missing values), results in stats.model.formula.ols.fit() returning the pvals / tvals as a pandas series instead of a numpy array (even one way studies).
Could someone please explain why / if there is a way to specify the output?
An example dataframe looks like this: (x0-x187 is our y, genotype and treatment are the desired factors, staging is a factor used for normalisation)
x0
x1
...
treatment
genotype
200926_ku20_e1_wt_veh
0.075821
0.012796
...
veh
wt
201210_ku25_e7_wt_veh
0.082307
0.007596
...
veh
wt
201127_ku55_e6_wt_veh
0.083049
0.008978
...
veh
wt
201220_ku52_e2_wt_veh
0.078414
0.013488
...
veh
wt
...
...
...
...
...
...
210913_b6ku_22297_e5_wt
0.067858
0.008081
...
treat
wt
210821_b6ku_3_e5_wt
0.070417
0.012396
...
treat
wt
And then the code:
'''import subprocess as sub
import os
import struct
from pathlib import Path
import tempfile
from typing import Tuple
import shutil
from logzero import logger as logging
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
for col in range(data.shape[1]):
if not df[f'x{col}'].any():
p = np.nan
t = np.nan
else:
if two_way:
# two way model - if its just the geno or treat comparison; the one-factor col
# will
# be ignored
# for some simulations smf is returning a Series.
fit = smf.ols(formula=f'x{col} ~ genotype * treatment + staging', data=df, missing='drop').fit()
# get all pvals except intercept and staging
p = fit.pvalues[~fit.pvalues.index.isin(['Intercept', 'staging'])]
t = fit.tvalues[~fit.tvalues.index.isin(['Intercept', 'staging'])]
else:
fit = smf.ols(formula=f'x{col} ~ genotype + staging', data=df, missing='drop').fit()
p = fit.pvalues['genotype[T.wt]']
t = fit.tvalues['genotype[T.wt]']
pvals.append(p)
tvals.append(t)
p_all = np.array(pvals)
print("example", p_all[0])
print(type(p_all[0][0]), p_all[0][0])
And finally some output:
Desired output:
'''example [1.63688492e-01 6.05907115e-06 7.70710934e-02]
<class 'numpy.float64'> 0.16368849176977607 '''
"Error" output:
'''example genotype[T.wt] 0.862423
treatment[T.veh] 0.000177
genotype[T.wt]:treatment[T.veh] 0.522066
dtype: float64
< class 'numpy.float64'> 0.8624226150886212'''
I've manually corrected the data but I would rather not have to do dumb fixes in the future.
I would like to calculate the rolling exponentially weighted mean with df.rolling().mean(). I get stuck at the win_type = 'exponential'.
I have tried other *win_types such as 'gaussian'. I think there would be sth a little different from 'exponential'.
dfTemp.rolling(window=21, min_periods=10, win_type='gaussian').mean(std=1)
# works fine
but when it comes to 'exponential',
dfTemp.rolling(window=21, min_periods=10, win_type='exponential').mean(tau=10)
# ValueError: The 'exponential' window needs one or more parameters -- pass a tuple.
How to use win_type='exponential'... Thanks~~~
I faced same issue and asked it on Russian SO:
Got the following answer:
x.rolling(window=(2,10), min_periods=1, win_type='exponential').mean(std=0.1)
You should pass tau value to window=(2, 10) parameter directly where 10 is a value for tau.
I hope it will help! Thanks to #MaxU
You can easily implement any kind of window by definining your kernel function.
Here's an example for a backward-looking exponential average:
import pandas as pd
import numpy as np
# Kernel function ( backward-looking exponential )
def K(x):
return np.exp(-np.abs(x)) * np.where(x<=0,1,0)
# Exponenatial average function
def exp_average(values):
N = len(values)
exp_weights = list(map(K, np.arange(-N,0) / N ))
return values.dot(exp_weights) / N
# Create a sample DataFrame
df = pd.DataFrame({
'date': [pd.datetime(2020,1,1)]*50 + [pd.datetime(2020,1,2)]*50,
'x' : np.random.randn(100)
})
# Finally, compute the exponenatial moving average using `rolling` and `apply`
df['mu'] = df.groupby(['date'])['x'].rolling(5).apply(exp_average, raw=True).values
df.head(10)
Notice that, if N is fixed, you can significantly reduce the execution time by keeping the weights constant:
N = 10
exp_weights = list(map(K, np.arange(-N,0) / N ))
def exp_average(values):
return values.dot(exp_weights) / N
Short answer: you should use pass tau to the applied function, e.g., rolling(d, win_type='exponential').sum(tau=10). Note that the mean function does not respect the exponential window as expected, so you may need to use sum(tau=10)/window_size to calculate the exponential mean. This is a BUG of current version Pandas (1.0.5).
Full example:
# To calculate the rolling exponential mean
import numpy as np
import pandas as pd
window_size = 10
tau = 5
a = pd.Series(np.random.rand(100))
rolling_mean_a = a.rolling(window_size, win_type='exponential').sum(tau=tau) / window_size
The answer of #Илья Митусов is not correct. With pandas 1.0.5, running the following code raises ValueError: exponential window requires tau:
import pandas as pd
import numpy as np
pd.Series(np.arange(10)).rolling(window=(4, 10), min_periods=1, win_type='exponential').mean(std=0.1)
This code has many problems. First, the 10 in window=(4, 10) is not tau, and will lead to wrong answers. Second, exponential window does not need the parameter std -- only gaussian window needs. Last, the tau should be provided to mean (although mean does not respect the win_type).
Assume a dataframe df with a single column (say latency, i.e. a uni-variate sample). The exceedance function is calculated and plotted as follows:
sorted_df = df.sort_values('latency')
samples = len(sorted_df)
exceedance = [1-(x/samples) for x in range(1, samples + 1)]
ax.plot(df['latency'], exceedance, 'o')
Is there a simpler/elegant way to calculate and plot exceedance function of a univariate sample using seaborn (may be distplot)? I recently learnt using seaborn's distplot function, but I can only plot the cdf as follows:
sns.distplot(df['latency'], hist=False, kde_kws={'cumulative':True})
I'm specifically interested in seaborn because I plan to use this function along with Seaborn.FacetGrid to get an exceedance plot for several factors.
Because you asked for a more elegant way, the following saves you two lines of code and is faster.
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
def plot_exceedance(data, **kwargs):
df = data.sort_values()
exceedance = 1.-np.arange(1.,len(df) + 1.)/len(df)
plt.plot(sorted_df, exceedance, **kwargs)
g = sns.FacetGrid(df, row='factorA',col='factorB',hue='factorC')
g.map(plot_exceedance, 'latency')
There is no predefined API/paramaters to calculate exceedance. So, I had to use the code listed above. But considering that I was specifically interested in getting an exceedance plot of several factors and that I could use plt.plot along with seaborn.FacetGrid, the following piece of code worked.
def plot_exceedance(data, **kwargs):
sorted_df = data.sort_values()
samples = len(sorted_df)
exceedance = [1-(x/samples) for x in range(1, samples + 1)]
ax=plt.gca()
ax.plot(sorted_df, exceedance, **kwargs)
g = sns.FacetGrid(df, row='factorA',col='factorB',hue='factorC')
g.map(plot_exceedance, 'latency')
where factorA, factorB and factorC are additional columns in df.
I have some categorical features in my data along with continuous ones. Is it a good or absolutely bad idea to hot encode category features to find correlation of it to labels along with other continuous creatures?
There is a way to calculate the correlation coefficient without one-hot encoding the category variable. Cramers V statistic is one method for calculating the correlation of categorical variables. It can be calculated as follows. The following link is helpful. Using pandas, calculate Cramér's coefficient matrix For variables with other continuous values, you can categorize by using cut of pandas.
import numpy as np
import pandas as pd
import scipy.stats as ss
import seaborn as sns
print('Pandas version:', pd.__version__)
# Pandas version: 1.3.0
tips = sns.load_dataset("tips")
tips["total_bill_cut"] = pd.cut(tips["total_bill"],
np.arange(0, 55, 5),
include_lowest=True,
right=False)
def cramers_v(confusion_matrix):
""" calculate Cramers V statistic for categorial-categorial association.
uses correction from Bergsma and Wicher,
Journal of the Korean Statistical Society 42 (2013): 323-328
"""
chi2 = ss.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum()
phi2 = chi2 / n
r, k = confusion_matrix.shape
phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
rcorr = r - ((r-1)**2)/(n-1)
kcorr = k - ((k-1)**2)/(n-1)
return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))
confusion_matrix = pd.crosstab(tips["day"], tips["time"])
cramers_v(confusion_matrix.values)
# Out[2]: 0.9386619340722221
confusion_matrix = pd.crosstab(tips["total_bill_cut"], tips["time"])
cramers_v(confusion_matrix.values)
# Out[3]: 0.1649870749498837
please note the .as_matrix() is deprecated in pandas since verison 0.23.0 . use .values instead
I found phik library quite useful in calculating correlation between categorical and interval features. This is also useful for binning numerical features. Try this once: phik documentation
I was looking to do same thing in BigQuery.
For numeric features you can use built in CORR(x,y) function.
For categorical features, you can calculate it as:
cardinality (cat1 x cat2) / max (cardinality(cat1), cardinality(cat2).
Which translates to following SQL:
SELECT
COUNT(DISTINCT(CONCAT(cat1, cat2))) / GREATEST (COUNT(DISTINCT(cat1)), COUNT(DISTINCT(cat2))) as cat1_2,
COUNT(DISTINCT(CONCAT(cat1, cat3))) / GREATEST (COUNT(DISTINCT(cat1)), COUNT(DISTINCT(cat3))) as cat1_3,
....
FROM ...
Higher number means lower correlation.
I used following python script to generate SQL:
import itertools
arr = range(1,10)
query = ',\n'.join(list('COUNT(DISTINCT(CONCAT({a}, {b}))) / GREATEST (COUNT(DISTINCT({a})), COUNT(DISTINCT({b}))) as cat{a}_{b}'.format(a=a,b=b)
for (a,b) in itertools.combinations(arr,2)))
query = 'SELECT \n ' + query + '\n FROM `...`;'
print (query)
It should be straightforward to do same thing in numpy.