How to use df.rolling(window, min_periods, win_type='exponential').sum() - pandas

I would like to calculate the rolling exponentially weighted mean with df.rolling().mean(). I get stuck at the win_type = 'exponential'.
I have tried other *win_types such as 'gaussian'. I think there would be sth a little different from 'exponential'.
dfTemp.rolling(window=21, min_periods=10, win_type='gaussian').mean(std=1)
# works fine
but when it comes to 'exponential',
dfTemp.rolling(window=21, min_periods=10, win_type='exponential').mean(tau=10)
# ValueError: The 'exponential' window needs one or more parameters -- pass a tuple.
How to use win_type='exponential'... Thanks~~~

I faced same issue and asked it on Russian SO:
Got the following answer:
x.rolling(window=(2,10), min_periods=1, win_type='exponential').mean(std=0.1)
You should pass tau value to window=(2, 10) parameter directly where 10 is a value for tau.
I hope it will help! Thanks to #MaxU

You can easily implement any kind of window by definining your kernel function.
Here's an example for a backward-looking exponential average:
import pandas as pd
import numpy as np
# Kernel function ( backward-looking exponential )
def K(x):
return np.exp(-np.abs(x)) * np.where(x<=0,1,0)
# Exponenatial average function
def exp_average(values):
N = len(values)
exp_weights = list(map(K, np.arange(-N,0) / N ))
return values.dot(exp_weights) / N
# Create a sample DataFrame
df = pd.DataFrame({
'date': [pd.datetime(2020,1,1)]*50 + [pd.datetime(2020,1,2)]*50,
'x' : np.random.randn(100)
})
# Finally, compute the exponenatial moving average using `rolling` and `apply`
df['mu'] = df.groupby(['date'])['x'].rolling(5).apply(exp_average, raw=True).values
df.head(10)
Notice that, if N is fixed, you can significantly reduce the execution time by keeping the weights constant:
N = 10
exp_weights = list(map(K, np.arange(-N,0) / N ))
def exp_average(values):
return values.dot(exp_weights) / N

Short answer: you should use pass tau to the applied function, e.g., rolling(d, win_type='exponential').sum(tau=10). Note that the mean function does not respect the exponential window as expected, so you may need to use sum(tau=10)/window_size to calculate the exponential mean. This is a BUG of current version Pandas (1.0.5).
Full example:
# To calculate the rolling exponential mean
import numpy as np
import pandas as pd
window_size = 10
tau = 5
a = pd.Series(np.random.rand(100))
rolling_mean_a = a.rolling(window_size, win_type='exponential').sum(tau=tau) / window_size
The answer of #Илья Митусов is not correct. With pandas 1.0.5, running the following code raises ValueError: exponential window requires tau:
import pandas as pd
import numpy as np
pd.Series(np.arange(10)).rolling(window=(4, 10), min_periods=1, win_type='exponential').mean(std=0.1)
This code has many problems. First, the 10 in window=(4, 10) is not tau, and will lead to wrong answers. Second, exponential window does not need the parameter std -- only gaussian window needs. Last, the tau should be provided to mean (although mean does not respect the win_type).

Related

how is pandas kurtosis defined?

I am trying to get kurtosis using pandas. By doing some exploration, I have
test_series = pd.Series(np.random.randn(5000))
test_series.kurtosis()
however, the output is:
-0.006755982906479385
But I think the kurtosis (https://en.wikipedia.org/wiki/Kurtosis) should be close to (maybe normalize over N-1 instead of N, but this does not matter here)
(test_series - test_series.mean()).pow(4).mean()/np.power(test_series.std(),4)
which is
2.9908543104146026
The pandas documentation says the following
Return unbiased kurtosis over requested axis using Fisher’s definition of kurtosis (kurtosis of normal == 0.0)
This is probably the excess kurtosis, defined as kurtosis - 3.
Pandas is calculating the UNBIASED estimator of the excess Kurtosis. Kurtosis is the normalized 4th central moment. To find the unbiased estimators of the cumulants you need the k-statistics.
So the unbiased estimator of kurtosis is (k4/k2**2)
To illustrate this:
import pandas as pd
import numpy as np
np.random.seed(11234)
test_series = pd.Series(np.random.randn(5000))
test_series.kurtosis()
#-0.0411811269445872
Now we can calculate this explicitly using the k-statistics:
n = len(test_series)
S1 = test_series.pow(1).sum()
S2 = test_series.pow(2).sum()
S3 = test_series.pow(3).sum()
S4 = test_series.pow(4).sum()
# Eq (7) and (5) from the k-statistics link
k4 = (-6*S1**4 + 12*n*S1**2*S2 - 3*n*(n-1)*S2**2 -4*n*(n+1)*S1*S3 + n**2*(n+1)*S4)/(n*(n-1)*(n-2)*(n-3))
k2 = (n*S2-S1**2)/(n*(n-1))
# k2 is the same as the N-1 variance: test_series.std(ddof=1)**2
k4/k2**2
#-0.04118112694458816
If you want better agreement to more decimal places, you'll need to be careful with the sums as they get rather large. But they're identical to 12 places.

Categorical features correlation

I have some categorical features in my data along with continuous ones. Is it a good or absolutely bad idea to hot encode category features to find correlation of it to labels along with other continuous creatures?
There is a way to calculate the correlation coefficient without one-hot encoding the category variable. Cramers V statistic is one method for calculating the correlation of categorical variables. It can be calculated as follows. The following link is helpful. Using pandas, calculate Cramér's coefficient matrix For variables with other continuous values, you can categorize by using cut of pandas.
import numpy as np
import pandas as pd
import scipy.stats as ss
import seaborn as sns
print('Pandas version:', pd.__version__)
# Pandas version: 1.3.0
tips = sns.load_dataset("tips")
tips["total_bill_cut"] = pd.cut(tips["total_bill"],
np.arange(0, 55, 5),
include_lowest=True,
right=False)
def cramers_v(confusion_matrix):
""" calculate Cramers V statistic for categorial-categorial association.
uses correction from Bergsma and Wicher,
Journal of the Korean Statistical Society 42 (2013): 323-328
"""
chi2 = ss.chi2_contingency(confusion_matrix)[0]
n = confusion_matrix.sum()
phi2 = chi2 / n
r, k = confusion_matrix.shape
phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
rcorr = r - ((r-1)**2)/(n-1)
kcorr = k - ((k-1)**2)/(n-1)
return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))
confusion_matrix = pd.crosstab(tips["day"], tips["time"])
cramers_v(confusion_matrix.values)
# Out[2]: 0.9386619340722221
confusion_matrix = pd.crosstab(tips["total_bill_cut"], tips["time"])
cramers_v(confusion_matrix.values)
# Out[3]: 0.1649870749498837
please note the .as_matrix() is deprecated in pandas since verison 0.23.0 . use .values instead
I found phik library quite useful in calculating correlation between categorical and interval features. This is also useful for binning numerical features. Try this once: phik documentation
I was looking to do same thing in BigQuery.
For numeric features you can use built in CORR(x,y) function.
For categorical features, you can calculate it as:
cardinality (cat1 x cat2) / max (cardinality(cat1), cardinality(cat2).
Which translates to following SQL:
SELECT
COUNT(DISTINCT(CONCAT(cat1, cat2))) / GREATEST (COUNT(DISTINCT(cat1)), COUNT(DISTINCT(cat2))) as cat1_2,
COUNT(DISTINCT(CONCAT(cat1, cat3))) / GREATEST (COUNT(DISTINCT(cat1)), COUNT(DISTINCT(cat3))) as cat1_3,
....
FROM ...
Higher number means lower correlation.
I used following python script to generate SQL:
import itertools
arr = range(1,10)
query = ',\n'.join(list('COUNT(DISTINCT(CONCAT({a}, {b}))) / GREATEST (COUNT(DISTINCT({a})), COUNT(DISTINCT({b}))) as cat{a}_{b}'.format(a=a,b=b)
for (a,b) in itertools.combinations(arr,2)))
query = 'SELECT \n ' + query + '\n FROM `...`;'
print (query)
It should be straightforward to do same thing in numpy.

The random number generator in numpy

I am using the numpy.random.randnand numpy.random.randto generate random numbers. I am confusing about the difference between random.randn and random.rand?
The main difference between the two is mentioned in the docs. Links to Doc rand and Doc randn
For numpy.rand, you get random values generated from a uniform distribution within 0 - 1
But for numpy.randn you get random values generated from a normal distribution, with mean 0 and variance 1.
Just a small example.
>>> import numpy as np
>>> np.random.rand(10)
array([ 0.63067838, 0.61371053, 0.62025104, 0.42751699, 0.22862483,
0.75287427, 0.90339087, 0.06643259, 0.17352284, 0.58213108])
>>> np.random.randn(10)
array([ 0.19972981, -0.35193746, -0.62164336, 2.22596365, 0.88984545,
-0.28463902, 1.00123501, 1.76429108, -2.5511792 , 0.09671888])
>>>
As you can see that rand gives me values within 0-1,
whereas randn gives me values with mean == 0 and variance == 1
To explain further, let me generate a large enough sample:
>>> a = np.random.rand(100)
>>> b = np.random.randn(100)
>>> np.mean(a)
0.50570149531258946
>>> np.mean(b)
-0.010864958465191673
>>>
you can see that the mean of a is close to 0.50, which was generated using rand. The mean of b on the other hand is close to 0.0, which was generated using randn
You can also get a conversion from rand numbers to randn numbers in Python by the application of percent point function (ppf) for the Normal Distribution with random variables distributed ~ N(0,1). It is a well-known method of projecting any uniform random variables (0,1) onto ppf in order to get random variables for a desired cumulative distribution.
In Python we can visualize that process as follows:
from numpy.random import rand
import matplotlib.pyplot as plt
from scipy.stats import norm
u = rand(100000) # uniformly distributed rvs
z = norm.ppf(u) # ~ N(0,1) rvs
plt.hist(z,bins=100)
plt.show()

pandas: finding the root of a function

I have some data frame in pandas, where the columns can be viewed as smooth functions of the index:
f g
x ------------
0.1 f(0.1) g(0.1)
0.2 f(0.2) g(0.2)
...
And I want to know the x value for some f(x) = y -- where y is a given, and I don't necessarily have a point at the x that I am looking for.
Essentially I want to find the intersection of a line and a data series in pandas. Is there a best way to do this?
Suppose your DataFrame looks something like this:
import numpy as np
import pandas as pd
def unknown_func(x):
return -x ** 3 + 1
x = np.linspace(-10, 10, 100)
df = pd.DataFrame({'f': unknown_func(x)}, index=x)
then, using scipy, you could create an interpolation function:
import scipy.interpolate as interpolate
func = interpolate.interp1d(x, df['f'], kind='linear')
and then use a root finder to solve f(x)-y=0 for x:
import scipy.optimize as optimize
root = optimize.brentq(lambda x: func(x)-y, x.min(), x.max())
import numpy as np
import pandas as pd
import scipy.optimize as optimize
import scipy.interpolate as interpolate
def unknown_func(x):
return -x ** 3 + 1
x = np.linspace(-10, 10, 100)
df = pd.DataFrame({'f': unknown_func(x)}, index=x)
y = 50
func = interpolate.interp1d(x, df['f'], kind='linear')
root = optimize.brentq(lambda x: func(x)-y, x.min(), x.max())
print(root)
# -3.6566397064
print(func(root))
# 50.0
idx = np.searchsorted(df.index.values, root)
print(df.iloc[idx-1:idx+1])
# f
# -3.737374 53.203496
# -3.535354 45.187410
Notice that you need some model for your data. Above, the linear interpolator,
interp1d is implicitly imposing a model for the unknown function that
generated the data.
If you already have a model function (such as unknown_func), then you could use that instead of the func returned by interp1d. If
you have a parametrized model function, then instead of interp1d you could use
optimize.curve_fit to find the best fitting parameters. And if you do choose
to interpolate, there are many other choices (e.g. quadratic or cubic
interpolation) for interpolation which you might use too. What to choose depends on what you think best models your data.

inverse of FFT not the same as original function

I don't understand why the ifft(fft(myFunction)) is not the same as my function. It seems to be the same shape but a factor of 2 out (ignoring the constant y-offset). All the documentation I can see says there is some normalisation that fft doesn't do, but that ifft should take care of that. Here's some example code below - you can see where I've bodged the factor of 2 to give me the right answer. Thanks for any help - its driving me nuts.
import numpy as np
import scipy.fftpack as fftp
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
def fourier_series(x, y, wn, n=None):
# get FFT
myfft = fftp.fft(y, n)
# kill higher freqs above wavenumber wn
myfft[wn:] = 0
# make new series
y2 = fftp.ifft(myfft).real
# find constant y offset
myfft[1:]=0
c = fftp.ifft(myfft)[0]
# remove c, apply factor of 2 and re apply c
y2 = (y2-c)*2 + c
plt.figure(num=None)
plt.plot(x, y, x, y2)
plt.show()
if __name__=='__main__':
x = np.array([float(i) for i in range(0,360)])
y = np.sin(2*np.pi/360*x) + np.sin(2*2*np.pi/360*x) + 5
fourier_series(x, y, 3, 360)
You're removing half the spectrum when you do myfft[wn:] = 0. The negative frequencies are those in the top half of the array and are required.
You have a second fudge to get your results which is taking the real part to find y2: y2 = fftp.ifft(myfft).real (fftp.ifft(myfft) has a non-negligible imaginary part due to the asymmetry in the spectrum).
Fix it with myfft[wn:-wn] = 0 instead of myfft[wn:] = 0, and remove the fudges. So the fixed code looks something like:
import numpy as np
import scipy.fftpack as fftp
import matplotlib.pyplot as plt
def fourier_series(x, y, wn, n=None):
# get FFT
myfft = fftp.fft(y, n)
# kill higher freqs above wavenumber wn
myfft[wn:-wn] = 0
# make new series
y2 = fftp.ifft(myfft)
plt.figure(num=None)
plt.plot(x, y, x, y2)
plt.show()
if __name__=='__main__':
x = np.array([float(i) for i in range(0,360)])
y = np.sin(2*np.pi/360*x) + np.sin(2*2*np.pi/360*x) + 5
fourier_series(x, y, 3, 360)
It's really worth paying attention to the interim arrays that you are creating when trying to do signal processing. Invariably, there are clues as to what is going wrong that should direct you to the problem. In this case, you taking the real part masked the problem and made your task more difficult.
Just to add another quick point: Sometimes taking the real part of the resultant array is exactly the correct thing to do. It's often the case that you end up with an imaginary part to the signal output which is just down to numerical errors in the input to the inverse FFT. Typically this manifests itself as very small imaginary values, so taking the real part is basically the same array.
You are killing the negative frequencies between 0 and -wn.
I think what you mean to do is to set myfft to 0 for all frequencies outside [-wn, wn].
Change the following line:
myfft[wn:] = 0
to:
myfft[wn:-wn] = 0