polynomial fitting of a signal and plotting the fitted signal - numpy

I am trying to use a polynomial expression that would fit my function (signal). I am using numpy.polynomial.polynomial.Polynomial.fit function to fit my function(signal) using the coefficients. Now, after generating the coefficients, I want to put those coefficients back into the polynomial equation - get the corresponding y-values - and plot them on the graph. But I am not getting what I want (orange line) . What am I doing wrong here?
Thanks.
import math
def getYValueFromCoeff(f,coeff_list): # low to high order
y_plot_values=[]
for j in range(len(f)):
item_list= []
for i in range(len(coeff_list)):
item= (coeff_list[i])*((f[j])**i)
item_list.append(item)
y_plot_values.append(sum(item_list))
print(len(y_plot_values))
return y_plot_values
from numpy.polynomial import Polynomial as poly
import numpy as np
import matplotlib.pyplot as plt
no_of_coef= 10
#original signal
x = np.linspace(0, 0.01, 10)
period = 0.01
y = np.sin(np.pi * x / period)
#poly fit
test1= poly.fit(x,y,no_of_coef)
coeffs= test1.coef
#print(test1.coef)
coef_y= getYValueFromCoeff(x, test1.coef)
#print(coef_y)
plt.plot(x,y)
plt.plot(x, coef_y)

If you check out the documentation, consider the two properties: poly.domain and poly.window. To avoid numerical issues, the range poly.domain = [x.min(), x.max()] of independent variable (x) that we pass to the fit() is being normalized to poly.window = [-1, 1]. This means the coefficients you get from poly.coef apply to this normalized range. But you can adjust this behaviour (sacrificing numerical stability) accordingly, that is, adjustig the poly.window will make your curves match:
...
test1 = poly.fit(x, y, deg=no_of_coef, window=[x.min(), x.max()])
...
But unless you have a good reason to do that, I'd stick to the default behaviour of fit().
As a side note: Evaluating polynomials or lists of coefficients is already implemented in numpy, e.g. using directly
coef_y = test1(x)
or alternatively using np.polyval.

I always like to see original solutions to problems. I urge you to continue to pursue that as that is the best way to learn how to fit functions programmatically. I also wanted to provide the solution that is much more tailored towards a standard numpy implementation. As for your custom function, you did really well. The only issue is that the coefficients are from high to low order, while you were counting up in powers from 0 to highest power. Simply counting down from highest power to 0, allows your function to give the correct result. Notice how your function overlays perfectly with the numpy polyval.
import numpy as np
import matplotlib.pyplot as plt
def getYValueFromCoeff(f,coeff_list): # low to high order
y_plot_values=[]
for j in range(len(f)):
item_list= []
for i in range(len(coeff_list)):
item= (coeff_list[i])*((f[j])**(len(coeff_list)-i-1))
item_list.append(item)
y_plot_values.append(sum(item_list))
print(len(y_plot_values))
return y_plot_values
no_of_coef = 10
#original signal
x = np.linspace(0, 0.01, 10)
period = 0.01
y = np.sin(np.pi * x / period)
#poly fit
coeffs = np.polyfit(x,y,no_of_coef)
coef_y = np.polyval(coeffs,x)
COEF_Y = getYValueFromCoeff(x,coeffs)
plt.figure()
plt.plot(x,y)
plt.plot(x, coef_y)
plt.plot(x, COEF_Y)
plt.legend(['Original Function', 'Fitted Function', 'Custom Fitting'])
plt.show()
Output

Here's the simple way of doing it if you didn't know that already...
import math
from numpy.polynomial import Polynomial as poly
import numpy as np
import matplotlib.pyplot as plt
no_of_coef= 10
#original signal
x = np.linspace(0, 0.01, 10)
period = 0.01
y = np.sin(np.pi * x / period)
#poly fit
test1= poly.fit(x,y,no_of_coef)
plt.plot(x, y, 'r', label='original y')
x = np.linspace(0, 0.01, 1000)
plt.plot(x, test1(x), 'b', label='y_fit')
plt.legend()
plt.show()

Related

statsmodels: IntegrationWarning: The maximum number of subdivisions (50) has been achieved

Trying to plot a CDF with seaborns, then encountered this error:
../venv/lib/python3.7/site-packages/statsmodels/nonparametric/kde.py:178: IntegrationWarning: The maximum number of subdivisions (50) has been achieved.
If increasing the limit yields no improvement it is advised to analyze
the integrand in order to determine the difficulties. If the position of a
local difficulty can be determined (singularity, discontinuity) one will
probably gain from splitting up the interval and calling the integrator
on the subranges. Perhaps a special-purpose integrator should be used.
args=endog)[0] for i in range(1, gridsize)]
Some minutes after pressing the return key
../venv/lib/python3.7/site-packages/statsmodels/nonparametric/kde.py:178: IntegrationWarning: The integral is probably divergent, or slowly convergent.
args=endog)[0] for i in range(1, gridsize)]
Code:
plt.figure()
plt.title('my distribution')
plt.ylabel('CDF')
plt.xlabel('x-labelled')
sns.kdeplot(data,cumulative=True)
plt.show()
If it could be of help:
print(len(data))
4360700
Sample data:
print(data[:10])
[ 0.00362846 0.00123409 0.00013711 -0.00029235 0.01515175 0.02780404
0.03610236 0.03410224 0.03887933 0.0307084 ]
Have no idea what the subdivisions are, is there a way to increase it?
A kde plot is created by summing one gaussian bell shape for every data point. Summing 4 million curves will create memory and performance problems, which might cause come functions to fail. The exact error message can be very cryptic.
The easiest way to work around the problem, is to subsample the data, as for a more or less smooth distribution the kde (and the cumultative kde or cdf) will look very similar whether the data is subsampled or not. Subsampling every 100th entry is easy using slicing data[::100].
Alternatively, with that many data, the "real" cdf can be drawn by plotting the sorted data versus N evenly spaced numbers from 0 to 1. (Where N is the number of data points.)
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
N = 1000000
data = np.random.normal(np.repeat(np.random.uniform(10, 20, 10), N // 10), 1)
sns.kdeplot(data[::100], cumulative=True, color='g', label='cumulative kde')
q = np.linspace(0, 1, data.size)
data.sort()
plt.plot(data, q, ':r', lw=2, label='cdf from sorted data')
plt.legend()
plt.show()
Note that in a similar, though slightly more involved, way you can draw a "more honest" kde given the differences of a large enough array of sorted data. np.interp interpolates the quantiles to a regularly spaced x-axis. As the raw differences are rather jaggy, some smoothing is needed.
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import statsmodels.api as sm
N = 1000000
data = np.random.normal(np.repeat(np.random.uniform(10, 20, 10), N // 10), 1)
sns.kdeplot(data[::100], cumulative=False, color='g', label='kde')
p = np.linspace(0, 1, data.size)
data.sort()
x = np.linspace(data.min(), data.max(), 1000)
y = np.interp(x, data, p)
# use lowess filter to smoothen the curve
lowess = sm.nonparametric.lowess(np.diff(y) * 1000 / (data.max() - data.min()), (x[:-1] + x[1:]) / 2, frac=0.05)
plt.plot(lowess[:, 0], lowess[:, 1], '-r', label='smoothed diff of sorted data')
# plt.plot((x[:-1]+x[1:])/2,
# np.convolve(np.diff(y), np.ones(20)/20, mode='same')*1000/(data.max() - data.min()),
# label='test np.diff')
plt.legend()
plt.show()

How can I use matplotlib ticklabel_format to not use scientific notation on y axis labels

I'm creating a plot (in a colab worksheet) and want the y tick labels to not use scientific notation. The ticklabel_format doesn't make any difference to the final graph. The y axis labels are still shown as 10^3 instead of 1000. How do I format the y tick labels to not use scientific notation?
Here is my code
import matplotlib.pyplot as plt
plt.ticklabel_format(style='plain', axis='y')
plt.plot(Cd_rank,Cd_raw,linewidth=4)
plt.plot(Cd_rank,Cd_sed,linewidth=4)
plt.plot(Cd_rank,Cd_filter,linewidth=4)
plt.plot([0,1],[0.3,0.3],linewidth=4)
plt.plot([0,1],[5,5],linewidth=4)
plt.ylabel('Turbidez (UTN)')
plt.xlabel('Datos ordenados')
plt.yscale('log')
plt.legend(['Agua cruda','Decantada','Filtrada','Norma EPA','Norma ENACAL'])
The ScalarFormatter shows the tick labels in a default format. Note that depending on your concrete situation, matplotlib still might be using scientific notation:
When the numbers are too high (default this is about 4 digits). set_powerlimits((n, m)) can be used to change the limits.
In case the numbers are very close together, matplotlib describes the range using an offset. That offset is placed at the top of the axis. This can be suppressed with the useOffset=None parameter of the formatter.
In some cases with a logarithmic scale, there are very few major ticks. Then also some (but not all) minor ticks get a label. Also for these, the formatter could be changed. A problem can be that a simple ScalarFormatter will set too many labels. Either suppress all these minor labels using a NullFormatter or you'll need a very custom formatter that returns empty strings for the minor tick labels that need to be suppressed.
A simple example:
from matplotlib import pyplot as plt
from matplotlib import ticker
import numpy as np
N = 50
Cd_rank = np.linspace(0, 100, N)
Cd_raw = np.random.normal(1, 20, N).cumsum() + 100
plt.plot(Cd_rank, Cd_raw, linewidth=4)
plt.plot([0, 1], [0.3, 0.3], linewidth=4)
plt.plot([0, 1], [5, 5], linewidth=4)
plt.yscale('log')
plt.gca().yaxis.set_major_formatter(ticker.ScalarFormatter())
plt.gca().yaxis.set_minor_formatter(ticker.NullFormatter())
plt.show()
And here is a more complicated example, with both minor (green) and major (red) ticks.
from matplotlib import pyplot as plt
from matplotlib import ticker
import numpy as np
N = 50
Cd_rank = np.linspace(0, 100, N)
Cd_raw = np.random.normal(10, 5, N).cumsum() + 80
plt.plot(Cd_rank, Cd_raw, linewidth=4)
plt.yscale('log')
mticker = ticker.ScalarFormatter(useOffset=False)
mticker.set_powerlimits((-6, 6))
ax = plt.gca()
ax.yaxis.set_major_formatter(mticker)
ax.yaxis.set_minor_formatter(mticker)
ax.tick_params(axis='y', which='major', colors='crimson')
ax.tick_params(axis='y', which='minor', colors='seagreen')
plt.show()
PS: When the ticks involve both powers of 10 larger than 1 and smaller than 1 (so, e.g. 100, 10, 1, 0.1, 0.01) the ScalarFormatter doesn't display the numbers smaller than 1 well (it displays 0.1 and 0.01 as 0). In that case, the StrMethodFormatter can be used instead:
plt.gca().yaxis.set_major_formatter(ticker.StrMethodFormatter("{x}"))
Here is code that turns off scientific notation and handles numbers that are smaller than 1 correctly. Thanks to #Johanc for this code.
from matplotlib import pyplot as plt
from matplotlib import ticker
import numpy as np
N = 50
x = np.linspace(0,1,N)
y = np.logspace(-3, 2, N)
plt.plot(x, y, linewidth=4)
plt.yscale('log')
plt.ylim(bottom=0.001,top=100)
plt.gca().yaxis.set_major_formatter(ticker.ScalarFormatter())
plt.gca().yaxis.set_major_formatter(ticker.StrMethodFormatter("{x}"))
plt.show()```

Numpy FFT give unexpected result when signal frequency falls exactly on a fft bin

When a signal's frequency falls exactly on an FFT bin, the amplitude becomes 0!
But if I offset the signal frequency a little bit off, the result is ok.
Reproducing code:
Here the signal's frequency is 30
import numpy as np
import matplotlib.pyplot as plt
N = 1024
Freq = 30
t = np.arange(N)
x = np.sin(2*np.pi*Freq/N*t)
f = np.fft.fft(x)
plt.plot(t, x)
plt.plot(t, f)
I would expect the output to have a huge spike in the 30th bin, but it's flat, as in the following figure.
However, if just slightly change the frequency to 30.1 to not let it fall on the exact bin,
import numpy as np
import matplotlib.pyplot as plt
N = 1024
Freq = 30.1
t = np.arange(N)
x = np.sin(2*np.pi*Freq/N*t)
f = np.fft.fft(x)
plt.plot(t, x)
plt.plot(t, f)
The result is correct as in the following figure:
WHY? Is this a numpy FFT implementation issue? Or is it a limitation of the standard FFT algorithm?
To get the power spectrum, you need to take the magnitude of the Fourier coefficient. Plotting the Fourier coefficient directly discards the imaginary component and only plots the real component.
Technically x and f shouldn't be plotted on the same x-axis, since they have different meanings.
import numpy as np
import matplotlib.pyplot as plt
T = 1 # Total signal duration (s)
N = 1024 # samples over signal duration
Freq = 30 # frequency: (Hz)
t = np.arange(N)/N*T # time array
df = 1.0/T # resolution of angular frequency
f = np.arange(N)*df
x = np.sin(2*np.pi*Freq*t)
xhat = np.fft.fft(x) # Fourier series of x
plt.plot(t, x)
plt.xlabel("t (s)")
plt.ylabel("x")
plt.savefig("fig1.png")
plt.cla()
plt.plot(f, np.abs(xhat))
plt.xlabel("f (Hz)")
plt.ylabel("|fft(x)|")
plt.savefig("fig2.png")
f is complex number, I should be using abs(f) for plotting.
It had slipped my mind :P

Fitting partial Gaussian

I'm trying to fit a sum of gaussians using scikit-learn because the scikit-learn GaussianMixture seems much more robust than using curve_fit.
Problem: It doesn't do a great job in fitting a truncated part of even a single gaussian peak:
from sklearn import mixture
import matplotlib.pyplot
import matplotlib.mlab
import numpy as np
clf = mixture.GaussianMixture(n_components=1, covariance_type='full')
data = np.random.randn(10000)
data = [[x] for x in data]
clf.fit(data)
data = [item for sublist in data for item in sublist]
rangeMin = int(np.floor(np.min(data)))
rangeMax = int(np.ceil(np.max(data)))
h = matplotlib.pyplot.hist(data, range=(rangeMin, rangeMax), normed=True);
plt.plot(np.linspace(rangeMin, rangeMax),
mlab.normpdf(np.linspace(rangeMin, rangeMax),
clf.means_, np.sqrt(clf.covariances_[0]))[0])
gives
now changing data = [[x] for x in data] to data = [[x] for x in data if x <0] in order to truncate the distribution returns
Any ideas how to get the truncation fitted properly?
Note: The distribution isn't necessarily truncated in the middle, there could be anything between 50% and 100% of the full distribution left.
I would also be happy if anyone can point me to alternative packages. I've only tried curve_fit but couldn't get it to do anything useful as soon as more than two peaks are involved.
A bit brutish, but simple solution would be to split the curve in two halfs (data = [[x] for x in data if x < 0]), mirror the left part (data.append([-data[d][0]])) and then do the regular Gaussian fit.
import numpy as np
from sklearn import mixture
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
np.random.seed(seed=42)
n = 10000
clf = mixture.GaussianMixture(n_components=1, covariance_type='full')
#split the data and mirror it
data = np.random.randn(n)
data = [[x] for x in data if x < 0]
n = len(data)
for d in range(n):
data.append([-data[d][0]])
clf.fit(data)
data = [item for sublist in data for item in sublist]
rangeMin = int(np.floor(np.min(data)))
rangeMax = int(np.ceil(np.max(data)))
h = plt.hist(data[0:n], bins=20, range=(rangeMin, rangeMax), normed=True);
plt.plot(np.linspace(rangeMin, rangeMax),
mlab.normpdf(np.linspace(rangeMin, rangeMax),
clf.means_, np.sqrt(clf.covariances_[0]))[0] * 2)
plt.show()
lhcgeneva the problem is once you have data that doesn't include the maximum of the curve more and more possible Gaussians can fit:
Black point represent the data, red points the fitted Gaussian
In the figure, black points represent the data to fit a curve, the red points the fitted results. This result was achieved by using A Simple Algorithm for Fitting a Gaussian Function

inverse of FFT not the same as original function

I don't understand why the ifft(fft(myFunction)) is not the same as my function. It seems to be the same shape but a factor of 2 out (ignoring the constant y-offset). All the documentation I can see says there is some normalisation that fft doesn't do, but that ifft should take care of that. Here's some example code below - you can see where I've bodged the factor of 2 to give me the right answer. Thanks for any help - its driving me nuts.
import numpy as np
import scipy.fftpack as fftp
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
def fourier_series(x, y, wn, n=None):
# get FFT
myfft = fftp.fft(y, n)
# kill higher freqs above wavenumber wn
myfft[wn:] = 0
# make new series
y2 = fftp.ifft(myfft).real
# find constant y offset
myfft[1:]=0
c = fftp.ifft(myfft)[0]
# remove c, apply factor of 2 and re apply c
y2 = (y2-c)*2 + c
plt.figure(num=None)
plt.plot(x, y, x, y2)
plt.show()
if __name__=='__main__':
x = np.array([float(i) for i in range(0,360)])
y = np.sin(2*np.pi/360*x) + np.sin(2*2*np.pi/360*x) + 5
fourier_series(x, y, 3, 360)
You're removing half the spectrum when you do myfft[wn:] = 0. The negative frequencies are those in the top half of the array and are required.
You have a second fudge to get your results which is taking the real part to find y2: y2 = fftp.ifft(myfft).real (fftp.ifft(myfft) has a non-negligible imaginary part due to the asymmetry in the spectrum).
Fix it with myfft[wn:-wn] = 0 instead of myfft[wn:] = 0, and remove the fudges. So the fixed code looks something like:
import numpy as np
import scipy.fftpack as fftp
import matplotlib.pyplot as plt
def fourier_series(x, y, wn, n=None):
# get FFT
myfft = fftp.fft(y, n)
# kill higher freqs above wavenumber wn
myfft[wn:-wn] = 0
# make new series
y2 = fftp.ifft(myfft)
plt.figure(num=None)
plt.plot(x, y, x, y2)
plt.show()
if __name__=='__main__':
x = np.array([float(i) for i in range(0,360)])
y = np.sin(2*np.pi/360*x) + np.sin(2*2*np.pi/360*x) + 5
fourier_series(x, y, 3, 360)
It's really worth paying attention to the interim arrays that you are creating when trying to do signal processing. Invariably, there are clues as to what is going wrong that should direct you to the problem. In this case, you taking the real part masked the problem and made your task more difficult.
Just to add another quick point: Sometimes taking the real part of the resultant array is exactly the correct thing to do. It's often the case that you end up with an imaginary part to the signal output which is just down to numerical errors in the input to the inverse FFT. Typically this manifests itself as very small imaginary values, so taking the real part is basically the same array.
You are killing the negative frequencies between 0 and -wn.
I think what you mean to do is to set myfft to 0 for all frequencies outside [-wn, wn].
Change the following line:
myfft[wn:] = 0
to:
myfft[wn:-wn] = 0