Elbow Method for GaussianMixture - numpy

I'd like to plot an elbow method for GMM to determine the optimal number of Clusters. I'm using mean_ assuming this represents distance from cluster's center, but I'm not generating a typical elbow report. Any ideas?
from sklearn.mixture import GaussianMixture
from scipy.spatial.distance import cdist
def elbow_report(X):
meandist = []
n_clusters = range(2,15)
for n_cluster in n_clusters:
gmm = GaussianMixture(n_components=n_cluster)
gmm.fit(X)
meandist.append(
sum(
np.min(
cdist(X, gmm.means_, 'mahalanobis', VI=gmm.precisions_),
axis=1
),
X.shape[0]
)
)
plt.plot(n_clusters,meandist,'bx-')
plt.xlabel('Number of Clusters')
plt.ylabel('Mean Mahalanobis Distance')
plt.title('GMM Clustering for n_cluster=2 to 15')
plt.show()

I played around with some test data and your function. Here are my findings and suggestions:
1. Minor bug
I believe there might be a little bug in your code. Change the , X.shape[0] to / X.shape[0] in the function to compute the mean distance. In particular,
meandist.append(
sum(
np.min(
cdist(X, gmm.means_, 'mahalanobis', VI=gmm.precisions_),
axis=1
) / X.shape[0]
)
)
When creating test data, e.g.
import numpy as np
import random
from matplotlib import pyplot as plt
means = [[-5,-5,-5], [6,6,6], [0,0,0]]
sigmas = [0.4, 0.4, 0.4]
sizes = [500, 500, 500]
L = [np.random.multivariate_normal(mean=np.array(loc), cov=scale*np.eye(len(loc)), size=size).tolist() for loc,scale,size in zip(means,sigmas, sizes)]
L = [x for l in L for x in l]
random.shuffle(L)
# design matrix
X = np.array(L)
elbow_report(X)
the output looks somewhat reasonable.
2. y-axis in log-scale
Sometimes, a bad fit for one particular n_cluster-value can throw off the entire plot. In particular, when the metric is the sum rather than the mean of the distances. Adding plt.yscale("log") to the plot might help to massage visualization by taming outliers.
3. Optimization instability during fitting
Note that you compute the in-sample error since gmm is fitted on the same data X on which the metric is subsequently evaluated. Leaving aside stability issues of the underlying optimization of the fitting procedure, the more cluster there are the better the fit should be (and, in turn, the lower the errors/distances). In the extreme, each datapoint gets its own cluster center: average values of the values should be close to 0. I assume this is what you desire to observe for the ELBOW.
Regardless, the lower effective sample size per cluster makes the optimization unstable. So rather than seeing an exponential decay toward 0, you see occasional spikes even far along the x-axis. I cannot judge how severe this issue truly is in your case, as you didn't provide sample sizes. Regardless, when the sample size of the data is of the same order of magnitude as n_clusters and/or the intra-class/inter-class heterogeneity is large, this is an issue.
4. Simulated vs. real data
This brings us to the final (catch-all) point. I'd suggest checking the plot on simulated data to get a feeling when things break. The simulated data above (multivariate Gaussian, isotropic noise, etc.) fits the assumptions to a T. However, some plots still look wonky (even when the sample size is moderately high and volatility somewhat low). Unfortunately, textbook-like plots are hard to come by on real data. As my former statistics professor put it: "real-world data is dirty." In turn, the plots will be, too.

Related

Difficulty fitting with Gaussian distribution

I am given two (long) finite sequences (i.e. numpy arrays) x and y of the same length. There graph is given here:
.
Array x uses the x-axis and is monotonically increasing. My goal is to fit the graph with Gaussian distribution such that the "major peak" is preserved, which looks something like this:
.
Here is a part of my code:
import numpy as np
import matplotlib.pyplot as plt
from astropy import modeling
fitter = modeling.fitting.LevMarLSQFitter()
model = modeling.models.Gaussian1D(amplitude = np.max(y), mean = y[np.argmax(x)],stddev = 1) #(1)
fitted_model = fitter(model, x, y)
plt.plot(x,fitted_model(x),linewidth=0.7, color = 'black')
plt.plot(x,y,linewidth=0.1, color = 'black')
plt.savefig('result.png', dpi = 1200)
My code results in the following:
.
It remains the same if I change the standard deviation in line (1). I figure I must have made some mistakes in line (1) but I have no idea why it is not working. If this is not possible in astropy, is there any work arounds?
Update:
As it is commented, I think Gaussian may not be the best distribution. I think I am actually looking for something similar to a perfusion curve. (In the picture AUC means "area under curve for infinite time" and "mTT" means "mean transit time".) The equation in the picture is not precise. The goal is to make sure the peak is best fitted. The curve does not need to follow the original data very closely as x is close to 0 or infinity. It only needs maintain smoothness and to roughly go down to zero (like the case for Gaussian). I need hints on what kind of function may best satisfy such a demand.

Linear regression graph interpretation

I have a histogram showing frequency of some data.
I have two type of files: Pdbs and Uniprots. Each Uniprot file is associated with a certain number of Pdbs. So this histogram shows how many Uniprot files are associated with 0 Pdb files, 1 Pdb file, 2 Pdb files ... 80 Pdb files.
Y-axis is in a log scale.
I did a regression on the same dataset and this is the result.
Here is the code I'm using for the regression graph:
# Fitting Simple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
x = np.array(x).reshape((-1, 1))
y = np.array(y)
regressor.fit(x, y)
# Predicting the Test set results
y = regressor.predict(x)
# Visualizing the Training set results
plt.scatter(x, y, color = 'red')
plt.plot(x, regressor.predict(x), color = 'blue')
plt.title('Uniprot vs Pdb')
plt.xlabel('Pdbs')
plt.ylabel('Uniprot')
plt.savefig('regression_test.png')
plt.show()
Can you help me interpret the regression graph?
I can understand that as the number of Pdbs increases, there will be less Uniprots associated with them.
But why is it going negative on the y-axis? Is this normal?
The correct way to interpret this linear regression is "this linear regression is 90% meaningless." In fact, some of that 90% is worse than meaningless, it's downright misleading, as you have pointed out with the negative y values. OTOH, there is about 10% of it that we can interpret to good effect, but you have to know what you're looking for.
The Why: Amongst other often less apparent things, one of the assumptions of a linear regression model is that the data are more-or-less linear. If your data aren't linear with some very regular "noise" added in, then all bets are off. Your data aren't linear. They're not even close. So all bets are off.
Since all bets are off, it is helpful to examine the sort of things that we might have otherwise wanted to do with a linear regression model. The hardest thing is extrapolation, which is predicting y outside of the original x range. Your model's abilities at extrapolation are pretty well illustrated by its behavior at the endpoints. This is where you noticed "hey, my graph is all negative!". This is, in a very simplistic sense, because you took a linear model, fit it to data that did not satisfy the "linear" assumption, and then tried to make it do the hardest thing for a model to do. The second hardest thing for a model to do is interpolation which is making predictions inside the original x range. This linear regression isn't very good at that either. Further down the list is, if we simply look at the slope of the linear regression line, we can get a general idea of whether our data are increasing or decreasing. Note that even this bet is off if your data aren't linear. However, it generally works out in a not-entirely-useless sort of way for large classes of even non-linear real-world data. So, this one thing, your linear regression model gets kind of right. Your data are decreasing, and the linear model is also decreasing. That's the 10% I spoke of previously.
What to do: Try to fit a better model. You say that you log-transformed your original data, but it doesn't look like that helped much. In general, the whole point of "transforming" data is to make it look linear. The log transform is helpful for exponential data. If your starting data didn't look exponential-like, then the log transform probably isn't going to help. Since you are trying to do density estimation, you almost certainly want to fit a probability distribution to this stuff, for which you don't even need to do a transform to make the data linear. Here is another Stack Overflow answer with details about how to fit a beta distribution to data. However, there are many options.
Can you help me interpret the regression graph?
Linear Regression tries to built a line between x-variables and a target y-variable which assimates the 'real' value in the most closed possible way (graph you find also here: https://en.wikipedia.org/wiki/Linear_regression):
the line here is the blue line, and the original points are the black lines. The goal is to minimize the error (black dots to blue line) for all black dots.
The regression line is the blue line. That means you can describe a uniprot with a linear equatation y = m*x +b , which has a constant value m=0.1 (example) and b=0.2 (example) and x=Pdbs.
I can understand that as the number of Pdbs increases, there will be less Uniprots associated with them. But why is it going negative on the y-axis?
This is normal, you could plot this line until -10000000 Pdbs or whateever, it is just a equation. Not a real line.
But there is one mistake in your plot, you need to plot the original black dots also or not?
y = regressor.predict(x)
plt.scatter(x, y, color = 'red')
This is wrong, you should add the original values to it, to get the plot from my graphic, something like:
y = df['Uniprot']
plt.scatter(x, y, color = 'red')
should help to understand it.

Power spectrum incorrectly yielding negative values

I have a real signal in time given by:
And I am simply trying to compute its power spectrum, which is the Fourier transform of the autocorrelation of the signal, and is also a purely real and positive quantity in this case. To do this, I simply write:
import numpy as np
from scipy.fftpack import fft, arange, rfftfreq, rfft
from pylab import *
lags1, c1, line1, b1 = acorr(((Y_DATA)), usevlines=False, normed=True, maxlags=3998, lw=2)
Power_spectrum = (fft(np.real(c1)))
freqs = np.fft.fftfreq(len(c1), dx)
plt.plot(freqs,Power_spectrum)
plt.xlabel('f (Hz)')
plt.xlim([-20000,20000])
plt.show()
But the output gives:
which has negative-valued output. Although if I simply take the absolute value of the data on the y-axis and plot it (i.e. np.abs(Power_spectrum)), then the output is:
which is exactly what I expect. Although why is this only fixed by taking the absolute value of my power spectrum? I checked my autocorrelation and plotted it—it seems to be working as expected and matches what others have computed.
Although what appears odd is the next step when I take the FFT. The FFT function outputs negative values which is contrary to the theory discussed in the link above and I don't quite understand why. Any thoughts on what is going wrong?
The power spectrum is the FFT of the autocorrelation, but that's not an efficient way to calculate it.
The autocorrelation is probably calculated with an FFT and iFFT, anyway.
The power spectrum is also just the squared magnitude of the FFT coefficients.
Do that instead so that the total work will be one FFT instead of 3.
An fft produces a complex result (real and imaginary components to represent both magnitude and phase of the spectrum). You have to take the (squared) magnitude of the complex vector to get the power spectrum.

Bad result plotting windowing FFT

im playing with python and scipy to understand windowing, i made a plot to see how windowing behave under FFT, but the result is not what i was specting.
the plot is:
the middle plots are pure FFT plot, here is where i get weird things.
Then i changed the trig. function to get leak, putting a 1 straight for the 300 first items of the array, the result:
the code:
sign_freq=80
sample_freq=3000
num=np.linspace(0,1,num=sample_freq)
i=0
#wave data:
sin=np.sin(2*pi*num*sign_freq)+np.sin(2*pi*num*sign_freq*2)
while i<1000:
sin[i]=1
i=i+1
#wave fft:
fft_sin=np.fft.fft(sin)
fft_freq_axis=np.fft.fftfreq(len(num),d=1/sample_freq)
#wave Linear Spectrum (Rms)
lin_spec=sqrt(2)*np.abs(np.fft.rfft(sin))/len(num)
lin_spec_freq_axis=np.fft.rfftfreq(len(num),d=1/sample_freq)
#window data:
hann=np.hanning(len(num))
#window fft:
fft_hann=np.fft.fft(hann)
#window fft Linear Spectrum:
wlin_spec=sqrt(2)*np.abs(np.fft.rfft(hann))/len(num)
#window + sin
wsin=hann*sin
#window + sin fft:
wsin_spec=sqrt(2)*np.abs(np.fft.rfft(wsin))/len(num)
wsin_spec_freq_axis=np.fft.rfftfreq(len(num),d=1/sample_freq)
fig=plt.figure()
ax1 = fig.add_subplot(431)
ax2 = fig.add_subplot(432)
ax3 = fig.add_subplot(433)
ax4 = fig.add_subplot(434)
ax5 = fig.add_subplot(435)
ax6 = fig.add_subplot(436)
ax7 = fig.add_subplot(413)
ax8 = fig.add_subplot(414)
ax1.plot(num,sin,'r')
ax2.plot(fft_freq_axis,abs(fft_sin),'r')
ax3.plot(lin_spec_freq_axis,lin_spec,'r')
ax4.plot(num,hann,'b')
ax5.plot(fft_freq_axis,fft_hann)
ax6.plot(lin_spec_freq_axis,wlin_spec)
ax7.plot(num,wsin,'c')
ax8.plot(wsin_spec_freq_axis,wsin_spec)
plt.show()
EDIT: as asked in the comments, i plotted the functions in dB scale, obtaining much clearer plots. Thanks a lot #SleuthEye !
It appears the plot which is problematic is the one generated by:
ax5.plot(fft_freq_axis,fft_hann)
resulting in the graph:
instead of the expected graph from Wikipedia.
There are a number of issues with the way the plot is constructed. The first is that this command essentially attempts to plot a complex-valued array (fft_hann). You may in fact be getting the warning ComplexWarning: Casting complex values to real discards the imaginary part as a result. To generate a graph which looks like the one from Wikipedia, you would have to take the magnitude (instead of the real part) with:
ax5.plot(fft_freq_axis,abs(fft_hann))
Then we notice that there is still a line striking through our plot. Looking at np.fft.fft's documentation:
The values in the result follow so-called “standard” order: If A = fft(a, n), then A[0] contains the zero-frequency term (the sum of the signal), which is always purely real for real inputs. Then A[1:n/2] contains the positive-frequency terms, and A[n/2+1:] contains the negative-frequency terms, in order of decreasingly negative frequency.
[...]
The routine np.fft.fftfreq(n) returns an array giving the frequencies of corresponding elements in the output.
Indeed, if we print the fft_freq_axis we can see that the result is:
[ 0. 1. 2. ..., -3. -2. -1.]
To get around this problem we simply need to swap the lower and upper parts of the arrays with np.fft.fftshift:
ax5.plot(np.fft.fftshift(fft_freq_axis),np.fft.fftshift(abs(fft_hann)))
Then you should note that the graph on Wikipedia is actually shown with amplitudes in decibels. You would then need to do the same with:
ax5.plot(np.fft.fftshift(fft_freq_axis),np.fft.fftshift(20*np.log10(abs(fft_hann))))
We should then be getting closer, but the result is not quite the same as can be seen from the following figure:
This is due to the fact that the plot on Wikipedia actually has a higher frequency resolution and captures the value of the frequency spectrum as its oscillates, whereas your plot samples the spectrum at fewer points and a lot of those points have near zero amplitudes. To resolve this problem, we need to get the frequency spectrum of the window at more frequency points.
This can be done by zero padding the input to the FFT, or more simply setting the parameter n (desired length of the output) to a value much larger than the input size:
N = 8*len(num)
fft_freq_axis=np.fft.fftfreq(N,d=1/sample_freq)
fft_hann=np.fft.fft(hann, N)
ax5.plot(np.fft.fftshift(fft_freq_axis),np.fft.fftshift(20*np.log10(abs(fft_hann))))
ax5.set_xlim([-40, 40])
ax5.set_ylim([-50, 80])

Zoom in on np.fft2 result

Is there a way to chose the x/y output axes range from np.fft2 ?
I have a piece of code computing the diffraction pattern of an aperture. The aperture is defined in a 2k x 2k pixel array. The diffraction pattern is basically the inner part of the 2D FT of the aperture. The np.fft2 gives me an output array same size of the input but with some preset range of the x/y axes. Of course I can zoom in by using the image viewer, but I have already lost detail. What is the solution?
Thanks,
Gert
import numpy as np
import matplotlib.pyplot as plt
r= 500
s= 1000
y,x = np.ogrid[-s:s+1, -s:s+1]
mask = x*x + y*y <= r*r
aperture = np.ones((2*s+1, 2*s+1))
aperture[mask] = 0
plt.imshow(aperture)
plt.show()
ffta= np.fft.fft2(aperture)
plt.imshow(np.log(np.abs(np.fft.fftshift(ffta))**2))
plt.show()
Unfortunately, much of the speed and accuracy of the FFT come from the outputs being the same size as the input.
The conventional way to increase the apparent resolution in the output Fourier domain is by zero-padding the input: np.fft.fft2(aperture, [4 * (2*s+1), 4 * (2*s+1)]) tells the FFT to pad your input to be 4 * (2*s+1) pixels tall and wide, i.e., make the input four times larger (sixteen times the number of pixels).
Begin aside I say "apparent" resolution because the actual amount of data you have hasn't increased, but the Fourier transform will appear smoother because zero-padding in the input domain causes the Fourier transform to interpolate the output. In the example above, any feature that could be seen with one pixel will be shown with four pixels. Just to make this fully concrete, this example shows that every fourth pixel of the zero-padded FFT is numerically the same as every pixel of the original unpadded FFT:
# Generate your `ffta` as above, then
N = 2 * s + 1
Up = 4
fftup = np.fft.fft2(aperture, [Up * N, Up * N])
relerr = lambda dirt, gold: np.abs((dirt - gold) / gold)
print(np.max(relerr(fftup[::Up, ::Up] , ffta))) # ~6e-12.
(That relerr is just a simple relative error, which you want to be close to machine precision, around 2e-16. The largest error between every 4th sample of the zero-padded FFT and the unpadded FFT is 6e-12 which is quite close to machine precision, meaning these two arrays are nearly numerically equivalent.) End aside
Zero-padding is the most straightforward way around your problem. But it does cost you a lot of memory. And it is frustrating because you might only care about a tiny, tiny part of the transform. There's an algorithm called the chirp z-transform (CZT, or colloquially the "zoom FFT") which can do this. If your input is N (for you 2*s+1) and you want just M samples of the FFT's output evaluated anywhere, it will compute three Fourier transforms of size N + M - 1 to obtain the desired M samples of the output. This would solve your problem too, since you can ask for M samples in the region of interest, and it wouldn't require prohibitively-much memory, though it would need at least 3x more CPU time. The downside is that a solid implementation of CZT isn't in Numpy/Scipy yet: see the scipy issue and the code it references. Matlab's CZT seems reliable, if that's an option; Octave-forge has one too and the Octave people usually try hard to match/exceed Matlab.
But if you have the memory, zero-padding the input is the way to go.