k-mean clustering - inertia only gets larger - k-means

I am trying to use the KMeans clustering from faiss on a human pose dataset of body joints. I have 16 body parts so a dimension of 32. The joints are scaled in a range between 0 and 1. My dataset consists of ~ 900.000 instances. As mentioned by faiss (faiss_FAQ):
As a rule of thumb there is no consistent improvement of the k-means quantizer beyond 20 iterations and 1000 * k training points
Applying this to my problem I randomly select 50000 instances for training. As I want to check for a number of clusters k between 1 and 30.
Now to my "problem":
The inertia is increasing directly as the number of cluster increases (n_cluster on the x-axis):
I tried varying the number of iterations, the number of redos, verbose and spherical, but the results stay the same or get worse. I do not think that it is a problem of my implementation; I tested it on a small example with 2D data and very clear clusters and it worked.
Is it that the data is just bad clustered or is there another problem/mistake I have missed? Maybe the scaling of the values between 0 and 1? Should I try another approach?

I found my mistake. I had to increase the parameter max_points_per_centroid. As I have so many data points it sampled a sub-batch for the fit. For a larger number of clusters this sub-batch is larger. See FAQ of faiss:
max_points_per_centroid * k: there are too many points, making k-means unnecessarily slow. Then the training set is sampled
The larger subbatch of course has a larger inertia as there are more points in total.

Related

Why the value to explained various ratio is too low in binary classification problem?

My input data $X$ has 100 samples and 1724 features. I want to classify my data into two classes. But explained various ratio value is too low like 0.05,0.04, No matter how many principal components I fix while performing PCA. I have seen in the literature that usually explained variance ratio is close to 1. I'm unable to understand my mistake. I tried to perform PCA by reducing number of features and increasing number of samples. But it doesn't make any significant effect on my result.
`pca = PCA(n_components=10).fit(X)
Xtrain = pca.transform(X)
explained_ratio=pca.explained_variance_ratio_
EX=explained_ratio
fig,ax1=plt.subplots(ncols=1,nrows=1)
ax1.plot(np.arange(len(EX)),EX,marker='o',color='blue')
ax1.set_ylabel(r"$\lambda$")
ax1.set_xlabel("l")

Does deeper LSTM need more units?

I'm applying LSTM on time series forecasting with 20 lags. Suppose that we have two cases. The first one just using five lags and the second one (like my case) is using 20 lags. Is it correct that for the second case we need more units compared to the former one? If yes, how can we support this idea? I have 2000 samples for training the model, so this is the main limitation for increasing number of units here.
It is very difficult to give an exact answer as the relationship between timesteps and number of hidden units is not an exact science. For example, following factors can affect the number of units required.
Short term memory problem vs long-term memory problem
If your problem can be solved with relatively less memory (i.e. requires to remember only a few time steps) you wouldn't get much benefit from adding more neurons while increasing the number of steps.
The amount of data
If you don't have enough data for the model to learn from (which I feel like you will run into with 2000 data points - but I could be wrong), then increasing the number of timesteps won't help you much.
The type of model you use
Depending on the type of model you use (e.g. LSTM / GRU ) you might get different results (this is not always true but can happen for certain problems)
I'm sure there are other factors out there, but these are few that came to my mind.
Proving more units give better results while having more time steps (if true)
That should be relatively easy as you can try few different options,
5 lags with 10 / 20 / 50 hidden units
20 lags with 10 / 20 / 50 hidden units
And if you get better performance (e.g. lower MSE) with 20 lags problem than 5 lags problem (when you use 50 units), then you have gotten your point across. And you can reinforce your claims by showing results with different types of models (e.g. LSTMs vs GRUs).

How to speed up the rjags model training in Bayesian ranking?

All,
I am doing Bayesian modeling using rjags. However, when the number of observation is larger than 1000. The graph size is too big.
More specifically, I am doing a Bayesian ranking problem. Traditionally, one observation means one X[i, 1:N]-Y[i] pair, where X[i, 1:N] means the i-th item is represented by a N-size predictor vector, and Y[i] is a response. The objective is to minimize the point-wise error of predicted values,for example, least square error.
A ranking problem is different. Since we more care about the order, we use a pair-wise 1-0 indicator to represent the order between Y[i] and Y[j], for example, when Y[i]>Y[j], I(i,j)=1; otherwise I(i,j)=0. We treat this 1-0 indicator as an observation. Therefore, assuming we have K items: Y[1:K], the number of indicator is 0.5*K*(K-1). Hence when K is increased from 500 to 5000, the number of observations is very large, i.e. from 500^2 to 5000^2. The garph size of the rjags model is large too, for example graph size > 500,000. And the log-posterior will be very small.
And it takes a long time to complete the training. I think the consumed time is >40 hours. It is not practical for me to do further experiment. Therefore, do you have any idea to speed up the rjags. I heard that the RStan is faster than Rjags. Any one who has similar experience?

k-means empty cluster

I try to implement k-means as a homework assignment. My exercise sheet gives me following remark regarding empty centers:
During the iterations, if any of the cluster centers has no data points associated with it, replace it with a random data point.
That confuses me a bit, firstly Wikipedia or other sources I read do not mention that at all. I further read about a problem with 'choosing a good k for your data' - how is my algorithm supposed to converge if I start setting new centers for cluster that were empty.
If I ignore empty clusters I converge after 30-40 iterations. Is it wrong to ignore empty clusters?
Check out this example of how empty clusters can happen: http://www.ceng.metu.edu.tr/~tcan/ceng465_f1314/Schedule/KMeansEmpty.html
It basically means either 1) a random tremor in the force, or 2) the number of clusters k is wrong. You should iterate over a few different values for k and pick the best.
If during your iterating you should encounter an empty cluster, place a random data point into that cluster and carry on.
I hope this helped on your homework assignment last year.
Handling empty clusters is not part of the k-means algorithm but might result in better clusters quality. Talking about convergence, it is never exactly but only heuristically guaranteed and hence the criterion for convergence is extended by including a maximum number of iterations.
Regarding the strategy to tackle down this problem, I would say randomly assigning some data point to it is not very clever since we might be affecting the clusters quality since the distance to its currently assigned center is large or small. An heuristic for this case would be to choose the farthest point from the biggest cluster and move that the empty cluster, then do so until there are no empty clusters.
Statement: k-means can lead to
Consider above distribution of data points.
overlapping points mean that the distance between them is del. del tends to 0 meaning you can assume arbitary small enough value eg 0.01 for it.
dash box represents cluster assign
legend in footer represents numberline
N=6 points
k=3 clusters (coloured)
final clusters = 2
blue cluster is orphan and ends up empty.
Empty clusters can be obtained if no points are allocated to a cluster during the assignment step. If this happens, you need to choose a replacement centroid otherwise SSE would be larger than neccessary.
*Choose the point that contributes most to SSE
*Choose a point from the cluster with the highest SSE
*If there are several empty clusters, the above can be repeated several times.
***SSE = Sum of Square Error.
Check this site https://chih-ling-hsu.github.io/2017/09/01/Clustering#
You should not ignore empty clusters but replace it. k-means is an algorithm could only provides you local minimums, and the empty clusters are the local minimums that you don't want.
your program is going to converge even if you replace a point with a random one. Remember that at the beginning of the algorithm, you choose the initial K points randomly. if it can converge, how come K-1 converge points with 1 random point can't? just a couple more iterations are needed.
"Choosing good k for your data" refers to the problem of choosing the right number of clusters. Since the k-means algorithm works with a predetermined number of cluster centers, their number has to be chosen at first. Choosing the wrong number could make it hard to divide the data points into clusters or the clusters could become small and meaningless.
I can't give you an answer on whether it is a bad idea to ignore empty clusters. If you do, you might end up with a smaller number of clusters than you defined at the beginning. This will confuse people who expect k-means to work in a certain way, but it is not necessarily a bad idea.
If you re-locate any empty cluster centers, your algorithm will probably converge anyway if that happens a limited number of times. However, you if you have to relocate too often, it might happen that your algorithm doesn't terminate.
For "Choosing good k for your data", Andrew Ng gives the example of a tee shirt manufacturer looking at potential customer measurements and doing k-means to decide if you want to offer S/M/L (k=3) or 2XS/XS/S/M/L/XL/2XL (k=7). Sometimes the decision is driven by the data (k=7 gives empty clusters) and sometimes by business considerations (manufacturing costs are less with only three sizes, or marketing says customers want more choices).
Set a variable to track the farthest distanced point and its cluster based on the distance measure used.
After the allocation step for all the points, check the number of datapoints in each cluster.
If any is 0, as is the case for this question, split the biggest cluster obtained and split further into 2 sub-clusters.
Replace the selected cluster with these sub-clusters.
I hope the issue is fixed now.. Random assignment will affect the clustering structure of the already obtained clustering.

VB FFT - stuck understanding relationship of results to frequency

Trying to understand an fft (Fast Fourier Transform) routine I'm using (stealing)(recycling)
Input is an array of 512 data points which are a sample waveform.
Test data is generated into this array. fft transforms this array into frequency domain.
Trying to understand relationship between freq, period, sample rate and position in fft array. I'll illustrate with examples:
========================================
Sample rate is 1000 samples/s.
Generate a set of samples at 10Hz.
Input array has peak values at arr(28), arr(128), arr(228) ...
period = 100 sample points
peak value in fft array is at index 6 (excluding a huge value at 0)
========================================
Sample rate is 8000 samples/s
Generate set of samples at 440Hz
Input array peak values include arr(7), arr(25), arr(43), arr(61) ...
period = 18 sample points
peak value in fft array is at index 29 (excluding a huge value at 0)
========================================
How do I relate the index of the peak in the fft array to frequency ?
If you ignore the imaginary part, the frequency distribution is linear across bins:
Frequency#i = (Sampling rate/2)*(i/Nbins).
So for your first example, assumming you had 256 bins, the largest bin corresponds to a frequency of 1000/2 * 6/256 = 11.7 Hz.
Since your input was 10Hz, I'd guess that bin 5 (9.7Hz) also had a big component.
To get better accuracy, you need to take more samples, to get smaller bins.
Your second example gives 8000/2*29/256 = 453Hz. Again, close, but you need more bins.
Your resolution here is only 4000/256 = 15.6Hz.
It would be helpful if you were to provide your sample dataset.
My guess would be that you have what are called sampling artifacts. The strong signal at DC ( frequency 0 ) suggests that this is the case.
You should always ensure that the average value in your input data is zero - find the average and subtract it from each sample point before invoking the fft is good practice.
Along the same lines, you have to be careful about the sampling window artifact. It is important that the first and last data point are close to zero because otherwise the "step" from outside to inside the sampling window has the effect of injecting a whole lot of energy at different frequencies.
The bottom line is that doing an fft analysis requires more care than simply recycling a fft routine found somewhere.
Here are the first 100 sample points of a 10Hz signal as described in the question, massaged to avoid sampling artifacts
> sinx[1:100]
[1] 0.000000e+00 6.279052e-02 1.253332e-01 1.873813e-01 2.486899e-01 3.090170e-01 3.681246e-01 4.257793e-01 4.817537e-01 5.358268e-01
[11] 5.877853e-01 6.374240e-01 6.845471e-01 7.289686e-01 7.705132e-01 8.090170e-01 8.443279e-01 8.763067e-01 9.048271e-01 9.297765e-01
[21] 9.510565e-01 9.685832e-01 9.822873e-01 9.921147e-01 9.980267e-01 1.000000e+00 9.980267e-01 9.921147e-01 9.822873e-01 9.685832e-01
[31] 9.510565e-01 9.297765e-01 9.048271e-01 8.763067e-01 8.443279e-01 8.090170e-01 7.705132e-01 7.289686e-01 6.845471e-01 6.374240e-01
[41] 5.877853e-01 5.358268e-01 4.817537e-01 4.257793e-01 3.681246e-01 3.090170e-01 2.486899e-01 1.873813e-01 1.253332e-01 6.279052e-02
[51] -2.542075e-15 -6.279052e-02 -1.253332e-01 -1.873813e-01 -2.486899e-01 -3.090170e-01 -3.681246e-01 -4.257793e-01 -4.817537e-01 -5.358268e-01
[61] -5.877853e-01 -6.374240e-01 -6.845471e-01 -7.289686e-01 -7.705132e-01 -8.090170e-01 -8.443279e-01 -8.763067e-01 -9.048271e-01 -9.297765e-01
[71] -9.510565e-01 -9.685832e-01 -9.822873e-01 -9.921147e-01 -9.980267e-01 -1.000000e+00 -9.980267e-01 -9.921147e-01 -9.822873e-01 -9.685832e-01
[81] -9.510565e-01 -9.297765e-01 -9.048271e-01 -8.763067e-01 -8.443279e-01 -8.090170e-01 -7.705132e-01 -7.289686e-01 -6.845471e-01 -6.374240e-01
[91] -5.877853e-01 -5.358268e-01 -4.817537e-01 -4.257793e-01 -3.681246e-01 -3.090170e-01 -2.486899e-01 -1.873813e-01 -1.253332e-01 -6.279052e-02
And here is the resulting absolute values of the fft frequency domain
[1] 7.160038e-13 1.008741e-01 2.080408e-01 3.291725e-01 4.753899e-01 6.653660e-01 9.352601e-01 1.368212e+00 2.211653e+00 4.691243e+00 5.001674e+02
[12] 5.293086e+00 2.742218e+00 1.891330e+00 1.462830e+00 1.203175e+00 1.028079e+00 9.014559e-01 8.052577e-01 7.294489e-01
I'm a little rusty too on math and signal processing but with the additional info I can give it a shot.
If you want to know the signal energy per bin you need the magnitude of the complex output. So just looking at the real output is not enough. Even when the input is only real numbers. For every bin the magnitude of the output is sqrt(real^2 + imag^2), just like pythagoras :-)
bins 0 to 449 are positive frequencies from 0 Hz to 500 Hz. bins 500 to 1000 are negative frequencies and should be the same as the positive for a real signal. If you process one buffer every second frequencies and array indices line up nicely. So the peak at index 6 corresponds with 6Hz so that's a bit strange. This might be because you're only looking at the real output data and the real and imaginary data combine to give an expected peak at index 10. The frequencies should map linearly to the bins.
The peaks at 0 indicates a DC offset.
It's been some time since I've done FFT's but here's what I remember
FFT usually takes complex numbers as input and output. So I'm not really sure how the real and imaginary part of the input and output map to the arrays.
I don't really understand what you're doing. In the first example you say you process sample buffers at 10Hz for a sample rate of 1000 Hz? So you should have 10 buffers per second with 100 samples each. I don't get how your input array can be at least 228 samples long.
Usually the first half of the output buffer are frequency bins from 0 frequency (=dc offset) to 1/2 sample rate. and the 2nd half are negative frequencies. if your input is only real data with 0 for the imaginary signal positive and negative frequencies are the same. The relationship of real/imaginary signal on the output contains phase information from your input signal.
The frequency for bin i is i * (samplerate / n), where n is the number of samples in the FFT's input window.
If you're handling audio, since pitch is proportional to log of frequency, the pitch resolution of the bins increases as the frequency does -- it's hard to resolve low frequency signals accurately. To do so you need to use larger FFT windows, which reduces time resolution. There is a tradeoff of frequency against time resolution for a given sample rate.
You mention a bin with a large value at 0 -- this is the bin with frequency 0, i.e. the DC component. If this is large, then presumably your values are generally positive. Bin n/2 (in your case 256) is the Nyquist frequency, half the sample rate, which is the highest frequency that can be resolved in the sampled signal at this rate.
If the signal is real, then bins n/2+1 to n-1 will contain the complex conjugates of bins n/2-1 to 1, respectively. The DC value only appears once.
The samples are, as others have said, equally spaced in the frequency domain (not logarithmic).
For example 1, you should get this:
alt text http://home.comcast.net/~kootsoop/images/SINE1.jpg
For the other example you should get
alt text http://home.comcast.net/~kootsoop/images/SINE2.jpg
So your answers both appear to be correct regarding the peak location.
What I'm not getting is the large DC component. Are you sure you are generating a sine wave as the input? Does the input go negative? For a sinewave, the DC should be close to zero provided you get enough cycles.
Another avenue is to craft a Goertzel's Algorithm of each note center frequency you are looking for.
Once you get one implementation of the algorithm working you can make it such that it takes parameters to set it's center frequency. With that you could easily run 88 of them or what ever you need in a collection and scan for the peak value.
The Goertzel Algorithm is basically a single bin FFT. Using this method you can place your bins logarithmically as musical notes naturally go.
Some pseudo code from Wikipedia:
s_prev = 0
s_prev2 = 0
coeff = 2*cos(2*PI*normalized_frequency);
for each sample, x[n],
s = x[n] + coeff*s_prev - s_prev2;
s_prev2 = s_prev;
s_prev = s;
end
power = s_prev2*s_prev2 + s_prev*s_prev - coeff*s_prev2*s_prev;
The two variables representing the previous two samples are maintained for the next iteration. This can be then used in a streaming application. I thinks perhaps the power calculation should be inside the loop as well. (However it is not depicted as such in the Wiki article.)
In the tone detection case there would be 88 different coeficients, 88 pairs of previous samples and would result in 88 power output samples indicating the relative level in that frequency bin.
WaveyDavey says that he's capturing sound from a mic, thru the audio hardware of his computer, BUT that his results are not zero-centered. This sounds like a problem with the hardware. It SHOULD BE zero-centered.
When the room is quiet, the stream of values coming from the sound API should be very close to 0 amplitude, with slight +- variations for ambient noise. If a vibratory sound is present in the room (e.g. a piano, a flute, a voice) the data stream should show a fundamentally sinusoidal-based wave that goes both positive and negative, and averages near zero. If this is not the case, the system has some funk going on!
-Rick