Proof that k-means always converges? - k-means

I understand k-means algorithms steps.
However I'm not sure if the algorithm will always converge? Or can the observations always switch from one centroid to another?

The algorithm always converges (by-definition) but not necessarily to global optimum.
The algorithm may switch from centroid to centroid but this is a parameter of the algorithm (precision, or delta). This is sometimes refered as "cycling". The algorithm after a while cycles through centroids. There are two solutions (which both can be used at the same time). Precision parameter, maximum number of iterations parameter.
Precision parameter, if centroids amount of change is less than a threshold delta, stop the algorithm.
Max Num Iterations, if algorithm reaches that number of iterations stop the algorithm.
Note that the above schemes do not spoil the convergence characteristics of the algorithm. it still will converge but not necessarily to global optimum (this is irrelevant of the scheme used, as in many optimisation algorithms).
You may be interested in the related question on stats.SE Cycling in k-means algorithm and a referenced proof of convergence

Related

Asynchrony loss function over an array of 1D signals

So I have an array of N 1D-signals (e.g. time series) with same number of samples per signal (all in equal resolution) and I want to define a differentiable loss function to penalize asynchrony among them and therefore be zero if all N 1D signals will be equal to each other. I've been searching the literature to find something but haven't had luck yet.
Few remarks:
1 - since N (number of signals) could be quite large I can not afford to calculate Mean squared loss between every single pair which could grow combinatorialy large. also I'm not quite sure whether it would be optimal in any mathematical sense for the goal to achieve.
There are two naive loss functions that I could think of :
a) Total variation loss for each time sample across all signals (to force to reach ideally zero variation). the problem is here the weight needs to be very large to yield zero varion. masking any other loss term that is going to be added and also there is no inherent order among the N signals, which doesnt make it suitable to TV loss to begin with.
b) minimizing the sum of variance at each time point among all signals. however, choice of the reference of variance (aka mean) could be crucial I believe as just using the sample mean might not really yield the desired result, not quite sure.

How Fast is Convolution Using FFT

I read that in order to compute the convolution of two signals x,y (1D for example), the naïve method takes O(NM).
However FFT is used to compute FFT^-1(FFT(x)FFT(y)), which takes O(N log(N)), in the case where N>M.
I wonder why is this complexity considered better than the former one, as M isn't necessarily bigger than log(N). Moreover, M is very often the length of a filter, which doesn't scale with the signal to be filtered, and will actually provide us with a complexity more similar to O(N) than to O(N^2).
Fast convolution in the frequency domain is typically more efficient than direct convolution when the size of the filter exceeds a particular threshold. So for relatively small filters direct convolution is more efficient, whereas for longer filters there comes a point at which FFT-based convolution is more efficient. The actual value of m for this "tipping point" depends on a lot of factors, but it's typically somewhere between 10 and 100.

Time complexity of a genetic algorithm for bin packing

I am trying to explore genetic algorithms (GA) for the bin packing problem, and compare it to classical Any-Fit algorithms. However the time complexity for GA is never mentioned in any of the scholarly articles. Is this because the time complexity is very high? and that the main goal of a GA is to find the best solution without considering the time? What is the time complexity of a basic GA?
Assuming that termination condition is number of iterations and it's constant then in general it would look something like that:
O(p * Cp * O(Crossover) * Mp * O(Mutation) * O(Fitness))
p - population size
Cp - crossover probability
Mp - mutation probability
As you can see it not only depends on parameters like eg. population size but also on implementation of crossover, mutation operations and fitness function implementation. In practice there would be more parameters like for example chromosome size etc.
You don't see much about time complexity in publications because researchers most of the time compare GA using convergence time.
Edit Convergence Time
Every GA has some kind of a termination condition and usually it's convergence criteria. Let's assume that we want to find the minimum of a mathematical function so our convergence criteria will be the function's value. In short we reach convergence during optimization when it's no longer worth it to continue optimization because our best individual doesn't get better significantly. Take a look at this chart:
You can see that after around 10000 iterations fitness doesn't improve much and the line is getting flat. Best case scenario reaches convergence at around 9500 iterations, after that point we don't observe any improvement or it's insignificantly small. Assuming that each line shows different GA then Best case has the best convergence time becuase it reaches convergence criteria first.

Numerical Accuracy: to scale or not?

I am working on a n-body gravitational simulator that takes input and produces output in metric MKS units. This involves dealing with some very large numbers (like solar masses expressed in kilograms, semimajor axes of planetary orbits expressed in meters, and timescales of years expressed in seconds), which get multiplied by some very small numbers (notably, the gravitational constant, which is 6.67384e-11 in MKS units), and also the occasional very small number getting added to or subtracted from a very large number (mainly when summing up pairwise accelerations), which gets me concerned about the effects of rounding errors.
I've already taken the step of replacing all masses m by Gm (premultiplying by the gravitational constant), which significantly reduces the total number of multiplies, and makes the mass numbers much smaller, and that seems to have had a positive effect on both efficiency and accuracy, as judged by how well the simulator conserves energy.
I am wondering, however: is potentially it worth trying to do some internal re-scaling into different units to further minimize floating point errors? And if so, what kind of range (for double-precision floats) should I be trying to get my numbers centered on for maximum accuracy?
In general if you want precise results in physical based rendering you don't want to use floats or doubles since they have massive rounding problems and thus introduce errors in your simulation.
If you need or want to stick with floats/double you probably should rescale around zero. The reason is that often floating point representations have a higher "density" of values around this point and tend to have fewer on the min/max sides. Image example from google
I would suggest that you change all values to integer based number variables. This erases rounding errors (over/underflow can still happen!) and speeds up the calculation process by an order of magnitude because normal CPUs work faster with integer operations. In case of GPU its basically the same but thats another story all by its own...
But before you take such an effort to further improve your accuracy i would strongly advise an arbitrary precision number library. This may come with an performance loss but should be way easier and yield better results than a rescaling of your values.
Most of the numerical mathematicians come across this problem.
At first let me remind you that you can not deal with numbers (or phsycal values) smaller than the machine epsilon for each calculation. Unfortunately the epsilon depends around which number you are analyzing. You can try eps(a) for any value of a in MATLAB, as far as I remember eps(1.0)~=2.3e-16 and eps(0)~1e-298.
That's why in numerical methods you avoid calculations using very different scaled numbers. Because one is just an ignored (smaller than its epsilon) by the other value and rounding errors are inevitable.
But what else people do? If they encounter such physical problems, before coding, mathematicians analyse the problem theoritically, they make simplifications to use similarly scaled numbers.

How to analyse 'noisiness' of an array of points

Have done fft (see earlier posting if you are interested!) and got a result, which helps me. Would like to analyse the noisiness / spikiness of an array (actually a vb.nre collection of single). Um, how to explain ...
When signal is good, fft power results is 512 data points (frequency buckets) with low values in all but maybe 2 or 3 array entries, and a decent range (i.e. the peak is high, relative to the noise value in the nearly empty buckets. So when graphed, we have a nice big spike in the values in those few buckets.
When signal is poor/noisy, data values spread (max to min) is low, and there's proportionally higher noise in many more buckets.
What's a good, computationally non-intensive was of analysing the noisiness of this data set? Would some kind of statistical method, standard deviations or something help ?
The key is defining what is noise and what is signal, for which modelling assumptions must be made. Often an assumption is made of white noise (constant power per frequency band) or noise of some other power spectrum, and that model is fitted to the data. The signal to noise ratio can then be used to measure the amount of noise.
Fitting a noise model depends on the nature of your data: if you know that the real signal will have no power in the high frequency components, you can look there for an indication of the noise level, and use the model to predict what the noise will be at the lower frequency components where there is both signal and noise. Alternatively, if your signal is constant in time, taking multiple FFTs at different points in time and comparing them to get a standard deviation for each frequency band can give the level of noise present.
I hope I'm not patronising you to mention the issues inherent with windowing functions when performing FFTs: these can have the effect of introducing spurious "noise" into the frequency spectrum which is in fact an artifact of the periodic nature of the FFT. There's a tradeoff between getting sharp peaks and 'sideband' noise - more here www.ee.iitm.ac.in/~nitin/_media/ee462/fftwindows.pdf
Calculate a standard deviation and then you decide the threshold that will indicate noise. In practice this is usually easy and allows you to easily tweak the "noise level" as needed.
There is a nice single pass stddev algorithm in Knuth. Here is link that describes an implementation.
Standard Deviation
calculate the signal to noise ratio
http://en.wikipedia.org/wiki/Signal-to-noise_ratio
you could also check the stdev for each point and if it's under some level you choose then the signal is good else it's not.
wouldn't the spike be
treated as a noise glitch in SNR, an
outlier to be discarded, as it were?
If it's clear from the time-domain data that there are such spikes, then they will certainly create a lot of noise in the frequency spectrum. Chosing to ignore them is a good idea, but unfortunately the FFT can't accept data with 'holes' in it where the spikes have been removed. There are two techniques to get around this. The 'dirty trick' method is to set the outlier sample to be the average of the two samples on either site, and compute the FFT with a full set of data.
The harder but more-correct method is to use a Lomb Normalised Periodogram (see the book 'Numerical Recipes' by W.H.Press et al.), which does a similar job to the FFT but can cope with missing data properly.