k-means empty cluster - k-means

I try to implement k-means as a homework assignment. My exercise sheet gives me following remark regarding empty centers:
During the iterations, if any of the cluster centers has no data points associated with it, replace it with a random data point.
That confuses me a bit, firstly Wikipedia or other sources I read do not mention that at all. I further read about a problem with 'choosing a good k for your data' - how is my algorithm supposed to converge if I start setting new centers for cluster that were empty.
If I ignore empty clusters I converge after 30-40 iterations. Is it wrong to ignore empty clusters?

Check out this example of how empty clusters can happen: http://www.ceng.metu.edu.tr/~tcan/ceng465_f1314/Schedule/KMeansEmpty.html
It basically means either 1) a random tremor in the force, or 2) the number of clusters k is wrong. You should iterate over a few different values for k and pick the best.
If during your iterating you should encounter an empty cluster, place a random data point into that cluster and carry on.
I hope this helped on your homework assignment last year.

Handling empty clusters is not part of the k-means algorithm but might result in better clusters quality. Talking about convergence, it is never exactly but only heuristically guaranteed and hence the criterion for convergence is extended by including a maximum number of iterations.
Regarding the strategy to tackle down this problem, I would say randomly assigning some data point to it is not very clever since we might be affecting the clusters quality since the distance to its currently assigned center is large or small. An heuristic for this case would be to choose the farthest point from the biggest cluster and move that the empty cluster, then do so until there are no empty clusters.

Statement: k-means can lead to
Consider above distribution of data points.
overlapping points mean that the distance between them is del. del tends to 0 meaning you can assume arbitary small enough value eg 0.01 for it.
dash box represents cluster assign
legend in footer represents numberline
N=6 points
k=3 clusters (coloured)
final clusters = 2
blue cluster is orphan and ends up empty.

Empty clusters can be obtained if no points are allocated to a cluster during the assignment step. If this happens, you need to choose a replacement centroid otherwise SSE would be larger than neccessary.
*Choose the point that contributes most to SSE
*Choose a point from the cluster with the highest SSE
*If there are several empty clusters, the above can be repeated several times.
***SSE = Sum of Square Error.
Check this site https://chih-ling-hsu.github.io/2017/09/01/Clustering#

You should not ignore empty clusters but replace it. k-means is an algorithm could only provides you local minimums, and the empty clusters are the local minimums that you don't want.
your program is going to converge even if you replace a point with a random one. Remember that at the beginning of the algorithm, you choose the initial K points randomly. if it can converge, how come K-1 converge points with 1 random point can't? just a couple more iterations are needed.

"Choosing good k for your data" refers to the problem of choosing the right number of clusters. Since the k-means algorithm works with a predetermined number of cluster centers, their number has to be chosen at first. Choosing the wrong number could make it hard to divide the data points into clusters or the clusters could become small and meaningless.
I can't give you an answer on whether it is a bad idea to ignore empty clusters. If you do, you might end up with a smaller number of clusters than you defined at the beginning. This will confuse people who expect k-means to work in a certain way, but it is not necessarily a bad idea.
If you re-locate any empty cluster centers, your algorithm will probably converge anyway if that happens a limited number of times. However, you if you have to relocate too often, it might happen that your algorithm doesn't terminate.

For "Choosing good k for your data", Andrew Ng gives the example of a tee shirt manufacturer looking at potential customer measurements and doing k-means to decide if you want to offer S/M/L (k=3) or 2XS/XS/S/M/L/XL/2XL (k=7). Sometimes the decision is driven by the data (k=7 gives empty clusters) and sometimes by business considerations (manufacturing costs are less with only three sizes, or marketing says customers want more choices).

Set a variable to track the farthest distanced point and its cluster based on the distance measure used.
After the allocation step for all the points, check the number of datapoints in each cluster.
If any is 0, as is the case for this question, split the biggest cluster obtained and split further into 2 sub-clusters.
Replace the selected cluster with these sub-clusters.
I hope the issue is fixed now.. Random assignment will affect the clustering structure of the already obtained clustering.

Related

Karger's Algorithm - Running Time - Edge Contraction

In Karger's Min-Cut Algorithm for undirected (possibly weighted) multigraphs, the main operation is to contract a randomly chosen edge and merge it's incident vertices into one metavertex. This process is repeated until two vertices remain. These vertices correspond to a cut. The algo can be implemented with an adjacency list.
Questions:
how can I find the particular edge, that has been chosen to be contracted?
how does an edge get contracted (in an unweighted and/or weighted graph)?
Why does this procedure take quadratic time?
Edit: I have found some information that the runtime can be quadratic, due to the fact that we have O(n-2) contractions of vertices and each contraction can take O(n) time. It would be great if somebody could explain me, why a contraction takes linear time in an adjacency list. Note a contraction consists of: deleting one adjacent edge, merging the two vertices into a supernode, and making sure that the remaining adjacent edges are connected to the supernode.
Pseudocode:
procedure contract(G=(V,E)):
while |V|>2
choose edge uniformly at random
contract its endpoints
delete self loops
return cut
I have read the related topic Karger Min cut algorithm running time, but it did not help me. Also I do not have so much experience, so a "laymens" term explanation would be very much appreciated!

K-means:Updaing centroids successively after each addition

Say we have five data points A,B,C,D,E and we are using K-means clustering algorithm to cluster them into two clusters.Can we update the centroids as follow:
Let's select first two i.e. A,B as centroids of initial clusters.
Then calculate the distance of C from A as well as from B.Say C is nearer to A.
Update the centroid of cluster with centroid A before the next step i.e. now new centroids are (A+C)/2 and B.
Then calculate the distances of D from these new centroids and so on.
Yes, it seems like we can update centroids incrementally in k-means as explained in chapter 8 of "Introduction to Data Mining" by Kumar. Here is the actual text:
Updating Centroids Incrementally
Instead of updating cluster centroids after all points have been assigned to a cluster, the centroids can be updated incrementally, after each assignment of a point to a cluster. Notice that this requires either zero or two updates to cluster centroids at each step, since a point either moves to a new cluster (two updates) or stays in its current cluster (zero updates). Using an incremental update strategy guarantees that empty clusters are not produced since all clusters start with a single point, and if a cluster ever has only one point, then that point will always be reassigned to the same cluster.
In addition, if incremental updating is used, the relative weight of the point
being added may be adjusted; e.g., the weight of points is often decreased as
the clustering proceeds. While this can result in better accuracy and faster
convergence, it can be difficult to make a good choice for the relative weight, especially in a wide variety of situations. These update issues are similar to those involved in updating weights for artificial neural networks.
Yet another benefit of incremental updates has to do with using objectives
other than “minimize SSE.” Suppose that we are given an arbitrary objective
function to measure the goodness of a set of clusters. When we process an
individual point, we can compute the value of the objective function for each
possible cluster assignment, and then choose the one that optimizes the objective. Specific examples of alternative objective functions are given in Section 8.5.2.
On the negative side, updating centroids incrementally introduces an order dependency. In other words, the clusters produced may depend on the order in which the points are processed. Although this can be addressed by
randomizing the order in which the points are processed, the basic K-means
approach of updating the centroids after all points have been assigned to clusters has no order dependency. Also, incremental updates are slightly more
expensive. However, K-means converges rather quickly, and therefore, the
number of points switching clusters quickly becomes relatively small.

Finding Optimal Parameters In A "Black Box" System

I'm developing machine learning algorithms which classify images based on training data.
During the image preprocessing stages, there are several parameters which I can modify that affect the data I feed my algorithms (for example, I can change the Hessian Threshold when extracting SURF features). So the flow thus far looks like:
[param1, param2, param3...] => [black box] => accuracy %
My problem is: with so many parameters at my disposal, how can I systematically pick values which give me optimized results/accuracy? A naive approach is to run i nested for-loops (assuming i parameters) and just iterate through all parameter combinations, but if it takes 5 minute to calculate an accuracy from my "black box" system this would take a long, long time.
This being said, are there any algorithms or techniques which can search for optimal parameters in a black box system? I was thinking of taking a course in Discrete Optimization but I'm not sure if that would be the best use of my time.
Thank you for your time and help!
Edit (to answer comments):
I have 5-8 parameters. Each parameter has its own range. One parameter can be 0-1000 (integer), while another can be 0 to 1 (real number). Nothing is stopping me from multithreading the black box evaluation.
Also, there are some parts of the black box that have some randomness to them. For example, one stage is using k-means clustering. Each black box evaluation, the cluster centers may change. I run k-means several times to (hopefully) avoid local optima. In addition, I evaluate the black box multiple times and find the median accuracy in order to further mitigate randomness and outliers.
As a partial solution, a grid search of moderate resolution and range can be recursively repeated in the areas where the n-parameters result in the optimal values.
Each n-dimensioned result from each step would be used as a starting point for the next iteration.
The key is that for each iteration the resolution in absolute terms is kept constant (i.e. keep the iteration period constant) but the range decreased so as to reduce the pitch/granular step size.
I'd call it a ‘contracting mesh’ :)
Keep in mind that while it avoids full brute-force complexity it only reaches exhaustive resolution in the final iteration (this is what defines the final iteration).
Also that the outlined process is only exhaustive on a subset of the points that may or may not include the global minimum - i.e. it could result in a local minima.
(You can always chase your tail though by offsetting the initial grid by some sub-initial-resolution amount and compare results...)
Have fun!
Here is the solution to your problem.
A method behind it is described in this paper.

Metric/density based clustering/grouping

I have a finite number of points (cloud), with a metric defined on them. I would like to find the maximum amount of clusters in this cloud such that:
1) the maximum distance between any two points in one cluster is smaller a given epsilon (const)
2) each cluster has exactly k (const) points in it
I looked at all kind of different clustering methods and clustering with a restriction on the inner maximum distance is not a problem (density based). The 2) constrain and the requirement to find "the maximum amount of clusters s.t." seem to be problematic though. Any suggestions for an efficient solution?
Thank you,
A~
Given your constraints, there might be no solution. And actually, that may happen quite often...
The most obvious case is when you don't have a multiple of k points.
But also if epsilon is set too low, there might be points that cannot be put into clusters anymore.
I think you need to rethink your requirements and problem, instead of looking for an algorithm to solve an unreasonably hard requirement that might not be satisfiable.
Also consider whether you really need to find the guaranteed maximum, or just a good solution.
There are some rather obvious approaches that will at least find a good approximation fast.
I have the same impression as #Anony-Mousse, actually: you have not understood your problem and requirements yet.
If you want your cluster sizes to be k, there is no question of how many clusters you will get: it's obviously n /k. So you can try to use a k-means variant that produces clusters of the same size as e.g. described in this tutorial: Tutorial on same-size k-means and set the desired number of cluster to n/k.
Note that this is not a particular sensible or good clustering algorithm. It does something to satisfy the constraints, but the clusters are not really meaningful from a cluster analysis point of view. It's constraint satisfaction, but not cluster analysis.
In order to also satisfy your epsilon constraint, you can then start off with this initial solution (which probably is what #Anony-Mousse referred to as "obvious approaches") and try to perform the same kind of optimization-by-swapping-elements in order to satisfy the epsilon condition.
You may need a number of restarts, because there may be no solution.
See also:
Group n points in k clusters of equal size
K-means algorithm variation with equal cluster size
for essentially redundant questions.

Need help generating discrete random numbers from distribution

I searched the site but did not find exactly what I was looking for... I wanted to generate a discrete random number from normal distribution.
For example, if I have a range from a minimum of 4 and a maximum of 10 and an average of 7. What code or function call ( Objective C preferred ) would I need to return a number in that range. Naturally, due to normal distribution more numbers returned would center round the average of 7.
As a second example, can the bell curve/distribution be skewed toward one end of the other? Lets say I need to generate a random number with a range of minimum of 4 and maximum of 10, and I want the majority of the numbers returned to center around the number 8 with a natural fall of based on a skewed bell curve.
Any help is greatly appreciated....
Anthony
What do you need this for? Can you do it the craps player's way?
Generate two random integers in the range of 2 to 5 (inclusive, of course) and add them together. Or flip a coin (0,1) six times and add 4 to the result.
Summing multiple dice produces a normal distribution (a "bell curve"), while eliminating high or low throws can be used to skew the distribution in various ways.
The key is you are going for discrete numbers (and I hope you mean integers by that). Multiple dice throws famously generate a normal distribution. In fact, I think that's how we were first introduced to the Gaussian curve in school.
Of course the more throws, the more closely you approximate the bell curve. Rolling a single die gives a flat line. Rolling two dice just creates a ramp up and down that isn't terribly close to a bell. Six coin flips gets you closer.
So consider this...
If I understand your question correctly, you only have seven possible outcomes--the integers (4,5,6,7,8,9,10). You can set up an array of seven probabilities to approximate any distribution you like.
Many frameworks and libraries have this built-in.
Also, just like TokenMacGuy said a normal distribution isn't characterized by the interval it's defined on, but rather by two parameters: Mean μ and standard deviation σ. With both these parameters you can confine a certain quantile of the distribution to a certain interval, so that 95 % of all points fall in that interval. But resticting it completely to any interval other than (−∞, ∞) is impossible.
There are several methods to generate normal-distributed values from uniform random values (which is what most random or pseudorandom number generators are generating:
The Box-Muller transform is probably the easiest although not exactly fast to compute. Depending on the number of numbers you need, it should be sufficient, though and definitely very easy to write.
Another option is Marsaglia's Polar method which is usually faster1.
A third method is the Ziggurat algorithm which is considerably faster to compute but much more complex to program. In applications that really use a lot of random numbers it may be the best choice, though.
As a general advice, though: Don't write it yourself if you have access to a library that generates normal-distributed random numbers for you already.
For skewing your distribution I'd just use a regular normal distribution, choosing μ and σ appropriately for one side of your curve and then determine on which side of your wanted mean a point fell, stretching it appropriately to fit your desired distribution.
For generating only integers I'd suggest you just round towards the nearest integer when the random number happens to fall within your desired interval and reject it if it doesn't (drawing a new random number then). This way you won't artificially skew the distribution (such as you would if you were clamping the values at 4 or 10, respectively).
1 In testing with deliberately bad random number generators (yes, worse than RANDU) I've noticed that the polar method results in an endless loop, rejecting every sample. Won't happen with random numbers that fulfill the usual statistic expectations to them, though.
Yes, there are sophisticated mathematical solutions, but for "simple but practical" I'd go with Nosredna's comment. For a simple Java solution:
Random random=new Random();
public int bell7()
{
int n=4;
for (int x=0;x<6;++x)
n+=random.nextInt(2);
return n;
}
If you're not a Java person, Random.nextInt(n) returns a random integer between 0 and n-1. I think the rest should be similar to what you'd see in any programming language.
If the range was large, then instead of nextInt(2)'s I'd use a bigger number in there so there would be fewer iterations through the loop, depending on frequency of call and performance requirements.
Dan Dyer and Jay are exactly right. What you really want is a binomial distribution, not a normal distribution. The shape of a binomial distribution looks a lot like a normal distribution, but it is discrete and bounded whereas a normal distribution is continuous and unbounded.
Jay's code generates a binomial distribution with 6 trials and a 50% probability of success on each trial. If you want to "skew" your distribution, simply change the line that decides whether to add 1 to n so that the probability is something other than 50%.
The normal distribution is not described by its endpoints. Normally it's described by it's mean (which you have given to be 7) and its standard deviation. An important feature of this is that it is possible to get a value far outside the expected range from this distribution, although that will be vanishingly rare, the further you get from the mean.
The usual means for getting a value from a distribution is to generate a random value from a uniform distribution, which is quite easily done with, for example, rand(), and then use that as an argument to a cumulative distribution function, which maps probabilities to upper bounds. For the standard distribution, this function is
F(x) = 0.5 - 0.5*erf( (x-μ)/(σ * sqrt(2.0)))
where erf() is the error function which may be described by a taylor series:
erf(z) = 2.0/sqrt(2.0) * Σ∞n=0 ((-1)nz2n + 1)/(n!(2n + 1))
I'll leave it as an excercise to translate this into C.
If you prefer not to engage in the exercise, you might consider using the Gnu Scientific Library, which among many other features, has a technique to generate random numbers in one of many common distributions, of which the Gaussian Distribution (hint) is one.
Obviously, all of these functions return floating point values. You will have to use some rounding strategy to convert to a discrete value. A useful (but naive) approach is to simply downcast to integer.