Elbow Method for optimal no. of clusters - data-science

I have a dataset that I am analysing to find the optimal number of clusters using k-means.
I am testing the number of clusters from [1..11] - which produces the following plot:
The original dataset has six classes but the elbow plot shows the bend really occurring at 3 clusters. For curiosity I overlaid a line on the plot from 11 clusters and back and it is almost a straight line to 6 clusters - which indicates to me that the real elbow is at 6, but it is subtle to see.
So, visually 3 looks to be the right answer, but given the known number of classes (6) the straight line I drew indicates 6...
Question:
How should you correctly interpret an elbow plot like this
(especially when you are not given the classes)?
Would you say the
elbow is at 3 or 6?

Based on the plot I'd say that there are 6 clusters. From my experience and intuition, I believe it makes sense to say that the "elbow" is where the "within cluster sum of squares" begins to decrease linearly.
However, for cluster validation, I recommend using silhouette coefficients as the "right answer" is objectively obtained. In addition, the silhouette coefficients takes the separation of clusters into account as well.

Related

How can I study the properties of outliers in high-dimensional data?

I have a bundle of high-dimensional data and the instances are labeled as outliers or not. I am looking to get some insights around where these outliers reside within the data. I seek to answer questions like:
Are the outliers spread far apart from each other? Or are they clustered together?
Are the outliers lying 'in-between' clusters of good data? Or are they on the 'edge' boundaries of the data?
If outliers are clustered together, how do these cluster densities compare with clusters of good data?
'Where' are the outliers?
What kind of techniques will let me find these insights? If the data was 2 or 3-dimensional, I can easily plot the data and just look at it. But I can't do it high-dimensional data.
Analyzing the Statistical Properties of Outliers
First of all, if you can choose to focus on specific features. For
example, if you know a featues is subject to high variation, you can
draw a box plot. You can also draw a 2D graph if you want to focus on
2 features. THis shows how much the labelled outliers vary.
Next, there's a metric called a Z-score, which basically says how
many standard devations a point varies compared to the mean. The
Z-score is signed, meaning if a point is below the mean, the Z-score
will be negative. This can be used to analyze all the features of the
dataset. You can find the threshold value in your labelled dataset for which all the points above that threshold are labelled outliers
Lastly, we can find the interquartile range and similarly filter
based on it. The IQR is simply the difference between the 75
percentile and 25 percentile. You can also use this similarly to Z-score.
Using these techniques, we can analyze some of the statistical properties of the outliers.
If you also want to analyze the clusters, you can adapt the DBSCAN algorithm to your problem. This algorithm clusters data based on densities, so it will be easy to apply the techniques to outliers.

How to create a synthetic dataset

I want to run some Machine Learning clustering algorithms on some big data.
The problem is that I'm having troubles to find interesting data for this purpose on the web.Also, usually this data might be inconvenient to use because the format won't fit for me.
I need a txt file which each line represents a mathematical vector, each element seperated by space, for example:
1 2.2 3.1
1.12 0.13 4.46
1 2 54.44
Therefore, I decided to first run those algorithms on some synthetic data which I'll create by my self. How can I do this in a smart way with numpy?
In smart way, I mean that it won't be generated uniformly, because it's a little bit boring. How can I generate some interesting clusters?
I want to have 5GB / 10GB of data at the moment.
You need to define what you mean by "clusters", but I think what you are asking for is several random-parameter normal distributions combined together, for each of your coordinate values.
From http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.random.randn.html#numpy.random.randn:
For random samples from N(\mu, \sigma^2), use:
sigma * np.random.randn(...) + mu
And use <range> * np.random.rand(<howmany>) for each of sigma and mu.
There is no one good answer for such question. What is interesting? For clustering, unfortunately, there is no such thing as an interesting or even well posed problem. Clustering as such has no well defineid evaluation, consequently each method is equally good/bad, as long as it has well defined internal objective. So k-means will always be good one to minimize inter-cluster euclidean distance and will struggle with sparse data, non-convex, imbalanced clusters. DBScan will always be the best in greedy density based sense and will strugle with diverse density clusters. GMM will be always great fitting on gaussian mixtures, and will strugle with clusters which are not gaussians (for example lines, squares etc.).
From the question one could deduce that you are at the very begining of work with clustering and so need "just anything more complex than uniform", so I suggest you take a look at datasets generators, in particular accesible in scikit-learn (python) http://scikit-learn.org/stable/datasets/ or in clusterSim (R) http://www.inside-r.org/packages/cran/clusterSim/docs/cluster.Gen or clusterGeneration (R) https://cran.r-project.org/web/packages/clusterGeneration/clusterGeneration.pdf

k-means empty cluster

I try to implement k-means as a homework assignment. My exercise sheet gives me following remark regarding empty centers:
During the iterations, if any of the cluster centers has no data points associated with it, replace it with a random data point.
That confuses me a bit, firstly Wikipedia or other sources I read do not mention that at all. I further read about a problem with 'choosing a good k for your data' - how is my algorithm supposed to converge if I start setting new centers for cluster that were empty.
If I ignore empty clusters I converge after 30-40 iterations. Is it wrong to ignore empty clusters?
Check out this example of how empty clusters can happen: http://www.ceng.metu.edu.tr/~tcan/ceng465_f1314/Schedule/KMeansEmpty.html
It basically means either 1) a random tremor in the force, or 2) the number of clusters k is wrong. You should iterate over a few different values for k and pick the best.
If during your iterating you should encounter an empty cluster, place a random data point into that cluster and carry on.
I hope this helped on your homework assignment last year.
Handling empty clusters is not part of the k-means algorithm but might result in better clusters quality. Talking about convergence, it is never exactly but only heuristically guaranteed and hence the criterion for convergence is extended by including a maximum number of iterations.
Regarding the strategy to tackle down this problem, I would say randomly assigning some data point to it is not very clever since we might be affecting the clusters quality since the distance to its currently assigned center is large or small. An heuristic for this case would be to choose the farthest point from the biggest cluster and move that the empty cluster, then do so until there are no empty clusters.
Statement: k-means can lead to
Consider above distribution of data points.
overlapping points mean that the distance between them is del. del tends to 0 meaning you can assume arbitary small enough value eg 0.01 for it.
dash box represents cluster assign
legend in footer represents numberline
N=6 points
k=3 clusters (coloured)
final clusters = 2
blue cluster is orphan and ends up empty.
Empty clusters can be obtained if no points are allocated to a cluster during the assignment step. If this happens, you need to choose a replacement centroid otherwise SSE would be larger than neccessary.
*Choose the point that contributes most to SSE
*Choose a point from the cluster with the highest SSE
*If there are several empty clusters, the above can be repeated several times.
***SSE = Sum of Square Error.
Check this site https://chih-ling-hsu.github.io/2017/09/01/Clustering#
You should not ignore empty clusters but replace it. k-means is an algorithm could only provides you local minimums, and the empty clusters are the local minimums that you don't want.
your program is going to converge even if you replace a point with a random one. Remember that at the beginning of the algorithm, you choose the initial K points randomly. if it can converge, how come K-1 converge points with 1 random point can't? just a couple more iterations are needed.
"Choosing good k for your data" refers to the problem of choosing the right number of clusters. Since the k-means algorithm works with a predetermined number of cluster centers, their number has to be chosen at first. Choosing the wrong number could make it hard to divide the data points into clusters or the clusters could become small and meaningless.
I can't give you an answer on whether it is a bad idea to ignore empty clusters. If you do, you might end up with a smaller number of clusters than you defined at the beginning. This will confuse people who expect k-means to work in a certain way, but it is not necessarily a bad idea.
If you re-locate any empty cluster centers, your algorithm will probably converge anyway if that happens a limited number of times. However, you if you have to relocate too often, it might happen that your algorithm doesn't terminate.
For "Choosing good k for your data", Andrew Ng gives the example of a tee shirt manufacturer looking at potential customer measurements and doing k-means to decide if you want to offer S/M/L (k=3) or 2XS/XS/S/M/L/XL/2XL (k=7). Sometimes the decision is driven by the data (k=7 gives empty clusters) and sometimes by business considerations (manufacturing costs are less with only three sizes, or marketing says customers want more choices).
Set a variable to track the farthest distanced point and its cluster based on the distance measure used.
After the allocation step for all the points, check the number of datapoints in each cluster.
If any is 0, as is the case for this question, split the biggest cluster obtained and split further into 2 sub-clusters.
Replace the selected cluster with these sub-clusters.
I hope the issue is fixed now.. Random assignment will affect the clustering structure of the already obtained clustering.

Simplification / optimization of GPS track

I've got a GPS track produced by gpxlogger(1) (supplied as a client for gpsd). GPS receiver updates its coordinates every 1 second, gpxlogger's logic is very simple, it writes down location (lat, lon, ele) and a timestamp (time) received from GPS every n seconds (n = 3 in my case).
After writing down a several hours worth of track, gpxlogger saves several megabyte long GPX file that includes several thousands of points. Afterwards, I try to plot this track on a map and use it with OpenLayers. It works, but several thousands of points make using the map a sloppy and slow experience.
I understand that having several thousands of points of suboptimal. There are myriads of points that can be deleted without losing almost anything: when there are several points making up roughly the straight line and we're moving with the same constant speed between them, we can just leave the first and the last point and throw away anything else.
I thought of using gpsbabel for such track simplification / optimization job, but, alas, it's simplification filter works only with routes, i.e. analyzing only geometrical shape of path, without timestamps (i.e. not checking that the speed was roughly constant).
Is there some ready-made utility / library / algorithm available to optimize tracks? Or may be I'm missing some clever option with gpsbabel?
Yes, as mentioned before, the Douglas-Peucker algorithm is a straightforward way to simplify 2D connected paths. But as you have pointed out, you will need to extend it to the 3D case to properly simplify a GPS track with an inherent time dimension associated with every point. I have done so for a web application of my own using a PHP implementation of Douglas-Peucker.
It's easy to extend the algorithm to the 3D case with a little understanding of how the algorithm works. Say you have input path consisting of 26 points labeled A to Z. The simplest version of this path has two points, A and Z, so we start there. Imagine a line segment between A and Z. Now scan through all remaining points B through Y to find the point furthest away from the line segment AZ. Say that point furthest away is J. Then, you scan the points between B and I to find the furthest point from line segment AJ and scan points K through Y to find the point furthest from segment JZ, and so on, until the remaining points all lie within some desired distance threshold.
This will require some simple vector operations. Logically, it's the same process in 3D as in 2D. If you find a Douglas-Peucker algorithm implemented in your language, it might have some 2D vector math implemented, and you'll need to extend those to use 3 dimensions.
You can find a 3D C++ implementation here: 3D Douglas-Peucker in C++
Your x and y coordinates will probably be in degrees of latitude/longitude, and the z (time) coordinate might be in seconds since the unix epoch. You can resolve this discrepancy by deciding on an appropriate spatial-temporal relationship; let's say you want to view one day of activity over a map area of 1 square mile. Imagining this relationship as a cube of 1 mile by 1 mile by 1 day, you must prescale the time variable. Conversion from degrees to surface distance is non-trivial, but for this case we simplify and say one degree is 60 miles; then one mile is .0167 degrees. One day is 86400 seconds; then to make the units equivalent, our prescale factor for your timestamp is .0167/86400, or about 1/5,000,000.
If, say, you want to view the GPS activity within the same 1 square mile map area over 2 days instead, time resolution becomes half as important, so scale it down twice further, to 1/10,000,000. Have fun.
Have a look at Ramer-Douglas-Peucker algorithm for smoothening complex polygons, also Douglas-Peucker line simplification algorithm can help you reduce your points.
OpenSource GeoKarambola java library (no Android dependencies but can be used in Android) that includes a GpxPathManipulator class that does both route & track simplification/reduction (3D/elevation aware).
If the points have timestamp information that will not be discarded.
https://sourceforge.net/projects/geokarambola/
This is the algorith in action, interactively
https://lh3.googleusercontent.com/-hvHFyZfcY58/Vsye7nVrmiI/AAAAAAAAHdg/2-NFVfofbd4ShZcvtyCDpi2vXoYkZVFlQ/w360-h640-no/movie360x640_05_82_05.gif
This algorithm is based on reducing the number of points by eliminating those that have the greatest XTD (cross track distance) error until a tolerated error is satisfied or the maximum number of points is reached (both parameters of the function), wichever comes first.
An alternative algorithm, for on-the-run stream like track simplification (I call it "streamplification") is:
you keep a small buffer of the points the GPS sensor gives you, each time a GPS point is added to the buffer (elevation included) you calculate the max XTD (cross track distance) of all the points in the buffer to the line segment that unites the first point with the (newly added) last point of the buffer. If the point with the greatest XTD violates your max tolerated XTD error (25m has given me great results) then you cut the buffer at that point, register it as a selected point to be appended to the streamplified track, trim the trailing part of the buffer up to that cut point, and keep going. At the end of the track the last point of the buffer is also added/flushed to the solution.
This algorithm is lightweight enough that it runs on an AndroidWear smartwatch and gives optimal output regardless of if you move slow or fast, or stand idle at the same place for a long time. The ONLY thing that maters is the SHAPE of your track. You can go for many minutes/kilometers and, as long as you are moving in a straight line (a corridor within +/- tolerated XTD error deviations) the streamplify algorithm will only output 2 points: those of the exit form last curve and entry on next curve.
I ran in to a similar issue. The rate at which the gps unit takes points is much larger that needed. Many of the points are not geographically far away from each other. The approach that I took is to calculate the distance between the points using the haversine formula. If the distance was not larger than my threshold (0.1 miles in my case) I threw away the point. This quickly gets the number of points down to a manageable size.
I don't know what language you are looking for. Here is a C# project that I was working on. At the bottom you will find the haversine code.
http://blog.bobcravens.com/2010/09/gps-using-the-netduino/
Hope this gets you going.
Bob
This is probably NP-hard. Suppose you have points A, B, C, D, E.
Let's try a simple deterministic algorithm. Suppose you calculate the distance from point B to line A-C and it's smaller than your threshold (1 meter). So you delete B. Then you try the same for C to line A-D, but it's bigger and D for C-E, which is also bigger.
But it turns out that the optimal solution is A, B, E, because point C and D are close to the line B-E, yet on opposite sides.
If you delete 1 point, you cannot be sure that it should be a point that you should keep, unless you try every single possible solution (which can be n^n in size, so on n=80 that's more than the minimum number of atoms in the known universe).
Next step: try a brute force or branch and bound algorithm. Doesn't scale, doesn't work for real-world size. You can safely skip this step :)
Next step: First do a determinstic algorithm and improve upon that with a metaheuristic algorithm (tabu search, simulated annealing, genetic algorithms). In java there are a couple of open source implementations, such as Drools Planner.
All in all, you 'll probably have a workable solution (although not optimal) with the first simple deterministic algorithm, because you only have 1 constraint.
A far cousin of this problem is probably the Traveling Salesman Problem variant in which the salesman cannot visit all cities but has to select a few.
You want to throw away uninteresting points. So you need a function that computes how interesting a point is, then you can compute how interesting all the points are and throw away the N least interesting points, where you choose N to slim the data set sufficiently. It sounds like your definition of interesting corresponds to high acceleration (deviation from straight-line motion), which is easy to compute.
Try this, it's free and opensource online Service:
https://opengeo.tech/maps/gpx-simplify-optimizer/
I guess you need to keep points where you change direction. If you split your track into the set of intervals of constant direction, you can leave only boundary points of these intervals.
And, as Raedwald pointed out, you'll want to leave points where your acceleration is not zero.
Not sure how well this will work, but how about taking your list of points, working out the distance between them and therefore the total distance of the route and then deciding on a resolution distance and then just linear interpolating the position based on each step of x meters. ie for each fix you have a "distance from start" measure and you just interpolate where n*x is for your entire route. (you could decide how many points you want and divide the total distance by this to get your resolution distance). On top of this you could add a windowing function taking maybe the current point +/- z points and applying a weighting like exp(-k* dist^2/accuracy^2) to get the weighted average of a set of points where dist is the distance from the raw interpolated point and accuracy is the supposed accuracy of the gps position.
One really simple method is to repeatedly remove the point that creates the largest angle (in the range of 0° to 180° where 180° means it's on a straight line between its neighbors) between its neighbors until you have few enough points. That will start off removing all points that are perfectly in line with their neighbors and will go from there.
You can do that in Ο(n log(n)) by making a list of each index and its angle, sorting that list in descending order of angle, keeping how many you need from the front of the list, sorting that shorter list in descending order of index, and removing the indexes from the list of points.
def simplify_points(points, how_many_points_to_remove)
angle_map = Array.new
(2..points.length - 1).each { |next_index|
removal_list.add([next_index - 1, angle_between(points[next_index - 2], points[next_index - 1], points[next_index])])
}
removal_list = removal_list.sort_by { |index, angle| angle }.reverse
removal_list = removal_list.first(how_many_points_to_remove)
removal_list = removal_list.sort_by { |index, angle| index }.reverse
removal_list.each { |index| points.delete_at(index) }
return points
end

Need help generating discrete random numbers from distribution

I searched the site but did not find exactly what I was looking for... I wanted to generate a discrete random number from normal distribution.
For example, if I have a range from a minimum of 4 and a maximum of 10 and an average of 7. What code or function call ( Objective C preferred ) would I need to return a number in that range. Naturally, due to normal distribution more numbers returned would center round the average of 7.
As a second example, can the bell curve/distribution be skewed toward one end of the other? Lets say I need to generate a random number with a range of minimum of 4 and maximum of 10, and I want the majority of the numbers returned to center around the number 8 with a natural fall of based on a skewed bell curve.
Any help is greatly appreciated....
Anthony
What do you need this for? Can you do it the craps player's way?
Generate two random integers in the range of 2 to 5 (inclusive, of course) and add them together. Or flip a coin (0,1) six times and add 4 to the result.
Summing multiple dice produces a normal distribution (a "bell curve"), while eliminating high or low throws can be used to skew the distribution in various ways.
The key is you are going for discrete numbers (and I hope you mean integers by that). Multiple dice throws famously generate a normal distribution. In fact, I think that's how we were first introduced to the Gaussian curve in school.
Of course the more throws, the more closely you approximate the bell curve. Rolling a single die gives a flat line. Rolling two dice just creates a ramp up and down that isn't terribly close to a bell. Six coin flips gets you closer.
So consider this...
If I understand your question correctly, you only have seven possible outcomes--the integers (4,5,6,7,8,9,10). You can set up an array of seven probabilities to approximate any distribution you like.
Many frameworks and libraries have this built-in.
Also, just like TokenMacGuy said a normal distribution isn't characterized by the interval it's defined on, but rather by two parameters: Mean μ and standard deviation σ. With both these parameters you can confine a certain quantile of the distribution to a certain interval, so that 95 % of all points fall in that interval. But resticting it completely to any interval other than (−∞, ∞) is impossible.
There are several methods to generate normal-distributed values from uniform random values (which is what most random or pseudorandom number generators are generating:
The Box-Muller transform is probably the easiest although not exactly fast to compute. Depending on the number of numbers you need, it should be sufficient, though and definitely very easy to write.
Another option is Marsaglia's Polar method which is usually faster1.
A third method is the Ziggurat algorithm which is considerably faster to compute but much more complex to program. In applications that really use a lot of random numbers it may be the best choice, though.
As a general advice, though: Don't write it yourself if you have access to a library that generates normal-distributed random numbers for you already.
For skewing your distribution I'd just use a regular normal distribution, choosing μ and σ appropriately for one side of your curve and then determine on which side of your wanted mean a point fell, stretching it appropriately to fit your desired distribution.
For generating only integers I'd suggest you just round towards the nearest integer when the random number happens to fall within your desired interval and reject it if it doesn't (drawing a new random number then). This way you won't artificially skew the distribution (such as you would if you were clamping the values at 4 or 10, respectively).
1 In testing with deliberately bad random number generators (yes, worse than RANDU) I've noticed that the polar method results in an endless loop, rejecting every sample. Won't happen with random numbers that fulfill the usual statistic expectations to them, though.
Yes, there are sophisticated mathematical solutions, but for "simple but practical" I'd go with Nosredna's comment. For a simple Java solution:
Random random=new Random();
public int bell7()
{
int n=4;
for (int x=0;x<6;++x)
n+=random.nextInt(2);
return n;
}
If you're not a Java person, Random.nextInt(n) returns a random integer between 0 and n-1. I think the rest should be similar to what you'd see in any programming language.
If the range was large, then instead of nextInt(2)'s I'd use a bigger number in there so there would be fewer iterations through the loop, depending on frequency of call and performance requirements.
Dan Dyer and Jay are exactly right. What you really want is a binomial distribution, not a normal distribution. The shape of a binomial distribution looks a lot like a normal distribution, but it is discrete and bounded whereas a normal distribution is continuous and unbounded.
Jay's code generates a binomial distribution with 6 trials and a 50% probability of success on each trial. If you want to "skew" your distribution, simply change the line that decides whether to add 1 to n so that the probability is something other than 50%.
The normal distribution is not described by its endpoints. Normally it's described by it's mean (which you have given to be 7) and its standard deviation. An important feature of this is that it is possible to get a value far outside the expected range from this distribution, although that will be vanishingly rare, the further you get from the mean.
The usual means for getting a value from a distribution is to generate a random value from a uniform distribution, which is quite easily done with, for example, rand(), and then use that as an argument to a cumulative distribution function, which maps probabilities to upper bounds. For the standard distribution, this function is
F(x) = 0.5 - 0.5*erf( (x-μ)/(σ * sqrt(2.0)))
where erf() is the error function which may be described by a taylor series:
erf(z) = 2.0/sqrt(2.0) * Σ∞n=0 ((-1)nz2n + 1)/(n!(2n + 1))
I'll leave it as an excercise to translate this into C.
If you prefer not to engage in the exercise, you might consider using the Gnu Scientific Library, which among many other features, has a technique to generate random numbers in one of many common distributions, of which the Gaussian Distribution (hint) is one.
Obviously, all of these functions return floating point values. You will have to use some rounding strategy to convert to a discrete value. A useful (but naive) approach is to simply downcast to integer.