K-Means calculation on a distributed computation - k-means

I am running k-means clustering on scala 0.9.0 and I am trying to understand how the data is distributed among n systems to calculate k center data points.
I understand what k-means clustering is but I want to know how the data is divided and calculation is done on a distributed computation (map and reduce). In this scala version, KMeansDataGenerator has option to generate data points into n partitions. Does each slave node get one partition of data file?

KMeansDataGenerator uses sc.parallelize to generate the data. There is a parameter in sc.parallelize is the partition number. You can change it via KMeansDataGenerator's option.
After that, SparkKMeans will use this partition number in the whole k-means algorithm.
Does each slave node get one partition of data file?
Spark does not guarantee the location of partitions. However, it will try to schedule the computation to the nearest node which has the partition file.

Related

Setting number of shuffle partitions per shuffle in the same Spark job

Is there a way, within the same Spark application or even the same job, to specify a different number of shuffle partitions for each shuffle, rather than a global number of shuffle partitions that applies for all?
In other words, can
spark.sql.shuffle.partitions
be set dynamically to a different value for each DataFrame transformation that involves shuffling?
This is for a scenario in which a job is a large DAG, and some shuffle outputs might be small and others very large.
Thanks!
Of course you can.
Issue command sqlContext.setConf("spark.sql.shuffle.partitions", "nnn") before the JOIN or Aggregation. Has no effect on broadcast hash join aspect of a query though.
Try and see.

What are Redis timeseries module restrictions

I didn't find it in any official document, what is redis timeseries module's restrictions in terms of the following:
Max number of labels which can be added?
Max size of keys?
Max number of keys or time series?
Please let me know
Max number of labels which can be added?
A. There is no hard limit but you might experience some performance degradation when querying a very large number of labels.
Max size of keys?
A. There is no limit as long as you have memory available for the module. It is recommended to downsample your data and retire raw data by using the RETENTION option. Data compression was recently added to the module which reduces memory footprint significantly.
Max number of keys or time series?
A. There is no limit beyond the usual limits on Redis itself.

K-means:Updaing centroids successively after each addition

Say we have five data points A,B,C,D,E and we are using K-means clustering algorithm to cluster them into two clusters.Can we update the centroids as follow:
Let's select first two i.e. A,B as centroids of initial clusters.
Then calculate the distance of C from A as well as from B.Say C is nearer to A.
Update the centroid of cluster with centroid A before the next step i.e. now new centroids are (A+C)/2 and B.
Then calculate the distances of D from these new centroids and so on.
Yes, it seems like we can update centroids incrementally in k-means as explained in chapter 8 of "Introduction to Data Mining" by Kumar. Here is the actual text:
Updating Centroids Incrementally
Instead of updating cluster centroids after all points have been assigned to a cluster, the centroids can be updated incrementally, after each assignment of a point to a cluster. Notice that this requires either zero or two updates to cluster centroids at each step, since a point either moves to a new cluster (two updates) or stays in its current cluster (zero updates). Using an incremental update strategy guarantees that empty clusters are not produced since all clusters start with a single point, and if a cluster ever has only one point, then that point will always be reassigned to the same cluster.
In addition, if incremental updating is used, the relative weight of the point
being added may be adjusted; e.g., the weight of points is often decreased as
the clustering proceeds. While this can result in better accuracy and faster
convergence, it can be difficult to make a good choice for the relative weight, especially in a wide variety of situations. These update issues are similar to those involved in updating weights for artificial neural networks.
Yet another benefit of incremental updates has to do with using objectives
other than “minimize SSE.” Suppose that we are given an arbitrary objective
function to measure the goodness of a set of clusters. When we process an
individual point, we can compute the value of the objective function for each
possible cluster assignment, and then choose the one that optimizes the objective. Specific examples of alternative objective functions are given in Section 8.5.2.
On the negative side, updating centroids incrementally introduces an order dependency. In other words, the clusters produced may depend on the order in which the points are processed. Although this can be addressed by
randomizing the order in which the points are processed, the basic K-means
approach of updating the centroids after all points have been assigned to clusters has no order dependency. Also, incremental updates are slightly more
expensive. However, K-means converges rather quickly, and therefore, the
number of points switching clusters quickly becomes relatively small.

Any existing implementation of distributed matrix multiplication in tensorflow?

From the github code, it seems the MatMul op doesn't support partitioned matrixes. So is there any tool in tensorflow that supports multiplication of two huge matrixes that are distributed across multiple nodes?
Support for distributing computation across machines is built into TensorFlow. I would recommend reading distributed TensorFlow docs to figure out how to setup a TensorFlow cluster.
Once cluster is setup, you can decide how to partition your problem and use with tf.device to allocate each worker to their partition of work.
For instance, suppose you are multiplying a*a', and you want to split intermediate multiplication evenly over 2 workers, and the aggregate results on the 3rd.
You would do something like this:
with tf.device(worker0):
# load a1
b1 = tf.matmul(a1, tf.transpose(a1))
with tf.device(worker1):
# load a2
b2 = tf.matmul(a2, tf.transpose(a2))
with tf.device(worker2):
result = b1+b2
The load a1 part depends on how big is your matrix is stored. If it's huge, then perhaps load a1 will read it from disk. If it fits in memory, you can use a1=a[:n/2,:] to get a partition of it

k-means empty cluster

I try to implement k-means as a homework assignment. My exercise sheet gives me following remark regarding empty centers:
During the iterations, if any of the cluster centers has no data points associated with it, replace it with a random data point.
That confuses me a bit, firstly Wikipedia or other sources I read do not mention that at all. I further read about a problem with 'choosing a good k for your data' - how is my algorithm supposed to converge if I start setting new centers for cluster that were empty.
If I ignore empty clusters I converge after 30-40 iterations. Is it wrong to ignore empty clusters?
Check out this example of how empty clusters can happen: http://www.ceng.metu.edu.tr/~tcan/ceng465_f1314/Schedule/KMeansEmpty.html
It basically means either 1) a random tremor in the force, or 2) the number of clusters k is wrong. You should iterate over a few different values for k and pick the best.
If during your iterating you should encounter an empty cluster, place a random data point into that cluster and carry on.
I hope this helped on your homework assignment last year.
Handling empty clusters is not part of the k-means algorithm but might result in better clusters quality. Talking about convergence, it is never exactly but only heuristically guaranteed and hence the criterion for convergence is extended by including a maximum number of iterations.
Regarding the strategy to tackle down this problem, I would say randomly assigning some data point to it is not very clever since we might be affecting the clusters quality since the distance to its currently assigned center is large or small. An heuristic for this case would be to choose the farthest point from the biggest cluster and move that the empty cluster, then do so until there are no empty clusters.
Statement: k-means can lead to
Consider above distribution of data points.
overlapping points mean that the distance between them is del. del tends to 0 meaning you can assume arbitary small enough value eg 0.01 for it.
dash box represents cluster assign
legend in footer represents numberline
N=6 points
k=3 clusters (coloured)
final clusters = 2
blue cluster is orphan and ends up empty.
Empty clusters can be obtained if no points are allocated to a cluster during the assignment step. If this happens, you need to choose a replacement centroid otherwise SSE would be larger than neccessary.
*Choose the point that contributes most to SSE
*Choose a point from the cluster with the highest SSE
*If there are several empty clusters, the above can be repeated several times.
***SSE = Sum of Square Error.
Check this site https://chih-ling-hsu.github.io/2017/09/01/Clustering#
You should not ignore empty clusters but replace it. k-means is an algorithm could only provides you local minimums, and the empty clusters are the local minimums that you don't want.
your program is going to converge even if you replace a point with a random one. Remember that at the beginning of the algorithm, you choose the initial K points randomly. if it can converge, how come K-1 converge points with 1 random point can't? just a couple more iterations are needed.
"Choosing good k for your data" refers to the problem of choosing the right number of clusters. Since the k-means algorithm works with a predetermined number of cluster centers, their number has to be chosen at first. Choosing the wrong number could make it hard to divide the data points into clusters or the clusters could become small and meaningless.
I can't give you an answer on whether it is a bad idea to ignore empty clusters. If you do, you might end up with a smaller number of clusters than you defined at the beginning. This will confuse people who expect k-means to work in a certain way, but it is not necessarily a bad idea.
If you re-locate any empty cluster centers, your algorithm will probably converge anyway if that happens a limited number of times. However, you if you have to relocate too often, it might happen that your algorithm doesn't terminate.
For "Choosing good k for your data", Andrew Ng gives the example of a tee shirt manufacturer looking at potential customer measurements and doing k-means to decide if you want to offer S/M/L (k=3) or 2XS/XS/S/M/L/XL/2XL (k=7). Sometimes the decision is driven by the data (k=7 gives empty clusters) and sometimes by business considerations (manufacturing costs are less with only three sizes, or marketing says customers want more choices).
Set a variable to track the farthest distanced point and its cluster based on the distance measure used.
After the allocation step for all the points, check the number of datapoints in each cluster.
If any is 0, as is the case for this question, split the biggest cluster obtained and split further into 2 sub-clusters.
Replace the selected cluster with these sub-clusters.
I hope the issue is fixed now.. Random assignment will affect the clustering structure of the already obtained clustering.