What is the benefit of using same size cluster in K- Means Algorithm? - k-means

since K Means using the Centroid and distance measurement Techniques. so was wondering what is the benefit of producing same size cluster in the - K Means Algorithm.
Also also there is one point for the observation, that producing the different size of cluster in k means its a drawback.

Related

k-mean clustering - inertia only gets larger

I am trying to use the KMeans clustering from faiss on a human pose dataset of body joints. I have 16 body parts so a dimension of 32. The joints are scaled in a range between 0 and 1. My dataset consists of ~ 900.000 instances. As mentioned by faiss (faiss_FAQ):
As a rule of thumb there is no consistent improvement of the k-means quantizer beyond 20 iterations and 1000 * k training points
Applying this to my problem I randomly select 50000 instances for training. As I want to check for a number of clusters k between 1 and 30.
Now to my "problem":
The inertia is increasing directly as the number of cluster increases (n_cluster on the x-axis):
I tried varying the number of iterations, the number of redos, verbose and spherical, but the results stay the same or get worse. I do not think that it is a problem of my implementation; I tested it on a small example with 2D data and very clear clusters and it worked.
Is it that the data is just bad clustered or is there another problem/mistake I have missed? Maybe the scaling of the values between 0 and 1? Should I try another approach?
I found my mistake. I had to increase the parameter max_points_per_centroid. As I have so many data points it sampled a sub-batch for the fit. For a larger number of clusters this sub-batch is larger. See FAQ of faiss:
max_points_per_centroid * k: there are too many points, making k-means unnecessarily slow. Then the training set is sampled
The larger subbatch of course has a larger inertia as there are more points in total.

How to optimally use a run length morton-encoded chunk of voxels to generate a mesh?

I am currently toying around with meshing voxel chunks, that is a N^3 list of voxel values. Since in my use-case most of the voxels are going to be of the same type a lot of the neighbors will share the same value. Thus, using RLE (run length encoding) makes sense to use, as it drastically cuts down the actual storage requirement at the cost of a O(log(n)) (n being the amount of runs in a chunk) random lookup. At the same time, I encode each voxel position as a z-order curve, specifically using morton encoding. This all works in my use-case and gives the expected space reductions required.
The issue is with meshing the individual chunks. Generating meshes takes up to 500ms (for N=48) which is simply too long for fluid gameplay. My current algorithm heavily borrows from the one described here and here.
Pseudocode algorithm:
- For each axis in [X, Y, Z]
- for each k in 0..N in axis
- Create a mask of size [N * N]
- for each u in 0..N of orthogonal axis:
- for each v in 0..N of second orthogonal axis:
- Check if (k, u, v) has a face between itself and (k+1, u, v)
- If yes, set mask[u*v*N] = true
- using the mask, make a greedy mesh and output that
The mesh generation itself is very fast (<<< 1ms) and wouldn't gain much from being optimized, but building the mask itself is very costly. As one can see in the tracing output below, each mesh_mask_building takes on average ~4ms which happens 3*N times per mesh!
My first thought to optimize this, was to use the inherent runs of the chunk and simply traverse those and build a mask of each, but this did not work out, as morton encoding is winding a lot throughout the primitive and as such would not be much better at constructing a mesh. It would also be highly suboptimal when one considers a chunk where only one layer is set to visible voxels. The current method generates a simple cuboid as it does not care about runs, but in my suggested method, each run would be seperate and generate more faces.
So my question is, how can I use a morton-run-length encoding to generate a mesh?

Time complexity of counting cycles and complete subgraphs of a given size

I am very new to graph theory and I am trying to understand the following. Given an undirected graph G with V vertexes and E edges, what is the time complexity of counting all the complete subgraphs of size k and what is the time complexity of counting all the cycles of length k? Basically, which one is faster? I am only considering cases where k is even. Are there any well-known references for this?
Counting all cycles of length k can, in general, be achieved in O(VE) time (V = number of vertices, E = number of edges), although this number can be greatly improved depending on k (Ref). Counting all complete subgraphs or cliques can be achieved in O(nk k2) in the general case, but again, this number can be improved upon depending on the structure of the graph and algorithm used (Ref).

K-means:Updaing centroids successively after each addition

Say we have five data points A,B,C,D,E and we are using K-means clustering algorithm to cluster them into two clusters.Can we update the centroids as follow:
Let's select first two i.e. A,B as centroids of initial clusters.
Then calculate the distance of C from A as well as from B.Say C is nearer to A.
Update the centroid of cluster with centroid A before the next step i.e. now new centroids are (A+C)/2 and B.
Then calculate the distances of D from these new centroids and so on.
Yes, it seems like we can update centroids incrementally in k-means as explained in chapter 8 of "Introduction to Data Mining" by Kumar. Here is the actual text:
Updating Centroids Incrementally
Instead of updating cluster centroids after all points have been assigned to a cluster, the centroids can be updated incrementally, after each assignment of a point to a cluster. Notice that this requires either zero or two updates to cluster centroids at each step, since a point either moves to a new cluster (two updates) or stays in its current cluster (zero updates). Using an incremental update strategy guarantees that empty clusters are not produced since all clusters start with a single point, and if a cluster ever has only one point, then that point will always be reassigned to the same cluster.
In addition, if incremental updating is used, the relative weight of the point
being added may be adjusted; e.g., the weight of points is often decreased as
the clustering proceeds. While this can result in better accuracy and faster
convergence, it can be difficult to make a good choice for the relative weight, especially in a wide variety of situations. These update issues are similar to those involved in updating weights for artificial neural networks.
Yet another benefit of incremental updates has to do with using objectives
other than “minimize SSE.” Suppose that we are given an arbitrary objective
function to measure the goodness of a set of clusters. When we process an
individual point, we can compute the value of the objective function for each
possible cluster assignment, and then choose the one that optimizes the objective. Specific examples of alternative objective functions are given in Section 8.5.2.
On the negative side, updating centroids incrementally introduces an order dependency. In other words, the clusters produced may depend on the order in which the points are processed. Although this can be addressed by
randomizing the order in which the points are processed, the basic K-means
approach of updating the centroids after all points have been assigned to clusters has no order dependency. Also, incremental updates are slightly more
expensive. However, K-means converges rather quickly, and therefore, the
number of points switching clusters quickly becomes relatively small.

k-means empty cluster

I try to implement k-means as a homework assignment. My exercise sheet gives me following remark regarding empty centers:
During the iterations, if any of the cluster centers has no data points associated with it, replace it with a random data point.
That confuses me a bit, firstly Wikipedia or other sources I read do not mention that at all. I further read about a problem with 'choosing a good k for your data' - how is my algorithm supposed to converge if I start setting new centers for cluster that were empty.
If I ignore empty clusters I converge after 30-40 iterations. Is it wrong to ignore empty clusters?
Check out this example of how empty clusters can happen: http://www.ceng.metu.edu.tr/~tcan/ceng465_f1314/Schedule/KMeansEmpty.html
It basically means either 1) a random tremor in the force, or 2) the number of clusters k is wrong. You should iterate over a few different values for k and pick the best.
If during your iterating you should encounter an empty cluster, place a random data point into that cluster and carry on.
I hope this helped on your homework assignment last year.
Handling empty clusters is not part of the k-means algorithm but might result in better clusters quality. Talking about convergence, it is never exactly but only heuristically guaranteed and hence the criterion for convergence is extended by including a maximum number of iterations.
Regarding the strategy to tackle down this problem, I would say randomly assigning some data point to it is not very clever since we might be affecting the clusters quality since the distance to its currently assigned center is large or small. An heuristic for this case would be to choose the farthest point from the biggest cluster and move that the empty cluster, then do so until there are no empty clusters.
Statement: k-means can lead to
Consider above distribution of data points.
overlapping points mean that the distance between them is del. del tends to 0 meaning you can assume arbitary small enough value eg 0.01 for it.
dash box represents cluster assign
legend in footer represents numberline
N=6 points
k=3 clusters (coloured)
final clusters = 2
blue cluster is orphan and ends up empty.
Empty clusters can be obtained if no points are allocated to a cluster during the assignment step. If this happens, you need to choose a replacement centroid otherwise SSE would be larger than neccessary.
*Choose the point that contributes most to SSE
*Choose a point from the cluster with the highest SSE
*If there are several empty clusters, the above can be repeated several times.
***SSE = Sum of Square Error.
Check this site https://chih-ling-hsu.github.io/2017/09/01/Clustering#
You should not ignore empty clusters but replace it. k-means is an algorithm could only provides you local minimums, and the empty clusters are the local minimums that you don't want.
your program is going to converge even if you replace a point with a random one. Remember that at the beginning of the algorithm, you choose the initial K points randomly. if it can converge, how come K-1 converge points with 1 random point can't? just a couple more iterations are needed.
"Choosing good k for your data" refers to the problem of choosing the right number of clusters. Since the k-means algorithm works with a predetermined number of cluster centers, their number has to be chosen at first. Choosing the wrong number could make it hard to divide the data points into clusters or the clusters could become small and meaningless.
I can't give you an answer on whether it is a bad idea to ignore empty clusters. If you do, you might end up with a smaller number of clusters than you defined at the beginning. This will confuse people who expect k-means to work in a certain way, but it is not necessarily a bad idea.
If you re-locate any empty cluster centers, your algorithm will probably converge anyway if that happens a limited number of times. However, you if you have to relocate too often, it might happen that your algorithm doesn't terminate.
For "Choosing good k for your data", Andrew Ng gives the example of a tee shirt manufacturer looking at potential customer measurements and doing k-means to decide if you want to offer S/M/L (k=3) or 2XS/XS/S/M/L/XL/2XL (k=7). Sometimes the decision is driven by the data (k=7 gives empty clusters) and sometimes by business considerations (manufacturing costs are less with only three sizes, or marketing says customers want more choices).
Set a variable to track the farthest distanced point and its cluster based on the distance measure used.
After the allocation step for all the points, check the number of datapoints in each cluster.
If any is 0, as is the case for this question, split the biggest cluster obtained and split further into 2 sub-clusters.
Replace the selected cluster with these sub-clusters.
I hope the issue is fixed now.. Random assignment will affect the clustering structure of the already obtained clustering.