Distance between Centroids of cluster and max data points

Distance between Centroids of cluster and max data points - k-means

I want to detect anomaly using K-means clustering. when I am using fit method each data points comes under the given cluster. I want to see the point which is farthest from the centroids.
how can I get the maximum distance for each cluster define some fraction of distance beyond which I can assign the data as anomaly.
I have centroid for each clusters and I also have datapoints.

Related

Centroids of K-means clustering

I was trying to cluster cities and came up with a problem:
I wanted the centroids to be obligatory in a city and they ended up in a desert area. I want to know if it is possible to "say" that the centroids have to be points in the input data. In order words, I want to find the n points in the input data that minimizes the sum of the distances of all points to this set.

K-medoids (or PAM) is the algorithm you are looking for.

K-means:Updaing centroids successively after each addition

Say we have five data points A,B,C,D,E and we are using K-means clustering algorithm to cluster them into two clusters.Can we update the centroids as follow:
Let's select first two i.e. A,B as centroids of initial clusters.
Then calculate the distance of C from A as well as from B.Say C is nearer to A.
Update the centroid of cluster with centroid A before the next step i.e. now new centroids are (A+C)/2 and B.
Then calculate the distances of D from these new centroids and so on.

Yes, it seems like we can update centroids incrementally in k-means as explained in chapter 8 of "Introduction to Data Mining" by Kumar. Here is the actual text:
Updating Centroids Incrementally
Instead of updating cluster centroids after all points have been assigned to a cluster, the centroids can be updated incrementally, after each assignment of a point to a cluster. Notice that this requires either zero or two updates to cluster centroids at each step, since a point either moves to a new cluster (two updates) or stays in its current cluster (zero updates). Using an incremental update strategy guarantees that empty clusters are not produced since all clusters start with a single point, and if a cluster ever has only one point, then that point will always be reassigned to the same cluster.
In addition, if incremental updating is used, the relative weight of the point
being added may be adjusted; e.g., the weight of points is often decreased as
the clustering proceeds. While this can result in better accuracy and faster
convergence, it can be difficult to make a good choice for the relative weight, especially in a wide variety of situations. These update issues are similar to those involved in updating weights for artificial neural networks.
Yet another benefit of incremental updates has to do with using objectives
other than “minimize SSE.” Suppose that we are given an arbitrary objective
function to measure the goodness of a set of clusters. When we process an
individual point, we can compute the value of the objective function for each
possible cluster assignment, and then choose the one that optimizes the objective. Specific examples of alternative objective functions are given in Section 8.5.2.
On the negative side, updating centroids incrementally introduces an order dependency. In other words, the clusters produced may depend on the order in which the points are processed. Although this can be addressed by
randomizing the order in which the points are processed, the basic K-means
approach of updating the centroids after all points have been assigned to clusters has no order dependency. Also, incremental updates are slightly more
expensive. However, K-means converges rather quickly, and therefore, the
number of points switching clusters quickly becomes relatively small.

Kinect Fusion volume voxel settings?

I need some help trying to figure out the Volume Voxel Per Meter and Resolution settings in Kinect Fusion...mostly how, and if at all, they interact with Depth Threshold settings in the Kinect Fusion Explorer program please...because I don't get if the depth threshold minimum is increased and maximum is reduced, does that smaller range increases the overall precision of the scanned volume, or does it stay the same?
Say I set the Kinect Fusion's depth threshold minimum to 2m and the maximum to 3m, thus setting the scanned range to 3m-2m=1m, does then the volume voxels per meter setting of say 256 and a resolution of also 256 mean that I would get a voxel depth precision of 1m/256=0.003m=0.3cm (a third of a centimeter)? Or is the resolution applicable only to the complete Kinect depth range instead of the one set via depth threshold? Also, how's width and height affected by depth threshold settings, and how to calculate precision in those two remaining axis?
Thanks in advance
P.S.
If the volume voxel resolution is set to maximum for all three axis (768x768x768) what is the minimum amount of GPU memory needed to make Kinect Fusion work?

Answering an old topic; because there is no other answer:
A. Simple Answer:
Depth threshold settings simply decide what region of the depth map are you interested in. Any value below min depth threshold and above max depth threshold is simply replaced with 0 during depth map generation.
B. Detailed Answer:
Volume Voxel per meter: This is the mm value depth represented by a single voxels . So 1000mm/256 (voxel_per_meter) = ~3.9 mm/voxel
( See:PCL documentation )
Voxel Resolution: The number of voxels in the volume you are constructing.
So;
Voxel Resolution / Voxel per m = Volume of Reconstruction volume (in meters)
EG: 512 voxels / 256 vpm = 2.0m (The volume of the reconstruction cube, given that the number of voxels per side of the cube are the same - each axis can be independently defined.)
If you have the Kinect SDK installed; see the descriptions of the following variables:
minDepthClip = FusionDepthProcessor.DefaultMinimumDepth;
maxDepthClip = FusionDepthProcessor.DefaultMaximumDepth;
voxelsPerMeter; voxelsX; voxelsY; voxelsZ;
So; these values are not dependent (or vice versa) on the depth threshold value.
A good example of using the depth threshold values is in the great video by Daniel Shiffman ([Kinect & Processing])

Within cluster sum of square of the next iteration is bigger than the previous when K-means is applied to SURF features?

I am using K-Means algorithm to classify SIFT vector features.
My K-Means skeleton is
choose first K points of the data as initial centers
do{
// assign each point to the corresponding cluster
data assignment;
// get the recalculated center of each cluster
// suppose point is multi-dimensional data, e.g.(x1, x2, x3...)
// the center is composed by average value of each dimension.
// e.g. ((x1 + y1 + ...)/n, (x2 + y2 + ...)/n ...)
new centroid;
sum total memberShipChanged;
}while(some point still changes their membership, i.e. total memberShipChanged != 0)
We all know that K-Means aims to get the minimal of within cluster sum of square, illustrated as below snapshot.
And we can use a do while iteration to reach the target. Now I prove why after every iteration the within cluster sum of square is smaller.
Proof:
For simplicity, I only consider 2 iterations.
After data assignment process, every descriptor vector has its new nearest cluster center, so the within cluster sum of square decrease after this process. After all, the within cluster of square is sum of every vector to one center, if every vector choose it own new nearest neighbor, there is no doubt that the sum decreases.
In new centroid process, I use arithmetic mean to calculate the new center vector, the local sum of one cluster must decrease.
So the within cluster sum of square decrease twice in one iteration. And after several iterations, every descriptor vector doesn't change its membership, and within cluster sum of square reaches the local minimum.
===============================================================================
Now My question comes:
my SURF data is derived from 3000 images, every descriptor vector is 128-dimension, and there are 1,296,672 vectors in total. And in my code, I print
1)vector number of each clsuter
2)total memeberShipChanged in one iteration
3)within cluster sum of square before one iteration.
Here is output:
sum of square : 8246977014860
90504 228516 429755 266828 1653711 398631 193081 240072
memberShipChanged : 3501098
sum of square : 4462579627000
244521 284626 448700 228211 1361902 303864 317464 311810
memberShipChanged : 975442
sum of square : 4561378972772
323746 457785 388988 228431 993328 304606 473668 330546
memberShipChanged : 828709
sum of square : 4678353976030
359537 480818 346767 222646 789858 332876 612672 355924
memberShipChanged : 563256
......
I only list 4 iteration output of it. From the output, we can see that after first iteration. within cluster sum of square really decrease from 8246977014860 to 4462579627000. But other iterations are nearly of on use in minimize it, but we can still observe the memberShipChanged is converging. I don't know why this happen. I think the first k-means iteration is overwhelming important.
Besides, what should I set the new center coordinate of a empty cluster when the memberShipChanged still doesn't converge to 0 yet? Now I use (0, 0, 0, 0, 0 ...). But is this accurate, perhaps the within cluster of sum increases due to it.

How to depict multidimentional vectors on two-dinesional plot?

I have a set of vectors in multidimensional space (may be several thousands of dimensions). In this space, I can calculate distance between 2 vectors (as a cosine of the angle between them, if it matters). What I want is to visualize these vectors keeping the distance. That is, if vector a is closer to vector b than to vector c in multidimensional space, it also must be closer to it on 2-dimensional plot. Is there any kind of diagram that can clearly depict it?

I don't think so. Imagine any twodimensional picture of a tetrahedron. There is no way of depicting the four vertices in two dimensions with equal distances from each other. So you will have a hard time trying to depict more than three n-dimensional vectors in 2 dimensions conserving their mutual distances.
(But right now I can't think of a rigorous proof.)
Update:
Ok, second idea, maybe it's dumb: If you try and find clusters of closer associated objects/texts, then calculate the center or mean vector of each cluster. Then you can reduce the problem space. At first find a 2D composition of the clusters that preserves their relative distances. Then insert the primary vectors, only accounting for their relative distances within a cluster and their distance to the center of to two or three closest clusters.
This approach will be ok for a large number of vectors. But it will not be accurate in that there always will be somewhat similar vectors ending up at distant places.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas