Metric/density based clustering/grouping - optimization

I have a finite number of points (cloud), with a metric defined on them. I would like to find the maximum amount of clusters in this cloud such that:
1) the maximum distance between any two points in one cluster is smaller a given epsilon (const)
2) each cluster has exactly k (const) points in it
I looked at all kind of different clustering methods and clustering with a restriction on the inner maximum distance is not a problem (density based). The 2) constrain and the requirement to find "the maximum amount of clusters s.t." seem to be problematic though. Any suggestions for an efficient solution?
Thank you,
A~

Given your constraints, there might be no solution. And actually, that may happen quite often...
The most obvious case is when you don't have a multiple of k points.
But also if epsilon is set too low, there might be points that cannot be put into clusters anymore.
I think you need to rethink your requirements and problem, instead of looking for an algorithm to solve an unreasonably hard requirement that might not be satisfiable.
Also consider whether you really need to find the guaranteed maximum, or just a good solution.
There are some rather obvious approaches that will at least find a good approximation fast.

I have the same impression as #Anony-Mousse, actually: you have not understood your problem and requirements yet.
If you want your cluster sizes to be k, there is no question of how many clusters you will get: it's obviously n /k. So you can try to use a k-means variant that produces clusters of the same size as e.g. described in this tutorial: Tutorial on same-size k-means and set the desired number of cluster to n/k.
Note that this is not a particular sensible or good clustering algorithm. It does something to satisfy the constraints, but the clusters are not really meaningful from a cluster analysis point of view. It's constraint satisfaction, but not cluster analysis.
In order to also satisfy your epsilon constraint, you can then start off with this initial solution (which probably is what #Anony-Mousse referred to as "obvious approaches") and try to perform the same kind of optimization-by-swapping-elements in order to satisfy the epsilon condition.
You may need a number of restarts, because there may be no solution.
See also:
Group n points in k clusters of equal size
K-means algorithm variation with equal cluster size
for essentially redundant questions.

Related

How can I study the properties of outliers in high-dimensional data?

I have a bundle of high-dimensional data and the instances are labeled as outliers or not. I am looking to get some insights around where these outliers reside within the data. I seek to answer questions like:
Are the outliers spread far apart from each other? Or are they clustered together?
Are the outliers lying 'in-between' clusters of good data? Or are they on the 'edge' boundaries of the data?
If outliers are clustered together, how do these cluster densities compare with clusters of good data?
'Where' are the outliers?
What kind of techniques will let me find these insights? If the data was 2 or 3-dimensional, I can easily plot the data and just look at it. But I can't do it high-dimensional data.
Analyzing the Statistical Properties of Outliers
First of all, if you can choose to focus on specific features. For
example, if you know a featues is subject to high variation, you can
draw a box plot. You can also draw a 2D graph if you want to focus on
2 features. THis shows how much the labelled outliers vary.
Next, there's a metric called a Z-score, which basically says how
many standard devations a point varies compared to the mean. The
Z-score is signed, meaning if a point is below the mean, the Z-score
will be negative. This can be used to analyze all the features of the
dataset. You can find the threshold value in your labelled dataset for which all the points above that threshold are labelled outliers
Lastly, we can find the interquartile range and similarly filter
based on it. The IQR is simply the difference between the 75
percentile and 25 percentile. You can also use this similarly to Z-score.
Using these techniques, we can analyze some of the statistical properties of the outliers.
If you also want to analyze the clusters, you can adapt the DBSCAN algorithm to your problem. This algorithm clusters data based on densities, so it will be easy to apply the techniques to outliers.

How to estimate the Scoring Scheme in Pairwise Alignment

I'm not specialist in bioinformatics. I want to align two nucleotide sequences using a global alignment method. Each sequence is a combinations of the {A,C,T,G} letters.
The problem is that I don't know how to choose the best scoring scheme (substations and gap penalties).
Currently, I'm using the values +1,-1,-2 for match , mismatch and gap penalty. And I'm aware that ; the number of transitions in human DNA is larger than the number of transversions.
My question is how to estimate the penalties for (match , mismatch and gap) based on my dataset. Is there any statistical model can help?
If we need to answer to this question we need to know the dataset well and your scope exactly,but generally for match/mismatch we may represent as +1/-1 this does not include (transversion and transition).
For I advice you to take a look on this model and Kimura
Finally for penalty, you may use "low, medium, and high" penalty according to the divergent the sequences,I mean If the organisms is closely related then you may use the low gap penalty, and the high penalty for the more divergent organisms, so the gap penalty depends on how divergent the sequences are that you are aligning.
If we need to know if the sequences is divergent or not, as I said it's depends and different according to your data, but you may take a look on these examples about some sequences : link1, link2, link3, link4, and link5

How to create a synthetic dataset

I want to run some Machine Learning clustering algorithms on some big data.
The problem is that I'm having troubles to find interesting data for this purpose on the web.Also, usually this data might be inconvenient to use because the format won't fit for me.
I need a txt file which each line represents a mathematical vector, each element seperated by space, for example:
1 2.2 3.1
1.12 0.13 4.46
1 2 54.44
Therefore, I decided to first run those algorithms on some synthetic data which I'll create by my self. How can I do this in a smart way with numpy?
In smart way, I mean that it won't be generated uniformly, because it's a little bit boring. How can I generate some interesting clusters?
I want to have 5GB / 10GB of data at the moment.
You need to define what you mean by "clusters", but I think what you are asking for is several random-parameter normal distributions combined together, for each of your coordinate values.
From http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.random.randn.html#numpy.random.randn:
For random samples from N(\mu, \sigma^2), use:
sigma * np.random.randn(...) + mu
And use <range> * np.random.rand(<howmany>) for each of sigma and mu.
There is no one good answer for such question. What is interesting? For clustering, unfortunately, there is no such thing as an interesting or even well posed problem. Clustering as such has no well defineid evaluation, consequently each method is equally good/bad, as long as it has well defined internal objective. So k-means will always be good one to minimize inter-cluster euclidean distance and will struggle with sparse data, non-convex, imbalanced clusters. DBScan will always be the best in greedy density based sense and will strugle with diverse density clusters. GMM will be always great fitting on gaussian mixtures, and will strugle with clusters which are not gaussians (for example lines, squares etc.).
From the question one could deduce that you are at the very begining of work with clustering and so need "just anything more complex than uniform", so I suggest you take a look at datasets generators, in particular accesible in scikit-learn (python) http://scikit-learn.org/stable/datasets/ or in clusterSim (R) http://www.inside-r.org/packages/cran/clusterSim/docs/cluster.Gen or clusterGeneration (R) https://cran.r-project.org/web/packages/clusterGeneration/clusterGeneration.pdf

k-means empty cluster

I try to implement k-means as a homework assignment. My exercise sheet gives me following remark regarding empty centers:
During the iterations, if any of the cluster centers has no data points associated with it, replace it with a random data point.
That confuses me a bit, firstly Wikipedia or other sources I read do not mention that at all. I further read about a problem with 'choosing a good k for your data' - how is my algorithm supposed to converge if I start setting new centers for cluster that were empty.
If I ignore empty clusters I converge after 30-40 iterations. Is it wrong to ignore empty clusters?
Check out this example of how empty clusters can happen: http://www.ceng.metu.edu.tr/~tcan/ceng465_f1314/Schedule/KMeansEmpty.html
It basically means either 1) a random tremor in the force, or 2) the number of clusters k is wrong. You should iterate over a few different values for k and pick the best.
If during your iterating you should encounter an empty cluster, place a random data point into that cluster and carry on.
I hope this helped on your homework assignment last year.
Handling empty clusters is not part of the k-means algorithm but might result in better clusters quality. Talking about convergence, it is never exactly but only heuristically guaranteed and hence the criterion for convergence is extended by including a maximum number of iterations.
Regarding the strategy to tackle down this problem, I would say randomly assigning some data point to it is not very clever since we might be affecting the clusters quality since the distance to its currently assigned center is large or small. An heuristic for this case would be to choose the farthest point from the biggest cluster and move that the empty cluster, then do so until there are no empty clusters.
Statement: k-means can lead to
Consider above distribution of data points.
overlapping points mean that the distance between them is del. del tends to 0 meaning you can assume arbitary small enough value eg 0.01 for it.
dash box represents cluster assign
legend in footer represents numberline
N=6 points
k=3 clusters (coloured)
final clusters = 2
blue cluster is orphan and ends up empty.
Empty clusters can be obtained if no points are allocated to a cluster during the assignment step. If this happens, you need to choose a replacement centroid otherwise SSE would be larger than neccessary.
*Choose the point that contributes most to SSE
*Choose a point from the cluster with the highest SSE
*If there are several empty clusters, the above can be repeated several times.
***SSE = Sum of Square Error.
Check this site https://chih-ling-hsu.github.io/2017/09/01/Clustering#
You should not ignore empty clusters but replace it. k-means is an algorithm could only provides you local minimums, and the empty clusters are the local minimums that you don't want.
your program is going to converge even if you replace a point with a random one. Remember that at the beginning of the algorithm, you choose the initial K points randomly. if it can converge, how come K-1 converge points with 1 random point can't? just a couple more iterations are needed.
"Choosing good k for your data" refers to the problem of choosing the right number of clusters. Since the k-means algorithm works with a predetermined number of cluster centers, their number has to be chosen at first. Choosing the wrong number could make it hard to divide the data points into clusters or the clusters could become small and meaningless.
I can't give you an answer on whether it is a bad idea to ignore empty clusters. If you do, you might end up with a smaller number of clusters than you defined at the beginning. This will confuse people who expect k-means to work in a certain way, but it is not necessarily a bad idea.
If you re-locate any empty cluster centers, your algorithm will probably converge anyway if that happens a limited number of times. However, you if you have to relocate too often, it might happen that your algorithm doesn't terminate.
For "Choosing good k for your data", Andrew Ng gives the example of a tee shirt manufacturer looking at potential customer measurements and doing k-means to decide if you want to offer S/M/L (k=3) or 2XS/XS/S/M/L/XL/2XL (k=7). Sometimes the decision is driven by the data (k=7 gives empty clusters) and sometimes by business considerations (manufacturing costs are less with only three sizes, or marketing says customers want more choices).
Set a variable to track the farthest distanced point and its cluster based on the distance measure used.
After the allocation step for all the points, check the number of datapoints in each cluster.
If any is 0, as is the case for this question, split the biggest cluster obtained and split further into 2 sub-clusters.
Replace the selected cluster with these sub-clusters.
I hope the issue is fixed now.. Random assignment will affect the clustering structure of the already obtained clustering.

Weighted Bipartite Matching covering one partition

I have a problem here, that I managed to reduce to a weighted bipartite match problem. Basically, I have a bipartite graph with partitions A and B, and a set of edges with weights. In my case, |A|~=20 and |B| =300.
I want to find a set of edges which minimizes the weigths AND COVERS 'A' (each edge on A has an associated solution edge)
Questions:
-Is there a special name for this kind a problem, so I can look for algorithms and solutions?
-I know I can reduce it to a weighted bipartite perfect match, by adding dummy vertices on A, with infinite weigth. But I'm worried about practical performance since |B|>>|A|.
-Any suggestions on Java libraries? I found this: http://algs4.cs.princeton.edu/code/. I think the 'AssignmentProblem.java' is almost what I need - (but I guess it doesn't ensure a perfect matching?)
Thanks in advance and sorry about the bad english.
a) maximum weighted perfect matching
b) ???
c) floyd or floyd-warshall alogorithm is your friend
I've found a c-implemenation in the web and also you can use edmond's blossom algorithm, too.