ML.NET KMeans clustering - What is the Davies Boulding Index? - k-means

I am using the KMeans clustering algorithm from ML.NET (here) and when evaluating the model, I see Davies Bouldin Index in the model metrics.
What is the range of this index? What does its value of zero mean?

According to the documentation the Davies Bouldin Index is:
"The average ratio of within-cluster distances to between-cluster distances. The tighter the cluster, and the further apart the clusters are, the lower this value is."
Also:
"Values closer to 0 are better. Clusters that are farther apart and less dispersed will result in a better score."
You can find more information on the Davies Bouldin Index in the following here.

Related

What impurity index (Gini, entropy?) is used in TensorFlow Random Forests with CART trees?

I was looking for this information in the tensorflow_decision_forests docs (https://github.com/tensorflow/decision-forests) (https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/wrappers/CartModel) and yggdrasil_decision_forests docs (https://github.com/google/yggdrasil-decision-forests).
I've also taken a look at the code of these two libraries, but I didn't find that information.
I'm also curious if I can specify an impurity index to use.
I'm looking for some analogy to sklearn decision tree, where you can specify the impurity index with criterion parameter.
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
For TensorFlow Random Forest i found only a parameter uplift_split_score:
uplift_split_score: For uplift models only. Splitter score i.e. score
optimized by the splitters. The scores are introduced in "Decision trees
for uplift modeling with single and multiple treatments", Rzepakowski et
al. Notation: p probability / average value of the positive outcome,
q probability / average value in the control group.
- KULLBACK_LEIBLER or KL: - p log (p/q)
- EUCLIDEAN_DISTANCE or ED: (p-q)^2
- CHI_SQUARED or CS: (p-q)^2/q
Default: "KULLBACK_LEIBLER".
I'm not sure if it's a good lead.
No, you shouldn't use uplift_split_score, because it is For uplift models only.
Uplift modeling is used to estimate treatment effect or other tasks in causal inference

Remove Outliers from a Multitrace in PYMC3

I have a model which has 3 parameters A, n, and Beta.
I did a Bayesian analysis using pymc3 and got the posterior distributions of the parameters in a multitrace called "trace". Is there any way to remove the outliers of A (and thus the corresponding values of n and Beta) from the multitrace?
Stating that specific values of A are outliers implies that you have enough "domain expertise" to know that the ranges where these values fall into have very low probability of occurence in the experiment/system you are modelling.
You could therefore narrow your chosen prior distribution for A, such that these "outliers" remain in the tails of the distribution.
Reducing the overall model entropy with such informative prior's choice is risky in a way but can be considered as a valid approach if you know that values within these specific ranges just do not happen in real-life experiments.
Once the Bayes rule applied, your posterior distribution will put a lot less weight on these ranges and should better reflect the actual system behaviour.

How can I study the properties of outliers in high-dimensional data?

I have a bundle of high-dimensional data and the instances are labeled as outliers or not. I am looking to get some insights around where these outliers reside within the data. I seek to answer questions like:
Are the outliers spread far apart from each other? Or are they clustered together?
Are the outliers lying 'in-between' clusters of good data? Or are they on the 'edge' boundaries of the data?
If outliers are clustered together, how do these cluster densities compare with clusters of good data?
'Where' are the outliers?
What kind of techniques will let me find these insights? If the data was 2 or 3-dimensional, I can easily plot the data and just look at it. But I can't do it high-dimensional data.
Analyzing the Statistical Properties of Outliers
First of all, if you can choose to focus on specific features. For
example, if you know a featues is subject to high variation, you can
draw a box plot. You can also draw a 2D graph if you want to focus on
2 features. THis shows how much the labelled outliers vary.
Next, there's a metric called a Z-score, which basically says how
many standard devations a point varies compared to the mean. The
Z-score is signed, meaning if a point is below the mean, the Z-score
will be negative. This can be used to analyze all the features of the
dataset. You can find the threshold value in your labelled dataset for which all the points above that threshold are labelled outliers
Lastly, we can find the interquartile range and similarly filter
based on it. The IQR is simply the difference between the 75
percentile and 25 percentile. You can also use this similarly to Z-score.
Using these techniques, we can analyze some of the statistical properties of the outliers.
If you also want to analyze the clusters, you can adapt the DBSCAN algorithm to your problem. This algorithm clusters data based on densities, so it will be easy to apply the techniques to outliers.

How to create a synthetic dataset

I want to run some Machine Learning clustering algorithms on some big data.
The problem is that I'm having troubles to find interesting data for this purpose on the web.Also, usually this data might be inconvenient to use because the format won't fit for me.
I need a txt file which each line represents a mathematical vector, each element seperated by space, for example:
1 2.2 3.1
1.12 0.13 4.46
1 2 54.44
Therefore, I decided to first run those algorithms on some synthetic data which I'll create by my self. How can I do this in a smart way with numpy?
In smart way, I mean that it won't be generated uniformly, because it's a little bit boring. How can I generate some interesting clusters?
I want to have 5GB / 10GB of data at the moment.
You need to define what you mean by "clusters", but I think what you are asking for is several random-parameter normal distributions combined together, for each of your coordinate values.
From http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.random.randn.html#numpy.random.randn:
For random samples from N(\mu, \sigma^2), use:
sigma * np.random.randn(...) + mu
And use <range> * np.random.rand(<howmany>) for each of sigma and mu.
There is no one good answer for such question. What is interesting? For clustering, unfortunately, there is no such thing as an interesting or even well posed problem. Clustering as such has no well defineid evaluation, consequently each method is equally good/bad, as long as it has well defined internal objective. So k-means will always be good one to minimize inter-cluster euclidean distance and will struggle with sparse data, non-convex, imbalanced clusters. DBScan will always be the best in greedy density based sense and will strugle with diverse density clusters. GMM will be always great fitting on gaussian mixtures, and will strugle with clusters which are not gaussians (for example lines, squares etc.).
From the question one could deduce that you are at the very begining of work with clustering and so need "just anything more complex than uniform", so I suggest you take a look at datasets generators, in particular accesible in scikit-learn (python) http://scikit-learn.org/stable/datasets/ or in clusterSim (R) http://www.inside-r.org/packages/cran/clusterSim/docs/cluster.Gen or clusterGeneration (R) https://cran.r-project.org/web/packages/clusterGeneration/clusterGeneration.pdf

how does lucene build VSM?

I understand the concept of VSM, TFIDF and cosine similarity, however, I am still confused about how lucene build VSM and calculate similarity for each query after reading lucene website.
As I understood, VSM is a matrix where the values of TFIDF of each term are filled. When i tried building VSM from a set of documents, it took a long time with this tool http://sourceforge.net/projects/wvtool/
This is not really related to the coding, because intuitively building a VSM matrix of large data is time consuming, but that seems not the case for lucene.
In additon, with a VSM prebuilt, finding most similar document which basically is the calculation of similarity between two documents or a query vs document often time consuming (assume millions of documents, because one has to compute similarity to everyone else), but lucene seems does it really fast. I guess that's also related to how it builds VSM internally. If possible, can someone also explain this ?
so please help me to understand two point here:
1. how lucene builds VSM so fast which can be used for calculating similarity.
2. how come lucene similarity calculation amoung millions of documents is so fast.
I'd appreciate it if an real example is given.
Thanks
As I understood, VSM is a matrix where the values of TFIDF of each term are filled.
This is more properly called a term-document matrix. The VSM is more of a conceptual framework from which this matrix, and the notion of cosine similarity arise.
Lucene stores term frequencies and document frequencies that can be used to get tf-idf weights for document and query terms. It uses those to compute a variant of cosine similarity outlined here. So, the rows of the term-document matrix are represented in the index, which is a hash table mapping terms to (document, tf) pairs plus a separate table mapping terms to their df value.
one has to compute similarity to everyone else
That's not true. If you review the textbook definition of cosine similarity, you'll find that it's the sum of products of corresponding term weights in a query and a document, normalized. Terms that occur in the document but not the query, or vice versa, have no effect on the similarity. It follows that, to compute cosine similarity, you only need to consider those documents that have some term in common with the query. That's how Lucene gets its speed: it does a hash table lookup for the query terms and computes similarities only to the documents that have non-zero intersection with the query's bag of words.