how to do text clustering from cosine similarity - k-means

I am using WEKA for performing text collection. Suppose i have n documents with text, i calculated TFID as feature vector for each document and than calculated cosine similarity between each of each of the document.it generated nXn matrix. Now i wonder how to use this nxn matrix in k-mean algorithm . i know i can apply some dimension reduction such as MDS or PCA. What I am confused here is that after applying dimension reduction how will i identify that document itself, for example if i have 3 documents d1,d2 d3 than cosine will give me distances between d11,d12,d13
d21,d22,d23
d31,d32,d33
now i am not sure what will be output after PCA or MDS and how i will identify the documents after kmean. Please suggest. I hope i have put my question clearly

PCA is used on the raw data, not on distances, i.e. PCA(X).
MDS uses a distance function, i.e. MDS(X, cosine).
You appear to believe you need to run PCA(cosine(X))? That doesn't work.
You want to run MDS(X, cosine).

Related

how to find closeness between two keras pad_sequences?

I am writing a small proof of concept where I turn a catalog into a json that has a url, and a label that explains the web page. I read this json in python, tokenize it and create a pad_sequences.
I need to then compare some free flow texts to find which index of the pad_sequences has the most words from the free flow text.
I am generating a pad_sequences() from the text too but not sure if I can somehow compare the two sequences for closeness?
Please help.
You can use cosine similarity or euclidean distance to compare two vectors.
https://www.tensorflow.org/api_docs/python/tf/keras/metrics/CosineSimilarity
https://www.tutorialexample.com/calculate-euclidean-distance-in-tensorflow-a-step-guide-tensorflow-tutorial/
For sequences you can make embedding to same lenght vector at first.

Explained variance calculation

My questions are specific to https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA.
I don't understand why you square eigenvalues
https://github.com/scikit-learn/scikit-learn/blob/55bf5d9/sklearn/decomposition/pca.py#L444
here?
Also, explained_variance is not computed for new transformed data other than original data used to compute eigen-vectors. Is that not normally done?
pca = PCA(n_components=2, svd_solver='full')
pca.fit(X)
pca.transform(Y)
In this case, won't you separately calculate explained variance for data Y as well. For that purpose, I think we would have to use point 3 instead of using eigen-values.
Explained variance can be also computed by taking the variance of each axis in the transformed space and dividing by the total variance. Any reason that is not done here?
Answers to your questions:
1) The square roots of the eigenvalues of the scatter matrix (e.g. XX.T) are the singular values of X (see here: https://math.stackexchange.com/a/3871/536826). So you square them. Important: the initial matrix X should be centered (data has been preprocessed to have zero mean) in order for the above to hold.
2) Yes this is the way to go. explained_variance is computed based on the singular values. See point 1.
3) It's the same but in the case you describe you HAVE to project the data and then do additional computations. No need for that if you just compute it using the eigenvalues / singular values (see point 1 again for the connection between these two).
Finally, keep in mind that not everyone really wants to project the data. Someone can only get the eigenvalues and then immediately estimate the explained variance WITHOUT projecting the data. So that's the best gold standard way to do it.
EDIT 1:
Answer to edited Point 2
No. PCA is an unsupervised method. It only transforms the X data not the Y (labels).
Again, the explained variance can be computed fast, easily, and with half line of code using the eigenvalues/singular values OR as you said using the projected data e.g. estimating the covariance of the projected data, then variances of PCs will be in the diagonal.

How tensorflow deals with large Variables which can not be stored in one box

I want to train a DNN model by training data with more than one billion feature dimensions. So the shape of the first layer weight matrix will be (1,000,000,000, 512). this weight matrix is too large to be stored in one box.
By now, is there any solution to deal with such large variables, for example partition the large weight matrix to multiple boxes.
Update:
Thanks Olivier and Keveman. let me add more detail about my problem.
The example is very sparse and all features are binary value: 0 or 1. The parameter weight looks like tf.Variable(tf.truncated_normal([1 000 000 000, 512],stddev=0.1))
The solutions kaveman gave seem reasonable, and I will update results after trying.
The answer to this question depends greatly on what operations you want to perform on the weight matrix.
The typical way to handle such a large number of features is to treat the 512 vector per feature as an embedding. If each of your example in the data set has only one of the 1 billion features, then you can use the tf.nn.embedding_lookup function to lookup the embeddings for the features present in a mini-batch of examples. If each example has more than one feature, but presumably only a handful of them, then you can use the tf.nn.embedding_lookup_sparse to lookup the embeddings.
In both these cases, your weight matrix can be distributed across many machines. That is, the params argument to both of these functions is a list of tensors. You would shard your large weight matrix and locate the shards in different machines. Please look at tf.device and the primer on distributed execution to understand how data and computation can be distributed across many machines.
If you really want to do some dense operation on the weight matrix, say, multiply the matrix with another matrix, that is still conceivable, although there are no ready-made recipes in TensorFlow to handle that. You would still shard your weight matrix across machines. But then, you have to manually construct a sequence of matrix multiplies on the distributed blocks of your weight matrix, and combine the results.

Multiscale morphological dilation and erosion

Can anyone please specify what is meant by multiscale morphological filtering ? I understand the basic concepts of dilation and erosion. But in multiscale filtering, a scaled structuring function is being used. What does the term scaled mean ?
Please find more relevant information here : Please check link. I want to apply this structuring element in matlab coding but cannot do so. Please can anyone help me ?
Here the multiscale operator is described as:
F(x,s1,s2) = (f-s1)+s2
where f(x) is the original function and s1(x) is the structure function. Apparently, erosion and
dilation with different scales can filter positive and negative noises more perfectly.This operation satisfies
the four quantification principles of morphological filter. (from paper)
This operator is known in the Morphology community as an Alternating Sequential Filter, which basically performs filtering using a alternating series of dilations and erosions or openings and closings of increasing radii on the same image. This series of radii for the given structuring function can be decided based on the structure of the object/detail to be extracted or filtered. One can note that there are two different structuring elements s1 and s2 used to decide different scales for the erosions and dilations. This Matlab chain discusses on how to test it.

How to depict multidimentional vectors on two-dinesional plot?

I have a set of vectors in multidimensional space (may be several thousands of dimensions). In this space, I can calculate distance between 2 vectors (as a cosine of the angle between them, if it matters). What I want is to visualize these vectors keeping the distance. That is, if vector a is closer to vector b than to vector c in multidimensional space, it also must be closer to it on 2-dimensional plot. Is there any kind of diagram that can clearly depict it?
I don't think so. Imagine any twodimensional picture of a tetrahedron. There is no way of depicting the four vertices in two dimensions with equal distances from each other. So you will have a hard time trying to depict more than three n-dimensional vectors in 2 dimensions conserving their mutual distances.
(But right now I can't think of a rigorous proof.)
Update:
Ok, second idea, maybe it's dumb: If you try and find clusters of closer associated objects/texts, then calculate the center or mean vector of each cluster. Then you can reduce the problem space. At first find a 2D composition of the clusters that preserves their relative distances. Then insert the primary vectors, only accounting for their relative distances within a cluster and their distance to the center of to two or three closest clusters.
This approach will be ok for a large number of vectors. But it will not be accurate in that there always will be somewhat similar vectors ending up at distant places.