Hierarchical clustering with different sample size on Python - hierarchical-clustering

I would like to know if it's possible to do hierarchical clustering with different sample size on Python? More precisely, with Ward's minimum variance method.
For instance, I have 5 lists of integers, A, B, C, D, E of different lengths. What I want to do is to group these 5 lists into 3 groups according to Ward's method (the decrease in variance for the cluster being merged).
Does anyone knows how to do so?

We can consider theses 5 lists are your samples you want to cluster in 3 groups.
Hierarchical cluster as you may know can take as input distance matrices.
Distance matrices evaluate some sort of pairwise distances (or dissimilarities) between your samples.
You have to construct this 5x5 matrix by choosing a meaningful distance function. This greatly depends on what your samples/integers represent. As your samples do not have constant length you can't compute metrics like euclidean distance.
For example if integers in your lists can be interpreted as classes, you could compute Jaccard Index to express some sort of dissimilarity.
[1 2 3 4 5] and [1 3 4] have a Jaccard similarity index of 3/5 (or
dissimilarity of 2/5).
0 being entirely different and 1 perfectly identical.
https://en.wikipedia.org/wiki/Jaccard_index
Once your dissimilarity matrix is computed (in fact it represent only 5 choose 2 = 10 different values as this matrix is symmetrical) you can apply hierarchical clustering on it.
The important part being finding the adapted distance function to your problem.

Related

Getting 2 values of focal length when finding Intrinsic camera matrix (F not Fx,Fy)?

The following image is the example that was given in my computer vision class. Now I cant understand why we are getting 2 unique values of f. I can understand if mxf and myf are different, but shouldn't the focal length 'f' be the same?
I believe you have an Fx and a Fy. This is so that the the matrix transforms on f can scale f in two directions x and y. IIRC this is why you get 2 f numbers
If really single f wanted, it should be modeled in the camera model used in calibration.
e.g. give the mx,my as constants to the camera model, and estimate the f.
However, perhaps the calibration process that obtained that K was not that way, but treated the two elements (K(0,0) and K(1,1)) independently.
In other words, mx and my were also estimated in the sense of dealing with the aspect ratio.
The estimation result is not the same as the values of mx and my calculated from the sensor specifications.
This is why you got 2 values.

Self-Attention Explainability of the Output Score Matrix

I am learning about attention models, and following along with Jay Alammar's amazing blog tutorial on The Illustrated Transformer. He gives a great walkthrough for how the attention scores are calculated, but I get a bit lost at a certain point, and am not seeing how the attention score Z matrix he explains is used to interpret strength of associations between different words within an input sequence.
He mentions that given some input matrix X, with shape N x D, where N is the number of elements in an input sequence, and D is the input dimensionality, we multiply X with three separate weight matrices of shape D x d, where d is some lower dimensionality that represents the projected space of the query, key, and value matrices:
The query and key matrices are dotted, and then divided by a scaling factor usually the square root of the projected dimensionality, and then run through a softmax function. This produces a weight matrix of size N x N, which is multiplied by the value matrix to get an output Z of shape N x d, which Jay says
That concludes the self-attention calculation. The resulting vector is
one we can send along to the feed-forward neural network.
The screenshot from his blog for this calculation is below:
However, this is where I'm confused. Z is N x d. However, I don't particularly understand what I'm supposed to do with this matrix from an interpretability sense, and as far as I understand, for a particular sequence element (ie. the word cats in the sequence I love pets, especially cats), self-attention is supposed to score other parts of the sequence high when it is relevant or strong associated with that word embedding. However, I'd expect then that Z is N x N, so I could say that I can select the Z[i,j] and say for the i-th word in the sequence, the j-th word relates or associates with it this or that much.
In fact, wouldn't it make much more sense to use only the softmax output of the weights (without multiplying them by the value matrix), since it already is N x N? In essence, how is Jay determining the strength of these associations in this particular sequence with the word it?
This is an N by 1 relationship he is showing - there are N values that correspond with the strength of association to the word it.

effective number

In Gelman book, the effective number is defined in terms of the following;
R hat
between- within MCMC sequence of variance, B and W
the number of MCMC samples, denoted by n
the number of chains, denoted by m
I do not know how the samplig() calculate the between MCMC sequence of variance for the case chains=1. So, I cannot calculate these terms ( B,W,m). I want to implement some algorithm according to the paper:https://arxiv.org/abs/1804.06788.
Roughly speaking, this paper construct some test statistics which is uniformly distributed under the null hypothesis that the MCMC sampling is correct. And if MCMC sampling is not correct, then the histogram of the test statistics become skew shape and this deviation from uniformity tells us the MCMC contains bias. I want to implement but it needs to calculate the above quantities.
In rstan, is there such function to extract the above quantities ? I think the process of calculation of R hat statistics, the above quantities B,W, m are retained in some place in the stanfit S4 object.
I am sorry, I found n_eff, but I do not know the choice of m of the case chains =1.
In the case that only one chain is estimated (which should not be happening anyway), then m = 2 because the post-warmup draws from the single chain are split into the first half and the second half. This splitting method is discussed in the documentation.

Understanding multidimensional full covariance of normal multivariate distribution in TensorFlow

Suppose I have, say, 3 identically distributed random vectors: w, v and x generally with different lengths. w is length 2, v is length 3 and x is length 4.
How should I define the full covariance matrix sigma of these vectors for tf.contrib.distributions.MultivariateNormalFullCovariance(mean, sigma)?
I think about full covariance in this case as [(2 + 3 + 4) x (2 + 3 + 4)] square matrix (tensor rank 2), where diagonal elements are standard deviations and non-diagonal are cross-covariances between each other component of each other vector. How can I switch my mind to the terms of multidimensional covariance? What is it?
Or should I build full covariance matrix by concatenating it from pieces (e.g. particular covariances and, for instance, assuming independence of these vectors I should build partitioned block diagonal matrix) and cut (split) results of sampling into particular vectors I want to get? (I did that with R.) Or is there an easier way?
What I want is full control over all random vectors including their covariances and cross-covariances.
There is no special consideration about the dimensionality just because your random variables are distributed across multiple vectors. From a probabilistic point of view, three normally-distributed vectors of sizes 2, 3 and 4, a normally-distributed vector of size 9 and and a normally-distributed matrix of size 3x3 are all the same: a 9-dimensional normal distribution. Of course, you could have three distributions of 2, 3 and 4 dimensions, but that's a different thing, it doesn't allow you to model correlations among variables of different vectors (just like having a one-dimensional normal distribution per number does not allow you to model any correlation at all); this may or may not be enough for your use case.
If you want to use a single distribution, you just need to establish a bijection between the domain of your problem (e.g. tuples of three vectors of sizes 2, 3 and 4) and the domain of the distribution (e.g. 9-dimensional vectors). In this case is pretty obvious, just flatten (if necessary) and concatenate the vectors to obtain a distribution sample and split a sample three parts of size 2, 3 and 4 to obtain the vectors.

Random projection in Python Pandas using a dataframe containing NaN values

I have a dataframe data containing real values and some NaN values. I'm trying to perform locality sensitive hashing using random projections to reduce the dimension to 25 components, specifically with thesklearn.random_projection.GaussianRandomProjection class. However, when I run:
tx = random_projection.GaussianRandomProjection(n_components = 25)
data25 = tx.fit_transform(data)
I get Input contains NaN, infinity or a value too large for dtype('float64'). Is there a work-around to this? I tried changing all the NaN values to a value that is never present in my dataset, such as -1. How valid would my output be in this case? I'm not an expert behind the theory of locality sensitive hashing/random projections so any insight would be helpful as well. Thanks.
NA / NaN values (not-available / not-a-number) are, I have found, just plain troublesome.
You don't want to just substitute a random value like -1. If you are inclined to do that, use one of the Imputer classes. Otherwise, you are likely to very substantially change the distances between points. You likely want to preserve distances as much as possible if you are using random projection:
The dimensions and distribution of random projections matrices are controlled so as to preserve the pairwise distances between any two samples of the dataset.
However, this may or may not result in reasonable values for learning. As far as I know, imputation is an open field of study, which (for instance) this gentlemen has specialized in studying.
If you have enough examples, consider dropping rows or columns that contain NaN values. Another possibility is training a generative model like a Restricted Boltzman Machine and use that to fill in missing values:
rbm = sklearn.neural_network.BernoulliRBM().fit( data_with_no_nans )
mean_imputed_data = sklearn.preprocessing.Imputer().fit_transform( all_data )
rbm_imputation = rbm.gibbs( mean_imputed_data )
nan_mask = np.isnan( all_data )
all_data[ nan_mask ] = rbm_imputation[ nan_mask ]
Finally, you might consider imputing using nearest neighbors. For a given column, train a nearest neighbors model on all the variables except that column using all complete rows. Then, for a row missing that column, find the k nearest neighbors and use the average value among them. (This gets very costly, especially if you have rows with more than one missing value, as you will have to train a model for every combination of missing columns).