Understanding multidimensional full covariance of normal multivariate distribution in TensorFlow

Understanding multidimensional full covariance of normal multivariate distribution in TensorFlow - tensorflow

Suppose I have, say, 3 identically distributed random vectors: w, v and x generally with different lengths. w is length 2, v is length 3 and x is length 4.
How should I define the full covariance matrix sigma of these vectors for tf.contrib.distributions.MultivariateNormalFullCovariance(mean, sigma)?
I think about full covariance in this case as [(2 + 3 + 4) x (2 + 3 + 4)] square matrix (tensor rank 2), where diagonal elements are standard deviations and non-diagonal are cross-covariances between each other component of each other vector. How can I switch my mind to the terms of multidimensional covariance? What is it?
Or should I build full covariance matrix by concatenating it from pieces (e.g. particular covariances and, for instance, assuming independence of these vectors I should build partitioned block diagonal matrix) and cut (split) results of sampling into particular vectors I want to get? (I did that with R.) Or is there an easier way?
What I want is full control over all random vectors including their covariances and cross-covariances.

There is no special consideration about the dimensionality just because your random variables are distributed across multiple vectors. From a probabilistic point of view, three normally-distributed vectors of sizes 2, 3 and 4, a normally-distributed vector of size 9 and and a normally-distributed matrix of size 3x3 are all the same: a 9-dimensional normal distribution. Of course, you could have three distributions of 2, 3 and 4 dimensions, but that's a different thing, it doesn't allow you to model correlations among variables of different vectors (just like having a one-dimensional normal distribution per number does not allow you to model any correlation at all); this may or may not be enough for your use case.
If you want to use a single distribution, you just need to establish a bijection between the domain of your problem (e.g. tuples of three vectors of sizes 2, 3 and 4) and the domain of the distribution (e.g. 9-dimensional vectors). In this case is pretty obvious, just flatten (if necessary) and concatenate the vectors to obtain a distribution sample and split a sample three parts of size 2, 3 and 4 to obtain the vectors.

Related

Does variational autoencoder make distribution based on only latent representation?

If my latent representation of variational autoencoder(VAE) is r, and my dataset is x, does vae's latent representation follows normalization based on r or x?
If r= 10, that means it has 10 means and variance (multi-gussain) and distribution comes from data whole data x?
Or r = 10 constructs one distribution based on r, and every sample try to follow this distribution
I'm confused about which one is correct

VAE constructs a mapping e(x) -> Z (encoder), and d(z) -> X (decoder). This means that every elements of your input space x will be mapped through an encoder e(x) into a single, r-dimensional Gaussian. It is not a "mixture", it is just a single gaussian with diagonal covariance matrix.

I'll add my 2 cents to #lejlot answer.
Your encoder in VAE will map your sample to a distribution, that in your case has 10 dimensions... that distribution is used to say "ok my best estimate of this property of this sample is mu, but I'm not too sure, so consider that it might have variance sigma"
Therefore, you have a distribution for each sample.
However, in order to make sampling easier in VAE, we ask the VAE to keep the distributions as close to a known one, that is the standard normal distribution (we know "where the distributions are located", if you check the latent space in a normal AE you will see that you will have groups far from eachother).

Why does 'dimension' mean several different things in the machine-learning world? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I've noticed that AI community refers to various tensors as 512-d, meaning 512 dimensional tensor, where the term 'dimension' seems to mean 512 different float values in the representation for a single datapoint. e.g. in 512-d word-embeddings means 512 length vector of floats used to represent 1 english-word e.g. https://medium.com/#jonathan_hui/nlp-word-embedding-glove-5e7f523999f6
But it isn't 512 different dimensions, it's only 1 dimensional vector? Why is the term dimension used in such a different manner than usual?
When we use the term conv1d or conv2d which are convolutions over 1-dimension and 2-dimensions, a dimension is used in the typical way it's used in math/sciences but in the word-embedding context, a 1-d vector is said to be a 512-d vector, or am I missing something?
Why is this overloaded use of the term dimension? What context determines what dimension means in machine-learning as the term seems overloaded?

In the context of word embeddings in neural networks, dimensionality reduction, and many other machine learning areas, it is indeed correct to call the vector (which is typically, an 1D array or tensor) as n-dimensional where n is usually greater than 2. This is because we usually work in the Euclidean space where a (data) point in a certain dimensional (Euclidean) space is represented as an n-tuple of real numbers (i.e. real n-space ℝn).
Below is an exampleref of a (data) point in a 3D (Euclidean) space. To represent any point in this space, say d1, we need a tuple of three real numbers (x1, y1, z1).
Now, your confusion arises why this point d1 is called as 3 dimensional instead of 1 dimensional array. The reason is because it lies or lives in this 3D space. The same argument can be extended to all points in any n-dimensional real space, as it is done in the case of embeddings with 300d, 512d, 1024d vector etc.
However, in all nD array compute frameworks such as NumPy, PyTorch, TensorFlow etc, these are still 1D arrays because the length of the above said vectors can be represented using a single number.
But, what if you have more than 1 data point? Then, you have to stack them in some (unique) way. And this is where the need for a second dimension arises. So, let's say you stack 4 of these 512d vectors vertically, then you'd end up with a 2D array/tensor of shape (4, 512). Note that here we call the array as 2D because two integer numbers are required to represent the extent/length along each axis.
To understand this better, please refer my other answer on axis parameter visualization for nD arrays, the visual representation of which I will include it below.
ref: Euclidean space wiki

It is not overloading, but standard usage. What are the elements of a 512-dimensional vector space? They are 512 dimensional vectors. Each of which can be represented by 512 floating point number as in your equation. Each such vector spans a 1-dimensional subspace of the 512-dimensional space.
When you talk of the dimension of a tensor, a tensor is a linear map (roughly speaking, I am omitting the duals) from the product of N vector spaces to the reals. The dimension of a TENSOR is the N.

If you want to be more specific, you need to be clear on the terms dimension, rank, and shape.
The dimensionality of a tensor means the rank, which has a specific definition: the rank is the number of indices. When you see "3-dimensional tensor", you can take that to mean that the tensor has 3 indices, namely T[i][j][k]. So a vector has rank 1, a matrix has rank 2, a cube has rank 3, etc.
When you want to specify the size of each dimension, you should prefer to use the term shape. A 3-dimensional (aka rank 3) tensor can have shape [10, 20, 30] if the 0th dimension has 10 values, the 1st dimension has 20 values, and the 2nd dimension has 30 values. (This shape might represent, say, a batch of 10 images, each of shape 20x30.)
Note, though, that when talking about vectors, it is common to say "512-D vector". As you mentioned, this terminology comes up a lot with word embeddings (e.g. "we used 512-D word embeddings"). Since "vector" by definition means rank 1, then people will interpret that statement to mean "a structure of rank 1 with 512 values".
You might encounter someone saying "I have a 5-d vector", in which case you'd need to follow up with "wait, do you mean a 5-d tensor or a 1-d vector with 5 values?".
I am not a mathematician, by the way.

Self-Attention Explainability of the Output Score Matrix

I am learning about attention models, and following along with Jay Alammar's amazing blog tutorial on The Illustrated Transformer. He gives a great walkthrough for how the attention scores are calculated, but I get a bit lost at a certain point, and am not seeing how the attention score Z matrix he explains is used to interpret strength of associations between different words within an input sequence.
He mentions that given some input matrix X, with shape N x D, where N is the number of elements in an input sequence, and D is the input dimensionality, we multiply X with three separate weight matrices of shape D x d, where d is some lower dimensionality that represents the projected space of the query, key, and value matrices:
The query and key matrices are dotted, and then divided by a scaling factor usually the square root of the projected dimensionality, and then run through a softmax function. This produces a weight matrix of size N x N, which is multiplied by the value matrix to get an output Z of shape N x d, which Jay says
That concludes the self-attention calculation. The resulting vector is
one we can send along to the feed-forward neural network.
The screenshot from his blog for this calculation is below:
However, this is where I'm confused. Z is N x d. However, I don't particularly understand what I'm supposed to do with this matrix from an interpretability sense, and as far as I understand, for a particular sequence element (ie. the word cats in the sequence I love pets, especially cats), self-attention is supposed to score other parts of the sequence high when it is relevant or strong associated with that word embedding. However, I'd expect then that Z is N x N, so I could say that I can select the Z[i,j] and say for the i-th word in the sequence, the j-th word relates or associates with it this or that much.
In fact, wouldn't it make much more sense to use only the softmax output of the weights (without multiplying them by the value matrix), since it already is N x N? In essence, how is Jay determining the strength of these associations in this particular sequence with the word it?
This is an N by 1 relationship he is showing - there are N values that correspond with the strength of association to the word it.

dimension of a tensor created by tf.zeros(n)

I'm confused by the dimension of a tensor created with tf.zeros(n). For instance, if I write: tf.zeros(6).eval.shape, this will return me (6, ). What dimension is this? is this a matrix of 6 rows and arbitrary # of columns? Or is this a matrix of 6 columns with arbitrary # of rows?
weights = tf.random_uniform([3, 6], minval=-1, maxval=1, seed=1)- this is 3X6 matrix
b=tf.zeros(6).eval- I'm not sure what dimension this is.
Why I am able to add the two like weights+b? If I understand correctly, in order for the two to be added, b needs to be 3X1 dimension.

why i am able to add the two like weights+b?
Operator + is the same as using tf.add() (<obj>.__add__() calls the tf.add() or tf.math.add()) and if you read the documentation it says:
NOTE: math.add supports broadcasting. AddN does not. More about broadcasting here
Now I'm quoting from numpy broadcasting rules (which are the same for tensorflow):
When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing dimensions, and works its way forward. Two dimensions are compatible when
they are equal, or
one of them is 1
So you're able to add two tensors with different shapes because they have the same trailing dimensions. If you change the dimension of your weights tensor to, let's say [3, 5], you will get InvalidArgumentError exception because trailing dimensions differ.

(6,) is python syntax for a tuple with 6 as a single element. Hence the shape here is a uni-dimensional vector of length 6.

Hierarchical clustering with different sample size on Python

I would like to know if it's possible to do hierarchical clustering with different sample size on Python? More precisely, with Ward's minimum variance method.
For instance, I have 5 lists of integers, A, B, C, D, E of different lengths. What I want to do is to group these 5 lists into 3 groups according to Ward's method (the decrease in variance for the cluster being merged).
Does anyone knows how to do so?

We can consider theses 5 lists are your samples you want to cluster in 3 groups.
Hierarchical cluster as you may know can take as input distance matrices.
Distance matrices evaluate some sort of pairwise distances (or dissimilarities) between your samples.
You have to construct this 5x5 matrix by choosing a meaningful distance function. This greatly depends on what your samples/integers represent. As your samples do not have constant length you can't compute metrics like euclidean distance.
For example if integers in your lists can be interpreted as classes, you could compute Jaccard Index to express some sort of dissimilarity.
[1 2 3 4 5] and [1 3 4] have a Jaccard similarity index of 3/5 (or
dissimilarity of 2/5).
0 being entirely different and 1 perfectly identical.
https://en.wikipedia.org/wiki/Jaccard_index
Once your dissimilarity matrix is computed (in fact it represent only 5 choose 2 = 10 different values as this matrix is symmetrical) you can apply hierarchical clustering on it.
The important part being finding the adapted distance function to your problem.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas