Why does 'dimension' mean several different things in the machine-learning world? [closed] - tensorflow

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I've noticed that AI community refers to various tensors as 512-d, meaning 512 dimensional tensor, where the term 'dimension' seems to mean 512 different float values in the representation for a single datapoint. e.g. in 512-d word-embeddings means 512 length vector of floats used to represent 1 english-word e.g. https://medium.com/#jonathan_hui/nlp-word-embedding-glove-5e7f523999f6
But it isn't 512 different dimensions, it's only 1 dimensional vector? Why is the term dimension used in such a different manner than usual?
When we use the term conv1d or conv2d which are convolutions over 1-dimension and 2-dimensions, a dimension is used in the typical way it's used in math/sciences but in the word-embedding context, a 1-d vector is said to be a 512-d vector, or am I missing something?
Why is this overloaded use of the term dimension? What context determines what dimension means in machine-learning as the term seems overloaded?

In the context of word embeddings in neural networks, dimensionality reduction, and many other machine learning areas, it is indeed correct to call the vector (which is typically, an 1D array or tensor) as n-dimensional where n is usually greater than 2. This is because we usually work in the Euclidean space where a (data) point in a certain dimensional (Euclidean) space is represented as an n-tuple of real numbers (i.e. real n-space ℝn).
Below is an exampleref of a (data) point in a 3D (Euclidean) space. To represent any point in this space, say d1, we need a tuple of three real numbers (x1, y1, z1).
Now, your confusion arises why this point d1 is called as 3 dimensional instead of 1 dimensional array. The reason is because it lies or lives in this 3D space. The same argument can be extended to all points in any n-dimensional real space, as it is done in the case of embeddings with 300d, 512d, 1024d vector etc.
However, in all nD array compute frameworks such as NumPy, PyTorch, TensorFlow etc, these are still 1D arrays because the length of the above said vectors can be represented using a single number.
But, what if you have more than 1 data point? Then, you have to stack them in some (unique) way. And this is where the need for a second dimension arises. So, let's say you stack 4 of these 512d vectors vertically, then you'd end up with a 2D array/tensor of shape (4, 512). Note that here we call the array as 2D because two integer numbers are required to represent the extent/length along each axis.
To understand this better, please refer my other answer on axis parameter visualization for nD arrays, the visual representation of which I will include it below.
ref: Euclidean space wiki

It is not overloading, but standard usage. What are the elements of a 512-dimensional vector space? They are 512 dimensional vectors. Each of which can be represented by 512 floating point number as in your equation. Each such vector spans a 1-dimensional subspace of the 512-dimensional space.
When you talk of the dimension of a tensor, a tensor is a linear map (roughly speaking, I am omitting the duals) from the product of N vector spaces to the reals. The dimension of a TENSOR is the N.

If you want to be more specific, you need to be clear on the terms dimension, rank, and shape.
The dimensionality of a tensor means the rank, which has a specific definition: the rank is the number of indices. When you see "3-dimensional tensor", you can take that to mean that the tensor has 3 indices, namely T[i][j][k]. So a vector has rank 1, a matrix has rank 2, a cube has rank 3, etc.
When you want to specify the size of each dimension, you should prefer to use the term shape. A 3-dimensional (aka rank 3) tensor can have shape [10, 20, 30] if the 0th dimension has 10 values, the 1st dimension has 20 values, and the 2nd dimension has 30 values. (This shape might represent, say, a batch of 10 images, each of shape 20x30.)
Note, though, that when talking about vectors, it is common to say "512-D vector". As you mentioned, this terminology comes up a lot with word embeddings (e.g. "we used 512-D word embeddings"). Since "vector" by definition means rank 1, then people will interpret that statement to mean "a structure of rank 1 with 512 values".
You might encounter someone saying "I have a 5-d vector", in which case you'd need to follow up with "wait, do you mean a 5-d tensor or a 1-d vector with 5 values?".
I am not a mathematician, by the way.

Related

How to customize a Deep Learning model if the output is one-hot vectors? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am trying to build a Deep Learning model with TensorFlow and Keras. This is a sequential model for tasks of Single-Instance Multi-Label, which is a simplified version of Multi-Instance Multi-Label.
Concretely, the input of my model is an array of fixed length, so it can be represented as a vector like this:
The output of my model is a sequence of letters, which are from a alphabet with a fixed size. For example, an alphabet of {A,B,C,D} with only 4 possible members. So I can use a one-hot vector to represent each letter in a sequence.
The length of the sequences is variable, but for simplicity, I use a fixed length(equals to that of the longest sequence) to store all sequences.
If the length of a sequence is shorter than the fixed length, the sequence is represented by one-hot vectors(equal to the seuqence's actual length) and zero vectors(equal to the remaining length). For example, CADB is represented by a 4 * 5 matrix like this:
Please note: the first 4 columns of this matrix are one-hot vectors, each of which has one and only one 1 entry, and all other entries are 0s.
But the entries of the last column are all 0s, which can be seen as a zero padding because the sequence of letters is not long enough.
So in one word, the input is a vector and the output is a matrix.
Different from the link posted above, the output matrix should be seen as a whole. So one input vector is assigned to a whole matrix, not to a row or column of this matrix.
My question is : how to customize my deep learning model for this special output, for example:
What loss function and accuracy metric should I choose or design?
Do I need to customize a special layer at the very end of my model?
You should use softmax activation on the output layer and have categorical_crossentropy as the loss function.
However, as you can see in the links above, the problem is that these two functions by default are applied on the last axis (axis=-1), while in you situation it is the second last axis (the columns of the matrix) that is one-hot encoded.
To use the right axis, one option is to define your own versions of these functions like so:
def softmax_columns(x):
return tf.keras.backend.softmax(x, axis=-2)
def categorical_crossentropy_columns(target, output):
return tf.keras.backend.categorical_crossentropy(target, output, axis=-2)
Then, you can use these like so:
model.add(SomeLayer(..., activation=softmax_columns, ...)) # output layer
model.compile(loss=categorical_crossentropy_columns, ...)
One good alternative (in general, not only here) is to make use of from_logits=True in the categorical_crossentropy call. This effectively makes the softmax built-in into the loss function, so that your model itself does not need (in fact: must not have) the final softmax activation anymore. This not only saves work, but is also more numerically stable.

If a tensor's shape defines how many entitites are in each dimension, can a tensor's shape have a different number of entities for each dimension?

I'm following a few tutorials to gain an understanding of tensors in TensorFlow. I understand that rank specifies the number of dimensions a tensor is. Now I'm curious about the term 'shapes' and I want to know if its possible or common to have one dimension with more entities than the next dimension or will the number of elements always be equal across dimensions?
I hope this makes sense and thank you in advance.
I'm not entirely sure if I understand your question correctly but I'll try to answer anyway, only if to provide some clarity.
A tensor represents simply an N-dimensional array. A shape of the tensor is a list of the sizes along the given dimension and the rank is the number of dimensions.
So take for example a 3D array of size 10x20x5. Then the shape is (10, 20, 5) and the rank is 3, the total number of elements of such array is 10*20*5=1000.

How to train with inputs of variable size?

This question is rather abstract and not necessarily tied to tensorflow or keras. Say that you want to train a language model, and you want to use inputs of different sizes for your LSTMs. Particularly, I'm following this paper: https://www.researchgate.net/publication/317379370_A_Neural_Language_Model_for_Query_Auto-Completion.
The authors use, among other things, word embeddings and one-hot encoding of characters. Most likely, the dimensions of each of these inputs are different. Now, to feed that into a network, I see a few alternatives but I'm sure I'm missing something and I would like to know how it should be done.
Create a 3D tensor of shape (instances, 2, max(embeddings,characters)). That is, padding the smaller input with 0s.
Create a 3D tensor of shape (instances, embeddings+characters, 1)). That is, concatenating inputs.
It looks to me that both alternatives are bad for efficiently training the model. So, what's the best way to approach this? I see the authors use an embedding layer for this purpose, but technically, what does that mean?
EDIT
Here are more details. Let's call these inputs X (character-level input) and E (word-level input). On each character of a sequence (a text), I compute x, e and y, the label.
x: character one-hot encoding. My character index is of size 38, so this is a vector filled with 37 zeros and one 1.
e: precomputed word embedding of dimension 200. If the character is a space, I fetch the word embedding of the previous word in the sequence, Otherwise, I assign the vector for incomplete word (INC, also of size 200). Real example with the sequence "red car": r>INC, e>INC, d>INC, _>embeddings["red"], c>INC, a>INC, r>INC.
y: the label to be predicted, which is the next character, one-hot encoded. This output is of the same dimension as x because it uses the same character index. In the example above, for "r", y is the one-hot encoding of "e".
According to keras documentation, the padding idea seems to be the one. There is the masking parameter in the embedding layer, that will make keras skip these values instead of processing them. In theory, you don't lose that much performance. If the library is well built, the skipping is actually skipping extra processing.
You just need to take care not to attribute the value zero to any other character, not even spaces or unknown words.
An embedding layer is not only for masking (masking is just an option in an embedding layer).
The embedding layer transforms integer values from a word/character dictionary into actual vectors of a certain shape.
Suppose you have this dictionary:
1: hey
2: ,
3: I'm
4: here
5: not
And you form sentences like
[1,2,3,4,0] -> this is "hey, I'm here"
[1,2,3,5,4] -> this is "hey, I'm not here"
[1,2,1,2,1] -> this is "hey, hey, hey"
The embedding layer will tranform each of those integers into vectors of a certain size. This does two good things at the same time:
Transforms the words in vectors because neural networks can only handle vectors or intensities. A list of indices cannot be processed by a neural network directly, there is no logical relation between indices and words
Creates a vector that will be a "meaningful" set of features for each word.
And after training, they become "meaningful" vectors. Each element starts to represent a certain feature of the word, although that feature is obscure to humans. It's possible that an embedding be capable of detecting words that are verbs, nouns, feminine, masculine, etc, everything encoded in a combination of numeric values (presence/abscence/intensity of features).
You may also try the approach in this question, which instead of using masking, needs to separate batches by length, so each batch can be trained at a time without needing to pad them: Keras misinterprets training data shape

Understanding multidimensional full covariance of normal multivariate distribution in TensorFlow

Suppose I have, say, 3 identically distributed random vectors: w, v and x generally with different lengths. w is length 2, v is length 3 and x is length 4.
How should I define the full covariance matrix sigma of these vectors for tf.contrib.distributions.MultivariateNormalFullCovariance(mean, sigma)?
I think about full covariance in this case as [(2 + 3 + 4) x (2 + 3 + 4)] square matrix (tensor rank 2), where diagonal elements are standard deviations and non-diagonal are cross-covariances between each other component of each other vector. How can I switch my mind to the terms of multidimensional covariance? What is it?
Or should I build full covariance matrix by concatenating it from pieces (e.g. particular covariances and, for instance, assuming independence of these vectors I should build partitioned block diagonal matrix) and cut (split) results of sampling into particular vectors I want to get? (I did that with R.) Or is there an easier way?
What I want is full control over all random vectors including their covariances and cross-covariances.
There is no special consideration about the dimensionality just because your random variables are distributed across multiple vectors. From a probabilistic point of view, three normally-distributed vectors of sizes 2, 3 and 4, a normally-distributed vector of size 9 and and a normally-distributed matrix of size 3x3 are all the same: a 9-dimensional normal distribution. Of course, you could have three distributions of 2, 3 and 4 dimensions, but that's a different thing, it doesn't allow you to model correlations among variables of different vectors (just like having a one-dimensional normal distribution per number does not allow you to model any correlation at all); this may or may not be enough for your use case.
If you want to use a single distribution, you just need to establish a bijection between the domain of your problem (e.g. tuples of three vectors of sizes 2, 3 and 4) and the domain of the distribution (e.g. 9-dimensional vectors). In this case is pretty obvious, just flatten (if necessary) and concatenate the vectors to obtain a distribution sample and split a sample three parts of size 2, 3 and 4 to obtain the vectors.

Variable length dimension in tensor

I'm trying to implement the paper "End-to-End memory networks" (http://arxiv.org/abs/1503.08895)
Each training example consists of a number of phrases, a question and then the answer. The number of sentences is variable, as is the number of words in each sentence and the question. Each word is encoded as an integer. So my input would have the form [batch size, # of sentences, # words in sentence].
Now my problem is that the second and third dimension are unknown for each mini-batch. Can I still somehow represent this input as a single tensor or do I have to use lists of tensors, so that I have a list of length batch_size, and then a sublist of length number of sentences and then for each sentence a tensor, whose size is also not known in advance, corresponding to the words encoded as integers.
Can I use this second approach or will tensorflow then not be able to backpropagate, e.g. I have an operation where I have to calculate the following sum: \sum_i tf.scalar_mul(p_i, c_i), where p_i is a scalar and c_i is an embedding vector that was previously calculated. The tensors for the p and c values are then stored in a list, so I would have to sum over the elements in the two lists in a loop. I'm assuming that tensorflow would not be able to incoorporate this loop in the computation graph, correct? I'm sceptical since theano has a special scan function that allows one to loop over input, so I'm assuming that a regular loop would cause problems in the computation graph. How does tensorflow handle this?
Moving Yaroslav's comment to an answer:
TensorFlow has tf.scan. Dimensions may also be dynamic as in Theano.