How to customize a Deep Learning model if the output is one-hot vectors? [closed] - tensorflow

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am trying to build a Deep Learning model with TensorFlow and Keras. This is a sequential model for tasks of Single-Instance Multi-Label, which is a simplified version of Multi-Instance Multi-Label.
Concretely, the input of my model is an array of fixed length, so it can be represented as a vector like this:
The output of my model is a sequence of letters, which are from a alphabet with a fixed size. For example, an alphabet of {A,B,C,D} with only 4 possible members. So I can use a one-hot vector to represent each letter in a sequence.
The length of the sequences is variable, but for simplicity, I use a fixed length(equals to that of the longest sequence) to store all sequences.
If the length of a sequence is shorter than the fixed length, the sequence is represented by one-hot vectors(equal to the seuqence's actual length) and zero vectors(equal to the remaining length). For example, CADB is represented by a 4 * 5 matrix like this:
Please note: the first 4 columns of this matrix are one-hot vectors, each of which has one and only one 1 entry, and all other entries are 0s.
But the entries of the last column are all 0s, which can be seen as a zero padding because the sequence of letters is not long enough.
So in one word, the input is a vector and the output is a matrix.
Different from the link posted above, the output matrix should be seen as a whole. So one input vector is assigned to a whole matrix, not to a row or column of this matrix.
My question is : how to customize my deep learning model for this special output, for example:
What loss function and accuracy metric should I choose or design?
Do I need to customize a special layer at the very end of my model?

You should use softmax activation on the output layer and have categorical_crossentropy as the loss function.
However, as you can see in the links above, the problem is that these two functions by default are applied on the last axis (axis=-1), while in you situation it is the second last axis (the columns of the matrix) that is one-hot encoded.
To use the right axis, one option is to define your own versions of these functions like so:
def softmax_columns(x):
return tf.keras.backend.softmax(x, axis=-2)
def categorical_crossentropy_columns(target, output):
return tf.keras.backend.categorical_crossentropy(target, output, axis=-2)
Then, you can use these like so:
model.add(SomeLayer(..., activation=softmax_columns, ...)) # output layer
model.compile(loss=categorical_crossentropy_columns, ...)
One good alternative (in general, not only here) is to make use of from_logits=True in the categorical_crossentropy call. This effectively makes the softmax built-in into the loss function, so that your model itself does not need (in fact: must not have) the final softmax activation anymore. This not only saves work, but is also more numerically stable.

Related

Why does 'dimension' mean several different things in the machine-learning world? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I've noticed that AI community refers to various tensors as 512-d, meaning 512 dimensional tensor, where the term 'dimension' seems to mean 512 different float values in the representation for a single datapoint. e.g. in 512-d word-embeddings means 512 length vector of floats used to represent 1 english-word e.g. https://medium.com/#jonathan_hui/nlp-word-embedding-glove-5e7f523999f6
But it isn't 512 different dimensions, it's only 1 dimensional vector? Why is the term dimension used in such a different manner than usual?
When we use the term conv1d or conv2d which are convolutions over 1-dimension and 2-dimensions, a dimension is used in the typical way it's used in math/sciences but in the word-embedding context, a 1-d vector is said to be a 512-d vector, or am I missing something?
Why is this overloaded use of the term dimension? What context determines what dimension means in machine-learning as the term seems overloaded?
In the context of word embeddings in neural networks, dimensionality reduction, and many other machine learning areas, it is indeed correct to call the vector (which is typically, an 1D array or tensor) as n-dimensional where n is usually greater than 2. This is because we usually work in the Euclidean space where a (data) point in a certain dimensional (Euclidean) space is represented as an n-tuple of real numbers (i.e. real n-space ℝn).
Below is an exampleref of a (data) point in a 3D (Euclidean) space. To represent any point in this space, say d1, we need a tuple of three real numbers (x1, y1, z1).
Now, your confusion arises why this point d1 is called as 3 dimensional instead of 1 dimensional array. The reason is because it lies or lives in this 3D space. The same argument can be extended to all points in any n-dimensional real space, as it is done in the case of embeddings with 300d, 512d, 1024d vector etc.
However, in all nD array compute frameworks such as NumPy, PyTorch, TensorFlow etc, these are still 1D arrays because the length of the above said vectors can be represented using a single number.
But, what if you have more than 1 data point? Then, you have to stack them in some (unique) way. And this is where the need for a second dimension arises. So, let's say you stack 4 of these 512d vectors vertically, then you'd end up with a 2D array/tensor of shape (4, 512). Note that here we call the array as 2D because two integer numbers are required to represent the extent/length along each axis.
To understand this better, please refer my other answer on axis parameter visualization for nD arrays, the visual representation of which I will include it below.
ref: Euclidean space wiki
It is not overloading, but standard usage. What are the elements of a 512-dimensional vector space? They are 512 dimensional vectors. Each of which can be represented by 512 floating point number as in your equation. Each such vector spans a 1-dimensional subspace of the 512-dimensional space.
When you talk of the dimension of a tensor, a tensor is a linear map (roughly speaking, I am omitting the duals) from the product of N vector spaces to the reals. The dimension of a TENSOR is the N.
If you want to be more specific, you need to be clear on the terms dimension, rank, and shape.
The dimensionality of a tensor means the rank, which has a specific definition: the rank is the number of indices. When you see "3-dimensional tensor", you can take that to mean that the tensor has 3 indices, namely T[i][j][k]. So a vector has rank 1, a matrix has rank 2, a cube has rank 3, etc.
When you want to specify the size of each dimension, you should prefer to use the term shape. A 3-dimensional (aka rank 3) tensor can have shape [10, 20, 30] if the 0th dimension has 10 values, the 1st dimension has 20 values, and the 2nd dimension has 30 values. (This shape might represent, say, a batch of 10 images, each of shape 20x30.)
Note, though, that when talking about vectors, it is common to say "512-D vector". As you mentioned, this terminology comes up a lot with word embeddings (e.g. "we used 512-D word embeddings"). Since "vector" by definition means rank 1, then people will interpret that statement to mean "a structure of rank 1 with 512 values".
You might encounter someone saying "I have a 5-d vector", in which case you'd need to follow up with "wait, do you mean a 5-d tensor or a 1-d vector with 5 values?".
I am not a mathematician, by the way.

Dynamic RNN: padding word vector

I am a bit confused on padding, my first question is:
Is it possible to pad a shorter sequence with values that are not 0? How do you deal with that then in the RNN?
Generally a 0 is used for padding, is there a specific reason why? Does it make it easy in the training because it does not affect the calculation or you still need to mask the loss function?
In case your sentence is composed of vectors embedding from a word2vec model, would padding be applied as a zero vector?
Thanks in advance fir any hint!
Your question is addressed in How to overcome training example's different lengths when working with Word Embeddings (word2vec).
For details on the alternating min/max padding method, see Apply word embeddings to entire document, to get a feature vector.
See also: keras.preprocessing.sequence.pad_sequences, which can take a value to pad with as an argument.

Tensorflow: Using one tensor to index slices of another [duplicate]

This question already has answers here:
Get the last output of a dynamic_rnn in TensorFlow
(4 answers)
Closed 4 years ago.
As motivation for this question, I'm trying to use variable length sequences with tf.nn.dynamic_rnn. When I was training with batch_size=1 (one element at a time), everything was going swimmingly, but now I'm trying to increase the batch size, which means zero-padding sequences to the same length.
I've zero-padded (or truncated) all of my sequences up to the max length of 15000.
outputs (from the RNN) has shape [batch_size, max_seq_length, num_units], which for concreteness is right now [16, 15000, 64].
I also create a seq_lengths tensor, which is [batch_size], so [16], corresponding to the actual sequence length of all the zero-padded sequences.
I've added a fully connected layer, to multiply what was previously outputs[:,-1,:] by W, then add a bias term, since ultimately I'm just trying to predict a single value (or rather batch_size values). However, now, I can't just naively use -1 as the index, because all of the sequences have been variously padded! I have seq_lengths, but I'm not sure exactly how to use it to index outputs. I've searched around, and I think the answer is some clever use of tf.gather_nd, but I can't quite figure it out. I can easily see how to take individual values, but I want to preserve entire slices. Do I need to create some sort of enormous 3D mask?
Here's what I want in terms of a Python comprehension (outputs is an np.array): outputs = np.array([outputs[i, seq_lengths[i], :] for i in range(batch_size)]).
I'd appreciate any help! Thank you.
Actually, Alex it turns out you've already answered my question for me :).
After some more research, I came across the following, which is exactly my use case: https://stackoverflow.com/a/43298689/5526865 . I won't copy the code here, but just check that out.

How to train with inputs of variable size?

This question is rather abstract and not necessarily tied to tensorflow or keras. Say that you want to train a language model, and you want to use inputs of different sizes for your LSTMs. Particularly, I'm following this paper: https://www.researchgate.net/publication/317379370_A_Neural_Language_Model_for_Query_Auto-Completion.
The authors use, among other things, word embeddings and one-hot encoding of characters. Most likely, the dimensions of each of these inputs are different. Now, to feed that into a network, I see a few alternatives but I'm sure I'm missing something and I would like to know how it should be done.
Create a 3D tensor of shape (instances, 2, max(embeddings,characters)). That is, padding the smaller input with 0s.
Create a 3D tensor of shape (instances, embeddings+characters, 1)). That is, concatenating inputs.
It looks to me that both alternatives are bad for efficiently training the model. So, what's the best way to approach this? I see the authors use an embedding layer for this purpose, but technically, what does that mean?
EDIT
Here are more details. Let's call these inputs X (character-level input) and E (word-level input). On each character of a sequence (a text), I compute x, e and y, the label.
x: character one-hot encoding. My character index is of size 38, so this is a vector filled with 37 zeros and one 1.
e: precomputed word embedding of dimension 200. If the character is a space, I fetch the word embedding of the previous word in the sequence, Otherwise, I assign the vector for incomplete word (INC, also of size 200). Real example with the sequence "red car": r>INC, e>INC, d>INC, _>embeddings["red"], c>INC, a>INC, r>INC.
y: the label to be predicted, which is the next character, one-hot encoded. This output is of the same dimension as x because it uses the same character index. In the example above, for "r", y is the one-hot encoding of "e".
According to keras documentation, the padding idea seems to be the one. There is the masking parameter in the embedding layer, that will make keras skip these values instead of processing them. In theory, you don't lose that much performance. If the library is well built, the skipping is actually skipping extra processing.
You just need to take care not to attribute the value zero to any other character, not even spaces or unknown words.
An embedding layer is not only for masking (masking is just an option in an embedding layer).
The embedding layer transforms integer values from a word/character dictionary into actual vectors of a certain shape.
Suppose you have this dictionary:
1: hey
2: ,
3: I'm
4: here
5: not
And you form sentences like
[1,2,3,4,0] -> this is "hey, I'm here"
[1,2,3,5,4] -> this is "hey, I'm not here"
[1,2,1,2,1] -> this is "hey, hey, hey"
The embedding layer will tranform each of those integers into vectors of a certain size. This does two good things at the same time:
Transforms the words in vectors because neural networks can only handle vectors or intensities. A list of indices cannot be processed by a neural network directly, there is no logical relation between indices and words
Creates a vector that will be a "meaningful" set of features for each word.
And after training, they become "meaningful" vectors. Each element starts to represent a certain feature of the word, although that feature is obscure to humans. It's possible that an embedding be capable of detecting words that are verbs, nouns, feminine, masculine, etc, everything encoded in a combination of numeric values (presence/abscence/intensity of features).
You may also try the approach in this question, which instead of using masking, needs to separate batches by length, so each batch can be trained at a time without needing to pad them: Keras misinterprets training data shape

Variable length dimension in tensor

I'm trying to implement the paper "End-to-End memory networks" (http://arxiv.org/abs/1503.08895)
Each training example consists of a number of phrases, a question and then the answer. The number of sentences is variable, as is the number of words in each sentence and the question. Each word is encoded as an integer. So my input would have the form [batch size, # of sentences, # words in sentence].
Now my problem is that the second and third dimension are unknown for each mini-batch. Can I still somehow represent this input as a single tensor or do I have to use lists of tensors, so that I have a list of length batch_size, and then a sublist of length number of sentences and then for each sentence a tensor, whose size is also not known in advance, corresponding to the words encoded as integers.
Can I use this second approach or will tensorflow then not be able to backpropagate, e.g. I have an operation where I have to calculate the following sum: \sum_i tf.scalar_mul(p_i, c_i), where p_i is a scalar and c_i is an embedding vector that was previously calculated. The tensors for the p and c values are then stored in a list, so I would have to sum over the elements in the two lists in a loop. I'm assuming that tensorflow would not be able to incoorporate this loop in the computation graph, correct? I'm sceptical since theano has a special scan function that allows one to loop over input, so I'm assuming that a regular loop would cause problems in the computation graph. How does tensorflow handle this?
Moving Yaroslav's comment to an answer:
TensorFlow has tf.scan. Dimensions may also be dynamic as in Theano.