Keras: why must an embedding layer be used only as the first layer? - tensorflow

In the keras documentation it states that the embedding layer "can only be used as the first layer in a model." This makes no sense to me, I might want to do a reshape/flatten on an input before passing it to the embedding layer, but this is not allowed. Why must the embedding layer be used only as the first layer?

"can only be used as the first layer in a model." This makes no sense
to me
Generally, an embedding layer maps discrete values to continues values. In the subsequence layers, we have continues vector representation that means there is no need to convert the vectors again.
I might want to do a reshape/flatten on input before passing it to
the embedding layer
Of course, you can reshape or flatten an input but in most cases is meaningless. For example, assume we have sentences with a length of 30 and want to flatten them before passed them to embedding:
input_layer = Input(shape=(30))
flatten = Flatten()(input_layer)
embedd = Embedding(1000, 100)(flatten)
In the above example, flatten layer has no effect at all. Before and after flatten our vector size is [batch, 30].
Let look at another example, assume our inputs vector our 2D with the shape of [batch, 30, 2]. After flatting the input, the vectors have the size of [batch, 60]. We can feed them into Embedding layer but in most of the scenarios, it has no meaning. In fact, we destroy the logical relationship between features.
input_layer = Input(shape=(30, 2))
flatten = Flatten()(input_layer)
embedd = Embedding(1000, 100)(flatten)


What is the network structure inside a Tensorflow Embedding Layer?

Tensoflow Embedding Layer ( is easy to use,
and there are massive articles talking about
"how to use" Embedding (,
However, I want to know the Implemention of the very "Embedding Layer" in Tensorflow or Pytorch.
Is it a word2vec?
Is it a Cbow?
Is it a special Dense Layer?
Structure wise, both Dense layer and Embedding layer are hidden layers with neurons in it. The difference is in the way they operate on the given inputs and weight matrix.
A Dense layer performs operations on the weight matrix given to it by multiplying inputs to it ,adding biases to it and applying activation function to it. Whereas Embedding layer uses the weight matrix as a look-up dictionary.
The Embedding layer is best understood as a dictionary that maps integer indices (which stand for specific words) to dense vectors. It takes integers as input, it looks up these integers in an internal dictionary, and it returns the associated vectors. It’s effectively a dictionary lookup.
from keras.layers import Embedding
embedding_layer = Embedding(1000, 64)
Here 1000 means the number of words in the dictionary and 64 means the dimensions of those words. Intuitively, embedding layer just like any other layer will try to find vector (real numbers) of 64 dimensions [ n1, n2, ..., n64] for any word. This vector will represent the semantic meaning of that particular word. It will learn this vector while training using backpropagation just like any other layer.
When you instantiate an Embedding layer, its weights (its internal dictionary of token vectors) are initially random, just as with any other layer. During training, these word vectors are gradually adjusted via backpropagation, structuring the space into something the downstream model can exploit. Once fully trained, the embedding space will show a lot of structure—a kind of structure specialized for the specific problem for which you’re training your model.
-- Deep Learning with Python by F. Chollet
Edit - How "Backpropagation" is used to train the look-up matrix of the Embedding Layer ?
Embedding layer is similar to the linear layer without any activation function. Theoretically, Embedding layer also performs matrix multiplication but doesn't add any non-linearity to it by using any kind of activation function. So backpropagation in the Embedding layer is similar to as of any linear layer. But practically, we don't do any matrix multiplication in the embedding layer because the inputs are generally one hot encoded and the matrix multiplication of weights by a one-hot encoded vector is as easy as a look-up.

Clarification on Tensorflow 2.0 Masking

From the Tensorflow documentation when using Keras subclassing API, they give this example on how to pass a mask along to other layers that implement masking. I am wondering if this is explicitly required or if it is handled correctly after the Embedding layer has mask_zero=True.
class MyLayer(layers.Layer):
def __init__(self, **kwargs):
super(MyLayer, self).__init__(**kwargs)
self.embedding = layers.Embedding(input_dim=5000, output_dim=16, mask_zero=True)
self.lstm = layers.LSTM(32)
def call(self, inputs):
x = self.embedding(inputs)
# Note that you could also prepare a `mask` tensor manually.
# It only needs to be a boolean tensor
# with the right shape, i.e. (batch_size, timesteps).
mask = self.embedding.compute_mask(inputs)
output = self.lstm(x, mask=mask) # The layer will ignore the masked values
return output
layer = MyLayer()
x = np.random.random((32, 10)) * 100
x = x.astype('int32')
My confusion comes from another area of the documentation which states:
This layer supports masking for input data with a variable number of
timesteps. To introduce masks to your data, use an Embedding layer
with the mask_zero parameter set to True.
Which seems to mean that if mask_zero=True no further commands need to be done on subsequent layers.
If you read about the Masking layer, it will also support that once you used the mask at the beginning, all the rest of the layers get the mask automatically.
For each timestep in the input tensor (dimension #1 in the tensor), if all values in the input tensor at that timestep are equal to mask_value, then the timestep will be masked (skipped) in all downstream layers (as long as they support masking).
If any downstream layer does not support masking yet receives such an input mask, an exception will be raised.
This other link also states the same. The mask will be propagated to all layers.
When using the Functional API or the Sequential API, a mask generated by an Embedding or Masking layer will be propagated through the network for any layer that is capable of using them (for example, RNN layers). Keras will automatically fetch the mask corresponding to an input and pass it to any layer that knows how to use it.
The second link is really full of details on masking.
Notice that the code you showed is for a custom embedding. If teaches you how to "create and pass" a mask, if you want to create a layer that will create a mask. It's basically showing what the normal Embedding layer does.
So, we can conclude that if you're using a normal Embedding layer, all you need is mask_zero=True and everything will go down the stream.
In addition to the high-level answer given, let's have a look at some important technical details.
In case of doubts inspect the masking source code, to understand how it works.
Masking adds a _keras_mask attribute to the tensor, which flags entries to be skipped, effectively letting other API methods know about it.
Test yourself if a layer supports the mask, via supports_masking attribute. Example: tf.keras.layers.GlobalMaxPool1D().supports_masking
Masking logic is: skip a timestep if all features are equal to the masked value (TF source code uses not_equal and any to flag what remains)
import tensorflow ast f
arr = np.arange(6).reshape((1,6,1))
arr_masked = tf.keras.layers.Masking(mask_value=5)(arr)
I think you have to pass the mask from layer to layer in a subclassing layer.
From the Tensorflow documentation: Quote
Note that in the call method of a subclassed model or layer, masks aren't automatically propagated, so you will need to manually pass a mask argument to any layer that needs one.

Keras Conv3D Layer with Discrete Values

I'm trying to build a model that will learn features of a 3D space. Unlike image processing, the values of the 3D matrix are not continuous; they represent some discrete value of what "material" can be found at that specific coordinate (grass with value 1 or stairs with value 2 for example).
Is it possible to train a model to learn the features of the space without interpolating in-between values? For example, I don't want the neural net to deduce 1.5 to be some kind of grass stairs.
You'll want to use one-hot encoding, which represents categorical values as arrays of zeroes with a single value set to one. This means that grass (id = 1) would be [0, 1, 0, 0, ...] and stairs (id = 2) would be [0, 0, 1, 0, ...]. To perform one-hot encoding, look into keras' to_categorical function.
Further reading:
one-hot encoding tutorial
one-hot preprocessing using to_categorical
one-hot on the fly using an embedding layer
As any categorical model, this should be a "one-hot" data.
The "channels" dimension of your data should have a size of n-materials.
Values = 0 mean there is no presence of that material
Values = 1 mean there is presence of that material
So, your input shape will be something like (samples, spatial1, spatial2, spatial3, materials). If your data is currently shaped as (samples, s1, s2, s3) and has the materias as integers as you described, you can use to_categorical to transform the integers to "one-hot".
Although I am not sure if this is what you are asking for, I would imagine that t after the bottleneck of the convolutional network, one would typically use a flatten layer and then the output goes to a dense layer. The output layer, if using sigmoid activation will give you probabilities for each of the classes which have to be one-hot encoded, as others have suggested.
If you want the output of the network itself to be in discreet values, I suppose you can use some sort of step-wise activation function in the output layer. However you have to take care that your loss remains differentiable throughout the network (which is why such activation functions are not available in keras). This might be of interest:

Shape of tensor for 2D image in Keras

I am a newbie to Keras (and somehow to TF) but I have found shape definition for the input layer very confusing.
So in the examples, when we have a 1D vector of length 20 for input, shape gets defined as
And when a 2D tensor for greyscale images needs to be defined for MNIST, it is defined as:
...Input(shape=(28, 28, 1)...)
So my question is why the tensor is not defined as (20) and (28, 28)? Why in the first case a second dimension is added and left empty? Also in second, number of channels have to be defined?
I understand that it depends on the layer so Conv1D, Dense or Conv2D take different shapes but it seems the first parameter is implicit?
According to docs, Dense needs be (batch_size, ..., input_dim) but how is this related the example:
Dense(32, input_shape=(784,))
Tuples vs numbers
input_shape must be a tuple, so only (20,) can satisfy it. The number 20 is not a tuple. -- There is the parameter input_dim, to make your life easier if you have only one dimension. This parameter can take 20. (But really, I find it just confusing, I always work with input_shape and use tuples, to keep a consistent understanding).
Dense(32, input_shape=(784,)) is the same as Dense(32, input_dim=784).
Images don't have only pixels, they also have channels (red, green, blue).
A black/white image has only one channel.
So, (28pixels, 28pixels, 1channel)
But notice that there isn't any obligation to follow this shape for images everywhere. You can shape them the way you like. But some kinds of layers do demand a certain shape, otherwise they couldn't work.
Some layers demand specific shapes
It's the case of the 2D convolutional layers, which need (size1,size2,channels). They need this shape because they must apply the convolutional filters accordingly.
It's also the case of recurrent layers, which need (timeSteps,featuresPerStep) to perform their recurrent calculations.
MNIST models
Again, there isn't any obligation to shape your image in a specific way. You must do it according to which first layer you choose and what you intend to achieve. It's a free thing.
Many examples simply don't care about an image being a 2d structured thing, and they just use models that take 784 pixels. That's enough. They probably start with Dense layers, which demand shapes like (size,)
Other examples may care, and use a shape (28,28), but then these models will have to reshape the input to fit the needs of the next layer.
Convolutional layers 2D will demand (28,28,1).
The main idea is: input arrays must match input_shape or input_dim.
Tensor shapes
Be careful, though, when reading Keras error messages or working with custom / lambda layers.
All these shapes we defined before omit an important dimension: the batch size or the number of samples.
Internally all tensors will have this additional dimension as the first dimension. Keras will report it as None (a dimension that will adapt to any batch size you have).
So, input_shape=(784,) will be reported as (None,784).
And input_shape=(28,28,1) will be reported as (None,28,28,1)
And your actual input data must have a shape that matches that reported shape.

Why do we flatten the data before we feed it into tensorflow?

I'm following udacity MNIST tutorial and MNIST data is originally 28*28 matrix. However right before feeding that data, they flatten the data into 1d array with 784 columns (784 = 28 * 28).
For example,
original training set shape was (200000, 28, 28).
200000 rows (data). Each data is 28*28 matrix
They converted this into the training set whose shape is (200000, 784)
Can someone explain why they flatten the data out before feeding to tensorflow?
Because when you're adding a fully connected layer, you always want your data to be a (1 or) 2 dimensional matrix, where each row is the vector representing your data. That way, the fully connected layer is just a matrix multiplication between your input (of size (batch_size, n_features)) and the weights (of shape (n_features, n_outputs)) (plus the bias and the activation function), and you get an output of shape (batch_size, n_outputs). Plus, you really don't need the original shape information in a fully connected layer, so it's OK to lose it.
It would be more complicated and less efficient to get the same result without reshaping first, that's why we always do it before a fully connected layer. For a convolutional layer, on the opposite, you'll want to keep the data in original format (width, height).
That is a convention with fully connected layers. Fully connected layers connect every node in the previous layer with every node in the successive layer so locality is not an issue for this type of layer.
Additionally by defining the layer like this we can efficiently calculate the next step by calculating the formula: f(Wx + b) = y. This would not be as easily possible with multidimensional input and reshaping the input is low cost and easy to accomplish.