I'm searching for a data leak in my model. I'm using tf.layers.dense before a masking operation and am concerned that the model could just learn to switch positions in the middle dimension of my input tensor.
When I have an input tensor x = tf.ones((2,3,4)) would tf.layers.dense(x,8) flatten x to a fully connected layer with 2*3*4=24 input neurons and 2*3*8=48 output neurons then reshape it again to [2,3,8], or would it create 2*3=6 fully connected layers with 4 input and 8 output neurons then concatenate them?
As for the Keras Dense layer, it has been already mentioned in another answer that its input is not flattened and instead, it is applied on the last axis of its input.
As for the TensorFlow Dense layer, it is actually inherited from Keras Dense layer and as a result, same as Keras Dense layer, it is applied on the last axis of its input.
Related
Say I have some feature extracted and it is 10x10 data(maybe image or cepstrogram).
Usually I would feed this into my 2DConv and i ll be on my way.
My quesiton is if I had to convert this into 1D of 100 inputs what disadvantages would I get besides the obvious part where my filter would not be detecting the surrounding neighboors but only the previous and the next ones to detect pattern, which might lead to a worse performance.
And If I had to do this though, would I just reshape ,use reshape layer or use permute layer ?
Thanks
Yes, you are correct regarding the GNA, our Intel GNA hardware is natively support only 1D convolution and 2D convolutions is experimental.
This article (GNA Plugin - OpenVINO™ Toolkit) specifies the steps to add Permute layers before or after convolutions.
You could try both methods and see which one works for you.
Generally,the 1d convolution in TensorFlow is created with 2d convolution wrapping in reshape layers to add H dimension before 2d convolution and remove it after that.
At the same time MO inserts permutes before and after reshape layers since they change the interpretation of data.
For advantages & disadvantages of 2D/1D CNN you may refer to this detailed thread
In TensorFlow, these are the process to build CNN architecture:
Reshape input if necessary using tf.reshape() to match the convolutional layer you intend to build (for example, if using a 2D convolution, reshape it into three-dimensional format)
Create a convolutional layer using tf.nn.conv1d(), tf.nn.conv2d(), or tf.nn.conv3d, depending on the dimensionality of the input.
Create a poling layer using tf.nn.maxpool()
Repeat steps 2 and 3 for additional convolution and pooling layers
Reshape output of convolution and pooling layers, flattening it to prepare for the fully connected layer
Create a fully connected layer using tf.matmul() function, add an activation using, for example, tf.nn.relu() and apply a dropout using tf.nn.dropout()
Create a final layer for class prediction, again using tf.matmul()
Store weights and biases using TensorFlow variables These are just the basic steps to create the CNN model, there are additional steps to define training and evaluation, execute the model and tune it
In step 2 of CNN development you create convolutional layer of 2D using tf.nn.conv2d() - this function Computes a 2-D convolution given 4-D input and filters tensors.
So if you have 1D vector as found in examples of MNIST datadet with 784 features, you can convert 1D vector to 4D input required for conv2d() function using the tensorflow reshape method, Reshape method converts to match picture format [Height x Width x Channel], then Tensor input become 4-D: [Batch Size, Height, Width, Channel]:
x = tf.reshape(x, shape=[-1, 28, 28, 1])
where x is placeholder vector
x = tf.placeholder(tf.float32, [None, num_input])
You may refer to the official Tensorflow documentation
After checking the official doc here keras mask tutorial, it is still not clear to me whether Keras Dense layer can propagate the mask to its following layers 4 and 5 in below example.
Another question is, when calculating the loss at the 5th layer, shall we apply the mask?
We can say it is not needed because the 2nd layer LSTM already ignored those <pad> tokens in input sequences. However, I've read somewhere that the LSTM output of <pad> tokens are NOT zero, but following the last valid token's output. It will affect the value of loss. Thus, we need to apply the mask at the 5th layer?
Our input are padded sequences, and we have a sequential model in Keras
Embedding layer with mask_zero = True //can generate mask
LSTM layer //can consume mask
Dense layer //Question: can this layer propagate mask to other layers in this model
other layers...
output layer with sigmoid as activation function
Thanks for your kind help!
I am trying to understand the Keras layers better. I am working on a sequence to sequence model where I embed a sentence and pass it to a LSTM that returns sequences. Hereafter, I want to apply a Dense layer to each timestep (word) in the sentence and it seems like TimeDistributed does the job for three-dimensional tensors like this case.
In my understanding, Dense layers only work for two-dimensional tensors and TimeDistributed just applies the same dense on every timestep in three dimensions. Could one then not simply flatten the timesteps, apply a dense layer and perform a reshape to obtain the same result or are these not equivalent in some way that I am missing?
Imagine you have a batch of 4 time steps, each containing a 3-element vector. Let's represent that with this:
Now you want to transform this batch using a dense layer, so you get 5 features per time step. The output of the layer can be represented as something like this:
You consider two options, a TimeDistributed dense layer, or reshaping as a flat input, apply a dense layer and reshaping back to time steps.
In the first option, you would apply a dense layer with 3 inputs and 5 outputs to every single time step. This could look like this:
Each blue circle here is a unit in the dense layer. By doing this with every input time step you get the total output. Importantly, these five units are the same for all the time steps, so you only have the parameters of a single dense layer with 3 inputs and 5 outputs.
The second option would involve flattening the input into a 12-element vector, applying a dense layer with 12 inputs and 20 outputs, and then reshaping that back. This is how it would look:
Here the input connections of only one unit are drawn for clarity, but every unit would be connected to every input. Here, obviously, you have many more parameters (those of a dense layer with 12 inputs and 20 outputs), and also note that each output value is influenced by every input value, so values in one time step would affect outputs in other time steps. Whether this is something good or bad depends on your problem and model, but it is an important difference with respect to the previous, where each time step input and output were independent. In addition to that, this configuration requires you to use a fixed number of time steps on each batch, whereas the previous works independently of the number of time steps.
You could also consider the option of having four dense layers, each applied independently to each time step (I didn't draw it but hopefully you get the idea). That would be similar to the previous one, only each unit would receive input connections only from its respective time step inputs. I don't think there is a straightforward way to do that in Keras, you would have to split the input into four, apply dense layers to each part and merge the outputs. Again, in this case the number of time steps would be fixed.
Dense layer can act on any tensor, not necessarily rank 2. And I think that TimeDistributed wrapper does not change anything in the way Dense layer acts. Just applying Dense layer to a tensor of rank 3 will do exactly the same as applying TimeDistributed wrapper of the Dense layer. Here is illustration:
from tensorflow.keras.layers import *
from tensorflow.keras.models import *
model = Sequential()
model.add(Dense(5,input_shape=(50,10)))
model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_5 (Dense) (None, 50, 5) 55
=================================================================
Total params: 55
Trainable params: 55
Non-trainable params: 0
_________________________________________________________________
model1 = Sequential()
model1.add(TimeDistributed(Dense(5),input_shape=(50,10)))
model1.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
time_distributed_3 (TimeDist (None, 50, 5) 55
=================================================================
Total params: 55
Trainable params: 55
Non-trainable params: 0
_________________________________________________________________
Adding to the above answers,
here are few pictures comparing the output shapes of the two layers. So when using one of these layers after LSTM(for example) would have different behaviors.
"Could one then not simply flatten the timesteps, apply a dense layer and perform a reshape to obtain the same result"
No, flattening timesteps into input dimensions (input_dim) is the wrong operation. As illustrated by yuva-rajulu if you flatten a 3D input (batch_size,timesteps,input_dim) = (1000,50,10), you end up with a flattened input (batch_size,input_dim)=(1000,500), resulting in a network architecture with timesteps interacting with each others (see jdehesa). This is not what is intended (i.e., we want to apply the same dense layer to each timestep independently).
What need to be done instead is to reshape the 3D input as (batch_size * timesteps, input_dim) = (50000,10), then apply the dense layer on this 2D input. That way the same dense layer will operate 50000 times on each input vector (10,1) independently. You will end up with a (50000,n_units) output that you should reshape back as a (1000,50,n_units) output. Fortunately, when you pass a 3D input to a dense layer keras does this automatically for you. See official reference:
"If the input to the layer has a rank greater than 2, then Dense computes the dot product between the inputs and the kernel along the last axis of the inputs and axis 0 of the kernel (using tf.tensordot). For example, if input has dimensions (batch_size, d0, d1), then we create a kernel with shape (d1, units), and the kernel operates along axis 2 of the input, on every sub-tensor of shape (1, 1, d1) (there are batch_size * d0 such sub-tensors). The output in this case will have shape (batch_size, d0, units)."
Another way to see it is that the way Dense() computes the output is simply by applying the kernel , i.e., weigth matrix of size (input_dim, n_units) to the last dimension of your 3D input, considering all other dimensions as similar to batch sizes, then size the output accordingly.
I think that they may have been a time when the TimeDistributed layer was needed in keras with Dense() discussion here. Today, we do not need the TimeDistributed wrapper as Dense() and TimeDistributed(Dense()) do exactly the same thing, see Andrey Kite Gorin or mujjiga.
I have problems understanding how to get the correct output when using word embeddings in Keras. My settings are as follows:
My input are batches of shape (batch_size, sequence_length). Each row
in a batch represents one sentence, the word are represented by word id's. The
sentences are padded with zeros such that all are of the same length.
For example a (3,6) input batch might look like: np.array([[135600],[174580],[138272]])
My targets are given by the input batch shifted one step to the right.
So for each input word I want to predict the next word: np.array([[356000],[745800],[382720]])
I feed such an input batch into the Keras embedding layer. My embedding
size is 100, so the output will be a 3D tensor of shape (batch_size,
sequence_length, embedding_size). So in the little example its (3,6,100)
This 3D batch is fed into an LSTM layer
The output of the LSTM layer is fed into a Dense layer with
(sequence_length) output neurons having a softmax activation
function. So the shape of the output will be like the shape of the input namely (batch_size, sequence_length)
As a loss I am using the categorical crossentropy between the input and target batch
My question:
The output batch will contain probabilities because of the
softmax activation function. But what I want is the network to predict
integers such that the output fits the target batch of integers.
How can I "decode" the output such that I know which word the network is predicting? Or do I have to construct the network differently?
Edit 1:
I have changed the output and target batches from 2D arrays to 3D tensors. So instead of using a target batch of size (batch_size, sequence_length) with integer id's I am now using a one-hot encoded 3D target tensor (batch_size, sequence_length, vocab_size). To get the same format as an output of the network, I have changed the network to output sequences (by setting return_sequences=True in the LSTM layer). Further, the number of output neurons was changed to vocab_size such that the output layer now produces a batch of size (batch_size, sequence_length, vocab_size).
With this 3D encoding I can get the predicted word id using tf.argmax(outputs, 2). This approach seems to work for the moment but I would still be interested whether it's possible to keep the 2D targets/outputs
One, solution, perhaps not the best, is to output one-hot vectors the size of of your dictionary (including dummy words).
Your last layer must output (sequence_length, dictionary_size+1).
Your dense layer will already output the sequence_length if you don't add any Flatten() or Reshape() before it, so it should be a Dense(dictionary_size+1)
You can use the functions keras.utils.to_categorical() to transform an integer in a one-hot vector and keras.backend.argmax() to transform a one=hot vector into an integer.
Unfortunately, this is sort of unpacking your embedding. It would be nice if it were possible to have a reverse embedding or something like that.
I'm following udacity MNIST tutorial and MNIST data is originally 28*28 matrix. However right before feeding that data, they flatten the data into 1d array with 784 columns (784 = 28 * 28).
For example,
original training set shape was (200000, 28, 28).
200000 rows (data). Each data is 28*28 matrix
They converted this into the training set whose shape is (200000, 784)
Can someone explain why they flatten the data out before feeding to tensorflow?
Because when you're adding a fully connected layer, you always want your data to be a (1 or) 2 dimensional matrix, where each row is the vector representing your data. That way, the fully connected layer is just a matrix multiplication between your input (of size (batch_size, n_features)) and the weights (of shape (n_features, n_outputs)) (plus the bias and the activation function), and you get an output of shape (batch_size, n_outputs). Plus, you really don't need the original shape information in a fully connected layer, so it's OK to lose it.
It would be more complicated and less efficient to get the same result without reshaping first, that's why we always do it before a fully connected layer. For a convolutional layer, on the opposite, you'll want to keep the data in original format (width, height).
That is a convention with fully connected layers. Fully connected layers connect every node in the previous layer with every node in the successive layer so locality is not an issue for this type of layer.
Additionally by defining the layer like this we can efficiently calculate the next step by calculating the formula: f(Wx + b) = y. This would not be as easily possible with multidimensional input and reshaping the input is low cost and easy to accomplish.