From https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense (*Note Section)
For an input of (batch_size, d0, d1) why is the same (d1, units) kernel used for every sub-tensor (1, 1, d1)?
Additionally, why is the higher dimension dense layer operation broken down to work on subsets of input nodes, instead of having a weight from all d0xd1 inputs to an output node?
I apologize if I am missing something obvious and thank you for any help!
So I dug through the source code and found Tensorflow made a change in how they implement the dense operation and I asked my boss at uni why they made this change.
In tf1 for input > rank 2 they flattened the input and just did a regular 1-D dense operation.
https://github.com/tensorflow/tensorflow/blob/r1.15/tensorflow/python/keras/layers/core.py
In tf2 for input > rank 2 they use the tensordot operation. This uses a smaller kernel and shares it for all input sub-tensors. This has the effect of sharing the learned channel-wise information.
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/layers/ops/core.py
Related
I have been building a word-based neural machine translation model in Tensorflow using LSTMs. I have been following a couple of tutorials, including:
https://towardsdatascience.com/implementing-neural-machine-translation-using-keras-8312e4844eb8
My question is specifically about how the final Dense layer (with softmax activation) works.
All the words in the corpus are assigned to an integer. No word is assigned to the integer 0.
When you get your output from the final Dense (+ softmax) layer, what happens if index 0 has the maximum value? How does Tensorflow interpret this? No word in the target language has been assigned to the index 0. Yet this output needs to be fed as the input for the next time-step.
Could someone explain what's going on here?
I'm working on a problem using Keras that has been presenting me with issues:
My X data is all of shape (num_samples, 8192, 8), but my Y data is of shape (num_samples, 4), where 4 is a one-hot encoded vector.
Both X and Y data will be run through LSTM layers, but the layers are rejecting the Y data because it doesn't match the shape of the X data.
Is padding the Y data with 0s so that it matches the dimensions of the X data unreasonable? What kind of effects would that have? Is there a better solution?
Edited for clarification:
As requested, here is more information:
My Y data represents the expected output of passing the X data through my model. This is my first time working with LSTMs, so I don't have an architecture in mind, but I'd like to use an architecture that works well with classifying long (8192-length) sequences of words into one of several categories. Additionally, the dataset that I have is of an immense size when fed through an LSTM, so I'm currently using batch-training.
Technologies being used:
Keras (Tensorflow Backend)
TL;DR Is padding one tensor with zeroes in all dimensions to match another tensor's shape a bad idea? What could be a better approach?
First of all, let's make sure your representation is actually what you think it is; the input to an LSTM (or any recurrent layer, for that matter) must be of dimensionality: (timesteps, shape), i.e. if you have 1000 training samples, each consisting of 100 timesteps, with each timestep having 10 values, your input shape will be (100,10,). Therefore I assume from your question that each input sample in your X set has 8192 steps and 8 values per step. Great; a single LSTM layer can iterate over these and produce 4-dimensional representations with absolutely no problem, just like so:
myLongInput = Input(shape=(8192,8,))
myRecurrentFunction = LSTM(4)
myShortOutput = myRecurrentFunction(myLongInput)
myShortOutput.shape
TensorShape([Dimension(None), Dimension(4)])
I assume your problem stems from trying to apply yet another LSTM on top of the first one; the next LSTM expects a tensor that has a time dimension, but your output has none. If that is the case, you'll need to let your first LSTM also output the intermediate representations at each time step, like so:
myNewRecurrentFunction=LSTM(4, return_sequences=True)
myLongOutput = myNewRecurrentFunction(myLongInput)
myLongOutput.shape
TensorShape([Dimension(None), Dimension(None), Dimension(4)])
As you can see the new output is now a 3rd order tensor, with the second dimension now being the (yet unassigned) timesteps. You can repeat this process until your final output, where you usually don't need the intermediate representations but rather only the last one. (Sidenote: make sure to set the activation of your last layer to a softmax if your output is in one-hot format)
On to your original question, zero-padding has very little negative impact on your network. The network will strain itself a bit in the beginning trying to figure out the concept of the additional values you have just thrown at it, but will very soon be able to learn they're meaningless. This comes at a cost of a larger parameter space (therefore more time and memory complexity), but doesn't really affect predictive power most of the time.
I hope that was helpful.
I've been going through the docs recently and in many different functions like tf.layers.dense or the tf.nn.conv2d, I came across with the arguments units and filters respectively and I can't understand the point of them. Can someone clearly describe the meaning of
dimensionality of the output space
in the above cases or maybe more general terms? Thanks in advance.
from my opinion:
units in tf.layers.dense:
means that how many output nodes of dense layer should be returned.
Because the fully connected layer(dense layer) should consist of input and output.
Then , the mean of dimensionality of the output space could be translated to the number of ouput nodes.
if the units = 1 , it means all the input nodes connected to one output nodes
in inception v3 or other classifier model, we could found the units of dense layer always be the classifier number.
filters in tf.nn.conv2d:
like the state in api doc :
filter: A Tensor. Must have the same type as input. A 4-D tensor of shape [filter_height, filter_width, in_channels, out_channels]
maybe the confused point is out_channels
for out_channels , I try to understand it as how many filters we want to scan the input tensors.
so out_channels is regarded as the number of kernel.
I'm looking for a way to achieve multiple classifications for an input. The number of outputs is specified, and the class sets may or may not be the same for the outputs. The sample belongs to one class of each class set.
My question is, what should the target data and the output layer look like? What activation, loss and training functions could be used, and how should the layer be connected to the hidden layer? I'm not necessarily looking for an optimal solution, just a working one.
My current guess on what could work, is to make the target data be multiple concatenated one-hot vectors and the output layer have as many softmax units as the number of vectors. I don't know how the layers would be connected with that solution and how the net would figure out the sizes of class sets. I think a label powerset would not work for my needs.
I think the matlab patternnet function can create a net that does that, but I don't know how the resulting net works. Code for TensorFlow or Keras would be very welcome.
Maybe it's not a good time to response to the question, but I am working on the multi-label classification and just found an solution.
As for Keras, there's a example:
target label: [1, 0, 0, 1, 0]
output layer: Dense(5, activation='sigmoid')
loss: 'binary_crossentropy'
That will work well if dataset is big enough.
There is a type of architecture that I would like to experiment with in TensorFlow.
The idea is to compose 2-D filter kernels by a combination of 1-D filters.
From the paper:
Simplifying ConvNets through Filter Compositions
The essence of our proposal consists of decomposing the ND kernels of a traditional network into N consecutive layers of 1D kernels.
...
We propose DecomposeMe which is an architecture consisting of decomposed layers. Each decomposed layer represents a N-D convolutional layer as a composition of 1D filters and, in addition, by including a non-linearity
φ(·) in-between.
...
Converting existing structures to decomposed ones is a straight forward process as
each existing ND convolutional layer can systematically be decomposed into sets of
consecutive layers consisting of 1D linearly rectified kernels and 1D transposed kernels
as shown in Figure 1.
If I understand correctly, a single 2-D convolutional layer is replaced with two consecutive 1-D convolutions?
Considering that the weights are shared and transposed, it is not clear to me how exactly to implement this in TensorFlow.
I know this question is old and you probably already figured it out, but it might help someone else with the same problem.
Separable convolution can be implemented in tensorflow as follows (details omitted):
X= placeholder(float32, shape=[None,100,100,3]);
v1=Variable(truncated_normal([d,1,3,K],stddev=0.001));
h1=Variable(truncated_normal([1,d,K,N],stddev=0.001));
M1=relu(conv2(conv2(X,v1),h1));
Standard 2d convolution with a column vector is the same as convolving each column of the input with that vector. Convolution with v1 produces K feature maps (or an output image with K channels), which is then passed on to be convolved by h1 producing the final desired number of featuremaps N.
Weight sharing, according to my knowledge, is simply a a misleading term, which is meant to emphasize the fact that you use one filter that is convolved with each patch in the image. Obviously you're going to use the same filter to obtain the results for each output pixel, which is how everyone does it in image/signal processing.
Then in order to "decompose" a convolution layer as shown on page 5, it can be done by simply adding activation units in between the convolutions (ignoring biases):
M1=relu(conv2(relu(conv2(X,v1)),h1));
Not that each filter in v1 is a column vector [d,1], and each h1 is a row vector [1,d]. The paper is a little vague, but when performing separable convolution, this is how it's done. That is, you convolve the image with the column vectors, then you convolve the result with the horizontal vectors, obtaining the final result.