What are the effects of padding a tensor? - tensorflow

I'm working on a problem using Keras that has been presenting me with issues:
My X data is all of shape (num_samples, 8192, 8), but my Y data is of shape (num_samples, 4), where 4 is a one-hot encoded vector.
Both X and Y data will be run through LSTM layers, but the layers are rejecting the Y data because it doesn't match the shape of the X data.
Is padding the Y data with 0s so that it matches the dimensions of the X data unreasonable? What kind of effects would that have? Is there a better solution?
Edited for clarification:
As requested, here is more information:
My Y data represents the expected output of passing the X data through my model. This is my first time working with LSTMs, so I don't have an architecture in mind, but I'd like to use an architecture that works well with classifying long (8192-length) sequences of words into one of several categories. Additionally, the dataset that I have is of an immense size when fed through an LSTM, so I'm currently using batch-training.
Technologies being used:
Keras (Tensorflow Backend)
TL;DR Is padding one tensor with zeroes in all dimensions to match another tensor's shape a bad idea? What could be a better approach?

First of all, let's make sure your representation is actually what you think it is; the input to an LSTM (or any recurrent layer, for that matter) must be of dimensionality: (timesteps, shape), i.e. if you have 1000 training samples, each consisting of 100 timesteps, with each timestep having 10 values, your input shape will be (100,10,). Therefore I assume from your question that each input sample in your X set has 8192 steps and 8 values per step. Great; a single LSTM layer can iterate over these and produce 4-dimensional representations with absolutely no problem, just like so:
myLongInput = Input(shape=(8192,8,))
myRecurrentFunction = LSTM(4)
myShortOutput = myRecurrentFunction(myLongInput)
myShortOutput.shape
TensorShape([Dimension(None), Dimension(4)])
I assume your problem stems from trying to apply yet another LSTM on top of the first one; the next LSTM expects a tensor that has a time dimension, but your output has none. If that is the case, you'll need to let your first LSTM also output the intermediate representations at each time step, like so:
myNewRecurrentFunction=LSTM(4, return_sequences=True)
myLongOutput = myNewRecurrentFunction(myLongInput)
myLongOutput.shape
TensorShape([Dimension(None), Dimension(None), Dimension(4)])
As you can see the new output is now a 3rd order tensor, with the second dimension now being the (yet unassigned) timesteps. You can repeat this process until your final output, where you usually don't need the intermediate representations but rather only the last one. (Sidenote: make sure to set the activation of your last layer to a softmax if your output is in one-hot format)
On to your original question, zero-padding has very little negative impact on your network. The network will strain itself a bit in the beginning trying to figure out the concept of the additional values you have just thrown at it, but will very soon be able to learn they're meaningless. This comes at a cost of a larger parameter space (therefore more time and memory complexity), but doesn't really affect predictive power most of the time.
I hope that was helpful.

Related

Keras class weight for multi-label binary classification on temporal data

I'm training a network with temporal data, and determine which of ~60 outputs are "active" at any given timestep (classified as 1 or 0 in the label data) - so I have an output of 60x1 floats that should represent a probability.
My input data is shaped as (X, 1, frames, dataPoints) - where X is the number of recorded sequences I have (I'm new to ML, I think this is 'batches'), frames is how long the longest sequence is (the rest are -1 padded and masked), and dataPoints is the actual input data for any given frame.
This is mostly an LTSM layer with return_sequences, but my input data is unbalanced.
For any given timestep, odds are ~85% that AN output is activated - but for any given output it's likely active at most 5% of the time.
When I attempted to apply a class weight of {0: 0.01, 1:0.99} (pending tuning), I get an error stating "class_weight not supported for 3+ dimensional targets". I've done some googling and people are suggesting compiling with sample_weight_mode of temporal and modifying sample weight, but (A) that doesn't seem right for my data (no individual sample is more important, but each 1 classification within all the samples is important), and (B) I don't understand the dimensionality of what that's doing.
How can I apply the class weighting to help balance each 1 classification with this data structure?
Side note: I'm rescaling the output of the LSTM to 0->1 since it uses tanh activation (and must use tanh activation for CUDA acceleration), and from_logits=False in my binary cross entropy loss.
Extra points if I can just use built-in tf/keras stuff and not have to write a custom loss function.
EDIT to include some code:
I have a data generator that outputs x and y in the shape of:
x.shape == (1, frameCount, inputFeatureLength) where frameCount is the number of frames in the temporal sequence, and inputFeatureLength is the size of the input data (around 100).
y.shape == (1, frameCount, outputSize) where outputSize is about 60 features.
I can successfully compile the mode, but when I try to model.fit with class_weight={0:0.01, 1:0.99} as an argument, I get the error ValueError: class_weight not supported for 3+ dimensional targets.
I've looked into sample weights, but as far as I can tell even using sample_weight_mode="temporal" on model.fit it'll let me give sample weights per frame of output, but not per each of the ~60 outputs per frame.

How to change the tensor shape in middle layers?

Saying I have a 2000x100 matrix, I put it into 10 dimension embedding layer, which gives me 2000x100x10 tensor. so it's 2000 examples and each example has a 100x10 matrix. and then, I pass it to a conv1d and KMaxpolling to get 2000x24 matrix, which is 2000 examples and each example has a 24 dimension vector. and now, I would like to recombine those examples before I apply another layer. I would like to combine the first 10 examples together, and such and such, so I get a tuple. and then I pass that tuple to the next layer.
My question is, Can I do that with Keras? and any idea on how to do it?
The idea of using "samples" is that these samples should be unique and not relate to each other.
This is something Keras will demand from your model: if it started with 2000 samples, it must end with 2000 samples. Ideally, these samples do not talk to each other, but you can use custom layers to hack this, but only in the middle. You will need to end with 2000 samples anyway.
I believe you're going to end your model with 200 groups, so maybe you should already start with shape (200,10,100) and use TimeDistributed wrappers:
inputs = Input((10,100)) #shape (200,10,100)
out = TimeDistributed(Embedding(....))(inputs) #shape (200,10,100,10)
out = TimeDistributed(Conv1D(...))(out) #shape (200,10,len,filters)
#here, you use your layer that will work on the groups without TimeDistributed.
To reshape a tensor without changing the batch size, use the Reshape(newShape) layer, where newShape does not include the first dimension (batch size).
To reshape a tensor including the batch size, use a Lambda(lambda x: K.reshape(x,newShape)) layer, where newShape includes the first dimension (batch size) - Here you must remember the warning above: somewhere you will need to undo this change so you end up with the same batch size as the input.

Why do we flatten the data before we feed it into tensorflow?

I'm following udacity MNIST tutorial and MNIST data is originally 28*28 matrix. However right before feeding that data, they flatten the data into 1d array with 784 columns (784 = 28 * 28).
For example,
original training set shape was (200000, 28, 28).
200000 rows (data). Each data is 28*28 matrix
They converted this into the training set whose shape is (200000, 784)
Can someone explain why they flatten the data out before feeding to tensorflow?
Because when you're adding a fully connected layer, you always want your data to be a (1 or) 2 dimensional matrix, where each row is the vector representing your data. That way, the fully connected layer is just a matrix multiplication between your input (of size (batch_size, n_features)) and the weights (of shape (n_features, n_outputs)) (plus the bias and the activation function), and you get an output of shape (batch_size, n_outputs). Plus, you really don't need the original shape information in a fully connected layer, so it's OK to lose it.
It would be more complicated and less efficient to get the same result without reshaping first, that's why we always do it before a fully connected layer. For a convolutional layer, on the opposite, you'll want to keep the data in original format (width, height).
That is a convention with fully connected layers. Fully connected layers connect every node in the previous layer with every node in the successive layer so locality is not an issue for this type of layer.
Additionally by defining the layer like this we can efficiently calculate the next step by calculating the formula: f(Wx + b) = y. This would not be as easily possible with multidimensional input and reshaping the input is low cost and easy to accomplish.

How to implement the DecomposeMe architecture in TensorFlow?

There is a type of architecture that I would like to experiment with in TensorFlow.
The idea is to compose 2-D filter kernels by a combination of 1-D filters.
From the paper:
Simplifying ConvNets through Filter Compositions
The essence of our proposal consists of decomposing the ND kernels of a traditional network into N consecutive layers of 1D kernels.
...
We propose DecomposeMe which is an architecture consisting of decomposed layers. Each decomposed layer represents a N-D convolutional layer as a composition of 1D filters and, in addition, by including a non-linearity
φ(·) in-between.
...
Converting existing structures to decomposed ones is a straight forward process as
each existing ND convolutional layer can systematically be decomposed into sets of
consecutive layers consisting of 1D linearly rectified kernels and 1D transposed kernels
as shown in Figure 1.
If I understand correctly, a single 2-D convolutional layer is replaced with two consecutive 1-D convolutions?
Considering that the weights are shared and transposed, it is not clear to me how exactly to implement this in TensorFlow.
I know this question is old and you probably already figured it out, but it might help someone else with the same problem.
Separable convolution can be implemented in tensorflow as follows (details omitted):
X= placeholder(float32, shape=[None,100,100,3]);
v1=Variable(truncated_normal([d,1,3,K],stddev=0.001));
h1=Variable(truncated_normal([1,d,K,N],stddev=0.001));
M1=relu(conv2(conv2(X,v1),h1));
Standard 2d convolution with a column vector is the same as convolving each column of the input with that vector. Convolution with v1 produces K feature maps (or an output image with K channels), which is then passed on to be convolved by h1 producing the final desired number of featuremaps N.
Weight sharing, according to my knowledge, is simply a a misleading term, which is meant to emphasize the fact that you use one filter that is convolved with each patch in the image. Obviously you're going to use the same filter to obtain the results for each output pixel, which is how everyone does it in image/signal processing.
Then in order to "decompose" a convolution layer as shown on page 5, it can be done by simply adding activation units in between the convolutions (ignoring biases):
M1=relu(conv2(relu(conv2(X,v1)),h1));
Not that each filter in v1 is a column vector [d,1], and each h1 is a row vector [1,d]. The paper is a little vague, but when performing separable convolution, this is how it's done. That is, you convolve the image with the column vectors, then you convolve the result with the horizontal vectors, obtaining the final result.

What is num_units in tensorflow BasicLSTMCell?

In MNIST LSTM examples, I don't understand what "hidden layer" means. Is it the imaginary-layer formed when you represent an unrolled RNN over time?
Why is the num_units = 128 in most cases ?
From this brilliant article
num_units can be interpreted as the analogy of hidden layer from the feed forward neural network. The number of nodes in hidden layer of a feed forward neural network is equivalent to num_units number of LSTM units in a LSTM cell at every time step of the network.
See the image there too!
The number of hidden units is a direct representation of the learning capacity of a neural network -- it reflects the number of learned parameters. The value 128 was likely selected arbitrarily or empirically. You can change that value experimentally and rerun the program to see how it affects the training accuracy (you can get better than 90% test accuracy with a lot fewer hidden units). Using more units makes it more likely to perfectly memorize the complete training set (although it will take longer, and you run the risk of over-fitting).
The key thing to understand, which is somewhat subtle in the famous Colah's blog post (find "each line carries an entire vector"), is that X is an array of data (nowadays often called a tensor) -- it is not meant to be a scalar value. Where, for example, the tanh function is shown, it is meant to imply that the function is broadcast across the entire array (an implicit for loop) -- and not simply performed once per time-step.
As such, the hidden units represent tangible storage within the network, which is manifest primarily in the size of the weights array. And because an LSTM actually does have a bit of it's own internal storage separate from the learned model parameters, it has to know how many units there are -- which ultimately needs to agree with the size of the weights. In the simplest case, an RNN has no internal storage -- so it doesn't even need to know in advance how many "hidden units" it is being applied to.
A good answer to a similar question here.
You can look at the source for BasicLSTMCell in TensorFlow to see exactly how this is used.
Side note: This notation is very common in statistics and machine-learning, and other fields that process large batches of data with a common formula (3D graphics is another example). It takes a bit of getting used to for people who expect to see their for loops written out explicitly.
The argument n_hidden of BasicLSTMCell is the number of hidden units of the LSTM.
As you said, you should really read Colah's blog post to understand LSTM, but here is a little heads up.
If you have an input x of shape [T, 10], you will feed the LSTM with the sequence of values from t=0 to t=T-1, each of size 10.
At each timestep, you multiply the input with a matrix of shape [10, n_hidden], and get a n_hidden vector.
Your LSTM gets at each timestep t:
the previous hidden state h_{t-1}, of size n_hidden (at t=0, the previous state is [0., 0., ...])
the input, transformed to size n_hidden
it will sum these inputs and produce the next hidden state h_t of size n_hidden
From Colah's blog post:
If you just want to have code working, just keep with n_hidden = 128 and you will be fine.
An LSTM keeps two pieces of information as it propagates through time:
A hidden state; which is the memory the LSTM accumulates using its (forget, input, and output) gates through time, and
The previous time-step output.
Tensorflow’s num_units is the size of the LSTM’s hidden state (which is also the size of the output if no projection is used).
To make the name num_units more intuitive, you can think of it as the number of hidden units in the LSTM cell, or the number of memory units in the cell.
Look at this awesome post for more clarity
Since I had some problems to combine the information from the different sources I created the graphic below which shows a combination of the blog post (http://colah.github.io/posts/2015-08-Understanding-LSTMs/) and (https://jasdeep06.github.io/posts/Understanding-LSTM-in-Tensorflow-MNIST/) where I think the graphics are very helpful but an error in explaining the number_units is present.
Several LSTM cells form one LSTM layer. This is shown in the figure below. Since you are mostly dealing with data that is very extensive, it is not possible to incorporate everything in one piece into the model. Therefore, data is divided into small pieces as batches, which are processed one after the other until the batch containing the last part is read in. In the lower part of the figure you can see the input (dark grey) where the batches are read in one after the other from batch 1 to batch batch_size. The cells LSTM cell 1 to LSTM cell time_step above represent the described cells of the LSTM model (http://colah.github.io/posts/2015-08-Understanding-LSTMs/). The number of cells is equal to the number of fixed time steps. For example, if you take a text sequence with a total of 150 characters, you could divide it into 3 (batch_size) and have a sequence of length 50 per batch (number of time_steps and thus of LSTM cells). If you then encoded each character one-hot, each element (dark gray boxes of the input) would represent a vector that would have the length of the vocabulary (number of features). These vectors would flow into the neuronal networks (green elements in the cells) in the respective cells and would change their dimension to the length of the number of hidden units (number_units). So the input has the dimension (batch_size x time_step x features). The Long Time Memory (Cell State) and Short Time Memory (Hidden State) have the same dimensions (batch_size x number_units). The light gray blocks that arise from the cells have a different dimension because the transformations in the neural networks (green elements) took place with the help of the hidden units (batch_size x time_step x number_units). The output can be returned from any cell but mostly only the information from the last block (black border) is relevant (not in all problems) because it contains all information from the previous time steps.
I think it is confusing for TF users by the term "num_hidden". Actually it has nothing to do with the unrolled LSTM cells, and it just is the dimension of the tensor, which is transformed from the time-step input tensor to and fed into the LSTM cell.
This term num_units or num_hidden_units sometimes noted using the variable name nhid in the implementations, means that the input to the LSTM cell is a vector of dimension nhid (or for a batched implementation, it would a matrix of shape batch_size x nhid). As a result, the output (from LSTM cell) would also be of same dimensionality since RNN/LSTM/GRU cell doesn't alter the dimensionality of the input vector or matrix.
As pointed out earlier, this term was borrowed from Feed-Forward Neural Networks (FFNs) literature and has caused confusion when used in the context of RNNs. But, the idea is that even RNNs can be viewed as FFNs at each time step. In this view, the hidden layer would indeed be containing num_hidden units as depicted in this figure:
Source: Understanding LSTM
More concretely, in the below example the num_hidden_units or nhid would be 3 since the size of hidden state (middle layer) is a 3D vector.
I think this is a correctly answer for your question. LSTM always make confusion.
You can refer this blog for more detail Animated RNN, LSTM and GRU
Most LSTM/RNN diagrams just show the hidden cells but never the units of those cells. Hence, the confusion.
Each hidden layer has hidden cells, as many as the number of time steps.
And further, each hidden cell is made up of multiple hidden units, like in the diagram below. Therefore, the dimensionality of a hidden layer matrix in RNN is (number of time steps, number of hidden units).
The Concept of hidden unit is illustrated in this image https://imgur.com/Fjx4Zuo.
Following #SangLe answer, I made a picture (see sources for original pictures) showing cells as classically represented in tutorials (Source1: Colah's Blog) and an equivalent cell with 2 units (Source2: Raimi Karim 's post). Hope it will clarify confusion between cells/units and what really is the network architecture.