Dataset Preparation for LSTM (multiple variables) - tensorflow

I am struggling to conceptualize the correct way to prepare a timeseries dataset for LSTM training. My main concern is how do I train the network to 'remember' N previous steps. I have two possible ways in my mind but I am not sure which one is the correct.
I am really confused with this, I have tried both approaches (for 1 variable however) and they both seem to provide some plausible results.
1.) The dataset should be in a tensor format like this
X1=[[var1_t_1, var2_t_1, var3_t_1],
[var1_t_2, var2_t_2, var3_t_2],
...]
X.shape = [N, 3]
y=[ [target_t_1],
[target_t_2],
...]
y.shape = [N, 1]
During training the LSTM gets N inputs, one for each timestep, and returns back N predictions that are used to compute the loss and update weights.
The network on its own "creates memmory" about previous time step values through its cell states. But for how many previous steps can it create memmory, is there any way to define this memmory (if possible answer with pytorch example).
2.) The dataset should already contain the previous timestep values as features, so a 3rd dimension is neccessary eg.
X = [ [var1_t_1, var1_t_2,..., var1_t_10], [var2_t_1,..., var2_t_10], [var3_t_1,..., var3_t_10],
[var1_t_2, var1_t_3,..., var1_t_11], [var2_t_2,..., var2_t_11], [var3_t_2,..., var3_t_11],
...]
X.shape = [N-10, 10, 3]
y = [ [target_t_11],
[target_t_12],
... ]
y.shape = [N-10, 1]
In this way we define the number of previous steps the LSTM should try to remember. For the example above we "ask" the LSTM to remember at least 10 previous prices in order to make predictions.
Any help to clarify the concept is greatly appreciated. Pytorch code would be extremely welcome as well.

Related

Increase dimension of RNN LSTM cell in Keras

I want to increase amount of recurrent weights in rnn or lstm cell.
If you look at the code below you will see, that lsrm cell inputs shape is (2,1), which means 2 timesteps and 1 feature.
%tensorflow_version 2.x
import tensorflow as tf
m = tf.keras.models.Sequential()
lstm = tf.keras.layers.LSTM(1, use_bias=False)
input = tf.keras.Input(shape=(2,1))
m.add(input)
m.add(lstm)
lstm.get_weights()
The output is
[array([[ 0.878217 , 0.89324415, 0.404307 , -1.0542995 ]], dtype=float32),
array([[-0.24181306, -0.341401 , 0.65207034, 0.63227856]], dtype=float32)]
4 weights for each feature, and 4 weights for previous outputs
Now if I change Input shape like this
input = tf.keras.Input(shape=(2,1))
then the output of get_weights function will be like this:
[array([[-0.9725287 , -0.90078545, 0.97881985, -0.9623983 ],
[-0.9644511 , 0.90705967, 0.05965471, 0.32613564]], dtype=float32),
array([[-0.24867296, -0.22346373, -0.6410606 , 0.69084513]], dtype=float32)]
Now my question is: how do I increase amount of weights in the second array whick keeps the (4,1) shape?
The idea is that I want RNN or STRM take not only the previous output (t-1 moment) but more prevois values like (t-2, t-3, t-4) moments.
Is there way to do it in keras with tf backend?
I can't understand the change, I think you had a typo in your question, but:
Length - Time steps:
The number of time steps will never change the number of weights. The layer is "recurrent", meaning it will "loop" the time steps. It's not supposed to have different weights for each step.
The whole purpose of the layer is to apply the same operations over and over and over for each time step.
Input features:
Input features are the last dimension of the input. They define one dimension of the weights.
Units = Output features:
Output features, also the last dimension of the output, are another dimension of the weights.
Two types of kernels
The LSTM layers have two groups of kernels:
What they call simply kernels - with shape=(input_dim, self.units * 4)
What they call recurrent kernels - with shape=(self.units, self.units * 4)
The first group acts on the input data, they have shape considering the input features and the output features.
The second group acts on inner states and have shapes considering only the output features (units).
From the source code:
self.kernel = self.add_weight(shape=(input_dim, self.units * 4),
name='kernel',
initializer=self.kernel_initializer,
regularizer=self.kernel_regularizer,
constraint=self.kernel_constraint)
self.recurrent_kernel = self.add_weight(
shape=(self.units, self.units * 4),
name='recurrent_kernel',
initializer=self.recurrent_initializer,
regularizer=self.recurrent_regularizer,
constraint=self.recurrent_constraint)
The last array in the list:
The last array in the list of weights are the 4 recurrent kernels with shape (1, 1) grouped into one.
So:
You can increase the kernels with more input features. Transform Input((anything, 1)) into Input((anything, more)) for instance.
You can increase the kernels and the recurrent_kernels (and biases, when considered) with bigger output features. Transform LSTM(1, ...) into LSTM(more, ...)
Weights are independent of the lenght. It's even possible to have Input((None, 1)), meaning a variable length.
Using more than just the last step
This should be automatic. LSTM layers are designed to have memory. The memory is an inner state that participates in all time steps. There are gates (the kernels) that decide how a new step will participate in this memory. Since all steps participate in the same memory, LSTM layer theoretically considers "all" time steps from the beginning.
So, you shouldn't really worry with this.
But if you do want this, there are maybe two ways. Don't know if they will bring any improvement, though.
One is to concatenate shifted inputs as features:
def pad_and_shift(x):
steps = 3
paddings = tf.constant([[0,0], [steps-1, 0], [0, 0]])
x = tf.pad(x, paddings)
to_concat = [ x[:,i:i - steps + 1] for i in range(steps-1) ]
to_concat += x[:, steps-1:]
return tf.concat(to_concat, axis=-1)
given_inputs = ....
out = Lambda(pad_and_shift)(given_inputs)
out = LSTM(units, ...)(out)
The other involves editing the source code of the LSTM, which would be very complicated and probably not very worthy.

Using TensorFlow hessians for second partial derivative test

Second partial derivative test is a simple way to tell whether a critical point is a minimum, a maximum, or a saddle. I am currently toying with the idea of implementing such test for a simple neural network in tensorflow. The following set of weights is used for modeling an XOR neural network with 2 inputs, 1 hidden layer with 2 hidden units, and 1 output unit:
weights = {
'h1': tf.Variable(np.empty([2, 2]), name="h1", dtype=tf.float64),
'b1': tf.Variable(np.empty([2]), name="b1", dtype=tf.float64),
'h2': tf.Variable(np.empty([2, 1]), name="h2", dtype=tf.float64),
'b2': tf.Variable(np.empty([1]), name="b2", dtype=tf.float64)
}
Both the gradients and the hessians can now be obtained as follows:
gradients = tf.gradients(mse_op, [weights['h1'], weights['b1'], weights['h2'], weights['b2']])
hessians = tf.hessians(mse_op, [weights['h1'], weights['b1'], weights['h2'], weights['b2']])
Where mse_op is the MSE error of the network.
Both gradients and hessians compute just fine. The dimensionality of the gradients is equal to the dimensionality of the original inputs. The dimensionality of the hessians obviously differs.
The question: is it a good idea, and is it even possible to conveniently compute the eigenvalues of the hessians generated by tf.hessian applied to the given set of weights? Will the eigenvalues be representative of what I think they represent - i.e., will I be able to say that if overall, both positive and negative values are present, then we can conclude that the point is a saddle point?
So far, I have tried the following out-of-the-box approach to calculate the eigenvalues of each of the hessians:
eigenvals1 = tf.self_adjoint_eigvals(hessians[0])
eigenvals2 = tf.self_adjoint_eigvals(hessians[1])
eigenvals3 = tf.self_adjoint_eigvals(hessians[2])
eigenvals4 = tf.self_adjoint_eigvals(hessians[3])
1,2, and 4 work, but the 3rd one bombs out, complaining that Dimensions must be equal, but are 2 and 1 for 'SelfAdjointEigV2_2' (op: 'SelfAdjointEigV2') with input shapes: [2,1,2,1]. Should I just reshape the hessian somehow and carry on, or am I on the wrong track entirely?
After some fiddling, I have figured out that, given n*m matrix of input variables, TensorFlow's tf.hessians produces [n,m,n,m] tensor, which can be reshaped into square [n*m, n*m] Hessian matrix as follows:
sq_hess = tf.reshape(hessians[0], [tf.size(weights['h1']), tf.size(weights['h1'])])
Further, one can calculate the eigenvalues of the resulting square hessian:
eigenvals = tf.self_adjoint_eigvals(sq_hess)
This might be trivial, but it took me some time to wrap my head around this. I believe the behaviour of tf.hessians is not very well documented. Once you put together the dimensionalities, though, everything makes sense!

loss function: mean pairwise squared error

When I use the
tf.losses.mean_pairwise_squared_error(labels, predictions, weights=1.0, scope=None, loss_collection=tf.GraphKeys.LOSSES)
function, I am sure the data is right. However, the loss on the tensorboard is always zero. I try hard to find it out, but do not know why? Following is the part of my code. Am I using the wrong shape?
score_a=tf.reshape(score,[-1])#shape: [1,39]
ys_a=tf.reshape(ys,[-1])#shape: [1,39]
with tf.name_scope('loss'):
loss=tf.losses.mean_pairwise_squared_error(score_a,ys_a)
To use tf.losses.mean_pairwise_squared_error(), labels and predictions should be of rank at least 2, because the first dimension will be used as batch_size. It means that you do not need to reshape score_a and ys_a. (I assume that score_a and ys_a have 39 entries and one batch.)
If labels and predictions are of rank 1, it means that all data entries are 0-tensor (scalar) so that the result of tf.losses.mean_pairwise_squared_error() becomes always zero.
One more thing. In my opinion, the current implementation (2018-01-03) of tf.losses.mean_pairwise_squared_error() looks imperfect. For example, as showin in the API document of the function, put the following data as labels and predictions:
labels = tf.constant([[0., 0.5, 1.]])
predictions = tf.constant([[1., 1., 1.]])
tf.losses.mean_pairwise_squared_error(labels, predictions)
In this case, the result should be [(0-0.5)^2+(0-1)^2+(0.5-1)^2]/3=0.5 which is different from the result 0.3333333134651184 by tensorflow.
[Update] The bug (mentioned above) in tf.losses.mean_pairwise_squared_error() was fixed and applied from tensorflow 1.6.0.
If you are doing classification, you may have your label tensor with categorical value in one dimension. If that is the case, you will need to perform one hot transformation to match the shape of the predictions before calling tf.losses.mean_pairwise_squared_error(). You do not need to reshape the predictions.
labels_a = tf.one_hot(labels, 2)

Weighted random tensor select in tensorflow

I have a list of tensors and list representing their probability mass function. How can I each session run tell tensorflow to randomly pick one tensor according to probability mass function.
I see few possible ways to do that:
One is packing list of tensors in rank one higher, and select one with slice & squeeze based on tensorflow variable I'm going to assign correct index. What would be performance penalty for this approach? Would tensorflow evaluate other, non-needed tensors?
Another is using tf.case in similar fashion as before with me picking one tensor out of many. Same question -> What's the performance penalty since I plan on having quite a few(~100s) conditional statements per one graph run.
Is there any better way of doing this?
I think you should use tf.multinomial(logits, num_samples).
Say you have:
a batch of tensors of shape [batch_size, num_features]
a probability distribution of shape [batch_size]
You want to output:
1 example from the batch of tensors, of shape [1, num_features]
batch_tensors = tf.constant([[0., 1., 2.], [3., 4., 5.]]) # shape [batch_size, num_features]
probabilities = tf.constant([0.7, 0.3]) # shape [batch_size]
# we need to convert probabilities to log_probabilities and reshape it to [1, batch_size]
rescaled_probas = tf.expand_dims(tf.log(probabilities), 0) # shape [1, batch_size]
# We can now draw one example from the distribution (we could draw more)
indice = tf.multinomial(rescaled_probas, num_samples=1)
output = tf.gather(batch_tensors, tf.squeeze(indice, [0]))
What's the performance penalty since I plan on having quite a few(~100s) conditional statements per one graph run?
If you want to do multiple draws, you should do it in one run by increasing the parameter num_samples. You can then gather these num_samples examples in one run with tf.gather.

How can I determine several labels in parallel (in a neural network) by using a softmax-output-layer in tensorflow?

Due to the project work of my master study I am implementing a neural network using the tensorflow library form Google. At that I would like to determine (at the output layer of my feed forward neural network) several labels in parallel. And as activation function of the output layer I want to use the softmax function.
So what I want to have specifically is a output is a Vector that looks like this:
vec = [0.1, 0.8, 0.1, 0.3, 0.2, 0.5]
Here the first three numbers are the probabilities of the three classes of the first classification and the other three numbers are the probabilities of the three classes of the second classification. So in this case I would say that the labels are:
[ class2 , class3 ]
In a first attempt I tried to implement this by first reshapeing the (1x6) vector to a (2x3) Matrix with tf.reshape(), then apply the softmax-function on the matrix tf.nn.softmax() and finally reshape the matrix back to a vector. Unfortunately, due to the reshaping, the Gradient-Descent-Optimizer gets problems with calculating the gradient, so I tried something different.
What I do now is, I take the (1x6) vector and multiply it my a matrix that has a (3x3) identity-matrix in the upper part and a (3x3) zero-matrix in the lower part. Whit this I extract the first three entries of the vector. Then I can apply the softmax function and bring it back into the old form of (1x6) by another matrix multiplication. This has to be repeated for the other three vector entries as well.
outputSoftmax = tf.nn.softmax( vec * [[1,0,0],[0,1,0],[0,0,1],[0,0,0],[0,0,0],[0,0,0]] ) * tf.transpose( [[1,0,0],[0,1,0],[0,0,1],[0,0,0],[0,0,0],[0,0,0]] )
+ tf.nn.softmax( vec * [[0,0,0],[0,0,0],[0,0,0],[1,0,0],[0,1,0],[0,0,1]] ) * tf.transpose( [[0,0,0],[0,0,0],[0,0,0],[1,0,0],[0,1,0],[0,0,1]] )
It works so far, but I don't like this solution.
Because in my real problem, I not only have to determine two labels at a time but 91, I would have to repeat the procedure form above 91-times.
Does anyone have an solution, how I can obtain the desired vector, where the softmax function is applied on only three entries at a time, without writing the "same" code 91-times?
You could apply the tf.split function to obtain 91 tensors (one for each class), then apply softmax to each of them.
classes_split = tf.split(0, 91, all_in_one)
for c in classes_split:
softmax_class = tf.nn.softmax(c)
# use softmax_class to compute some loss, add it to overall loss
or instead of computing the loss directly, you could also concatenate them together again:
classes_split = tf.split(0, 91, all_in_one)
# softmax each split individually
classes_split_softmaxed = [tf.nn.softmax(c) for c in classes_split]
# Concatenate again
all_in_one_softmaxed = tf.concat(0, classes_split_softmaxed)