Multivariate binary sequence prediction with LSTM - tensorflow

I'm working on a sequence forecasting problem and I don't have much experience in this area, so some of the below questions might be naive.
FYI: I've created a follow-up question with a focus on CRFs here
I have the following problem:
I would like to forecast a binary sequence for multiple, non-independent variables.
Inputs:
I have a dataset with the following variables:
Timestamps
Groups A and B
Binary signal corresponding to each group at a particular timestamp
Additionally, suppose the following:
We can extract additional attributes from the timestamps (e.g. hour of day) which can be used as external predictors
We believe that groups A and B are not independent therefore it might be optimal to model their behaviour jointly
binary_signal_group_A and binary_signal_group_B are the 2 non-independent variables that I would like to forecast using (1) their past behaviour and (2) additional information extracted from each timestamp.
What I've done so far:
# required libraries
import re
import numpy as np
import pandas as pd
from keras import Sequential
from keras.layers import LSTM
data_length = 18 # how long our data series will be
shift_length = 3 # how long of a sequence do we want
df = (pd.DataFrame # create a sample dataframe
.from_records(np.random.randint(2, size=[data_length, 3]))
.rename(columns={0:'a', 1:'b', 2:'extra'}))
# NOTE: the 'extra' variable refers to a generic predictor such as for example 'is_weekend' indicator, it doesn't really matter what it is
# shift so that our sequences are in rows (assuming data is sorted already)
colrange = df.columns
shift_range = [_ for _ in range(-shift_length, shift_length+1) if _ != 0]
for c in colrange:
for s in shift_range:
if not (c == 'extra' and s > 0):
charge = 'next' if s > 0 else 'last' # 'next' variables is what we want to predict
formatted_s = '{0:02d}'.format(abs(s))
new_var = '{var}_{charge}_{n}'.format(var=c, charge=charge, n=formatted_s)
df[new_var] = df[c].shift(s)
# drop unnecessary variables and trim missings generated by the shift operation
df.dropna(axis=0, inplace=True)
df.drop(colrange, axis=1, inplace=True)
df = df.astype(int)
df.head() # check it out
# a_last_03 a_last_02 ... extra_last_02 extra_last_01
# 3 0 1 ... 0 1
# 4 1 0 ... 0 0
# 5 0 1 ... 1 0
# 6 0 0 ... 0 1
# 7 0 0 ... 1 0
# [5 rows x 15 columns]
# separate predictors and response
response_df_dict = {}
for g in ['a','b']:
response_df_dict[g] = df[[c for c in df.columns if 'next' in c and g in c]]
# reformat for LSTM
# the response for every row is a matrix with depth of 2 (the number of groups) and width = shift_length
# the predictors are of the same dimensions except the depth is not 2 but the number of predictors that we have
response_array_list = []
col_prefix = set([re.sub('_\d+$','',c) for c in df.columns if 'next' not in c])
for c in col_prefix:
current_array = df[[z for z in df.columns if z.startswith(c)]].values
response_array_list.append(current_array)
# reshape into samples (1), time stamps (2) and channels/variables (0)
response_array = np.array([response_df_dict['a'].values,response_df_dict['b'].values])
response_array = np.reshape(response_array, (response_array.shape[1], response_array.shape[2], response_array.shape[0]))
predictor_array = np.array(response_array_list)
predictor_array = np.reshape(predictor_array, (predictor_array.shape[1], predictor_array.shape[2], predictor_array.shape[0]))
# feed into the model
model = Sequential()
model.add(LSTM(8, input_shape=(predictor_array.shape[1],predictor_array.shape[2]), return_sequences=True)) # the number of neurons here can be anything
model.add(LSTM(2, return_sequences=True)) # should I use an activation function here? the number of neurons here must be equal to the # of groups we are predicting
model.summary()
# _________________________________________________________________
# Layer (type) Output Shape Param #
# =================================================================
# lstm_62 (LSTM) (None, 3, 8) 384
# _________________________________________________________________
# lstm_63 (LSTM) (None, 3, 2) 88
# =================================================================
# Total params: 472
# Trainable params: 472
# Non-trainable params: 0
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # is it valid to use crossentropy and accuracy as metric?
model.fit(predictor_array, response_array, epochs=10, batch_size=1)
model_preds = model.predict_classes(predictor_array) # not gonna worry about train/test split here
model_preds.shape # should return (12, 3, 2) or (# of records, # of timestamps, # of groups which are a and b)
# (12, 3)
model_preds
# array([[1, 0, 0],
# [0, 0, 0],
# [1, 0, 0],
# [0, 0, 0],
# [1, 0, 0],
# [0, 0, 0],
# [0, 0, 0],
# [0, 0, 0],
# [0, 0, 0],
# [0, 0, 0],
# [1, 0, 0],
# [0, 0, 0]])
Questions:
The main question here is this: how do I get this working so that the model would forecast the next N sequences for both groups?
Additionally, I would like to ask the following questions:
Groups A and B are expected to be cross-correlated, however, is it valid to attempt to output both A and B sequences by a single model or should I fit 2 separate models, one predicting A, the other one predicting B but both using historical A and B data as inputs?
While my last layer in the model is an LSTM of shape (None, 3, 2), the prediction output is of shape (12, 3) when I would have expected it to be (12, 2) -- am I doing something wrong here and if so, how would I fix this?
As far as the output LSTM layer is concerned, would it be a good idea to use an activation function here, such as sigmoid? Why/why not?
Is it valid to use a classification type loss (binary cross-entropy) and metrics (accuracy) for optimising a sequence?
Is an LSTM model an optimal choice here? Does anyone think that a CRF or some HMM-type model would work better here?
Many thanks!

I will answer all question sequentially
how do I get this working so that the model would forecast the next N
sequences for both groups?
I would suggest two modifications to your model.
The first is using sigmoid activation for the last layer.
Why?? Consider binary cross entropy loss function (I borrowed the equation from here)
Where L is calculated loss, p is network prediction and y is target values.
The Loss is defined for .
If p is outside of this open interval range then the loss is undefined. The default activation of lstm layer in keras is tanh and it's output range is (-1, 1). This implies that the output of the model is not suitable for binary cross-entropy loss. If you try to train the model you might end up getting nan for loss.
The second modification (is part of the first modification) either add sigmoid activation before the last layer. For this, you have three options.
Add dense layer with sigmoid activation between your output and last lstm layer.
Or change the activation of the lstm layer to sigmoid.
Or add Activation layer with sigmoid activation after the output layer.
Even though all cases would work, I would suggest using dense layer with sigmoid activation because it almost always works better.
Now the model with suggested changes would be
model = Sequential()
model.add(LSTM(8, input_shape=(predictor_array.shape[1],predictor_array.shape[2]), return_sequences=True))
model.add(LSTM(2, return_sequences=True))
model.add(TimeDistributed(Dense(2, activation="sigmoid")))
model.summary()
... is it valid to attempt to output both A and B sequences by a single
model or should I fit 2 separate models ... ?
Ideally, both cases could work. But the latest studies such a this one show that the former case(where you use a single model for both groups) tends to perform better. The approach is generally called as Multi Task Learning. The idea behind Multi-Task learning is very broad, for simplicity, it can be thought of as adding inductive bias by forcing the model to learn hidden representations that are common for multiple tasks.
... the prediction output is of shape (12, 3) when I would have expected
it to be (12, 2) -- am I doing something wrong here ... ?
You are getting this because you are using predict_classes method. Unlike predict method, predict_classes method returns the maximum index of channels' axis(in your case third index). As I explained above if you use sigmoid activation for the last layer and replaced predict_classes with predict, you will get what you are expecting.
As far as the output LSTM layer is concerned, would it be a good idea
to use an activation function here, such as sigmoid? Why/why not?
I hope I've explained this above. The answer is YES.
Is it valid to use a classification type loss (binary cross-entropy)
and metrics (accuracy) for optimizing a sequence?
Since your targets are binary signals(the distribution is Bernoulli distribution), Yes it is valid to use binary loss and accuracy metrics. This answer gives more details on why binary cross-entropy is valid for this type of target variables.
Is an LSTM model an optimal choice here? Does anyone think that a CRF
or some HMM-type model would work better here?
This depends on the data available and the complexity of the network you choose. CRF and HMM networks are simple and work better if the available data is small. But if the available dataset is large, LSTM will almost always outperform both CRF and HMM. My suggestion is if you have a lot of data use LSTM. But if either you have small data or looking for simple models you can use CRF or HMM.

Related

Does Keras masking impact weight updates and loss calcuations?

I'm working with time series, and understand that keras.layers.Masking and keras.layers.Embedding are useful to create a mask value in the network which indicates timesteps to 'skip'. The mask value is propagated throughout the network to be used by any layers that support it.
The Keras documentation doesn't specify any further impacts of the mask value. My expectation is that the mask would be applied through all functions in model training and evaluation, but I don't see any evidence in support of this.
Does the mask value impact back-propagation?
Does the mask value impact the loss function or the metrics?
Would it be wise or foolish to use the sample_weight parameter in model.compile() to tell Keras to 'ignore' the masked timesteps in the loss function?
I've performed some experiments to answer these questions.
Here's my sample code:
import tensorflow as tf
import tensorflow.keras as keras
import numpy as np
# Fix the random seed for repeatable results
np.random.seed(5)
tf.random.set_seed(5)
x = np.array([[[3, 0], [1, 4], [3, 2], [4, 0], [4, 5]],
[[1, 2], [3, 1], [1, 3], [5, 1], [3, 5]]], dtype='float64')
# Choose some values to be masked out
mask = np.array([[False, False, True, True, True],
[ True, True, False, False, True]]) # True:keep. False:ignore
samples, timesteps, features_in = x.shape
features_out = 1
y_true = np.random.rand(samples, timesteps, features_out)
# y_true[~mask] = 1e6 # TEST MODIFICATION
# Apply the mask to x
mask_value = 0 # Set to any value
x[~mask] = [mask_value] * features_in
input_tensor = keras.Input(shape=(timesteps, features_in))
this_layer = input_tensor
this_layer = keras.layers.Masking(mask_value=mask_value)(this_layer)
this_layer = keras.layers.Dense(10)(this_layer)
this_layer = keras.layers.Dense(features_out)(this_layer)
model = keras.Model(input_tensor, this_layer)
model.compile(loss='mae', optimizer='adam')
model.fit(x=x, y=y_true, epochs=100, verbose=0)
y_pred = model.predict(x)
print("y_pred = ")
print(y_pred)
print("model weights = ")
print(model.get_weights()[1])
print(f"{'model.evaluate':>14s} = {model.evaluate(x, y_true, verbose=0):.5f}")
# See if the loss computed by model.evaluate() is equal to the masked loss
error = y_true - y_pred
masked_loss = np.abs(error[mask]).mean()
unmasked_loss = np.abs(error).mean()
print(f"{'masked loss':>14s} = {masked_loss:.5f}")
print(f"{'unmasked loss':>14s} = {unmasked_loss:.5f}")
Which outputs
y_pred =
[[[-0.28896046]
[-0.28896046]
[ 0.1546848 ]
[-1.1596009 ]
[ 1.5819632 ]]
[[ 0.59000516]
[-0.39362794]
[-0.28896046]
[-0.28896046]
[ 1.7996234 ]]]
model weights =
[-0.06686568 0.06484845 -0.06918766 0.06470951 0.06396528 0.06470013
0.06247645 -0.06492618 -0.06262784 -0.06445726]
model.evaluate = 0.60170
masked loss = 1.00283
unmasked loss = 0.90808
mask and loss calculation
Surprisingly, the 'mae' (mean absolute error) loss calculation does NOT exclude the masked timesteps from the calculation. Instead, it assumes that these timesteps have zero loss - a perfect prediction. Therefore, every masked timestep actually reduces the calculated loss!
To explain in more detail: the above sample code input x has 10 timesteps. 4 of them are removed by the mask, so 6 valid timesteps remain. The 'mean absolute error' loss calculation sums the losses for the 6 valid timesteps, then divides by 10 instead of dividing by 6. This looks like a bug to me.
output values are masked
Output values of masked timesteps do not impact the model training or evaluation (as it should be).
This can be easily tested by setting:
y_true[~mask] = 1e6
The model weights, predictions and losses remain exactly the same.
input values are masked
Input values of masked timesteps do not impact the model training or evaluation (as it should be).
Similarly, I can change mask_value from 0 to any other number, and the resulting model weights, predictions, and losses remain exactly the same.
In summary:
Q1: Effectively yes - the mask impacts the loss function, which is used through backpropagation to update the weights.
Q2: Yes, but the mask impacts the loss in an unexpected way.
Q3: Initially foolish - the mask should already be applied to the loss calculation. However, perhaps sample_weights could be valuable to correct the unexpected method of the loss calculation...
Note that I'm using Tensorflow 2.7.0.
I have been struggling through this on a related issue, namely implementing a mask to a multi-output model where some samples are missing labels for different outputs. Here, construct features, labels, sample_weights from a dataset and labels and sample_weights are dictionaries with equivalent keys. The weights are 0,1 for each sample indicating if it should contribute to the calculation for the relevant loss.
I had hoped that sample_weights would contribute to the loss as they do when I pass the metric equivalents for the losses via weight_metrics in model.compile
I've found that sample_weight does not seem to address this problem. I can tell from the training metrics that the task_loss values are different from task_metric values when sample weights are used.
I've given up on this and decided to go ahead and use masking. The masked loss values are low in your case (and in mine) because tensorflow sees the modeled output as perfection - I hope this means it does not see a gradient for this points and so parameters aren't tuned in response.

How to map an array of values for y_true to a single value in order to compare to y_pred in a Tensorflow loss function (Tensorflow/Tensorflow Quantum)

I am trying to implement the circuits listed on page 8 in the following paper: https://arxiv.org/pdf/1905.10876.pdf using Tensorflow Quantum (TFQ). I have done so previously for a subset of circuits using Qiskit, and ended up with accuracies that can be found on page 14 in the following paper: https://arxiv.org/pdf/2003.09887.pdf. In TFQ, my accuracies are way down. I think this delta originates because in TFQ, I only used 1 observable Pauli Z operator on the first qubit, and the circuits do not seem to "transfer all knowledge" to the first qubit. I place this in quotes, because I am sure there is a better way to describe this. In Qiskit on the other hand, 16 states (4^2) get mapped to 2 states.
My question: how can I get my accuracies back up?
Potential answer a): some method of "transferring all information" to a single qubit, potentially an ancilla qubit, and doing a readout on this qubit.
Potential answer b) placing a Pauli Z observable on all qubits (4 in total), mapping half of the 16 states to a label 0 and the other half to a label 1. I attempted this in the code below.
My attempt at answer b):
I have a Tensorflow Quantum (TFQ) circuit implemented in Tensorflow. The circuit has multiple observables, which I try to bring together in my loss function. I prefer to use as many standard components as possible, but need to map my quantum states to a label in order to determine the loss. I think what I am trying to achieve is not unique to TFQ. I define my model in the following way:
def circuit():
data_qubits = cirq.GridQubit.rect(4, 1)
circuit = cirq.Circuit()
...
return circuit, [cirq.Z(data_qubits[0]), cirq.Z(data_qubits[1]), cirq.Z(data_qubits[2]), cirq.Z(data_qubits[3])]
model_circuit, model_readout = circuit()
model = tf.keras.Sequential([
tf.keras.layers.Input(shape=(), dtype=tf.string),
# The PQC layer returns the expected value of the readout gate, range [-1,1].
tfq.layers.PQC(model_circuit, model_readout),
])
# compile model
model.compile(
loss = loss_mse,
optimizer=tf.keras.optimizers.Adam(learning_rate=0.01),
metrics=[])
in loss_mse (Mean Square Error), I receive a (32, 4) tensor for y_pred. One row could look like
[-0.2, 0.33, 0.6, 0.3]
This would have to be first mapped from [-1,1] to a binarized version of [0,1], so that it looks like:
[0, 1, 1, 1]
Now, a table lookup needs to happen, which tells if this combination is 0 or 1. Finally, the regular (y_true-y_pred)^2 can be performed by that row, followed by a np.sum on all rows. I tried to implement this:
def get_label(measurement):
if measurement == [0,0,0,0]: return 0
...
elif measurement == [1,1,1,1]: return 0
else: return -1
def py_call(y_true, y_pred):
# cast tensor to numpy
y_pred_np = np.asarray(y_pred)
loss = np.zeros((len(y_pred))) # could be a single variable with += within the loop
# evalaute all 32 samples
for pred in range(len(y_pred_np)):
# map, binarize and lookup
y_labelled = get_label([0 if y<0 else 1 for y in y_pred_np[pred]])
# regular loss comparison
loss[pred] = (y_labelled - y_true[pred])**2
# reduce
loss = np.sum(loss)/len(y_true)
return loss
#tf.function
def loss_mse(y_true, y_pred):
external_list = []
loss = tf.py_function(py_call, inp=[y_true, y_pred], Tout=[tf.float64])
return loss
However, the system appears to still expect a (32,4) tensor. I would have thought I could simply provide a single loss values (float). My question: how can I map multiple values for y_true to a single number in order to compare with a single y_pred value in a tensorflow loss function?
So it looks like there are a couple of things going on here. To answer your question
how can I map multiple values for y_true to a single number in order to compare with a single y_pred value in a tensorflow loss function ?
What you might want is some kind of tf.reduce_* function like tf.reduce_mean or tf.reduce_sum. This function will allow you to apply this reduction operation accross a given tensor axis allowing you to convert a tensor of shape (32, 4) to a tensor of shape (32,) or a tensor of shape (4,). Here is a quick snippet:
#tf.function
def my_loss(y_true, y_pred):
# y_true is shape (32, 4)
# y_pred is shape (32, 4)
# Scale from [-1, 1] to [0, 1]
y_true += 1
y_true /= 2
y_pred += 1
y_pred /= 2
# These are now both (32,) with the reduction of taking the mean applied along
# the second axis.
reduced_true = tf.reduce_mean(y_true, axis=1)
reduced_pred = tf.reduce_mean(y_pred, axis=1)
# Now a scalar loss.
loss = tf.reduce_mean((reduce_true - reduced_pred) ** 2)
return loss
Now the above isn't exactly what you want, since it's not super clear to me at least what exact reduction rules you have in mind for taking something like [0,1,1,1] -> 0 vs [0,0,0,0] -> 1.
Another thing I will also mention is that if you want JUST the sum of these Pauli Operators in cirq that you have term by term in the list [cirq.Z(data_qubits[0]), cirq.Z(data_qubits[1]), cirq.Z(data_qubits[2]), cirq.Z(data_qubits[3])] and all you care about is the final sum of these expectations, you could just as easily do:
my_operator = sum([cirq.Z(data_qubits[0]), cirq.Z(data_qubits[1]),
cirq.Z(data_qubits[2]), cirq.Z(data_qubits[3])])
print(my_op)
Which should give something like:
cirq.PauliSum(cirq.LinearDict({frozenset({(cirq.GridQubit(0, 0), cirq.Z)}): (1+0j), frozenset({(cirq.GridQubit(0, 1), cirq.Z)}): (1+0j), frozenset({(cirq.GridQubit(0, 2), cirq.Z)}): (1+0j), frozenset({(cirq.GridQubit(0, 3), cirq.Z)}): (1+0j)}))
Which is also compatable as a readout operation in the PQC layer. Lastly if would recommend reading through some of the snippets and examples here:
https://www.tensorflow.org/quantum/api_docs/python/tfq/layers/PQC
and here:
https://www.tensorflow.org/quantum/api_docs/python/tfq/layers/Expectation
Which give a pretty good description of how the input and output signatures of the functions look as well as the shapes you can expect from them.

What would be the output from tensorflow dense layer if we assign itself as input and output while making a neural network?

I have been going through the implementation of neural network in openAI code for any Vanilla Policy Gradient (As a matter of fact, this part is used nearly everywhere). The code looks something like this :
def mlp_categorical_policy(x, a, hidden_sizes, activation, output_activation, action_space):
act_dim = action_space.n
logits = mlp(x, list(hidden_sizes) + [act_dim], activation, None)
logp_all = tf.nn.log_softmax(logits)
pi = tf.squeeze(tf.random.categorical(logits, 1), axis=1)
logp = tf.reduce_sum(tf.one_hot(a, depth=act_dim) * logp_all, axis=1)
logp_pi = tf.reduce_sum(tf.one_hot(pi, depth=act_dim) * logp_all, axis=1)
return pi, logp, logp_pi
and this multi-layered perceptron network is defined as follows :
def mlp(x, hidden_sizes=(32,), activation=tf.tanh, output_activation=None):
for h in hidden_sizes[:-1]:
x = tf.layers.dense(inputs=x, units=h, activation=activation)
return tf.layers.dense(inputs=x, units=hidden_sizes[-1], activation=output_activation)
My question is what is the return from this mlp function? I mean the structure or shape. Is it an N-dimentional tensor? If so, how is it given as an input to tf.random_categorical? If not, and its just has the shape [hidden_layer2, output], then what happened to the other layers? As per their website description about random_categorical it only takes a 2-D input. The complete code of openAI's VPG algorithm can be found here. The mlp is implemented here. I would be highly grateful if someone would just tell me what this mlp_categorical_policy() is doing?
Note: The hidden size is [64, 64], the action dimension is 3
Thanks and cheers
Note that this is a discrete action space - there are action_space.n different possible actions at every step, and the agent chooses one.
To do this the MLP is returning the logits (which are a function of the probabilities) of the different actions. This is specified in the code by + [act_dim] which is appending count of the action_space as the final MLP layer. Note that the last layer of an MLP is the output layer. The input layer is not specified in tensorflow, it is inferred from the inputs.
tf.random.categorical takes the logits and samples a policy action pi from them, which is returned as a number.
mlp_categorical_policy also returns logp, the log probability of the action a (used to assign credit), and logp_pi, the log probability of the policy action pi.
It seems your question is more about the return from the mlp.
The mlp creates a series of fully connected layers in a loop. In each iteration of the loop, the mlp is creating a new layer using the previous layer x as an input and assigning it's output to overwrite x, with this line x = tf.layers.dense(inputs=x, units=h, activation=activation).
So the output is not the same as the input, on each iteration x is overwritten with the value of the new layer. This is the same kind of coding trick as x = x + 1, which increments x by 1. This effectively chains the layers together.
The output of tf.layers.dense is a tensor of size [:,h] where : is the batch dimension (and can usually be ignored). The creation of the last layer happens outisde the loop, it can be seen that the number of nodes in this layer is act_dim (so shape is [:,3]). You can check the shape by doing this:
import tensorflow.compat.v1 as tf
import numpy as np
def mlp(x, hidden_sizes=(32,), activation=tf.tanh, output_activation=None):
for h in hidden_sizes[:-1]:
x = tf.layers.dense(x, units=h, activation=activation)
return tf.layers.dense(x, units=hidden_sizes[-1], activation=output_activation)
obs = np.array([[1.0,2.0]])
logits = mlp(obs, [64, 64, 3], tf.nn.relu, None)
print(logits.shape)
result: TensorShape([1, 3])
Note that the observation in this case is [1.,2.], it is nested inside a batch of size 1.

Increase dimension of RNN LSTM cell in Keras

I want to increase amount of recurrent weights in rnn or lstm cell.
If you look at the code below you will see, that lsrm cell inputs shape is (2,1), which means 2 timesteps and 1 feature.
%tensorflow_version 2.x
import tensorflow as tf
m = tf.keras.models.Sequential()
lstm = tf.keras.layers.LSTM(1, use_bias=False)
input = tf.keras.Input(shape=(2,1))
m.add(input)
m.add(lstm)
lstm.get_weights()
The output is
[array([[ 0.878217 , 0.89324415, 0.404307 , -1.0542995 ]], dtype=float32),
array([[-0.24181306, -0.341401 , 0.65207034, 0.63227856]], dtype=float32)]
4 weights for each feature, and 4 weights for previous outputs
Now if I change Input shape like this
input = tf.keras.Input(shape=(2,1))
then the output of get_weights function will be like this:
[array([[-0.9725287 , -0.90078545, 0.97881985, -0.9623983 ],
[-0.9644511 , 0.90705967, 0.05965471, 0.32613564]], dtype=float32),
array([[-0.24867296, -0.22346373, -0.6410606 , 0.69084513]], dtype=float32)]
Now my question is: how do I increase amount of weights in the second array whick keeps the (4,1) shape?
The idea is that I want RNN or STRM take not only the previous output (t-1 moment) but more prevois values like (t-2, t-3, t-4) moments.
Is there way to do it in keras with tf backend?
I can't understand the change, I think you had a typo in your question, but:
Length - Time steps:
The number of time steps will never change the number of weights. The layer is "recurrent", meaning it will "loop" the time steps. It's not supposed to have different weights for each step.
The whole purpose of the layer is to apply the same operations over and over and over for each time step.
Input features:
Input features are the last dimension of the input. They define one dimension of the weights.
Units = Output features:
Output features, also the last dimension of the output, are another dimension of the weights.
Two types of kernels
The LSTM layers have two groups of kernels:
What they call simply kernels - with shape=(input_dim, self.units * 4)
What they call recurrent kernels - with shape=(self.units, self.units * 4)
The first group acts on the input data, they have shape considering the input features and the output features.
The second group acts on inner states and have shapes considering only the output features (units).
From the source code:
self.kernel = self.add_weight(shape=(input_dim, self.units * 4),
name='kernel',
initializer=self.kernel_initializer,
regularizer=self.kernel_regularizer,
constraint=self.kernel_constraint)
self.recurrent_kernel = self.add_weight(
shape=(self.units, self.units * 4),
name='recurrent_kernel',
initializer=self.recurrent_initializer,
regularizer=self.recurrent_regularizer,
constraint=self.recurrent_constraint)
The last array in the list:
The last array in the list of weights are the 4 recurrent kernels with shape (1, 1) grouped into one.
So:
You can increase the kernels with more input features. Transform Input((anything, 1)) into Input((anything, more)) for instance.
You can increase the kernels and the recurrent_kernels (and biases, when considered) with bigger output features. Transform LSTM(1, ...) into LSTM(more, ...)
Weights are independent of the lenght. It's even possible to have Input((None, 1)), meaning a variable length.
Using more than just the last step
This should be automatic. LSTM layers are designed to have memory. The memory is an inner state that participates in all time steps. There are gates (the kernels) that decide how a new step will participate in this memory. Since all steps participate in the same memory, LSTM layer theoretically considers "all" time steps from the beginning.
So, you shouldn't really worry with this.
But if you do want this, there are maybe two ways. Don't know if they will bring any improvement, though.
One is to concatenate shifted inputs as features:
def pad_and_shift(x):
steps = 3
paddings = tf.constant([[0,0], [steps-1, 0], [0, 0]])
x = tf.pad(x, paddings)
to_concat = [ x[:,i:i - steps + 1] for i in range(steps-1) ]
to_concat += x[:, steps-1:]
return tf.concat(to_concat, axis=-1)
given_inputs = ....
out = Lambda(pad_and_shift)(given_inputs)
out = LSTM(units, ...)(out)
The other involves editing the source code of the LSTM, which would be very complicated and probably not very worthy.

Confused about tensorflow variable shapes

I am learning tensorflow using the tensorflow machine learning cookbook (https://github.com/nfmcclure/tensorflow_cookbook). I am currently on the NLP chapter (07). I am very confused about how one decides on the dimensions of the tensorflow variables. For example, in the bag of words example they use:
# Create variables for logistic regression
A = tf.Variable(tf.random_normal(shape=[embedding_size,1]))
b = tf.Variable(tf.random_normal(shape=[1,1]))
# Initialize placeholders
x_data = tf.placeholder(shape=[sentence_size], dtype=tf.int32)
y_target = tf.placeholder(shape=[1, 1], dtype=tf.float32)
and in the tf-idf example they use:
# Create variables for logistic regression
A = tf.Variable(tf.random_normal(shape=[max_features,1]))
b = tf.Variable(tf.random_normal(shape=[1,1]))
x_data = tf.placeholder(shape=[None, max_features], dtype=tf.float32)
y_target = tf.placeholder(shape=[None, 1], dtype=tf.float32)
How does one decide on when to use None vs. 1 in the placeholder shapes? Thank you!
Using None as part of a shape, means it will be determined when you run the session.
This is useful for training in what is called batch training where you feed each iteration of the training process a fixed size subset of the data.
So if you kept it at None you can switch between batch sizes without a problem. (Although you won't be doing so in the same session, but every session you can try a different batch size)
When you state a specific shape, that is what it will be and that is the only shape that can be fed to it during the session (using the feed_dict param)
In your specific example, the first part of code, the shape of y_target will always be [1, 1] where in the second part of code, y_target could be [10, 1] / [200, 1] / [<whatever>, 1]
'None' should be used when count of elements in placeholder is unknown in advance. But for example in x_data placeholder if count of data elements is 1 i.e. it is known in advance, then you can replace 'None' with 1.