I found a nicely trained LSTM-based network.
The network allows for masking.
for l in range(len(model.layers)):
d=model.layers[l].__dict__
print(d['supports_masking'])
print(d['name'])
is True for me for all the 'name' beside the input layers.
I also have a time serie with missing timestamps, which I replace by the correct mask_value.
Is the network using all the masked_values as other ordinary values to determine the final prediction, so all the computation of the forward pass are actually executed (example update of the state in an LSTM for each timestamp in input) or the masked samples are completely skipped so the computation never take places?
Keras will skip time steps, as said in the documentation.
Related
I am working on an experiment with LSTM for time series classification and I have been going through several HOWTOs, but still, I am struggling with some very basic questions:
Is the main idea for learning the LSTM to take a same sample from every time series?
E.g. if I have time series A (with samples a1,a2,a3,a4), B(b1,b2,b3,b4) and C(c1,c2,c3,c4), then I will feed the LSTM with batches of (a1,b1,c1), then (a2,b2,c2) etc.? Meaning that all time series needs to be of the same size/number of samples?
If so, can anynone more experienced be so kind and describe me very simply how to approach the whole process of learning the LSTM and creating the classifier?
My intention is to use TensorFlow, but I am still new to this.
If your goal is classification, then your data should be a a time series and a label. During training, you feed each into the lstm, and look only at the last output and backprop as necessary.
Judging from your question, you are probably confused about batching -- you can train multiple items at once. However, each item in the batch would get its own hidden state, and only the parameters of the layers are updated.
The time series in a single batch should be of the same length. You should terminate each sequence with a END token and pad items that are too short with a special token PAD -- the lstm should learn that PAD's after and END are useless.
There is no need for different batches to have the same number of items, nor to have items of the same length.
I went through this tutorial . In the last block it says that dynamic_rnn function cannot apply to calculate attention. But what I don't understand is all we need is the hidden state of the decoder in order to find the attention which will be work out with encoder symbols.
Attention mechanism in the context of encoder-decoder means that decoder at each time step "attends" to the "useful" parts of the encoder. This is implemented as, for example, averaging encoder's outputs and feeding that value (called context) into a decoder at a given time step.
dynamic_rnn computes outputs of LSTM cells across all time steps and gives you the final value. So, there is no way to tell the model that the cell state at time step t should depend not only on the output of the previous cell and input, but also on additional information such as context. You can control computation at each time step of encoder or decoder LSTM using raw_rnn.
If I understand correctly, in this tutorial the author feeds ground truth input as input to the decoder at each time step. However, this is not the usual way it is done. Usually, you want to feed the output of decoder at time t as input to decoder at time t+1. In short, the input to the decoder at each time step is variable, whereas in dynamic_rnn it is predefined.
Refer to for more technical details: https://arxiv.org/abs/1409.0473
When I first train an LSTM in Keras on sequence data - my training data -
and then use model.predict() to make predictions with my test data as input, is the hidden state of the LSTM still being adjusted?
Basic operation of a neural network is to take an input (vector) which is connected to the output with connections and, sometimes, other layers such as context layers. These connections are modelled as matrices and vary in strength, we call these weight matrices.
This means that the only thing we do when we are feeding data into the network is to put a vector into the network, multiply the values with the weight matrix and call that the output. In special cases, like recurrent networks, we even keep some values stored in other vectors and combine this stored value with the current input.
During training we not only feed data into the network, we also compute an error value that we evaluate in a clever way so that it tells us how we should change the weight matrices we multiply our inputs (and possibly past inputs for recurrent layers) with.
Therefore: yes, of course the basic execution behavior does not change for recurrent layers. We are just not updating weights anymore.
There are layers that do behave differently during execution time because they are treated as regularisers, i.e. methods that make training the network more efficient, which are deemed as unnecessary during execution. Examples for these layers are Noise and BatchNormalization. Almost all neural network layers (including recurrent ones) include drop-out which is another form of regularisation which disables a random percentage of connections in the layer. This is also only done during training.
I am using Tensorflow's combination of GRUCell + MultiRNNCell + dynamic_rnn to generate a multi-layer LSTM to predict a sequence of elements.
In the few examples I have seen, like character-level language models, once the Training stage is done, the Generation seems to be done by feeding only ONE 'character' (or whatever element) at a time to get the next prediction, and then getting the following 'character' based on the first prediction, etc.
My question is, since Tensorflow's dynamic_rnn unrolls the RNN graph into an arbitrary number of steps of whatever sequence length is fed into it, what is the benefit of feeding only one element at a time, once a prediction is gradually being built out? Doesn't it make more sense to be gradually collecting a longer sequence with each predictive step and re-feeding it into the graph? I.e. after generating the first prediction, feed back a sequence of 2 elements, and then 3, etc.?
I am currently trying out the prediction stage by initially feeding in a sequence of 15 elements (actual historic data), getting the last element of the prediction, and then replacing one element in the original input with that predicted value, and so on in a loop of N predictive steps.
What is the disadvantage of this approach versus feeding just one element at-a-time?
I'm not sure your approach is actually doing what you want it to do.
Let's say we have an LSTM network trained to generate the alphabet. Now, in order to have the network generate a sequence, we start with a clean state h0 and feed in the first character, a. The network outputs a new state, h1, and its prediction, b, which we append to our output. Next, we want the network to predict the next character based on the current output, ab. If we would feed the network ab with the state being h1 at this step, its perceived sequence would be aab, because h1 was calculated after the first a, and now we put in another a and a b. Alternatively, we could feed ab and a clean state h0 into the network, which would provide a proper output (based on ab), but we would perform unnecessary calculations for the whole sequence except b, because we already calculated the state h1 which corresponds to the network reading the sequence a, so in order to get the next prediction and state we only have to feed in the next character, b.
So to answer your question, feeding the network one character at a time makes sense because the network needs to see each character only once, and feeding the same character multiple times would just be unnecessary calculations.
This is an great question, I asked something very similar here.
The idea being instead of sharing weights across time (one element at-a-time as you describe it), each time step gets it's own set of weights.
I believe there are several reasons for training one-step at a time, mainly computational complexity and training difficulty. The number of weights you'll need to train grows linearly for each time step. You'd need some pretty sporty hardware to train long sequences. Also for long sequences you'll need a very large data set to train all those weights. But imho, I am still optimistic that for the right problem, with sufficient resources, it would show improvement.
I'm using tensorflow to run a cnn for image classification.
I use tensorflow cifar10 cnn implementation.(tensorflow cifar10)
I want to decrease the number of connections, meaning I want to prune the low-weight connections.
How can I create a new graph(subgraph) without some of the nuerones?
Tensorflow does not allow you lock/freeze a particular kernel of a particular layer, that I have found. The only I've found to do this is to use the tf.assign() function as shown in
How to freeze/lock weights of one Tensorflow variable (e.g., one CNN kernel of one layer
It's fairly cave-man but I've seen no other solution that works. Essentially, you have to .assign() the values every so often as you iterate through the data. Since this approach is so inelegant and brute-force, it's very slow. I do the .assign() every 100 batches.
Someone please post a better solution and soon!
The cifar10 model you point to, and for that matter, most models written in TensorFlow, do not model the weights (and hence, connections) of individual neurons directly in the computation graph. For instance, for fully connected layers, all the connections between the two layers, say, with M neurons in the layer below, and 'N' neurons in the layer above, are modeled by one MxN weight matrix. If you wanted to completely remove a neuron and all of its outgoing connections from the layer below, you can simply slice out a (M-1)xN matrix by removing the relevant row, and multiply it with the corresponding M-1 activations of the neurons.
Another way is add an addition mask to control the connections.
The first step involves adding mask and threshold variables to the
layers that need to undergo pruning. The variable mask is the same
shape as the layer's weight tensor and determines which of the weights
participate in the forward execution of the graph.
There is a pruning implementation under tensorflow/contrib/model_pruning to prune the model. Hope this can help you to prune model quickly.
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/model_pruning
I think google has an updated answer here : https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/model_pruning
Removing pruning nodes from the trained graph:
$ bazel build -c opt contrib/model_pruning:strip_pruning_vars
$ bazel-bin/contrib/model_pruning/strip_pruning_vars --checkpoint_path=/tmp/cifar10_train --output_node_names=softmax_linear/softmax_linear_2 --filename=cifar_pruned.pb
I suppose that cifar_pruned.pb will be smaller, since the pruned "or zero masked" variables are removed.