I'm having a hard time to understand the different "jargons" used in RNN. They are the following:
batch_size, time_steps, inputs and instances.
Let me go through my understanding of each input parameters & please correct me where I'm wrong.
Suppose I've got a sequence of numbers and I want to predict the next number. The numbers are the following:
[1,2,3,4,5,....,100]
time_steps: This parameter means how far RNN will look into past before it predicts the future. For simplicity, I want to predict 1 number ahead. And want to do after I see 10 numbers in the past. So, in this case, time_steps will be 10.
inputs: These are the values at each time_steps. In first time_step (t) the inputs are
t0: [1]
t1: [2]
.
.
.
t10: [10]`
batch_size: This helps in efficient computation of RNN model. Suppose my batch_size is 2. In that case, at time_step 2, the RNN input will be
t0: [1]
t0: [11]
Then what's the usage of instances? E.g. in this post, instances have been used. And there are multiple cases where instances are used. Is it means each loop over batch? E.g. there are 5 batches, each of size 2. Then there will be 5 instances.
Please help me correct my understanding.
Thanks!
batch_size
Batch size, in general, represents the size of the mini-batches constructed from the experimental dataset. Since in deep learning, we are required to do a lot of computations, it is better if we consider mini-batch operations because GPU usage will be worth then.
time_steps
Since RNN takes sequential inputs, index of each element in the input sequence can be referred as a time step of that sequence. For example, if [1,2,3,4,5,....,100] is a sequence, index of each element in the sequence is a time step.
inputs
The term inputs has a broader meaning, so I am not sure if my definition is correct. According to my understanding, inputs to an RNN refers to individual inputs provided to RNN at each time step. For example, in [1,2,3,4,5,....,100], each element is an input to the RNN at a particular time step.
But in an abstract way, if someone asks, what is the input of your deep neural model? You can say, it is English sentences or images or audio clips or videos etc. In short, the meaning of the term inputs depends on the context.
instances
Instances, in general, refers to a training/dev/test example in the dataset. For example, the sequence: [1,2,3,4,5,....,100] can be a training instance in your dataset.
Hope this helps!
Alright pal, you did good learning those concepts. I had a hard time learning those correctly. Everything you know seems to be in order and as for "instances". They're basically a set of data. There's no fixed term of usage of "instances" in a deep learning community. Some people use it for referring for a different set of data or batches of data. I rarely hear it in papers.
Related
Correct me if I am wrong but according to the official Keras documentation, by default, the fit function has the argument 'shuffle=True', hence it shuffles the whole training dataset on each epoch.
However, the point of using recurrent neural networks such as LSTM or GRU is to use the precise order of each data so that the state of the previous data influence the current one.
If we shuffle all the data, all the logical sequences are broken. Thus I don't understand why there are so much examples of LSTM where the argument is not set to False. What is the point of using RNN without sequences ?
Also, when I set the shuffle option to False, my LSTM model is less performant eventhought there are dependencies between the data: I use the KDD99 dataset where the connections are linked.
If we shuffle all the data, all the logical sequences are broken.
No, the shuffling happens on the batches axis, not on the time axis.
Usually, your data for an RNN has a shape like this: (batch_size, timesteps, features)
Usually, you give your network not only one sequence to learn from, but many sequences. Only the order in which these many sequences are being trained on gets shuffled. The sequences themselves stay intact.
Shuffling is usually always a good idea because your network shall only learn the training examples themselves, not their order.
This being said, there are cases where you have indeed only one huge sequence to learn from. In that case you have the option to still divide your sequence into several batches. If this is the case, you are absolutely right with your concern that shuffling would have a huge negative impact, so don't do that in this case!
Note: RNNs have a stateful parameter that you can set to True. In that case the last state of the previous batch will be passed to the following one which effectively makes your RNN see all batches as one huge sequence. So, absolutely do this, if you have a huge sequence over multiple batches.
I'm currently facing a Machine Learning problem and I've reached a point where I need some help to proceed.
I have various time series of positional (x, y, z) data tracked by sensors. I've developed some more features. For example, I rasterized the whole 3D space and calculated a cell_x, cell_y and cell_z for every time step. The time series itself have variable lengths.
My goal is to build a model which classifies every time step with the labels 0 or 1 (binary classification based on past and future values). Therefore I have a lot of training time series where the labels are already set.
One thing which could be very problematic is that there are very few 1's labels in the data (for example only 3 of 800 samples are labeled with 1).
It would be great if someone can help me in the right direction because there are too many possible problems:
Wrong hyperparameters
Incorrect model
Too few 1's labels, but I think that's not a big problem because I only need the model to suggests the right time steps. So I would only use the peaks of the output.
Bad or too less training data
Bad features
I appreciate any help and tips.
Your model seems very strange. Why only use 2 units in lstm layer? Also your problem is a binary classification. In this case you should choose only one neuron in your output layer (try to insert one additional dense layer between and lstm layer and try dropout layers between them).
Binary crossentropy does not make much sense with 2 output neurons, if you don't have a multi label problem. But if you're switching to one output neuron it's the right one. You also need sigmoid then as activation function.
As last advice: Try class weights.
http://scikit-learn.org/stable/modules/generated/sklearn.utils.class_weight.compute_class_weight.html
This can make a huge difference, if you're label are unbalanced.
You can create the model using tensorflow BasicLSTMCell, the shape of your data fits for BasicLSTMCell in TensorFlow you can find Documentation for BasicLSTMCell here and for creating the model this Documentation contain code that will help to build BasicLstmCell model . Hope this will help you, Cheers.
I am working on an experiment with LSTM for time series classification and I have been going through several HOWTOs, but still, I am struggling with some very basic questions:
Is the main idea for learning the LSTM to take a same sample from every time series?
E.g. if I have time series A (with samples a1,a2,a3,a4), B(b1,b2,b3,b4) and C(c1,c2,c3,c4), then I will feed the LSTM with batches of (a1,b1,c1), then (a2,b2,c2) etc.? Meaning that all time series needs to be of the same size/number of samples?
If so, can anynone more experienced be so kind and describe me very simply how to approach the whole process of learning the LSTM and creating the classifier?
My intention is to use TensorFlow, but I am still new to this.
If your goal is classification, then your data should be a a time series and a label. During training, you feed each into the lstm, and look only at the last output and backprop as necessary.
Judging from your question, you are probably confused about batching -- you can train multiple items at once. However, each item in the batch would get its own hidden state, and only the parameters of the layers are updated.
The time series in a single batch should be of the same length. You should terminate each sequence with a END token and pad items that are too short with a special token PAD -- the lstm should learn that PAD's after and END are useless.
There is no need for different batches to have the same number of items, nor to have items of the same length.
I am using Tensorflow's combination of GRUCell + MultiRNNCell + dynamic_rnn to generate a multi-layer LSTM to predict a sequence of elements.
In the few examples I have seen, like character-level language models, once the Training stage is done, the Generation seems to be done by feeding only ONE 'character' (or whatever element) at a time to get the next prediction, and then getting the following 'character' based on the first prediction, etc.
My question is, since Tensorflow's dynamic_rnn unrolls the RNN graph into an arbitrary number of steps of whatever sequence length is fed into it, what is the benefit of feeding only one element at a time, once a prediction is gradually being built out? Doesn't it make more sense to be gradually collecting a longer sequence with each predictive step and re-feeding it into the graph? I.e. after generating the first prediction, feed back a sequence of 2 elements, and then 3, etc.?
I am currently trying out the prediction stage by initially feeding in a sequence of 15 elements (actual historic data), getting the last element of the prediction, and then replacing one element in the original input with that predicted value, and so on in a loop of N predictive steps.
What is the disadvantage of this approach versus feeding just one element at-a-time?
I'm not sure your approach is actually doing what you want it to do.
Let's say we have an LSTM network trained to generate the alphabet. Now, in order to have the network generate a sequence, we start with a clean state h0 and feed in the first character, a. The network outputs a new state, h1, and its prediction, b, which we append to our output. Next, we want the network to predict the next character based on the current output, ab. If we would feed the network ab with the state being h1 at this step, its perceived sequence would be aab, because h1 was calculated after the first a, and now we put in another a and a b. Alternatively, we could feed ab and a clean state h0 into the network, which would provide a proper output (based on ab), but we would perform unnecessary calculations for the whole sequence except b, because we already calculated the state h1 which corresponds to the network reading the sequence a, so in order to get the next prediction and state we only have to feed in the next character, b.
So to answer your question, feeding the network one character at a time makes sense because the network needs to see each character only once, and feeding the same character multiple times would just be unnecessary calculations.
This is an great question, I asked something very similar here.
The idea being instead of sharing weights across time (one element at-a-time as you describe it), each time step gets it's own set of weights.
I believe there are several reasons for training one-step at a time, mainly computational complexity and training difficulty. The number of weights you'll need to train grows linearly for each time step. You'd need some pretty sporty hardware to train long sequences. Also for long sequences you'll need a very large data set to train all those weights. But imho, I am still optimistic that for the right problem, with sufficient resources, it would show improvement.
My problem:
I have a sequence of complex states and I want to predict the future states.
Input:
I have a sequence of states. Each sequence can be of variable length. Each state is a moment in time and is described by several attributes: [att1, att2, ...]. Where each attribute is a number between an interval [[0..5], [1..3651], ...]
The example (and paper) of Seq2Seq is based on that each state (word) is taken from their dictionary. So each state has around 80.000 possibilities. But how would you represent each state when it is taken from a set of vectors and the set is just each possible combination of the attributes.
Is there any method to work with more complex states with TensorFlow? Also, what is a good method do decide the boundaries of your buckets when the relation between input length and output length is unclear?
May I suggest a rephrasing and splitting of your question into two parts? The first is really a general machine learning/LSTM question that's independent of tensorflow: How to use an LSTM to predict when the sequence elements are general vectors, and the second is how to represent this in tensorflow. For the former - there's nothing really magical to do there.
But a very quick answer: You've really just skipped the embedding lookup part of seq2seq. You can feed dense tensors in to a suitably modified version of it -- your state is just a dense vector representation of the state. That's the same thing that comes out of an embedding lookup.
The vector representation tutorial discusses the preprocessing that turns, e.g., words into embeddings for use in later parts of the learning pipeline.
If you look at line 139 of seq2seq.py you'll see that the embedding_rnn_decoder takes in a 1D batch of things to decide (the dimension is elements in the batch), but then uses the embedding lookup to turn it into a batch_size * cell.input_size tensor. You want to directly input a batch_size * cell.input_size tensor into the RNN, skipping the embedding step.