In the training of a Deep Learning model, does it matter the sequential order of elements in my dataset I use to input? - tensorflow

To be more specific, I’m dealing with a NLP problem and training a LSTM to word prediction given a initial sequence of words. My dataset is 200k reddit comments.
Does it matter if I randomly feed the examples one at a time (allowing repeated inputs) or if I feed them in a sequence (not allowing repetitions)?

Since your data is actually a set of comments, there is no need to process them in a sequence. In fact, typically it is better to process data in random order to make sure that network does not learn something which is order dependent. Repeats do not matter at all, as long as you sample uniformly.

Related

How to inject future sequential input for multi-step time series forecasts in LSTM networks

I am trying to do multi-step (i.e., sequence-to-sequence) forecasts for product sales using both (multivariate) sequential and non-sequential inputs.
Specifically, I am using sales numbers as well as some other sequential inputs (e.g., price, is day before holiday, etc...) of the past n days to predict the sales for future m days. Additionally, I have some non-sequential features characterizing the product itself.
Definitions:
n_seq_features <- number of sequential features (in the multivariate time-series) including sales
n_non_seq_features <- number of non-sequential features characterizing a product
I got as far as building a hybrid-model, where first the sequential input is passed through some LSTM layers. The output of the final LSTM layer is then concatenated with the non-sequential features and fed into some dense layers.
What I can't quite get my head around, though, is how to input future sequntial input (everything except sales numbers for the following m days) in a way that efficiently utilizes the sequential information (i.e., causality, etc...). For m=1, I can simply input the sequential data for this one day together with the non-sequential input after the LSTM layers, however as soon as m becomes greater than 1 this appears to be a waste of causal information.
The only ways I could think of were:
to incorporate the sequential information for future m days as features in the LSTM input blowing up the input shape from (..., n, n_seq_features) to (..., n, n_seq_features + m*(n_seq_features-1))
add a separate LSTM branch handling the future data, the output of which is then 'somehow' fed into the dense layers at the last stage of the model
I only started using LSTM networks a while ago so I unfortunately have only limited intuition on how they are best utilized (especially in hybrid approaches). For this reason, I would like to ask:
Is the general approach of injecting sequential and non-sequential input at different stages of the same model (i.e., trained concurrently) useful or would one rather split it into separate models which can be trained independently for more fine-grained control?
How is future sequential input injected into an LSTM network to preserve causal information? Can this be achieved with a high-level frontend like KERAS or does it require a very deep dive into the tensorflow backend?
Are LSTM networks not the way to go for this specific problem in the first place?
Cheers and thanks in advance for any advice, resources or thoughts on the matter. :)
In case someone is having a similar issue with future sequential (or temporal) data, University of Oxford and Google Cloud AI have come up with a new architecture to handle all three types of input (past temporal, future temporal as well as static). It is called Temporal Fusion Transformer and, at least from reading the paper, looks like a neat fit. However, I have yet to implement and test it. There is also a PyTorch Tutorial available.

Keras : Shuffling dataset while using LSTM

Correct me if I am wrong but according to the official Keras documentation, by default, the fit function has the argument 'shuffle=True', hence it shuffles the whole training dataset on each epoch.
However, the point of using recurrent neural networks such as LSTM or GRU is to use the precise order of each data so that the state of the previous data influence the current one.
If we shuffle all the data, all the logical sequences are broken. Thus I don't understand why there are so much examples of LSTM where the argument is not set to False. What is the point of using RNN without sequences ?
Also, when I set the shuffle option to False, my LSTM model is less performant eventhought there are dependencies between the data: I use the KDD99 dataset where the connections are linked.
If we shuffle all the data, all the logical sequences are broken.
No, the shuffling happens on the batches axis, not on the time axis.
Usually, your data for an RNN has a shape like this: (batch_size, timesteps, features)
Usually, you give your network not only one sequence to learn from, but many sequences. Only the order in which these many sequences are being trained on gets shuffled. The sequences themselves stay intact.
Shuffling is usually always a good idea because your network shall only learn the training examples themselves, not their order.
This being said, there are cases where you have indeed only one huge sequence to learn from. In that case you have the option to still divide your sequence into several batches. If this is the case, you are absolutely right with your concern that shuffling would have a huge negative impact, so don't do that in this case!
Note: RNNs have a stateful parameter that you can set to True. In that case the last state of the previous batch will be passed to the following one which effectively makes your RNN see all batches as one huge sequence. So, absolutely do this, if you have a huge sequence over multiple batches.

Understanding the Input Parameters in RNN

I'm having a hard time to understand the different "jargons" used in RNN. They are the following:
batch_size, time_steps, inputs and instances.
Let me go through my understanding of each input parameters & please correct me where I'm wrong.
Suppose I've got a sequence of numbers and I want to predict the next number. The numbers are the following:
[1,2,3,4,5,....,100]
time_steps: This parameter means how far RNN will look into past before it predicts the future. For simplicity, I want to predict 1 number ahead. And want to do after I see 10 numbers in the past. So, in this case, time_steps will be 10.
inputs: These are the values at each time_steps. In first time_step (t) the inputs are
t0: [1]
t1: [2]
.
.
.
t10: [10]`
batch_size: This helps in efficient computation of RNN model. Suppose my batch_size is 2. In that case, at time_step 2, the RNN input will be
t0: [1]
t0: [11]
Then what's the usage of instances? E.g. in this post, instances have been used. And there are multiple cases where instances are used. Is it means each loop over batch? E.g. there are 5 batches, each of size 2. Then there will be 5 instances.
Please help me correct my understanding.
Thanks!
batch_size
Batch size, in general, represents the size of the mini-batches constructed from the experimental dataset. Since in deep learning, we are required to do a lot of computations, it is better if we consider mini-batch operations because GPU usage will be worth then.
time_steps
Since RNN takes sequential inputs, index of each element in the input sequence can be referred as a time step of that sequence. For example, if [1,2,3,4,5,....,100] is a sequence, index of each element in the sequence is a time step.
inputs
The term inputs has a broader meaning, so I am not sure if my definition is correct. According to my understanding, inputs to an RNN refers to individual inputs provided to RNN at each time step. For example, in [1,2,3,4,5,....,100], each element is an input to the RNN at a particular time step.
But in an abstract way, if someone asks, what is the input of your deep neural model? You can say, it is English sentences or images or audio clips or videos etc. In short, the meaning of the term inputs depends on the context.
instances
Instances, in general, refers to a training/dev/test example in the dataset. For example, the sequence: [1,2,3,4,5,....,100] can be a training instance in your dataset.
Hope this helps!
Alright pal, you did good learning those concepts. I had a hard time learning those correctly. Everything you know seems to be in order and as for "instances". They're basically a set of data. There's no fixed term of usage of "instances" in a deep learning community. Some people use it for referring for a different set of data or batches of data. I rarely hear it in papers.

Time series classification using LSTM - How to approach?

I am working on an experiment with LSTM for time series classification and I have been going through several HOWTOs, but still, I am struggling with some very basic questions:
Is the main idea for learning the LSTM to take a same sample from every time series?
E.g. if I have time series A (with samples a1,a2,a3,a4), B(b1,b2,b3,b4) and C(c1,c2,c3,c4), then I will feed the LSTM with batches of (a1,b1,c1), then (a2,b2,c2) etc.? Meaning that all time series needs to be of the same size/number of samples?
If so, can anynone more experienced be so kind and describe me very simply how to approach the whole process of learning the LSTM and creating the classifier?
My intention is to use TensorFlow, but I am still new to this.
If your goal is classification, then your data should be a a time series and a label. During training, you feed each into the lstm, and look only at the last output and backprop as necessary.
Judging from your question, you are probably confused about batching -- you can train multiple items at once. However, each item in the batch would get its own hidden state, and only the parameters of the layers are updated.
The time series in a single batch should be of the same length. You should terminate each sequence with a END token and pad items that are too short with a special token PAD -- the lstm should learn that PAD's after and END are useless.
There is no need for different batches to have the same number of items, nor to have items of the same length.

LSTM one-step-ahead prediction with Tensorflow

I am using Tensorflow's combination of GRUCell + MultiRNNCell + dynamic_rnn to generate a multi-layer LSTM to predict a sequence of elements.
In the few examples I have seen, like character-level language models, once the Training stage is done, the Generation seems to be done by feeding only ONE 'character' (or whatever element) at a time to get the next prediction, and then getting the following 'character' based on the first prediction, etc.
My question is, since Tensorflow's dynamic_rnn unrolls the RNN graph into an arbitrary number of steps of whatever sequence length is fed into it, what is the benefit of feeding only one element at a time, once a prediction is gradually being built out? Doesn't it make more sense to be gradually collecting a longer sequence with each predictive step and re-feeding it into the graph? I.e. after generating the first prediction, feed back a sequence of 2 elements, and then 3, etc.?
I am currently trying out the prediction stage by initially feeding in a sequence of 15 elements (actual historic data), getting the last element of the prediction, and then replacing one element in the original input with that predicted value, and so on in a loop of N predictive steps.
What is the disadvantage of this approach versus feeding just one element at-a-time?
I'm not sure your approach is actually doing what you want it to do.
Let's say we have an LSTM network trained to generate the alphabet. Now, in order to have the network generate a sequence, we start with a clean state h0 and feed in the first character, a. The network outputs a new state, h1, and its prediction, b, which we append to our output. Next, we want the network to predict the next character based on the current output, ab. If we would feed the network ab with the state being h1 at this step, its perceived sequence would be aab, because h1 was calculated after the first a, and now we put in another a and a b. Alternatively, we could feed ab and a clean state h0 into the network, which would provide a proper output (based on ab), but we would perform unnecessary calculations for the whole sequence except b, because we already calculated the state h1 which corresponds to the network reading the sequence a, so in order to get the next prediction and state we only have to feed in the next character, b.
So to answer your question, feeding the network one character at a time makes sense because the network needs to see each character only once, and feeding the same character multiple times would just be unnecessary calculations.
This is an great question, I asked something very similar here.
The idea being instead of sharing weights across time (one element at-a-time as you describe it), each time step gets it's own set of weights.
I believe there are several reasons for training one-step at a time, mainly computational complexity and training difficulty. The number of weights you'll need to train grows linearly for each time step. You'd need some pretty sporty hardware to train long sequences. Also for long sequences you'll need a very large data set to train all those weights. But imho, I am still optimistic that for the right problem, with sufficient resources, it would show improvement.