Keras : Shuffling dataset while using LSTM - tensorflow

Correct me if I am wrong but according to the official Keras documentation, by default, the fit function has the argument 'shuffle=True', hence it shuffles the whole training dataset on each epoch.
However, the point of using recurrent neural networks such as LSTM or GRU is to use the precise order of each data so that the state of the previous data influence the current one.
If we shuffle all the data, all the logical sequences are broken. Thus I don't understand why there are so much examples of LSTM where the argument is not set to False. What is the point of using RNN without sequences ?
Also, when I set the shuffle option to False, my LSTM model is less performant eventhought there are dependencies between the data: I use the KDD99 dataset where the connections are linked.

If we shuffle all the data, all the logical sequences are broken.
No, the shuffling happens on the batches axis, not on the time axis.
Usually, your data for an RNN has a shape like this: (batch_size, timesteps, features)
Usually, you give your network not only one sequence to learn from, but many sequences. Only the order in which these many sequences are being trained on gets shuffled. The sequences themselves stay intact.
Shuffling is usually always a good idea because your network shall only learn the training examples themselves, not their order.
This being said, there are cases where you have indeed only one huge sequence to learn from. In that case you have the option to still divide your sequence into several batches. If this is the case, you are absolutely right with your concern that shuffling would have a huge negative impact, so don't do that in this case!
Note: RNNs have a stateful parameter that you can set to True. In that case the last state of the previous batch will be passed to the following one which effectively makes your RNN see all batches as one huge sequence. So, absolutely do this, if you have a huge sequence over multiple batches.

Related

Understanding the Input Parameters in RNN

I'm having a hard time to understand the different "jargons" used in RNN. They are the following:
batch_size, time_steps, inputs and instances.
Let me go through my understanding of each input parameters & please correct me where I'm wrong.
Suppose I've got a sequence of numbers and I want to predict the next number. The numbers are the following:
[1,2,3,4,5,....,100]
time_steps: This parameter means how far RNN will look into past before it predicts the future. For simplicity, I want to predict 1 number ahead. And want to do after I see 10 numbers in the past. So, in this case, time_steps will be 10.
inputs: These are the values at each time_steps. In first time_step (t) the inputs are
t0: [1]
t1: [2]
.
.
.
t10: [10]`
batch_size: This helps in efficient computation of RNN model. Suppose my batch_size is 2. In that case, at time_step 2, the RNN input will be
t0: [1]
t0: [11]
Then what's the usage of instances? E.g. in this post, instances have been used. And there are multiple cases where instances are used. Is it means each loop over batch? E.g. there are 5 batches, each of size 2. Then there will be 5 instances.
Please help me correct my understanding.
Thanks!
batch_size
Batch size, in general, represents the size of the mini-batches constructed from the experimental dataset. Since in deep learning, we are required to do a lot of computations, it is better if we consider mini-batch operations because GPU usage will be worth then.
time_steps
Since RNN takes sequential inputs, index of each element in the input sequence can be referred as a time step of that sequence. For example, if [1,2,3,4,5,....,100] is a sequence, index of each element in the sequence is a time step.
inputs
The term inputs has a broader meaning, so I am not sure if my definition is correct. According to my understanding, inputs to an RNN refers to individual inputs provided to RNN at each time step. For example, in [1,2,3,4,5,....,100], each element is an input to the RNN at a particular time step.
But in an abstract way, if someone asks, what is the input of your deep neural model? You can say, it is English sentences or images or audio clips or videos etc. In short, the meaning of the term inputs depends on the context.
instances
Instances, in general, refers to a training/dev/test example in the dataset. For example, the sequence: [1,2,3,4,5,....,100] can be a training instance in your dataset.
Hope this helps!
Alright pal, you did good learning those concepts. I had a hard time learning those correctly. Everything you know seems to be in order and as for "instances". They're basically a set of data. There's no fixed term of usage of "instances" in a deep learning community. Some people use it for referring for a different set of data or batches of data. I rarely hear it in papers.

Predict all probable trajectories in a grid structure using Keras

I'm trying to predict sequences of 2D coordinates. But I don't want only the most probable future path but all the most probable paths to visualize it in a grid map.
For this I have traning data consisting of 40000 sequences. Each sequence consists of 10 2D coordinate pairs as input and 6 2D coordinate pairs as labels.
All the coordinates are in a fixed value range.
What would be my first step to predict all the probable paths? To get all probable paths I have to apply a softmax in the end, where each cell in the grid is one class right? But how to process the data to reflect this grid like structure? Any ideas?
A softmax activation won't do the trick I'm afraid; if you have an infinite number of combinations, or even a finite number of combinations that do not already appear in your data, there is no way to turn this into a multi-class classification problem (or if you do, you'll have loss of generality).
The only way forward I can think of is a recurrent model employing variational encoding. To begin with, you have a lot of annotated data, which is good news; a recurrent network fed with a sequence X (10,2,) will definitely be able to predict a sequence Y (6,2,). But since you want not just one but rather all probable sequences, this won't suffice. Your implicit assumption here is that there is some probability space hidden behind your sequences, which affects how they play out over time; so to model the sequences properly, you need to model that latent probability space. A Variational Auto-Encoder (VAE) does just that; it learns the latent space, so that during inference the output prediction depends on sampling over that latent space. Multiple predictions over the same input can then result in different outputs, meaning that you can finally sample your predictions to empirically approximate the distribution of potential outputs.
Unfortunately, VAEs can't really be explained within a single paragraph over stackoverflow, and even if they could I wouldn't be the most qualified person to attempt it. Try searching the web for LSTM-VAE and arm yourself with patience; you'll probably need to do some studying but it's definitely worth it. It might also be a good idea to look into Pyro or Edward, which are probabilistic network libraries for python, better suited to the task at hand than Keras.

Time series classification using LSTM - How to approach?

I am working on an experiment with LSTM for time series classification and I have been going through several HOWTOs, but still, I am struggling with some very basic questions:
Is the main idea for learning the LSTM to take a same sample from every time series?
E.g. if I have time series A (with samples a1,a2,a3,a4), B(b1,b2,b3,b4) and C(c1,c2,c3,c4), then I will feed the LSTM with batches of (a1,b1,c1), then (a2,b2,c2) etc.? Meaning that all time series needs to be of the same size/number of samples?
If so, can anynone more experienced be so kind and describe me very simply how to approach the whole process of learning the LSTM and creating the classifier?
My intention is to use TensorFlow, but I am still new to this.
If your goal is classification, then your data should be a a time series and a label. During training, you feed each into the lstm, and look only at the last output and backprop as necessary.
Judging from your question, you are probably confused about batching -- you can train multiple items at once. However, each item in the batch would get its own hidden state, and only the parameters of the layers are updated.
The time series in a single batch should be of the same length. You should terminate each sequence with a END token and pad items that are too short with a special token PAD -- the lstm should learn that PAD's after and END are useless.
There is no need for different batches to have the same number of items, nor to have items of the same length.

Does batch normalization work on balanced dataset?

I trained a classification network using tensorFlow with batch normalization in every convolutional layer. When I predict on a balanced test set where every category included in it, the accuracy is normal. However, if I chose any one specific category from test set, the accuracy is low, even zero.
But when 3 categories included in test set, the accuracy became higher. As we all know, the weights was fixed when the model finished training. But I find the balance in test set have greatly influence on prediction accuracy.
I think if batch normalization has influence on this, so I remove all batch normalization and retrained the model again. This time, when I predict only one category picture, it became normal.
Could anyone know why? THANKS!
You're right. If your training set is unbalanced you compute and accumulate mean values (for every layer) that are skewed in favor of the majority class.
In fact, you're not "normalizing" but instead, you're making the unbalancing problem worse.
Use batch normalization when you have a balanced training set and you can be sure that your batches will contain a balanced number of samples. This gives you optimal results.
However, since you added in the comments that you're using tf.contrib.layers.conv2d(x, num_output, kernel_size, stride, padding, activation_fn, normal_fn=tf.contrib.layers.batch_norm)
I spotted the problem: normalizer_fn calls the function you pass (batch_norm). But it uses the defaults parameters. By default, is_training equals to True thus you're computing even during the test phase the mean and the variance over the batch. Just read carefully the documentation of tf.contrib.layers.conv2d and use normalizer_params to pass is_training=True when training and is_training=False when testing/validating.

In tensorflow seq2seq framework, How to train data of different bucket-size in one batch

I applied queued reader to tensorflow seq2seq to avoid reading the whole dataset into memory and process them all in advance. I didn't bucket the dataset into different bucket files first to ensure one bucket-size per batch for that will also take a lot of time. As a consequence, each batch of data from queue reader may contain sequences of different bucket-size, which lead to a failure to run the original seq2seq model (It assume that data in one batch is of the same bucket-size, and only chose one sub-graph depending on the bucket-size to execute)
What i have tried:
In the original implementation, sub-graphs, as many as buckets, are constructed to share the same parameters. The only difference between them is the times of computation that should be taken during it's RNN process.
I changed the sub-graph to a conditional one, which, when the switch is True, will compute the bucket_loss of this bucket and add it to loss_list and when the switch is False, will do nothing and add tf.constant(0.0) to loss_list. Finally, I use total_loss = tf.reduce_sum(loss_list) to collect all the losses and constructed gradient graph on it. Also, I feed a switches_list into model at every step. The size of switches_list is the same as that of buckets, and if there is any data of the ith bucket-size in this batch, the corresponding ith switch in switches_list will be True, otherwise False.
The Problems encountered:
when the backpropagation process went through the tf.cond(...)
node, I was warned by gradient.py that some sparse tensors are
transformed to dense one
when I tried to fetch the total_loss or bucket_loss, I was told:
ValueError: Operation u'cond/model_with_one_buckets/sequence_loss/truediv' has been marked as not fetchable.
Would you please help me:
How can I solve the two problems above?
How should I modify the graph to meet my requirement?
Any better ideas for training data of different bucket-size in one
batch?
Any better ideas for applying asynchronous queue reader to seq2seq
framework without bucketing the whole dataset first?
I would (did) throw out the bucketing entirely. Go with dynamic_rnn. Idea here is to fill up your batch with a padding symbol, as many as needed for THAT batch to arrive at equal length for all members of THAT batch (usually just the size of the longest member of the respective batch). Solves all four of your questions, but yes, it is some hassle to rewrite. (I dont regret it at all though)
I did many things that were very particular to my case and data on the way, thus sharing makes no sense, but maybe you want to check out this implementation: Variable Sequence Lengths in TensorFlow