How to calculate the KL divergence for two multivariate pandas dataframes - pandas

I am training a Gaussian-Process model iteratively. In each iteration, a new sample is added to the training dataset (Pandas DataFrame), and the model is re-trained and evaluated. Each row of the dataset comprises 5 independent variables + the dependent variable. The training ends after 150 iterations (150 samples), but I want to extend this behaviour so the training can automatically stop after a number of iterations for which no meaningful information is added to the model.
My first approach is to compare the distribution of the last 10 samples to the previous 10. If the distributions are very similar, I assume that not meaningful knowledge has been added in the last 10 iterations, so I abort the training.
I thought of using Kullback-Leibler divergence, but I am not sure if this can be used for multivariate distributions. Should I use it? If so, how?
Additionally, is there any other better/smarter way to proceed?
Thanks

Related

How to train an LSTM on multiple independent time-series of sensor-data

I have sensor measurements for 10 different people performing the same experiment in which they need to complete a specific task. For each timestep in the measurements I have the corresponding label and my goal is to train a sequential classifier which predicts the action a person is performing given the sensor observations. So, basically, for each person I have a separate dataset containing timesteps, several sensor measurements and the corresponding action (activity) for each timestep. I want to perform a leave-one-out cross validation, which would mean that I will take the sequence of measurements and action labels for 9 people for the training part and 1 sequence for the test part. However, I don't know how to train my model on the 9 different independent measurement sequences (they have also different lengths).
My idea is to first apply masking/padding to make the sequences of equal length L, then concatenate the padded sequences and for the training to use a batch size of n, where L is divisible by n without remainder. I am not sure though if this is the right way to go. Maybe Keras already supports training sequential models on independent sequences?
I would be happy to hear your recommendations. Thank you!

What is Keras doing if my sample size is smaller than my batch size?

fairly new to LSTM, but I already searched for a solution and could not find anything satisfying or even similar enough.
So here is my problem:
I am dealing with sleep classifaction and have annotated records for about 6k patients.
To train my bidirectional LSTM, I pick one patient and fit the model on that data instead of putting all the data from all the patients into one big matrix because I want to prevent patient samples mixing when Keras is training with mini batches.
The sequence length or samples_size per patient are not the same.
Then I loop over all patients and do a additional loop for the number of epochs I considered to train the model for (as described in Developer Guides).
So since LSTMs (if not stateful) reset their cell and hidden state after each batch and the default batch_size for tf.keras.Sequential.fit() is 32 I wanted it to match the sample_size of the patient I am showing to the network. If I do so I am getting a warning and the training process errors after some time. The error is:
WARNING:tensorflow:6 out of the last 11 calls to .distributed_function at 0x0000023F9D517708> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings is likely due to passing python objects instead of tensors. Also, tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. Please refer to https://www.tensorflow.org/beta/tutorials/eager/tf_function#python_or_tensor_args and https://www.tensorflow.org/api_docs/python/tf/function for more details.
So I looked up what my longest sample_size is and set my batch_size accordingly.
tl;dr: What is Keras doing in all the instances where my variable sample_size is not matching my batch_size=max(len(sample_size))?
Is it just showing the available samples to the network?
If so: Why is there the warning mentioned above where setting the batch_size=sample_size leads to the failed training?
Or is it showing the available samples to the network and filling up the rest with zeros to match the given batch_size?
If so: Why is there the necessity of masking when using e.g. stateful mode?
edit:
So, I tried some additional workarounds and built my own Data Generator, which proves data of one patient as one batch. I then set steps_per_epoch=len(train_patients) to include all patients into one epoch. No warnings about retracing, which I do not understand either.
It seems to solve my problem of showing one patient per batch without mixing patient data and have a variable sample_size, but I really do not understand the differences between all these posibilities and their different warnings.

Binary classification of every time series step based on past and future values

I'm currently facing a Machine Learning problem and I've reached a point where I need some help to proceed.
I have various time series of positional (x, y, z) data tracked by sensors. I've developed some more features. For example, I rasterized the whole 3D space and calculated a cell_x, cell_y and cell_z for every time step. The time series itself have variable lengths.
My goal is to build a model which classifies every time step with the labels 0 or 1 (binary classification based on past and future values). Therefore I have a lot of training time series where the labels are already set.
One thing which could be very problematic is that there are very few 1's labels in the data (for example only 3 of 800 samples are labeled with 1).
It would be great if someone can help me in the right direction because there are too many possible problems:
Wrong hyperparameters
Incorrect model
Too few 1's labels, but I think that's not a big problem because I only need the model to suggests the right time steps. So I would only use the peaks of the output.
Bad or too less training data
Bad features
I appreciate any help and tips.
Your model seems very strange. Why only use 2 units in lstm layer? Also your problem is a binary classification. In this case you should choose only one neuron in your output layer (try to insert one additional dense layer between and lstm layer and try dropout layers between them).
Binary crossentropy does not make much sense with 2 output neurons, if you don't have a multi label problem. But if you're switching to one output neuron it's the right one. You also need sigmoid then as activation function.
As last advice: Try class weights.
http://scikit-learn.org/stable/modules/generated/sklearn.utils.class_weight.compute_class_weight.html
This can make a huge difference, if you're label are unbalanced.
You can create the model using tensorflow BasicLSTMCell, the shape of your data fits for BasicLSTMCell in TensorFlow you can find Documentation for BasicLSTMCell here and for creating the model this Documentation contain code that will help to build BasicLstmCell model . Hope this will help you, Cheers.

My Variables are becoming NaN after updating in tensorflow

So I am trying to implement DQN algorithm in tensorflow and I have defined the loss function as given below but whenever I am performing the weight update using ADAM optimizer, after 2-3 updates all my variables are becoming nan. Any idea what could be the problem. My actions can take integer values between (0,10). Any idea what might me going on?
def Q_Values_of_Given_State_Action(self, actions_, y_targets):
self.dense_output=self.dense_output #Output of the online network which given the Q values of all the actions in the current state
actions_=tf.reshape(tf.cast(actions_, tf.int32), shape=(Mini_batch,1)) #Actions which was taken by the online network
z=tf.reshape(tf.range(tf.shape(self.dense_output)[0]), shape=(Mini_batch,1) )
index_=tf.concat((z,actions_), axis=-1)
self.Q_Values_Select_Actions=tf.gather_nd(self.dense_output, index_)
self.loss_=tf.divide((tf.reduce_sum (tf.square(self.Q_Values_Select_Actions-y_targets))), 2)
return self.loss_
The fact that your inputs are often as large as 10 suggests your gradients are exploding. You can check this by reducing the learning rate to something very small (try dividing your current learning rate by 100). If it takes longer to get NaNs, or they don't happen at all, it's your learning rate. If it's your learning rate, then consider using a one-hot vector to represent the actions.
In general, you can track down small bugs using tf.Print and big ones using tfdbg.

How to deal with multi step time series forecasting in multivariate LSTM in keras

I am trying to do multi-step time series forecasting using multivariate LSTM in Keras. Specifically, I have two variables (var1 and var2) for each time step originally. Having followed the online tutorial here, I decided to use data at time (t-2) and (t-1) to predict the value of var2 at time step t. As sample data table shows, I am using the first 4 columns as input, Y as output. The code I have developed can be seen here, but I have got three questions.
var1(t-2) var2(t-2) var1(t-1) var2(t-1) var2(t)
2 1.5 -0.8 0.9 -0.5 -0.2
3 0.9 -0.5 -0.1 -0.2 0.2
4 -0.1 -0.2 -0.3 0.2 0.4
5 -0.3 0.2 -0.7 0.4 0.6
6 -0.7 0.4 0.2 0.6 0.7
Q1: I have trained an LSTM model with the data above. This model does
well in predicting the value of var2 at time step t. However, what
if I want to predict var2 at time step t+1. I feel it is hard
because the model cannot tell me the value of var1 at time step t. If I want to do it, how should I modify the code to build the model?
Q2: I have seen this question asked a lot, but I am still confused. In
my example, what should be the correct time step in [samples, time
steps, features] 1 or 2?
Q3: I just started studying LSTMs. I have
read here that one of the biggest advantages of LSTM is that it
learns the temporal dependence/sliding window size by itself, then
why must we always covert time series data into format like the
table above?
Update: LSTM result (blue line is the training seq, orange line is the ground truth, green is the prediction)
Question 1:
From your table, I see you have a sliding window over a single sequence, making many smaller sequences with 2 steps.
For predicting t, you take first line of your table as input
For predicting t+1, you take the second line as input.
If you're not using the table: see question 3
Question 2:
Assuming you're using that table as input, where it's clearly a sliding window case taking two time steps as input, your timeSteps is 2.
You should probably work as if var1 and var2 were features in the same sequence:
input_shape = (2,2) - Two time steps and two features/vars.
Question 3:
We do not need to make tables like that or build a sliding window case. That is one possible approach.
Your model is actually capable of learning things and deciding the size of this window itself.
If on one hand your model is capable of learning long time dependencies, allowing you not to use windows, on the other hand, it may learn to identify different behaviors at the beginning and at the middle of a sequence. In this case, if you want to predict using sequences that start from the middle (not including the beginning), your model may work as if it were the beginning and predict a different behavior. Using windows eliminate this very long influence. Which is better may depend on testing, I guess.
Not using windows:
If your data has 800 steps, feed all the 800 steps at once for training.
Here, we will need to separate two models, one for training, another for predicting. In training, we will take advantage of the parameter return_sequences=True. This means that for each input step, we will get an output step.
For predicting later, we will want only one output, then we will use return_sequences= False. And in case we are going to use the predicted outputs as inputs for following steps, we are going to use a stateful=True layer.
Training:
Have your input data shaped as (1, 799, 2), 1 sequence, taking the steps from 1 to 799. Both vars in the same sequence (2 features).
Have your target data (Y) shaped also as (1, 799, 2), taking the same steps shifted, from 2 to 800.
Build a model with return_sequences=True. You may use timeSteps=799, but you may also use None (allowing variable amount of steps).
model.add(LSTM(units, input_shape=(None,2), return_sequences=True))
model.add(LSTM(2, return_sequences=True)) #it could be a Dense 2 too....
....
model.fit(X, Y, ....)
Predicting:
For predicting, create a similar model, now with return_sequences=False.
Copy the weights:
newModel.set_weights(model.get_weights())
You can make an input with length 800, for instance (shape: (1,800,2)) and predict just the next step:
step801 = newModel.predict(X)
If you want to predict more, we are going to use the stateful=True layers. Use the same model again, now with return_sequences=False (only in the last LSTM, the others keep True) and stateful=True (all of them). Change the input_shape by batch_input_shape=(1,None,2).
#with stateful=True, your model will never think that the sequence ended
#each new batch will be seen as new steps instead of new sequences
#because of this, we need to call this when we want a sequence starting from zero:
statefulModel.reset_states()
#predicting
X = steps1to800 #input
step801 = statefulModel.predict(X).reshape(1,1,2)
step802 = statefulModel.predict(step801).reshape(1,1,2)
step803 = statefulModel.predict(step802).reshape(1,1,2)
#the reshape is because return_sequences=True eliminates the step dimension
Actually, you could do everything with a single stateful=True and return_sequences=True model, taking care of two things:
When training, reset_states() for every epoch. (Train with a manual loop and epochs=1)
When predicting from more than one step, take only the last step of the output as the desired result.
Actually you can't just feed in the raw time series data, as the network won't fit to it naturally. The current state of RNNs still requires you to input multiple 'features' (manually or automatically derived) for it to properly learn something useful.
Usually the prior steps needed are:
Detrend
Deseasonalize
Scale (normalize)
A great source of information is this post from a Microsoft researcher which won a time series forecasting competition by the means of a LSTM Network.
Also this post: CNTK - Time series Prediction