How to find Entropy or KL divergence between two data? - tensorflow

I have two datasets which are same shape: (576, 450, 5) where 576 is the number of examples, 450 is the time points and 5 is the number of channels.
I want to calculate entropy and KL-divergence between these two datas. But I know that the entropy and kl-divergence are calculated between probability distributions but the datas are just numerical values(not probability distributions). So how can I calculate these for my datas? Should I convert my data to probability distributions? If so how can I do it with my 3d data? Thank you.

You can use quantiles to derive the empirical distribution of each dataset if it were binned and use that to compute the entropy, mutual information, etc. (any measure or distance that relates to one or more probability distributions) between the binned distributions.
In tensorflow, this can be achieved by using tfp.stats.quantiles as follows tfp.stats.quantiles(x, num_quantiles=4, interpolation='nearest'), where you can replace x with a dataset and set num_quantiles to any reasonable number.
The crucial thing to be careful of here is that the cut points should be the same for the two datasets (i.e., both binned random variables must have the same support).
More generally, you need to train/estimate a statistical model of the two datasets and then use that model to compute these metrics. In the above, the statistical model is a categorical distribution.
In sum, you can either:
Call tfp.stats.quantiles with num_quantiles on one dataset and then re-use the cut_points to compute quantiles for the other dataset. To do so you will need tfp.stats.find_bins.
Decide on the cut_points based on some other metric (equal partitions of the support of the data?) and then call tfp.stats.find_bins on both datasets.
The alternative I would favour is a variant of option 2. You can use quantiles to get the cut_points that correspond to both datasets if the datasets were concatenated together. You can then use those cut_points for binning both datasets.
Once you have the quantiles and/or the bins, you have a categorical probability distribution describing each dataset and from there these measures/distances can be computed easily.

Related

How to calculate the KL divergence for two multivariate pandas dataframes

I am training a Gaussian-Process model iteratively. In each iteration, a new sample is added to the training dataset (Pandas DataFrame), and the model is re-trained and evaluated. Each row of the dataset comprises 5 independent variables + the dependent variable. The training ends after 150 iterations (150 samples), but I want to extend this behaviour so the training can automatically stop after a number of iterations for which no meaningful information is added to the model.
My first approach is to compare the distribution of the last 10 samples to the previous 10. If the distributions are very similar, I assume that not meaningful knowledge has been added in the last 10 iterations, so I abort the training.
I thought of using Kullback-Leibler divergence, but I am not sure if this can be used for multivariate distributions. Should I use it? If so, how?
Additionally, is there any other better/smarter way to proceed?
Thanks

How to train an LSTM on multiple independent time-series of sensor-data

I have sensor measurements for 10 different people performing the same experiment in which they need to complete a specific task. For each timestep in the measurements I have the corresponding label and my goal is to train a sequential classifier which predicts the action a person is performing given the sensor observations. So, basically, for each person I have a separate dataset containing timesteps, several sensor measurements and the corresponding action (activity) for each timestep. I want to perform a leave-one-out cross validation, which would mean that I will take the sequence of measurements and action labels for 9 people for the training part and 1 sequence for the test part. However, I don't know how to train my model on the 9 different independent measurement sequences (they have also different lengths).
My idea is to first apply masking/padding to make the sequences of equal length L, then concatenate the padded sequences and for the training to use a batch size of n, where L is divisible by n without remainder. I am not sure though if this is the right way to go. Maybe Keras already supports training sequential models on independent sequences?
I would be happy to hear your recommendations. Thank you!

What are good MSE and RMSE for my normalized dataset to between 0 and 1

I using a training source CSV file raed into a master dataframe that I split into 80% training data and 20% test data. Before I split the data I normalized all columns of the dataframe to have all the independent and dependent data to be between 0 and 1, including the targets (dependent variables). In my results after training my predicted values all read between 0 and 1. I then de-normalize a single prediction to see what value I get and compare to the expected value. My question is I'm measuring the model by MSE (mean squared error) and RMSE (root mean squared error). My MSE and RMSE on my training data are 0.03 and 0.16, respectively. Are these acceptable values with a normalized data source? If not, with my normalized data source, what would be acceptable values? Or should I be normalizing my data because I don't have huge differences of range differences between my independent variables? If I don't normalize my data should I then use a Normalized RMSE to interpret the metric? If I normalize the RMSE when not normalizing the training and test data, what would be an acceptable value for the Normalized RMSE? Thanks in advance for any replies.
“Good” should be measure relative to a naive forecast (such as a random walk). That benchmark will vary according to the degree of volatility in the data. .5 might be terrible for one forecast and excellent for another.

Vector representation in multidimentional time-series prediction in Tensorflow

I have a large data set (~30 million data-points with 5 features) that I have reduced using K-means down to 200,000 clusters. The data is a time-series with ~150,000 time-steps. The data on which I would like to train the model is the presence of particular clusters at each time-step. The purpose of the predictive model is generate a generalized sequence similar to generating syntactically correct sentences from a model trained on word sequences. The easiest way to think about this data is that I'm trying to predict the pixels in the next video frame from pixels in the current video frame in order to generate a new sequence of frames that approximate the original sequence.
The raw and sparse representation at each time-step would be 200,000 binary values representing which clusters are present or not at that time step. Note, no more than 200 clusters may be present in any one time-step and thus this representation is extremely sparse.
What is the best representation to convert this sparse vector to a dense vector that would be more suitable to time-series prediction using Tensorflow?
I initially had in mind a RNN / LSTM trained on the vectors at each time-step, but due to the size of the training vector I'm now wondering if a convolution approach would be more suitable.
Note, I have not actually used tensorflow beyond some simple tutorials, but have have previously used OpenCV ML functions. Please consider me a novice in your responses.
Thank you.

How to softmax two types of labels in TensorFlow

TensorFlow is great and we have used it for image classification or recommendation system. We used softmax and cross entropy as loss function. It works if we have only one type of label. For example, we choose only one digit from 0 to 9 in MNIST dataset.
Now we have the features of gender and age. We have one-hot encoding for each example, such as [1, 0, 1, 0, 0, 0, 0]. The first two labels represent the gender and the last five labels represent the age. Each example has two 1s and the others should be 0s.
Now our code looks like this.
logits = inference(batch_features)
softmax = tf.nn.softmax(logits)
But I found that it "softmax" all the labels and sum up to 2. But what I expect is the first two sum up to 1 and the last five sum up to 1. Not sure how to implement that in TensorFlow because these 7(2+5) features seems the same.
You have your gender and age logits concatenated together.
You want the marginal predictions.
You need to split your logits (tf.slice) into two arrays and softmax them separately.
Just remember that this only gives you the marginal probabilities.
It can't represent "an old man or a young woman", as this doesn't factorize.
So you might want make joint predictions instead. 5x2 classes instead of 5+2 classes. Obviously this more powerful model is more prone to overfit.
If you have a lot of classes in each category you could build an intermediate model with a low rank factorization of the joint matrix, by adding together multiple marginal predictions. This gives Nxr+Mxr entries instead of N+M or NxM.