How to generate data that fits the normal distribution within each class? - numpy

Using numpy, I need to produce training and test data for a machine learning problem. The model is able to predict three different classes (X,Y,Z). The classes represent the types of patients in multiple clinical trials, and the model should be able to predict the type of patient based on data gathered about the patient (such as blood analysis and blood pressure, previous history etc.)
From a previous study we know that, in total, the classes are represented with the following distribution, in terms of a percentage of the total patient count per trial:
X - u=7.2, s=5.3
Y - u=83.7, s=15.2
Z - u=9.1, s=2.3
The u/s describe the distribution in N(u, s) for each class (so, for all trials studied, class X had mean 7.2 and variance 5.3). Unfortunately the data set for the study is not available.
How can I recreate a dataset that follows the same distribution over all classes, and within each class, subject to the constraint of X+Y+Z=100 for each record.
It is easy to generate a dataset that follows the overall distribution (the u values), but how do I get a dataset that has the same distribution per each class?

The problem you have stated is to sample from a mixture distribution. A mixture distribution is just a number of component distributions, each with a weight, such that the weights are nonnegative and sum to 1. Your mixture has 3 components. Each is a Gaussian distribution with the mean and sd you gave. It is reasonable to assume the mixing weights are the proportion of each class in the population. To sample from a mixture, first select a component using the weights as probabilities for a discrete distribution. Then sample from the component. I assume you know how to sample from a Gaussian distribution.

Related

test and train good practice wrt summary feature

When one feature of a dataset is a summary statistic of the entire pool of data, is it good practice to include the train data in your test data in order to calculate the feature for validation?
For instance, let's say I have 1000 data points split into 800 entries of training and 200 entries for validation. I create a feature with the 800 entries for training of say rank quartile (or could be anything), which numbers 0-3 the quartile some other feature falls in. So in the training set, there will be 200 data points in each quartile.
Once you train the model and need to calculate the feature again for the validation set, a) do you use the already set quartiles barriers, ie the 200 validation entries could have a different than 50-50-50-50 quartile split, or b) do you recalculate the quartiles using all 1000 entries so there is a new feature of quartile rank, each of 250 entries each?
Thanks very much
The ideal practice would be to calculate the quartiles on the training dataset, and using those barriers on your holdout / validation dataset. To ensure that you correctly generate model diagnostics to evaluate its predictive performance, you do not want the distribution of the testing dataset to influence your model training. This is because that data will not be available in real life when you apply the model on unseen data.
I also thought that you will find this article extremely useful when thinking about train-test splitting - https://towardsdatascience.com/3-things-you-need-to-know-before-you-train-test-split-869dfabb7e50

Balance Dataset for Tensorflow Object Detection

I currently want to use Tensorflows Object Detection API for my custom problem.
I already created the dataset, but its pretty unbalanced.
The Dataset has 3 classes and my main problem is, that one class has about 16k samples and another class has only about 2.5k samples.
So I think I have to balance the dataset. Someone told me, that there is something called sample/class weights(Not sure if this is 100% correct), which balance the samples for training, so that the biggest class has a smaller impact on training then the smallest class.
I'm not able to find this method for balancing. Can someone pleas give me a hint where to start?
You can do normal cross entropy, giving you a ? x 1 tensor, X of losses
If you want class number N to count T times more, you can do
X = X * tf.reduce_sum(tf.multiply(one_hot_label, class_weight), axis = 1)
tf.multiply
scales the label by whatever weight you want,
tf.reduce_sum
converts the label vector a to a scalar, so you end up with a ? x 1 tensor filled with the class weightings. Then you simply multiply the tensor of losses with the tensor of weightings to achieve desired results.
Since one class is 6.4 times more common than the other, I would apply the weightings 1 and 6.4 to the more common and less common class respectively. This will mean that every time the less common class occurs, it has 6.4 times the affect of the more common class, so it's like it saw the same number of samples from each.
You might want to modify it so that the weighting add up to the number of classes. This matches the default case is all of the weightings are 1. In that case we have 1 /7.4 and 6.4/7.4

Error propagation in a Bayesian analysis of a Markov chain

I'm analysing longitudinal panel data, in which individuals transition between different states in a Markov chain. I'm modelling the transition rates between states using a series of multinomial logistic regressions. This means that I end up with a very large number of regression slopes.
For each regression slope, I obtain a posterior distribution (using WinBUGS). From the posterior distribution, we get the mean, standard deviation, and 95% credible interval associated with the slope in question.
The value I am ultimately interested in is the expected first passage time ('hitting time') through the Markov chain. This is a function of all the different predictor variables, and so is built from the many regression slopes produced by the multinomial logistic regressions.
A simple approach would be to take the mean of each posterior distribution as a point-estimate for each regression slope, and solve for the expected first passage time at a series of different values of the predictor variables. I have now done this, but it is potentially misleading because it doesn't show the uncertainty around the predicted values of expected first passage time.
My question is: how can I calculate a credible interval for the expected first passage time?
My first thought was to approximate the error via simulation, by sampling individual values for the regression slopes from each posterior distribution, obtaining the expected first passage time given those values, and then plotting the standard deviation of all these simulated values. However, I feel like (a) this would make a statistician scream and (b) it doesn't take into account the fact that different posterior distributions will be correlated (it samples from each one independently).
In WinBUGS, you can actually obtain the correlations between the posterior distributions. So if the simulation idea is appropriate, I could in theory simulate the regression slope coefficients incorporating these correlations.
Is there a more direct and less approximate way to find the uncertainty? Could I, for instance, use WinBUGS to find the posterior distribution of the expected first passage time for a given set of values of the predictor variables? Rather like the answer to this question: define a new node and monitor it. I would imagine defining a series of new nodes, where each one is for a different set of actual predictor values, and monitoring each one. Does this make good statistical sense?
Any thoughts about this would be really appreciated!

hmmlearn: Using GausianHMM, how does one calculate the probability of an observation (as opposed to the probability of a state)

i'm new to the HMM universe. I've followed the tutorials using a GaussianHMM machine learner, and they work but i was just wondering how i can use the code to display the probability of an observation given the most likely sequence, assuming i have multiple sequences of observations? Thanks
so for example, if the observations are:
seq1:[1,2,-1,4,2], seq2:[a,v,s,a,f], and the model has 2 states,
once the model predicts the states, how does one calculate the probability of an observed output [1],[a] ?

PyMC: How can I describe a state space model?

I used to code my MCMC using C. But I'd like to give PyMC a try.
Suppose X_n is the underlying state whose dynamics following a Markov chain and Y_n is the observed data. In particular,
Y_n has Poisson distribution with mean depending on X_n and a multidimensional unknown parameter theta
X_n | X_{n-1} has distribution depending on theta
How should I describe this model using PyMC?
Another question: I can find conjugate priors for theta but not for X_n. Is it possible to specify which posteriors are updated using conjugate priors and which using MCMC?
Here is an example of a state-space model in PyMC on the PyMC wiki. It basically involves populating a list and allowing PyMC to treat it as a container of PyMC nodes.
As for the second part of the question, you could certainly calculate some of your conjugate posteriors ahead of time and put them into the model. For example, if you observed binomial data x=4, n=10 you could insert a Beta node p = Beta('p', 5, 7) to represent that posterior (its really just a prior, as far as the model is concerned, but it is the posterior given data x). Then PyMC would draw a sample for this posterior at every iteration to be used wherever it is needed in the model.