Calculating RMSEA, SRMR, and CFI after clustering? - model-fitting

I have completed clustering of data on individuals into different groups using the packages "mclust" and "tidyLPA" which uses mclust for the clustering. The clustering method I did is a latent profile analysis which is a type of mixture model. My indicators are continuous and my outcome latent variable is class membership. However, the panel of fit indices I get from the package only has AIC, BIC, and some others I'm not including. I would like to include RMSEA, CFI, and SRMR.
I have my data with each indicator in a separate column, and class membership in a single column. What can I do with this dataset to get the remaining fit indices I need? Any packages or functions?

Related

A huge number of discrete features

I'm developing a regression model. But I ran into a problem when preparing the data. 17 out of 20 signs are categorical, and there are a lot of categories in each of them. Using one-hot-encoding, my data table is transformed into a 10000x6000 table. How should I prepare this type of data?
I used PCA, trying to reduce the dimension, but even 70% of the variance is in 2500 features. That's why I joined.
Unfortunately, I can't attach the dataset, as it is confidential
How do I prepare the data to achieve the best results in the learning process?
Can the data be mapped more accurately in a non-linear manner? If so, you might want to try using an autoencoder for dimensionality reduction.
One thing to note about PCA is that it computes an orthogonal projection of the data into linear space. This means that it only gives a linear mapping of the data. Autoencoders, on the other hand, can give you a non-linear mapping, and so is able to represent a greater amount of variance in the data in fewer dimensions. Just be sure to use non-linear activation functions in your autoencoder architecture.
It really depends on exactly what you are trying to do. Getting a covariance matrix (and also PCA decomp.) will give you great insight about which classes tend to come together (and this requires one-hot encoded categories), but training a model off of that might be problematic.
In general, it really depends on the model you want to use.
One option would be a random forest. They can definitely be used for regression, though they need to be trained specifically for that. SKLearn has a class just for this:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
The benifits of random forest is that it is great for tabular data (as is the case here), and can easily be trained using numerical values for class features, meaning your data vector can only be of dimension 20!
Decision tree models (such as random forest) are being shown to out-preform deep-learning in many cases, and this may be one of them.
TLDR; If you use a random forest, it can take learn even with numerical values for categories, and you can avoid creating incredibly large vectors for data.

how to predict winner based on teammates

I am trying to create a machine learning model to predict the position of each team, but I am having trouble organizing the data in a way the model can train off of it.
I want the pandas dataframe to look something like this
Where each tournament has team members constantly shifting teams.
And based on the inputted teammates, the model makes a prediction on the team's position. Anyone have any suggestions on how I can make a pandas dataframe like this that a model can use as trainnig data? I'm completely stumped. Thanks in advance!
Coming on to the question as to how to create this sheet, you can easily get the data and store in the format you described above. The trick is in how to use it as training data to your model. We need to convert it in numerical form to be able to be used as training data to any model. As we know that the max team size is 3 in most cases, we can divide the three names in three columns (keep the column blank, if there are less than 3 members in the team). Now we can either use Label encoding or One-hot encoding to convert the names to numbers. You should create a combined list of all three columns to fit a LabelEncoder and then use transform function individually on each column (since the names might be shared in these 3 columns). On label encoding, we can easily use tree based models. One-hot encoding might lead to curse of dimensionality as there will be many names, so I would prefer not to use it for an initial simple model.

Multiple trained models vs Multple features and one model

I'm trying to build a regression based M/L model using tensorflow.
I am trying to estimate an object's ETA based on the following:
distance from target
distance from target (X component)
distance from target (Y component)
speed
The object travels on specific journeys. This could be represented as from A->B or from A->C or from D->F (POINT 1 -> POINT 2). There are 500 specific journeys (between a set of points).
These journeys aren't completely straight lines, and every journey is different (ie. the shape of the route taken).
I have two ways of getting around this problem:
I can have 500 different models with 4 features and one label(the training ETA data).
I can have 1 model with 5 features and one label.
My dilemma is that if I use option 1, that's added complexity, but will be more accurate as every model will be specific to each journey.
If I use option 2, the model will be pretty simple, but I don't know if it would work properly. The new feature that I would add are originCode+ destinationCode. Unfortunately these are not quantifiable in order to make any numerical sense or pattern - they're just text that define the journey (journey A->B, and the feature would be 'AB').
Is there some way that I can use one model, and categorize the features so that one feature is just a 'grouping' feature (in order separate the training data with respect to the journey.
In ML, I believe that option 2 is generally the better option. We prefer general models rather than tailoring many models to specific tasks, as that gets dangerously close to hardcoding, which is what we're trying to get away from by using ML!
I think that, depending on the training data you have available, and the model size, a one-hot vector could be used to describe the starting/end points for the model. Eg, say we have 5 points (ABCDE), and we are going from position B to position C, this could be represented by the vector:
0100000100
as in, the first five values correspond to the origin spot whereas the second five are the destination. It is also possible to combine these if you want to reduce your input feature space to:
01100
There are other things to consider, as Scott has said in the comments:
How much data do you have? Maybe the feature space will be too big this way, I can't be sure. If you have enough data, then the model will intuitively learn the general distances (not actually, but intrinsically in the data) between datapoints.
If you have enough data, you might even be able to accurately predict between two points you don't have data for!
If it does come down to not having enough data, then finding representative features of the journey will come into use, ie. length of journey, shape of the journey, elevation travelled etc. Also a metric for distance travelled from the origin could be useful.
Best of luck!
I would be inclined to lean toward individual models. This is because, for a given position along a given route and a constant speed, the ETA is a deterministic function of time. If one moves monotonically closer to the target along the route, it is also a deterministic function of distance to target. Thus, there is no information to transfer from one route to the next, i.e. "lumping" their parameters offers no a priori benefit. This is assuming, of course, that you have several "trips" worth of data along each route (i.e. (distance, speed) collected once per minute, or some such). If you have only, say, one datum per route then lumping the parameters is a must. However, in such a low-data scenario, I believe that including a dummy variable for "which route" would ultimately be fruitless, since that would introduce a number of parameters that rivals the size of your dataset.
As a side note, NEITHER of the models you describe could handle new routes. I would be inclined to build an individual model per route, data quantity permitting, and a single model neglecting the route identity entirely just for handling new routes, until sufficient data is available to build a model for that route.

Using embedded columns

I'm trying to understand the TensorFlow tutorial on wide and deep learning. The demo application creates indicator columns for categorical features with few categories (gender, education), and it creates embedded columns for categorical features with many categories (native_country, occupation).
I don't understand embedded columns. Is there a rule that clarifies when to use embedded columns instead of indicator columns? According to the documentation, the dimension parameter sets the dimension of the embedding. What does that mean?
From the feature columns tutorial:
Now, suppose instead of having just three possible classes, we have a million. Or maybe a billion. For a number of reasons, as the number of categories grow large, it becomes infeasible to train a neural network using indicator columns.
We can use an embedding column to overcome this limitation. Instead of representing the data as a one-hot vector of many dimensions, an embedding column represents that data as a lower-dimensional, ordinary vector in which each cell can contain any number, not just 0 or 1. By permitting a richer palette of numbers for every cell, an embedding column contains far fewer cells than an indicator column.
The dimension parameter is the length of the vector you're reducing the categories to.

How to generate data that fits the normal distribution within each class?

Using numpy, I need to produce training and test data for a machine learning problem. The model is able to predict three different classes (X,Y,Z). The classes represent the types of patients in multiple clinical trials, and the model should be able to predict the type of patient based on data gathered about the patient (such as blood analysis and blood pressure, previous history etc.)
From a previous study we know that, in total, the classes are represented with the following distribution, in terms of a percentage of the total patient count per trial:
X - u=7.2, s=5.3
Y - u=83.7, s=15.2
Z - u=9.1, s=2.3
The u/s describe the distribution in N(u, s) for each class (so, for all trials studied, class X had mean 7.2 and variance 5.3). Unfortunately the data set for the study is not available.
How can I recreate a dataset that follows the same distribution over all classes, and within each class, subject to the constraint of X+Y+Z=100 for each record.
It is easy to generate a dataset that follows the overall distribution (the u values), but how do I get a dataset that has the same distribution per each class?
The problem you have stated is to sample from a mixture distribution. A mixture distribution is just a number of component distributions, each with a weight, such that the weights are nonnegative and sum to 1. Your mixture has 3 components. Each is a Gaussian distribution with the mean and sd you gave. It is reasonable to assume the mixing weights are the proportion of each class in the population. To sample from a mixture, first select a component using the weights as probabilities for a discrete distribution. Then sample from the component. I assume you know how to sample from a Gaussian distribution.