I'm trying to do a multilevel structural equation model using a data set with observations from 32 different countries. I cluster the model by country. The model runs but on the output it says there were 29 clusters. Would there be a way to check which clusters are being dropped and where listwise deletion is removing cases?
fit.3b <- sem(mod3, data=data_merge, meanstructure=TRUE, std.lv=TRUE, sampling.weights="WEIGHT", cluster = "country", optim.method = "em")
summary(fit.3b, fit.measures=TRUE, estimates=TRUE)
I was expecting there to be 32 clusters used in the output. I removed countries that were missing exogenous variables.
For any fitted model, you can extract the vector of cluster IDs in that model:
library(lavaan)
example(Demo.twolevel)
lavInspect(fit, "cluster.id")
To extract the missing cluster(s), you could use setdiff() to compare that to the unique() values in your data's cluster-ID variable.
setdiff(unique(Demo.twolevel$cluster), # what's in here...
lavInspect(fit, "cluster.id")) # but not in here?
Related
When using GridSearchCV() to perform a k-fold cross validation analysis on some data is there a way to know which data was used for each split?
For example, assumed the goal is to build a binary classifier of your choosing, named 'model'. There are 100 data points (rows) with 5 features each and an associated 1 or 0 target. 20 of the 100 data points are held out for testing after training and hyperparameter tuning, GridSearchCV will never see those 20 data points. The other 80 data rows are put into the estimator as X and Y, so GridSearchCV will only see 80 rows of data. Various hyper parameters are tuned and laid out in the param_grid variable. For this case the cross validation parameter of cv is assigned a value of 3, as shown:
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=3) grid_result = grid.fit(X, Y)
Is there a way to see which data was used as the training data and as the cross validation data for each fold? Maybe seeing which indices were used for the split?
I am trying to create a machine learning model to predict the position of each team, but I am having trouble organizing the data in a way the model can train off of it.
I want the pandas dataframe to look something like this
Where each tournament has team members constantly shifting teams.
And based on the inputted teammates, the model makes a prediction on the team's position. Anyone have any suggestions on how I can make a pandas dataframe like this that a model can use as trainnig data? I'm completely stumped. Thanks in advance!
Coming on to the question as to how to create this sheet, you can easily get the data and store in the format you described above. The trick is in how to use it as training data to your model. We need to convert it in numerical form to be able to be used as training data to any model. As we know that the max team size is 3 in most cases, we can divide the three names in three columns (keep the column blank, if there are less than 3 members in the team). Now we can either use Label encoding or One-hot encoding to convert the names to numbers. You should create a combined list of all three columns to fit a LabelEncoder and then use transform function individually on each column (since the names might be shared in these 3 columns). On label encoding, we can easily use tree based models. One-hot encoding might lead to curse of dimensionality as there will be many names, so I would prefer not to use it for an initial simple model.
I have a case with missing data on endogenous variables and binary/ordinal endogenous variables.
The model 1 below represents it and it works perfectly.
However, in the way it is written, it assumes that my variables are continuous (and they are actually all ordinal/binary) and it does not include the calculation of the indirect effects.
When I try to adjust it (as displayed in model 2) to consider these two things, it says the estimator FIML cannot be used with categorical data (so, it excludes all the lines with missing data). Furthermore, the resulting output doesn't even include standard deviations.
Can anyone help me figuring out how to model that?
Thanks in advance
# Model 1
model1 <-'Importance~Seats+PriceRange
Measurement~Importance
Prekitchen~Importance+Measurement
Kitchen~Importance+Measurement
Postkitchen~Importance+Measurement
# Means are mentioned below so that all the information is used, bypassing listwise deletion
Seats~1
Price Range~1'
fit <- lavaan(model1, data=Mediate, missing="fiml")
summary(fit, fit.measures=TRUE)
semPaths(fit)
# Model2
model2 <- 'Importance~Seats+PriceRange
# Including the paths to calculate the indirect effects
Measurement~a*Importance
Prekitchen~b*Measurement
Prekitchen~c*Importance
Kitchen~d*Measurement
Kitchen~e*Importance
Postkitchen~f*Measurement
Postkitchen~g*Importance
# Indirect effects exerted by Importance
ab:=a*b
total:=c+(a*b)
ad:=a*d
total:=e+(a*d)
af:=a*f
total:=g+(a*f)
Seats~1
Price Range~1'
# Including the variable type "Ordered" for all the categorical variables.
fit2 <- sem(model2, data=Mediate, missing="fiml", ordered=c("Importance", "Measurement", "Prekitchen", "Kitchen", "Postkitchen"))
summary(fit2, fit.measures=TRUE)
semPaths(fit2)
P.S: I already used M-plus, but the problem in there is that for such a model, there are no goodness-of-fit indexes.
Have you tried doing the Imputation before calculating your model? I think that would spare you the FIML chunk in your code.
I usually use the MICE package to do this:
install.packages("mice")
library(mice)
md.pattern(data)
If you want to take a closer look, here ist a usefull paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3074241/
Also, try using the function sem instead of lavaan and including the argument ordered, to indicate your data is categorical.
e. g.: fit.1 <- sem(cat.1, data=data, std.lv=TRUE, ordered=names[1:n])
I am trying to cluster sentences by clustering the sentence embedding of them taken from fasttext model. Each sentence embedding has 300 dimensions, and I want to reduce them to 50 (say). I tried t-SNE, PCA, UMAP. I wanted to see how Auto Encoder works for my data.
Now passing those 300 numbers for each sentence as separate features to the NN would make sense or they should be passed as a single entity? If so, is there an way to pass a list as a feature to NN?
I tried passing the 300 numbers as individual features and with the output I tried clustering. Could get very few meaningful clusters rest were either noise or clusters with no similar sentences but being grouped (But with other techniques like UMAP I could get far more meaningful clusters in more number). Any leads would be helpful. Thanks in advance :)
I tried to use gams to find flow of material across network of nodes. I defined
set edge(i,n,nn);
positive variable flux(i,n,nn);
y.up(i,n,nn)$( not edge(i,n,nn)) = 0;
My intention is to define 3D matrix of variable for flux of matrial i from node n to nn, then use the set edge which specifies which of complete graph can have mass of flow.
This apparently working but when i tried to save y into gdx file, i have lots of lots of zeros. I only need subset of y where edge(i,n,nn) is true.
How can i subset the y when saving gdx file.
Thanks!
You could store things in a reduced parameter:
Parameter yLevel(i,n,nn);
yLevel(i,n,nn)$edge(i,n,nn) = y.l(i,n,nn);
execute_unload 'result.gdx' yLevel;
Just a note: Do you really need the complete y(i,n,nn)? This could be huge dependent on the size of the indexing sets. Or could you alternatively modify your model to just use y(i,n,nn)$edge(i,n,nn)?