How to use multiple imputed data for further analysis in SVM and ANN? - missing-data

My original data contains some missing values and I used multiple imputation to fill them. My next objective is to use these data in SVM and ANN. I originally thought MI would give me a "pooled" completed dataset but it turned out that MI only gives pooled analysis results regarding the imputed datasets. So my questions are:
1) Is there any way, like any equation, I can use to aggregate the imputed datasets into one dataset and use it for further analysis;
2) If not, how should proceed my study using the multiple datasets.
Thank you!

This is a general misunderstanding about MI.
The general process is supposed to be like this:
Multiple Imputation
Analysis for each imputed dataset
Pooling
If you would do the imputation and then merge all imputed dataset to one imputed dataset you loose all the benefit of MI. Then you could have just used any other imputation method. The idea is to perform your analysis for example 5 times, one time for each imputed dataset. Because you want to account for the different outcomes your analysis could have had with different imputed input datasets. Afterwards you pool / merge the results of your analysis.
The whole process is not so common in ML. But in your case you could for example use SVM on all 5 datasets and then afterwards compare the results / come up with a procedure to merge/combine the results.

Related

Predict a nonlinear array based on 2 features with scalar values using XGBoost or equivalent

So I have been looking at XGBoost as a place to start with this, however I am not sure the best way to accomplish what I want.
My data is set up something like this
Where every value, whether it be input or output is numerical. The issue I'm facing is that I only have 3 input data points per several output data points.
I have seen that XGBoost has a multi-output regression method, however I am only really seeing it used to predict around 2 outputs per 1 input, whereas my data may have upwards of 50 output points needing to be predicted with only a handful of scalar input features.
I'd appreciate any ideas you may have.
For reference, I've been looking at mainly these two demos (they are the same idea just one is scikit and the other xgboost)
https://machinelearningmastery.com/multi-output-regression-models-with-python/
https://xgboost.readthedocs.io/en/stable/python/examples/multioutput_regression.html

sjmisc::merge_imputations() averages across imputed datasets, which seems unjustified?

The sjmisc package has a function sjmisc::merge_imputations()
This function merges multiple imputed data frames from mice::mids()-objects into a single data frame by computing the mean or selecting the most likely imputed value.
I think this is what Stef van Buuren cautions against in 5.1.2 Not recommended workflow: Averaging the data ?
the procedure ignores the between-imputation variability, and hence shares all the drawbacks of single imputation
Instead, they advocate for mice::with() and mice::pool().
So when might one use sjmisc::merge_imputations() ?
If:
The researcher either only cares about means, not about correlations or other more complicated relationships between variables. Or, is willing to assume that the imputation models were "true" models.
The researcher only cares about point estimates, and is less concerned about the uncertainty in those estimates (variance, standard errors, confidence intervals, hypothesis tests, coefficients of variation).
There is only a small amount of missing data.
Then averaging the imputed values can be a reasonable fix. Averaging the imputed values is basically a version of "stochastic regression imputation". Although note that as the number of imputations increases, averaging the imputed values converges to simple regression imputation. It's still wrong, but it may be a practical method. The sjmisc package documentation quotes Burns et al (2011). https://doi.org/10.1016/j.jclinepi.2010.10.011 From that article:
There were practical benefits in providing DYNOPTA investigators an averaged imputation score as it precludes the necessity for investigators to run MICE for different projects using the MMSE, the need to obtain software capable of combining and analyzing multiple imputed datasets, and many investigators are unfamiliar with MI analysis techniques.
Compare also van Buuren 1.3.5
If you have the ability to use proper pooling methods I would recommend using those instead.

How to handle skewed categorical data for multiclass-classification task?

I want to know how to handle the skewed data which contains a particular column that has multiple categorical values. Some of these values have more value_counts() than others.
As you can see in this data the values greater than 7 have value counts lot less than others. How to handle this kind of skewed data? (This is not the target variable. I want to know about skewed independent variable)
I tried changing ' these smaller count values to a particular value (-1). That way I got count of -1 comparable to other values. But training classification model on this data will affect the accuracy.
Oversampling techniques for minority classes/categories may not work well in many scenarios. You could read more about them here.
One thing you could do is to assign different weights to samples from different classes in your model's loss function, inversely proportional to their frequencies. This would ensure that even classes with few datapoints will equally affect the model's loss, as compared to classes with large number of datapoints.
You could share more details about the dataset or the specific model that you are using, to get more specific suggestions/solutions.

how to predict winner based on teammates

I am trying to create a machine learning model to predict the position of each team, but I am having trouble organizing the data in a way the model can train off of it.
I want the pandas dataframe to look something like this
Where each tournament has team members constantly shifting teams.
And based on the inputted teammates, the model makes a prediction on the team's position. Anyone have any suggestions on how I can make a pandas dataframe like this that a model can use as trainnig data? I'm completely stumped. Thanks in advance!
Coming on to the question as to how to create this sheet, you can easily get the data and store in the format you described above. The trick is in how to use it as training data to your model. We need to convert it in numerical form to be able to be used as training data to any model. As we know that the max team size is 3 in most cases, we can divide the three names in three columns (keep the column blank, if there are less than 3 members in the team). Now we can either use Label encoding or One-hot encoding to convert the names to numbers. You should create a combined list of all three columns to fit a LabelEncoder and then use transform function individually on each column (since the names might be shared in these 3 columns). On label encoding, we can easily use tree based models. One-hot encoding might lead to curse of dimensionality as there will be many names, so I would prefer not to use it for an initial simple model.

How to Set the Same Categorical Codes to Train and Test data? Python-Pandas

NOTE:
If someone else it's wondering about this topic, I understand you're getting deeper in the Data Analysis world, so I did this question before to learn that:
You encode categorical values as INTEGERES only if you're dealing with Ordinal Classes, i.e. College degree, Customer Satisfaction Surveys as an example.
Otherwise if you're dealing with Nominal Classes like, gender, colors or names, you MUST convert them with other methods since they do not specific any numerical order, most known are One-hot Encoding or Dummy variables.
I encorage you to read more about them and hope this has been useful.
Check the link below to see a nice explanation:
https://www.youtube.com/watch?v=9yl6-HEY7_s
This may be a simple question but I think it can be useful for beginners.
I need to run a prediction model on a test dataset, so to convert the categorical variables into categorical codes that can be handled by the random forests model I use these lines with all of them:
Train:
data_['Col1_CAT'] = data_['Col1'].astype('category')
data_['Col1_CAT'] = data_['Col1_CAT'].cat.codes
So, before running the model I have to apply the same procedure to both, the Train and Test data.
And since both datasets have the same categorical variables/columns, I think it will be useful to apply the same categorical codes to each column respectively.
However, although I'm handling the same variables on each dataset I get different codes everytime I use these two lines.
So, my question is, how can I do to get the same codes everytime I convert the same categoricals on each dataset?
Thanks for your insights and feedback.
Usually, how I do this is to do the categorical conversions before the train test split so that I get a neat transformed dataset. Once I do that, I do the train-test split and train the model and test it on the test set.