Imputation of missing data

Imputation of missing data - missing-data

I would like to know, if I need to impute missing data on only some specific variables, can I impute the whole dataset and just integrate these specific variables in the original dataset?
Thanks for helping me

Related

Is it possible to mask individual features in tensorflow?

I have a large quantity of missing values that appear at random in my data. Unfortunately, I cannot simply drop observations with missing data as I am grouping observations by a feature and cannot drop NaNs without affecting the entire group.
I was hoping to simply mask features that were missing. So a single group might have 8 items in it, and each item may have 0 to N features, depending on how many got masked due to being missing.
I have been experimenting a lot with RaggedTensors, but have encountered a lot of issues ranging from not being able to flatten the RaggedTensor, not being able to concatenate it with regular tensors of uniform shape, and Dense layers requiring the last dimension of their input to be known, aka the number of features.
Does anybody know if there is a way to do this?

how to predict winner based on teammates

I am trying to create a machine learning model to predict the position of each team, but I am having trouble organizing the data in a way the model can train off of it.
I want the pandas dataframe to look something like this
Where each tournament has team members constantly shifting teams.
And based on the inputted teammates, the model makes a prediction on the team's position. Anyone have any suggestions on how I can make a pandas dataframe like this that a model can use as trainnig data? I'm completely stumped. Thanks in advance!

Coming on to the question as to how to create this sheet, you can easily get the data and store in the format you described above. The trick is in how to use it as training data to your model. We need to convert it in numerical form to be able to be used as training data to any model. As we know that the max team size is 3 in most cases, we can divide the three names in three columns (keep the column blank, if there are less than 3 members in the team). Now we can either use Label encoding or One-hot encoding to convert the names to numbers. You should create a combined list of all three columns to fit a LabelEncoder and then use transform function individually on each column (since the names might be shared in these 3 columns). On label encoding, we can easily use tree based models. One-hot encoding might lead to curse of dimensionality as there will be many names, so I would prefer not to use it for an initial simple model.

How to Set the Same Categorical Codes to Train and Test data? Python-Pandas

NOTE:
If someone else it's wondering about this topic, I understand you're getting deeper in the Data Analysis world, so I did this question before to learn that:
You encode categorical values as INTEGERES only if you're dealing with Ordinal Classes, i.e. College degree, Customer Satisfaction Surveys as an example.
Otherwise if you're dealing with Nominal Classes like, gender, colors or names, you MUST convert them with other methods since they do not specific any numerical order, most known are One-hot Encoding or Dummy variables.
I encorage you to read more about them and hope this has been useful.
Check the link below to see a nice explanation:
https://www.youtube.com/watch?v=9yl6-HEY7_s
This may be a simple question but I think it can be useful for beginners.
I need to run a prediction model on a test dataset, so to convert the categorical variables into categorical codes that can be handled by the random forests model I use these lines with all of them:
Train:
data_['Col1_CAT'] = data_['Col1'].astype('category')
data_['Col1_CAT'] = data_['Col1_CAT'].cat.codes
So, before running the model I have to apply the same procedure to both, the Train and Test data.
And since both datasets have the same categorical variables/columns, I think it will be useful to apply the same categorical codes to each column respectively.
However, although I'm handling the same variables on each dataset I get different codes everytime I use these two lines.
So, my question is, how can I do to get the same codes everytime I convert the same categoricals on each dataset?
Thanks for your insights and feedback.

Usually, how I do this is to do the categorical conversions before the train test split so that I get a neat transformed dataset. Once I do that, I do the train-test split and train the model and test it on the test set.

Correlation between continuous independent variable and binary class dependent variable

Can someone please tell if it's correct to find correlation between a dependent variable that has binary class(0 or 1) and independent variables that have continuous values using pandas df.corr().
I am getting correlation output if I do use it. But I want to understand if it's statistically correct to find pearson correlation(using df.corr()) between a binary categorical output and continuous input variables.

pearson correlation is for continues data if one is categorical and other is binary, you should use ANOVA to see the relation between variables refrence

How to use multiple imputed data for further analysis in SVM and ANN?

My original data contains some missing values and I used multiple imputation to fill them. My next objective is to use these data in SVM and ANN. I originally thought MI would give me a "pooled" completed dataset but it turned out that MI only gives pooled analysis results regarding the imputed datasets. So my questions are:
1) Is there any way, like any equation, I can use to aggregate the imputed datasets into one dataset and use it for further analysis;
2) If not, how should proceed my study using the multiple datasets.
Thank you!

This is a general misunderstanding about MI.
The general process is supposed to be like this:
Multiple Imputation
Analysis for each imputed dataset
Pooling
If you would do the imputation and then merge all imputed dataset to one imputed dataset you loose all the benefit of MI. Then you could have just used any other imputation method. The idea is to perform your analysis for example 5 times, one time for each imputed dataset. Because you want to account for the different outcomes your analysis could have had with different imputed input datasets. Afterwards you pool / merge the results of your analysis.
The whole process is not so common in ML. But in your case you could for example use SVM on all 5 datasets and then afterwards compare the results / come up with a procedure to merge/combine the results.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas