How to use Pandas get_dummies on predict data? - one-hot-encoding

After using Pandas get_dummies on 3 categorical columns to get a one hot-encoded Dataframe, I've trained (with some success) a Perceptron model.
Now I would like to predict the result from a new observation, that it is not hot-encoded.
Is there any way to record the get_dummies column mapping to re-use it?

There is no automatic procedure to do it at the moment, to my knowledge. In the future release of sklearn CategoricalEncoder will be very handy for this job. You can already get your hands on it, if you clone sklearn github master branch and build in yourself. At the moment 2 options come to my mind:
use LabelEncoder+OneHotEncoder combination, see this answer, for example;
simply retrieve (and store, if needed) the list of columns after the training OHE output. Then run pd.get_dummies on the test set/example. Loop through the output test OHE columns, drop those that do not appear in the training OHE and add those that are missing in test OHE filled with zeros.

Related

Predict a nonlinear array based on 2 features with scalar values using XGBoost or equivalent

So I have been looking at XGBoost as a place to start with this, however I am not sure the best way to accomplish what I want.
My data is set up something like this
Where every value, whether it be input or output is numerical. The issue I'm facing is that I only have 3 input data points per several output data points.
I have seen that XGBoost has a multi-output regression method, however I am only really seeing it used to predict around 2 outputs per 1 input, whereas my data may have upwards of 50 output points needing to be predicted with only a handful of scalar input features.
I'd appreciate any ideas you may have.
For reference, I've been looking at mainly these two demos (they are the same idea just one is scikit and the other xgboost)
https://machinelearningmastery.com/multi-output-regression-models-with-python/
https://xgboost.readthedocs.io/en/stable/python/examples/multioutput_regression.html

how to predict winner based on teammates

I am trying to create a machine learning model to predict the position of each team, but I am having trouble organizing the data in a way the model can train off of it.
I want the pandas dataframe to look something like this
Where each tournament has team members constantly shifting teams.
And based on the inputted teammates, the model makes a prediction on the team's position. Anyone have any suggestions on how I can make a pandas dataframe like this that a model can use as trainnig data? I'm completely stumped. Thanks in advance!
Coming on to the question as to how to create this sheet, you can easily get the data and store in the format you described above. The trick is in how to use it as training data to your model. We need to convert it in numerical form to be able to be used as training data to any model. As we know that the max team size is 3 in most cases, we can divide the three names in three columns (keep the column blank, if there are less than 3 members in the team). Now we can either use Label encoding or One-hot encoding to convert the names to numbers. You should create a combined list of all three columns to fit a LabelEncoder and then use transform function individually on each column (since the names might be shared in these 3 columns). On label encoding, we can easily use tree based models. One-hot encoding might lead to curse of dimensionality as there will be many names, so I would prefer not to use it for an initial simple model.

Should I join features and targets dataframes for use with scikit-learn?

I am trying to create a regression model to predict deliverables (dataframe 2) using design parameters (dataframe 1). Both dataframes have a id number that I used as an index.
Is it possible to use two dataframes to create a dataset for sklearn? Or do I need to join them? If I need to join them then what would be the best way?
# import data
df1= pd.read_excel('data.xlsx', sheet_name='Data1',index_col='Unnamed: 0')
df2= pd.read_excel('data.xlsx', sheet_name='Data2',index_col='Unnamed: 0')
I have only used sklearn on a single dataframe that had all of the columns for the feature and target vectors in it. So not sure how to handle the case where I am using two dataframes where one has the features and one has the targets.
All estimators in scikit-learn have a signature like estimator.fit(X, y), X being training features and y training targets.
Then, prediction will be achieved by calling some kind of estimator.predict(X_test), with X_test being the test features.
Even train_test_split takes as parameters two arrays X and y.
This means that, as long as you maintain the right order in rows, nothing requires you to merge features and targets.
Completely agree with the Guillaume answer.
Just be aware, as he said, of the rows order. That's the key of your problem. If they have the same order, you don't need to merge dataframe and you can fit the model directly.
But, if they are not in the same order, you have to combine both dataframes (similar to left join in SQL) in order to relate features and targets of one ID. You can do it like this (more information here):
df_final= pd.concat([df1, df2], axis=1)
As you used the ID as index, it should work properly. Be aware that maybe NaN values will appear if some ID appears in one Dataframe but not in the other one. You will have to handle with them.

How to Set the Same Categorical Codes to Train and Test data? Python-Pandas

NOTE:
If someone else it's wondering about this topic, I understand you're getting deeper in the Data Analysis world, so I did this question before to learn that:
You encode categorical values as INTEGERES only if you're dealing with Ordinal Classes, i.e. College degree, Customer Satisfaction Surveys as an example.
Otherwise if you're dealing with Nominal Classes like, gender, colors or names, you MUST convert them with other methods since they do not specific any numerical order, most known are One-hot Encoding or Dummy variables.
I encorage you to read more about them and hope this has been useful.
Check the link below to see a nice explanation:
https://www.youtube.com/watch?v=9yl6-HEY7_s
This may be a simple question but I think it can be useful for beginners.
I need to run a prediction model on a test dataset, so to convert the categorical variables into categorical codes that can be handled by the random forests model I use these lines with all of them:
Train:
data_['Col1_CAT'] = data_['Col1'].astype('category')
data_['Col1_CAT'] = data_['Col1_CAT'].cat.codes
So, before running the model I have to apply the same procedure to both, the Train and Test data.
And since both datasets have the same categorical variables/columns, I think it will be useful to apply the same categorical codes to each column respectively.
However, although I'm handling the same variables on each dataset I get different codes everytime I use these two lines.
So, my question is, how can I do to get the same codes everytime I convert the same categoricals on each dataset?
Thanks for your insights and feedback.
Usually, how I do this is to do the categorical conversions before the train test split so that I get a neat transformed dataset. Once I do that, I do the train-test split and train the model and test it on the test set.

Adding statsmodels 'predict' results to a Pandas dataframe

It is common to want to append the results of predictions to the dataset used to make the predictions, but the statsmodels predict function returns (non-indexed) results of a potentially different length than the dataset on which predictions are based.
For example, if the test dataset, test, contains any null entries, then
mod_fit = sm.Logit.from_formula('Y ~ A B C', train).fit()
press = mod_fit.predict(test)
will produce an array that is shorter than the length of test, and cannot be usefully appended with
test['preds'] = preds
And since the result of predict is not indexed, there is no way to recover the rows to which the results should be attached.
What is the idiom for associating predict results to the rows from which they were generated? Is there, perhaps, a way to get predict to return a dataframe that preserves the indices of its argument?
Predict shouldn't drop any rows. Can you post a minimal working example where this happens? Preserving the pandas index is on my radar and should be fixed in master soon.
https://github.com/statsmodels/statsmodels/issues/1501
Edit: Nevermind. This is a known issue. https://github.com/statsmodels/statsmodels/issues/1352