datatypes in macgien learning - dataframe

I have table with different datatypes. Some of my columns are :
name, time, date, number_of_files, hour_works, type_of_job
Jack, 10:24:54, 2015-02-15, 82, 20, project manager
….etc
I want to train these features to predict type_of_job in the company by using a randomforest model.
My question is should I convert the columns to specific datatypes to get good accuracy and what about time and data? I have around 48970 rows and this is first time I work with machine learning.

Yes, it is necessary to convert the data. Usually all the columns should have numeric format:
you can extract features from time - day, hour, week and so;
type of job is a categorical feature, common transformation methods are labelencoding and onehotencoding;
the same could be done with other categorical columns, like name;
if you use linear model, then numerical features should be normalized;

Related

Stratified splits constraint in XGBoost

I'm using a large dataset spanning many years to cross-validate hyperparameters for an XGBoost model. This data can look different in different years, so to reduce generalization error I would like to disallow the model from making any splits that are imbalanced with respect to years, i.e. don't let it split on year. For example, adding a constraint that all splits must contain at least n samples from each year, or adding a penalty on how far the ratio of each year's data in the split differs from 1/2. I don't have the timestamp as a feature but there are other features that would allow it to do effectively this. I don't see anything in the documentation that covers this use-case, but I was wondering if there might be some trick (eg. with monotonicity constraints) that could work.

how to predict winner based on teammates

I am trying to create a machine learning model to predict the position of each team, but I am having trouble organizing the data in a way the model can train off of it.
I want the pandas dataframe to look something like this
Where each tournament has team members constantly shifting teams.
And based on the inputted teammates, the model makes a prediction on the team's position. Anyone have any suggestions on how I can make a pandas dataframe like this that a model can use as trainnig data? I'm completely stumped. Thanks in advance!
Coming on to the question as to how to create this sheet, you can easily get the data and store in the format you described above. The trick is in how to use it as training data to your model. We need to convert it in numerical form to be able to be used as training data to any model. As we know that the max team size is 3 in most cases, we can divide the three names in three columns (keep the column blank, if there are less than 3 members in the team). Now we can either use Label encoding or One-hot encoding to convert the names to numbers. You should create a combined list of all three columns to fit a LabelEncoder and then use transform function individually on each column (since the names might be shared in these 3 columns). On label encoding, we can easily use tree based models. One-hot encoding might lead to curse of dimensionality as there will be many names, so I would prefer not to use it for an initial simple model.

How to Set the Same Categorical Codes to Train and Test data? Python-Pandas

NOTE:
If someone else it's wondering about this topic, I understand you're getting deeper in the Data Analysis world, so I did this question before to learn that:
You encode categorical values as INTEGERES only if you're dealing with Ordinal Classes, i.e. College degree, Customer Satisfaction Surveys as an example.
Otherwise if you're dealing with Nominal Classes like, gender, colors or names, you MUST convert them with other methods since they do not specific any numerical order, most known are One-hot Encoding or Dummy variables.
I encorage you to read more about them and hope this has been useful.
Check the link below to see a nice explanation:
https://www.youtube.com/watch?v=9yl6-HEY7_s
This may be a simple question but I think it can be useful for beginners.
I need to run a prediction model on a test dataset, so to convert the categorical variables into categorical codes that can be handled by the random forests model I use these lines with all of them:
Train:
data_['Col1_CAT'] = data_['Col1'].astype('category')
data_['Col1_CAT'] = data_['Col1_CAT'].cat.codes
So, before running the model I have to apply the same procedure to both, the Train and Test data.
And since both datasets have the same categorical variables/columns, I think it will be useful to apply the same categorical codes to each column respectively.
However, although I'm handling the same variables on each dataset I get different codes everytime I use these two lines.
So, my question is, how can I do to get the same codes everytime I convert the same categoricals on each dataset?
Thanks for your insights and feedback.
Usually, how I do this is to do the categorical conversions before the train test split so that I get a neat transformed dataset. Once I do that, I do the train-test split and train the model and test it on the test set.

How can I combine two time-series datasets with different time-steps?

I want to train a Multivariate LSTM model by using data from 2 datasets MIMIC-1.0 and MIMIC-3. The problem is that the vital signs recorded in the first data set is minute by minute while in MIMIC-III the data is recorded hourly. There is a interval difference between recording of data in both data sets.
I want to predict diagnosis from the vital signs by giving streams/sequences of vital signs to my model every 5 minutes. How can I merge both data sets for my model?
You need to be able to find a common field using which you can do a merge. For e.g. patient_ids or it's like. You can do the same with ICU episode identifiers. It's a been a while since I've worked on the MIMIC dataset to recall exactly what those fields were.
Dataset
Granularity
Subsampling for 5-minutely
MIMIC-I
Minutely
Subsample every 5th reading
MIMIC-III
Hourly
Interpolate the 10 5-minutely readings between each pair of consecutive hourly readings
The interpolation method you choose to get the between hour readings could be as simple as forward-filling the last value. If the readings are more volatile, a more complex method may be appropriate.

Scalling Feature implemented in DataFrame modelling

I have dataset with 15 columns with below scenario
9 -columns are categorical use so I have convert the data one hot encoder
6 columns are numeric, out of 6 - 3 columns is having outlier since column values are different range, so I have chosen RobustScaler() as scaling features and other I chosen standard Scalar.
after that I have combined all the data frame and apply the Logistic Regression algorithm my model produced very low score even I got the good score with out scaling.
will any one can able to help on this ?
please apply column standardization to data frame and see the output..I guess since logistic regression is sensitive to outliers,you are facing problem
impute outliers properly and then apply column standardization