Scaling of Nominal, Ordinal, Binary with numneric Variable dataset - data-science

If the dataset is given in characters i.e. Categorical, then we need to convert them into numerical data using one hot encoding ?
My second question is that, One hot encoding only is meaningful for the Nominal datatype or its meaningful for both Nominal and Ordinal data types ?

It is indeed required to convert Categorical variables into numerical form before submitting it to a model ( although some model implementation are doing it automatically). One Hot Encoding is one way to do it, but there are much more "Encoders" you could choose ( Ordinal Encoding, Binary Encoding, Hashing Encoding , ... ), which all fit a different situation.
For the second question, it does not really matter if your data is Nominal or Ordinal, the only thing that really matter is that your data is Categorical.
That said, if your data is Ordinal, the model would accept it. But an Ordinal can be bad in some situation as to introduce a "distance notion" between your categories.
For example if you have, this encoding for means of transportation :
1 -> car
2 -> bus
3 -> metro
4 -> bike
The model would understand that bike is closer to metro than to car , which is an information that you may not want to give to your model.
The One hot Encoding solve this issue by putting each categories at the same distance from one another.

Related

how to predict winner based on teammates

I am trying to create a machine learning model to predict the position of each team, but I am having trouble organizing the data in a way the model can train off of it.
I want the pandas dataframe to look something like this
Where each tournament has team members constantly shifting teams.
And based on the inputted teammates, the model makes a prediction on the team's position. Anyone have any suggestions on how I can make a pandas dataframe like this that a model can use as trainnig data? I'm completely stumped. Thanks in advance!
Coming on to the question as to how to create this sheet, you can easily get the data and store in the format you described above. The trick is in how to use it as training data to your model. We need to convert it in numerical form to be able to be used as training data to any model. As we know that the max team size is 3 in most cases, we can divide the three names in three columns (keep the column blank, if there are less than 3 members in the team). Now we can either use Label encoding or One-hot encoding to convert the names to numbers. You should create a combined list of all three columns to fit a LabelEncoder and then use transform function individually on each column (since the names might be shared in these 3 columns). On label encoding, we can easily use tree based models. One-hot encoding might lead to curse of dimensionality as there will be many names, so I would prefer not to use it for an initial simple model.

Confusion about how bucketized feature columns work

I had some confusion about how bucketized feature columns represent input to the model. According to the blog post on feature columns, when we bucketize a feature like year this puts each value in buckets based on the defined boundaries, and creates a binary vector, turning on each bucket based on the input value, but the example in the documentation shows the output as a single integer. I'm confused as to how the input is to the model when using a bucketized column. Can anyone clarify this for me please?
From the dimensions of the first hidden layer of the estimator, it seems like for each feature column that is a tf.feature_column.bucketized_column, a one hot encoded vector is created based on the boundaries.

Dealing with both categorical and numerical variables in a Multiple Linear Regression Python

So I have already performed a multiple linear regression in Python using LinearRegression from sklearn.
My independant variables were all numerical (and so was my dependant one)
But now I'd like to perform a multiple linear regression combining numerical and non numerical independant variables.
Therefore I have several questions:
If I use dummy variables or One-Hot for the non-numerical ones, will I then be able to perform the LinearRegression from sklearn?
If yes, do I have to change some parameters?
If not, how should I perform the Linear Regression?
One thing that bother me is that dummy/one-hot methods don't deal with ordinal variables, right? (Because it shouldn't be encoded the same way in my opinion)
Problem is: Even if I want to encode diffently nominal and ordinal variables,
it seems impossible for Python to tell the difference between both of them?
This stuff might be easy for you but right now as you could tell I'm a little confused so I could really use your help !
Thanks in advance,
Alex
If I use dummy variables or One-Hot for the non-numerical ones, will I then be able to perform the LinearRegression from sklearn?
In fact the model has to be fed exclusively with numerical data, thus you must use OneHot vectors for the categorical data in your input features. For that you can take a look at Scikit-Learn's LabelEncoder and OneHotEncoder.
One thing that bother me is that dummy/one-hot methods don't deal with ordinal variables, right? (Because it shouldn't be encoded the same way in my opinion)
Yes. As you mention one-hot methods don't deal with ordinal variables. One way to work with ordinal features is to create a scale map, and map those features to that scale. Ordinal is a very useful tool for these cases. You can feed it a mapping dictionary according to a predifined scale mapping as mentioned. Otherwise, obviously it randomly assigns integers to the different categories as it has no knowledge to infer any order. From the documentation:
Ordinal encoding uses a single column of integers to represent the classes. An optional mapping dict can be passed in, in this case we use the knowledge that there is some true order to the classes themselves. Otherwise, the classes are assumed to have no true order and integers are selected at random.
Hope this helps.

A similar approach for LabelEncoder in sklearn.preprocessing?

For encoding categorical data like sex we normally use LabelEncorder() in scikit learn. But If I'm going to use Tensorflow instead of Scikit Learn, what is the equivalent function or methodology for doing such task? I know that we can do one hot encoding easily with tensorflow, but then it will create labels as 10 , 01 instead of 1 , 0.
There is a package in TensorFlow called tf.feature_columns, that contain 4 methods to create categorical columns from your input data:
categorical_column_with_hash_bucket(...): Hash the input value to a fixed number of categories
categorical_column_with_identity(...): If you have numeric input and you want the value itself to be treated as a categorical column
categorical_column_with_vocabulary_list(...): Outputs a category based on a fixed (memory) list of words
categorical_column_with_vocabulary_file(...): Same as _list but reads the vocabulary from file
The package also provides lots more way of getting your input data to the model. For an overview, see this blogpost written by the developers of the package.

Using embedded columns

I'm trying to understand the TensorFlow tutorial on wide and deep learning. The demo application creates indicator columns for categorical features with few categories (gender, education), and it creates embedded columns for categorical features with many categories (native_country, occupation).
I don't understand embedded columns. Is there a rule that clarifies when to use embedded columns instead of indicator columns? According to the documentation, the dimension parameter sets the dimension of the embedding. What does that mean?
From the feature columns tutorial:
Now, suppose instead of having just three possible classes, we have a million. Or maybe a billion. For a number of reasons, as the number of categories grow large, it becomes infeasible to train a neural network using indicator columns.
We can use an embedding column to overcome this limitation. Instead of representing the data as a one-hot vector of many dimensions, an embedding column represents that data as a lower-dimensional, ordinary vector in which each cell can contain any number, not just 0 or 1. By permitting a richer palette of numbers for every cell, an embedding column contains far fewer cells than an indicator column.
The dimension parameter is the length of the vector you're reducing the categories to.