Processing column with letters before feeding into a NN - pandas

I wanted to implement a classification algorithm using NN but some columns have complex alphanumeric strings, so I just chose only the simpler columns to check. Here is an example with few elements of the columns I chose...
Few Elements of the COL
As you can see these columns have A,G,C or T..etc. Some had combinations of the 4 letters but I removed it for now. My plan was to map each of these letters to values like 1,2,3 and 4 and then feed them to the NN.
Is this mapping acceptable for feeding into a dense NN?? Or is there any better method for doing this

I would not map it to integers like 1, 2, 3 etc because you are mistakenly giving them a certain order or rank which the NN may capture as important, although this ranking does not truly exist.
If you do not have high cardinality (many unique values) then you can apply One-Hot Encoding. If the cardinality is high, then you should use other encoding techniques, otherwise one-hot encoder will introduce a lot of dimensionality to your data and sparsity, which are not welcomed. You can find here some other interesting methods to encode categorical variables.

Related

how to predict winner based on teammates

I am trying to create a machine learning model to predict the position of each team, but I am having trouble organizing the data in a way the model can train off of it.
I want the pandas dataframe to look something like this
Where each tournament has team members constantly shifting teams.
And based on the inputted teammates, the model makes a prediction on the team's position. Anyone have any suggestions on how I can make a pandas dataframe like this that a model can use as trainnig data? I'm completely stumped. Thanks in advance!
Coming on to the question as to how to create this sheet, you can easily get the data and store in the format you described above. The trick is in how to use it as training data to your model. We need to convert it in numerical form to be able to be used as training data to any model. As we know that the max team size is 3 in most cases, we can divide the three names in three columns (keep the column blank, if there are less than 3 members in the team). Now we can either use Label encoding or One-hot encoding to convert the names to numbers. You should create a combined list of all three columns to fit a LabelEncoder and then use transform function individually on each column (since the names might be shared in these 3 columns). On label encoding, we can easily use tree based models. One-hot encoding might lead to curse of dimensionality as there will be many names, so I would prefer not to use it for an initial simple model.

one_hot Vs Tokenizer for Word representation

I have seen in many blogs , people using one_hot (from tf.keras.preprocessing.text.one_hot ) to convert the string of words into array of numbers which represent indices. This does not ensure unicity. Whereas Tokenizer class ensures unicity (tf.keras.preprocessing.text.Tokenizer ).
Then why is one_hot prefered over tokenizer?
Update: I got to know that hashing is used in One_hot to convert words into numbers but didn't get its importance as we can use the tokenizer class to do the same thing with more accuracy.
Not sure what you mean by uncity. I expect it has to do with the sequential relationship between the words. That of course is lost with ine hot encoding. However one-hot encoding is used when the number of words is limited. If say you have 10 words in the vocubulary you will create 10 new features which is fine for most neural networks to process. If you have other features in your data set beside the word sequences say numeric ordinal parameters you can still create a single input model. However if you have 10,000 words in the vocabulary you would create 10,000 new features which at best will take a lot to process. So in the case of a large vocabularly it is best to use "dense" encoding" versus the sparse encoding generated by one hot encoding. You can use the results of the tokenizer encoding to serve as input to a keras embedding layer which will encode the words into an n dimensional space where N is a value you specify. If you have additional ordinal features then to process the data your model will need multiple inputs. Perhaps that is why some people prefer to one hot encode the words.

Dealing with both categorical and numerical variables in a Multiple Linear Regression Python

So I have already performed a multiple linear regression in Python using LinearRegression from sklearn.
My independant variables were all numerical (and so was my dependant one)
But now I'd like to perform a multiple linear regression combining numerical and non numerical independant variables.
Therefore I have several questions:
If I use dummy variables or One-Hot for the non-numerical ones, will I then be able to perform the LinearRegression from sklearn?
If yes, do I have to change some parameters?
If not, how should I perform the Linear Regression?
One thing that bother me is that dummy/one-hot methods don't deal with ordinal variables, right? (Because it shouldn't be encoded the same way in my opinion)
Problem is: Even if I want to encode diffently nominal and ordinal variables,
it seems impossible for Python to tell the difference between both of them?
This stuff might be easy for you but right now as you could tell I'm a little confused so I could really use your help !
Thanks in advance,
Alex
If I use dummy variables or One-Hot for the non-numerical ones, will I then be able to perform the LinearRegression from sklearn?
In fact the model has to be fed exclusively with numerical data, thus you must use OneHot vectors for the categorical data in your input features. For that you can take a look at Scikit-Learn's LabelEncoder and OneHotEncoder.
One thing that bother me is that dummy/one-hot methods don't deal with ordinal variables, right? (Because it shouldn't be encoded the same way in my opinion)
Yes. As you mention one-hot methods don't deal with ordinal variables. One way to work with ordinal features is to create a scale map, and map those features to that scale. Ordinal is a very useful tool for these cases. You can feed it a mapping dictionary according to a predifined scale mapping as mentioned. Otherwise, obviously it randomly assigns integers to the different categories as it has no knowledge to infer any order. From the documentation:
Ordinal encoding uses a single column of integers to represent the classes. An optional mapping dict can be passed in, in this case we use the knowledge that there is some true order to the classes themselves. Otherwise, the classes are assumed to have no true order and integers are selected at random.
Hope this helps.

Using embedded columns

I'm trying to understand the TensorFlow tutorial on wide and deep learning. The demo application creates indicator columns for categorical features with few categories (gender, education), and it creates embedded columns for categorical features with many categories (native_country, occupation).
I don't understand embedded columns. Is there a rule that clarifies when to use embedded columns instead of indicator columns? According to the documentation, the dimension parameter sets the dimension of the embedding. What does that mean?
From the feature columns tutorial:
Now, suppose instead of having just three possible classes, we have a million. Or maybe a billion. For a number of reasons, as the number of categories grow large, it becomes infeasible to train a neural network using indicator columns.
We can use an embedding column to overcome this limitation. Instead of representing the data as a one-hot vector of many dimensions, an embedding column represents that data as a lower-dimensional, ordinary vector in which each cell can contain any number, not just 0 or 1. By permitting a richer palette of numbers for every cell, an embedding column contains far fewer cells than an indicator column.
The dimension parameter is the length of the vector you're reducing the categories to.

How to train with inputs of variable size?

This question is rather abstract and not necessarily tied to tensorflow or keras. Say that you want to train a language model, and you want to use inputs of different sizes for your LSTMs. Particularly, I'm following this paper: https://www.researchgate.net/publication/317379370_A_Neural_Language_Model_for_Query_Auto-Completion.
The authors use, among other things, word embeddings and one-hot encoding of characters. Most likely, the dimensions of each of these inputs are different. Now, to feed that into a network, I see a few alternatives but I'm sure I'm missing something and I would like to know how it should be done.
Create a 3D tensor of shape (instances, 2, max(embeddings,characters)). That is, padding the smaller input with 0s.
Create a 3D tensor of shape (instances, embeddings+characters, 1)). That is, concatenating inputs.
It looks to me that both alternatives are bad for efficiently training the model. So, what's the best way to approach this? I see the authors use an embedding layer for this purpose, but technically, what does that mean?
EDIT
Here are more details. Let's call these inputs X (character-level input) and E (word-level input). On each character of a sequence (a text), I compute x, e and y, the label.
x: character one-hot encoding. My character index is of size 38, so this is a vector filled with 37 zeros and one 1.
e: precomputed word embedding of dimension 200. If the character is a space, I fetch the word embedding of the previous word in the sequence, Otherwise, I assign the vector for incomplete word (INC, also of size 200). Real example with the sequence "red car": r>INC, e>INC, d>INC, _>embeddings["red"], c>INC, a>INC, r>INC.
y: the label to be predicted, which is the next character, one-hot encoded. This output is of the same dimension as x because it uses the same character index. In the example above, for "r", y is the one-hot encoding of "e".
According to keras documentation, the padding idea seems to be the one. There is the masking parameter in the embedding layer, that will make keras skip these values instead of processing them. In theory, you don't lose that much performance. If the library is well built, the skipping is actually skipping extra processing.
You just need to take care not to attribute the value zero to any other character, not even spaces or unknown words.
An embedding layer is not only for masking (masking is just an option in an embedding layer).
The embedding layer transforms integer values from a word/character dictionary into actual vectors of a certain shape.
Suppose you have this dictionary:
1: hey
2: ,
3: I'm
4: here
5: not
And you form sentences like
[1,2,3,4,0] -> this is "hey, I'm here"
[1,2,3,5,4] -> this is "hey, I'm not here"
[1,2,1,2,1] -> this is "hey, hey, hey"
The embedding layer will tranform each of those integers into vectors of a certain size. This does two good things at the same time:
Transforms the words in vectors because neural networks can only handle vectors or intensities. A list of indices cannot be processed by a neural network directly, there is no logical relation between indices and words
Creates a vector that will be a "meaningful" set of features for each word.
And after training, they become "meaningful" vectors. Each element starts to represent a certain feature of the word, although that feature is obscure to humans. It's possible that an embedding be capable of detecting words that are verbs, nouns, feminine, masculine, etc, everything encoded in a combination of numeric values (presence/abscence/intensity of features).
You may also try the approach in this question, which instead of using masking, needs to separate batches by length, so each batch can be trained at a time without needing to pad them: Keras misinterprets training data shape