how to predict winner based on teammates - pandas

I am trying to create a machine learning model to predict the position of each team, but I am having trouble organizing the data in a way the model can train off of it.
I want the pandas dataframe to look something like this
Where each tournament has team members constantly shifting teams.
And based on the inputted teammates, the model makes a prediction on the team's position. Anyone have any suggestions on how I can make a pandas dataframe like this that a model can use as trainnig data? I'm completely stumped. Thanks in advance!

Coming on to the question as to how to create this sheet, you can easily get the data and store in the format you described above. The trick is in how to use it as training data to your model. We need to convert it in numerical form to be able to be used as training data to any model. As we know that the max team size is 3 in most cases, we can divide the three names in three columns (keep the column blank, if there are less than 3 members in the team). Now we can either use Label encoding or One-hot encoding to convert the names to numbers. You should create a combined list of all three columns to fit a LabelEncoder and then use transform function individually on each column (since the names might be shared in these 3 columns). On label encoding, we can easily use tree based models. One-hot encoding might lead to curse of dimensionality as there will be many names, so I would prefer not to use it for an initial simple model.

Related

How to Set the Same Categorical Codes to Train and Test data? Python-Pandas

NOTE:
If someone else it's wondering about this topic, I understand you're getting deeper in the Data Analysis world, so I did this question before to learn that:
You encode categorical values as INTEGERES only if you're dealing with Ordinal Classes, i.e. College degree, Customer Satisfaction Surveys as an example.
Otherwise if you're dealing with Nominal Classes like, gender, colors or names, you MUST convert them with other methods since they do not specific any numerical order, most known are One-hot Encoding or Dummy variables.
I encorage you to read more about them and hope this has been useful.
Check the link below to see a nice explanation:
https://www.youtube.com/watch?v=9yl6-HEY7_s
This may be a simple question but I think it can be useful for beginners.
I need to run a prediction model on a test dataset, so to convert the categorical variables into categorical codes that can be handled by the random forests model I use these lines with all of them:
Train:
data_['Col1_CAT'] = data_['Col1'].astype('category')
data_['Col1_CAT'] = data_['Col1_CAT'].cat.codes
So, before running the model I have to apply the same procedure to both, the Train and Test data.
And since both datasets have the same categorical variables/columns, I think it will be useful to apply the same categorical codes to each column respectively.
However, although I'm handling the same variables on each dataset I get different codes everytime I use these two lines.
So, my question is, how can I do to get the same codes everytime I convert the same categoricals on each dataset?
Thanks for your insights and feedback.
Usually, how I do this is to do the categorical conversions before the train test split so that I get a neat transformed dataset. Once I do that, I do the train-test split and train the model and test it on the test set.

Number of distinct labels and input data shape in tf.data Dataset

The Tensorflow Fashion-MNIST tutorial is great... but it seems clear you have to know in advance that there are 10 distinct labels in the dataset, and that the input data is image data of size 28x28. I would have thought these details should be readily discoverable from the dataset itself - is this possible? Could I discover the same information the same way on a quite different dataset (e.g. the Titanic Dataset, which comprises M rows by N columns of CSV data, and is a binary classification task). tf.data.Dataset does not appear to have any obvious get_label_count() or get_input_shape() functions in its API. Call me a newbie, but this suprises/confuses me.
According to the accepted answer to this question, Tensorflow tf.data.Dataset instances are lazily evaluated, meaning that you could, in principle, need to iterate the through an entire dataset to establish the number of distinct labels, and the input data shape(s) (which can be variable, for example with variable-length sequences of sound or text).

Confusion about how bucketized feature columns work

I had some confusion about how bucketized feature columns represent input to the model. According to the blog post on feature columns, when we bucketize a feature like year this puts each value in buckets based on the defined boundaries, and creates a binary vector, turning on each bucket based on the input value, but the example in the documentation shows the output as a single integer. I'm confused as to how the input is to the model when using a bucketized column. Can anyone clarify this for me please?
From the dimensions of the first hidden layer of the estimator, it seems like for each feature column that is a tf.feature_column.bucketized_column, a one hot encoded vector is created based on the boundaries.

A similar approach for LabelEncoder in sklearn.preprocessing?

For encoding categorical data like sex we normally use LabelEncorder() in scikit learn. But If I'm going to use Tensorflow instead of Scikit Learn, what is the equivalent function or methodology for doing such task? I know that we can do one hot encoding easily with tensorflow, but then it will create labels as 10 , 01 instead of 1 , 0.
There is a package in TensorFlow called tf.feature_columns, that contain 4 methods to create categorical columns from your input data:
categorical_column_with_hash_bucket(...): Hash the input value to a fixed number of categories
categorical_column_with_identity(...): If you have numeric input and you want the value itself to be treated as a categorical column
categorical_column_with_vocabulary_list(...): Outputs a category based on a fixed (memory) list of words
categorical_column_with_vocabulary_file(...): Same as _list but reads the vocabulary from file
The package also provides lots more way of getting your input data to the model. For an overview, see this blogpost written by the developers of the package.

Using embedded columns

I'm trying to understand the TensorFlow tutorial on wide and deep learning. The demo application creates indicator columns for categorical features with few categories (gender, education), and it creates embedded columns for categorical features with many categories (native_country, occupation).
I don't understand embedded columns. Is there a rule that clarifies when to use embedded columns instead of indicator columns? According to the documentation, the dimension parameter sets the dimension of the embedding. What does that mean?
From the feature columns tutorial:
Now, suppose instead of having just three possible classes, we have a million. Or maybe a billion. For a number of reasons, as the number of categories grow large, it becomes infeasible to train a neural network using indicator columns.
We can use an embedding column to overcome this limitation. Instead of representing the data as a one-hot vector of many dimensions, an embedding column represents that data as a lower-dimensional, ordinary vector in which each cell can contain any number, not just 0 or 1. By permitting a richer palette of numbers for every cell, an embedding column contains far fewer cells than an indicator column.
The dimension parameter is the length of the vector you're reducing the categories to.