How to prepare sparse data for building autoencoder in TF? - tensorflow

I have user-item-rating tuples. The users and items are currently both strings (either hash codes or plain text strings, think book or movie titles). The ratings are integers. I'm trying to figure out the data transformation needed to get these ratings into TF to build an autoencoder.
Let's say I have 100K possible items. My thought is that I should feed the model sparse tensors, where each mini-batch would be a set of user-item ratings. Do I need to transform the item strings into integer id's in order to do this? Other than that are there any other details I should know?

Related

How do I initialize weights to correspond with amount of inputs

I'm new to Neural networks and I'm just wondering how I initialize my weights to be to the same amount of inputs I have. I obviously could do it manually (w1, w2...w30) but I was wondering if there was a quicker way to do this and for it to correspond to input value just using NumPy.
You can use numpy random method like this
np.random.rand(3,2)
This will create an array of random values having 3 rows and 2 columns.
For more information visit this link

how to predict winner based on teammates

I am trying to create a machine learning model to predict the position of each team, but I am having trouble organizing the data in a way the model can train off of it.
I want the pandas dataframe to look something like this
Where each tournament has team members constantly shifting teams.
And based on the inputted teammates, the model makes a prediction on the team's position. Anyone have any suggestions on how I can make a pandas dataframe like this that a model can use as trainnig data? I'm completely stumped. Thanks in advance!
Coming on to the question as to how to create this sheet, you can easily get the data and store in the format you described above. The trick is in how to use it as training data to your model. We need to convert it in numerical form to be able to be used as training data to any model. As we know that the max team size is 3 in most cases, we can divide the three names in three columns (keep the column blank, if there are less than 3 members in the team). Now we can either use Label encoding or One-hot encoding to convert the names to numbers. You should create a combined list of all three columns to fit a LabelEncoder and then use transform function individually on each column (since the names might be shared in these 3 columns). On label encoding, we can easily use tree based models. One-hot encoding might lead to curse of dimensionality as there will be many names, so I would prefer not to use it for an initial simple model.

Number of distinct labels and input data shape in tf.data Dataset

The Tensorflow Fashion-MNIST tutorial is great... but it seems clear you have to know in advance that there are 10 distinct labels in the dataset, and that the input data is image data of size 28x28. I would have thought these details should be readily discoverable from the dataset itself - is this possible? Could I discover the same information the same way on a quite different dataset (e.g. the Titanic Dataset, which comprises M rows by N columns of CSV data, and is a binary classification task). tf.data.Dataset does not appear to have any obvious get_label_count() or get_input_shape() functions in its API. Call me a newbie, but this suprises/confuses me.
According to the accepted answer to this question, Tensorflow tf.data.Dataset instances are lazily evaluated, meaning that you could, in principle, need to iterate the through an entire dataset to establish the number of distinct labels, and the input data shape(s) (which can be variable, for example with variable-length sequences of sound or text).

Using embedded columns

I'm trying to understand the TensorFlow tutorial on wide and deep learning. The demo application creates indicator columns for categorical features with few categories (gender, education), and it creates embedded columns for categorical features with many categories (native_country, occupation).
I don't understand embedded columns. Is there a rule that clarifies when to use embedded columns instead of indicator columns? According to the documentation, the dimension parameter sets the dimension of the embedding. What does that mean?
From the feature columns tutorial:
Now, suppose instead of having just three possible classes, we have a million. Or maybe a billion. For a number of reasons, as the number of categories grow large, it becomes infeasible to train a neural network using indicator columns.
We can use an embedding column to overcome this limitation. Instead of representing the data as a one-hot vector of many dimensions, an embedding column represents that data as a lower-dimensional, ordinary vector in which each cell can contain any number, not just 0 or 1. By permitting a richer palette of numbers for every cell, an embedding column contains far fewer cells than an indicator column.
The dimension parameter is the length of the vector you're reducing the categories to.

How to load categorical attributes in scikit-learn?

I want to create a Bayes model in scikit-learn to predict box office openings for movies.
I'm starting with scikit learn and I found many examples on how to load CSV and other table data, but I haven't found examples on how to load attributes with a collection of values, e.g:
Movie 1: Actors: [Actor 1, Actor 2, Actor 3...], etc.
Can anyone give me a hint?
DictVectorizer is the preferred way of handling categorical data that is not already encoded as a Numpy array. For each sample, you can build a bunch of dicts that looks like
[{'Tom Hanks': True, 'Halle Berry': True},
{'Tom Hanks': True, 'Kevin Bacon': True}]
etc. The keys must be strings; the values may be either strings (which are expanded using a one-of-k coding), booleans or numbers. DictVectorizer then transforms these dicts to a matrix that can be fed to a learning algorithm. The matrix will have one column per actor (or other movie feature) in the entire input set. Features not occurring in a dict/sample have an implicit value of zero.