The Tensorflow Fashion-MNIST tutorial is great... but it seems clear you have to know in advance that there are 10 distinct labels in the dataset, and that the input data is image data of size 28x28. I would have thought these details should be readily discoverable from the dataset itself - is this possible? Could I discover the same information the same way on a quite different dataset (e.g. the Titanic Dataset, which comprises M rows by N columns of CSV data, and is a binary classification task). tf.data.Dataset does not appear to have any obvious get_label_count() or get_input_shape() functions in its API. Call me a newbie, but this suprises/confuses me.
According to the accepted answer to this question, Tensorflow tf.data.Dataset instances are lazily evaluated, meaning that you could, in principle, need to iterate the through an entire dataset to establish the number of distinct labels, and the input data shape(s) (which can be variable, for example with variable-length sequences of sound or text).
Related
I am trying to create a machine learning model to predict the position of each team, but I am having trouble organizing the data in a way the model can train off of it.
I want the pandas dataframe to look something like this
Where each tournament has team members constantly shifting teams.
And based on the inputted teammates, the model makes a prediction on the team's position. Anyone have any suggestions on how I can make a pandas dataframe like this that a model can use as trainnig data? I'm completely stumped. Thanks in advance!
Coming on to the question as to how to create this sheet, you can easily get the data and store in the format you described above. The trick is in how to use it as training data to your model. We need to convert it in numerical form to be able to be used as training data to any model. As we know that the max team size is 3 in most cases, we can divide the three names in three columns (keep the column blank, if there are less than 3 members in the team). Now we can either use Label encoding or One-hot encoding to convert the names to numbers. You should create a combined list of all three columns to fit a LabelEncoder and then use transform function individually on each column (since the names might be shared in these 3 columns). On label encoding, we can easily use tree based models. One-hot encoding might lead to curse of dimensionality as there will be many names, so I would prefer not to use it for an initial simple model.
NOTE:
If someone else it's wondering about this topic, I understand you're getting deeper in the Data Analysis world, so I did this question before to learn that:
You encode categorical values as INTEGERES only if you're dealing with Ordinal Classes, i.e. College degree, Customer Satisfaction Surveys as an example.
Otherwise if you're dealing with Nominal Classes like, gender, colors or names, you MUST convert them with other methods since they do not specific any numerical order, most known are One-hot Encoding or Dummy variables.
I encorage you to read more about them and hope this has been useful.
Check the link below to see a nice explanation:
https://www.youtube.com/watch?v=9yl6-HEY7_s
This may be a simple question but I think it can be useful for beginners.
I need to run a prediction model on a test dataset, so to convert the categorical variables into categorical codes that can be handled by the random forests model I use these lines with all of them:
Train:
data_['Col1_CAT'] = data_['Col1'].astype('category')
data_['Col1_CAT'] = data_['Col1_CAT'].cat.codes
So, before running the model I have to apply the same procedure to both, the Train and Test data.
And since both datasets have the same categorical variables/columns, I think it will be useful to apply the same categorical codes to each column respectively.
However, although I'm handling the same variables on each dataset I get different codes everytime I use these two lines.
So, my question is, how can I do to get the same codes everytime I convert the same categoricals on each dataset?
Thanks for your insights and feedback.
Usually, how I do this is to do the categorical conversions before the train test split so that I get a neat transformed dataset. Once I do that, I do the train-test split and train the model and test it on the test set.
I am trying to cluster sentences by clustering the sentence embedding of them taken from fasttext model. Each sentence embedding has 300 dimensions, and I want to reduce them to 50 (say). I tried t-SNE, PCA, UMAP. I wanted to see how Auto Encoder works for my data.
Now passing those 300 numbers for each sentence as separate features to the NN would make sense or they should be passed as a single entity? If so, is there an way to pass a list as a feature to NN?
I tried passing the 300 numbers as individual features and with the output I tried clustering. Could get very few meaningful clusters rest were either noise or clusters with no similar sentences but being grouped (But with other techniques like UMAP I could get far more meaningful clusters in more number). Any leads would be helpful. Thanks in advance :)
Imagine I have hundreds of rectangular patterns that look like the following:
_yx_0zzyxx
_0__yz_0y_
x0_0x000yx
_y__x000zx
zyyzx_z_0y
Say the only variables for the different patterns are dimension (width by height in characters) and values at a given cell within the rectangle with possible characters _ y x z 0. So another pattern might look like this:
yx0x_x
xz_x0_
_yy0x_
zyy0__
and another like this:
xx0z00yy_z0x000
zzx_0000_xzzyxx
_yxy0y__yx0yy_z
_xz0z__0_y_xz0z
y__x0_0_y__x000
xz_x0_z0z__0_x0
These simplified examples were randomly generated, but imagine there is a deeper structure and relation between dimensions and layout of characters.
I want to train on this dataset in an unsupervised fashion (no labels) in order to generate similar output. Assuming I have created my dataset appropriately with tf.data.Dataset and categorical identity columns:
what is a good general purpose model for unsupervised training (no labels)?
is there a Tensorflow premade estimator that would represent such a model well enough?
once I've trained the model, what is a general approach to using it for generation of patterns based on what it has learned? I have in mind Google Magenta, which can be used to train on a dataset of musical melodies in order to generate similar ones from a kind of seed/primer melody
I'm not looking for a full implementation (that's the fun part!), just some suggested tutorials and next steps to follow. Thanks!
I want to train a DNN model by training data with more than one billion feature dimensions. So the shape of the first layer weight matrix will be (1,000,000,000, 512). this weight matrix is too large to be stored in one box.
By now, is there any solution to deal with such large variables, for example partition the large weight matrix to multiple boxes.
Update:
Thanks Olivier and Keveman. let me add more detail about my problem.
The example is very sparse and all features are binary value: 0 or 1. The parameter weight looks like tf.Variable(tf.truncated_normal([1 000 000 000, 512],stddev=0.1))
The solutions kaveman gave seem reasonable, and I will update results after trying.
The answer to this question depends greatly on what operations you want to perform on the weight matrix.
The typical way to handle such a large number of features is to treat the 512 vector per feature as an embedding. If each of your example in the data set has only one of the 1 billion features, then you can use the tf.nn.embedding_lookup function to lookup the embeddings for the features present in a mini-batch of examples. If each example has more than one feature, but presumably only a handful of them, then you can use the tf.nn.embedding_lookup_sparse to lookup the embeddings.
In both these cases, your weight matrix can be distributed across many machines. That is, the params argument to both of these functions is a list of tensors. You would shard your large weight matrix and locate the shards in different machines. Please look at tf.device and the primer on distributed execution to understand how data and computation can be distributed across many machines.
If you really want to do some dense operation on the weight matrix, say, multiply the matrix with another matrix, that is still conceivable, although there are no ready-made recipes in TensorFlow to handle that. You would still shard your weight matrix across machines. But then, you have to manually construct a sequence of matrix multiplies on the distributed blocks of your weight matrix, and combine the results.