How to embed discrete IDs in Tensorflow? - tensorflow

There are many discrete IDs and I want to embed them to feed into a neural network. tf.nn.embedding_lookup only supports the fixed range of IDs, i.e., ID from 0 to N. How to embed the discrete IDs with the range of 0 to 2^62.

Just to clarify how I understand your question, you want to do something like word embeddings, but instead of words you want to use discrete IDs (not indices). Your IDs can be very large (2^62). But the number of distinct IDs is much less.
If we were to process words, then we would build a dictionary of the words and feed the indices within the dictionary to the neural network (into the embedding layer). That is basically what you need to do with your discrete IDs too. Usually you'd also reserve one number (such as 0) for not previously seen values. You could also later trim the dictionary to only include the most frequent values and put all others into the same unknown bucket (exactly the same options you would have when doing word embeddings or other nlp).
e.g.:
unknown -> 0
84588271 -> 1
92238356 -> 2
78723958 -> 3

Related

Processing column with letters before feeding into a NN

I wanted to implement a classification algorithm using NN but some columns have complex alphanumeric strings, so I just chose only the simpler columns to check. Here is an example with few elements of the columns I chose...
Few Elements of the COL
As you can see these columns have A,G,C or T..etc. Some had combinations of the 4 letters but I removed it for now. My plan was to map each of these letters to values like 1,2,3 and 4 and then feed them to the NN.
Is this mapping acceptable for feeding into a dense NN?? Or is there any better method for doing this
I would not map it to integers like 1, 2, 3 etc because you are mistakenly giving them a certain order or rank which the NN may capture as important, although this ranking does not truly exist.
If you do not have high cardinality (many unique values) then you can apply One-Hot Encoding. If the cardinality is high, then you should use other encoding techniques, otherwise one-hot encoder will introduce a lot of dimensionality to your data and sparsity, which are not welcomed. You can find here some other interesting methods to encode categorical variables.

Variable number of instances for Multiple Instance Learning

I am trying to do a Multiple Instance Learning for a binary classification problem, where each bag of instances has an associated label 0/1. However, the different bags have different numbers of instances. One solution is to take the minimum of all the instance numbers of the bag. For eg-
Bag1 - 20 instances, Bag2- 5 instances, Bag3 - 10 instances .... etc
I am taking the minimum i.e- 5 instances from all the bags. However, this technique discards all the other instances from other bags which might contribute to the training.
Is there any workaround/algorithm for MIL where variable instance numbers for bags could be handled?
You can try using RaggedTensors for this. They're mostly used in NLP work, since sentences have variable numbers of words (and paragraphs have variable numbers of sentences etc.). But there's nothing special about ragged tensors that limits them to this domain.
See the tensorflow docs for more information. Not all layers or operations will work, but you can use things like Dense layers to build a Sequential or Functional (or even custom) model, if this works for you.

one_hot Vs Tokenizer for Word representation

I have seen in many blogs , people using one_hot (from tf.keras.preprocessing.text.one_hot ) to convert the string of words into array of numbers which represent indices. This does not ensure unicity. Whereas Tokenizer class ensures unicity (tf.keras.preprocessing.text.Tokenizer ).
Then why is one_hot prefered over tokenizer?
Update: I got to know that hashing is used in One_hot to convert words into numbers but didn't get its importance as we can use the tokenizer class to do the same thing with more accuracy.
Not sure what you mean by uncity. I expect it has to do with the sequential relationship between the words. That of course is lost with ine hot encoding. However one-hot encoding is used when the number of words is limited. If say you have 10 words in the vocubulary you will create 10 new features which is fine for most neural networks to process. If you have other features in your data set beside the word sequences say numeric ordinal parameters you can still create a single input model. However if you have 10,000 words in the vocabulary you would create 10,000 new features which at best will take a lot to process. So in the case of a large vocabularly it is best to use "dense" encoding" versus the sparse encoding generated by one hot encoding. You can use the results of the tokenizer encoding to serve as input to a keras embedding layer which will encode the words into an n dimensional space where N is a value you specify. If you have additional ordinal features then to process the data your model will need multiple inputs. Perhaps that is why some people prefer to one hot encode the words.

How to pass a list of numbers as a single feature to a neural network?

I am trying to cluster sentences by clustering the sentence embedding of them taken from fasttext model. Each sentence embedding has 300 dimensions, and I want to reduce them to 50 (say). I tried t-SNE, PCA, UMAP. I wanted to see how Auto Encoder works for my data.
Now passing those 300 numbers for each sentence as separate features to the NN would make sense or they should be passed as a single entity? If so, is there an way to pass a list as a feature to NN?
I tried passing the 300 numbers as individual features and with the output I tried clustering. Could get very few meaningful clusters rest were either noise or clusters with no similar sentences but being grouped (But with other techniques like UMAP I could get far more meaningful clusters in more number). Any leads would be helpful. Thanks in advance :)

Using embedded columns

I'm trying to understand the TensorFlow tutorial on wide and deep learning. The demo application creates indicator columns for categorical features with few categories (gender, education), and it creates embedded columns for categorical features with many categories (native_country, occupation).
I don't understand embedded columns. Is there a rule that clarifies when to use embedded columns instead of indicator columns? According to the documentation, the dimension parameter sets the dimension of the embedding. What does that mean?
From the feature columns tutorial:
Now, suppose instead of having just three possible classes, we have a million. Or maybe a billion. For a number of reasons, as the number of categories grow large, it becomes infeasible to train a neural network using indicator columns.
We can use an embedding column to overcome this limitation. Instead of representing the data as a one-hot vector of many dimensions, an embedding column represents that data as a lower-dimensional, ordinary vector in which each cell can contain any number, not just 0 or 1. By permitting a richer palette of numbers for every cell, an embedding column contains far fewer cells than an indicator column.
The dimension parameter is the length of the vector you're reducing the categories to.