How to setup a neural network architecture for binary classification - tensorflow

I am reading through the tensorflow tutorials on neural network and i came across the architecture part which is a bit confusing. Can some explain me why he had use following settings in this code
# input shape is the vocabulary count used for the movie reviews
(10,000 words)
vocab_size = 10000
model = keras.Sequential()
model.add(keras.layers.Embedding(vocab_size, 16))
model.add(keras.layers.Dense(16, activation=tf.nn.relu))
model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))
value of 16 for Embedding?
and the choice of units, i get the intuition behind the last dense layer because it is a binary classification(1) but why 16 units in the second layer?
Is the 16 in embedding and 16 units in first dense layer related? Like they should be equal?
If someone can explain this para too
The first layer is an Embedding layer. This layer takes the integer-encoded vocabulary and looks up the embedding vector for each word-index. These vectors are learned as the model trains. The vectors add a dimension to the output array. The resulting dimensions are: (batch, sequence, embedding).
Classify movie reviews: binary classification

vocab_size: All word in your corpus (in this case IMDB) sorted based on their frequency and their top 10000 word extracted. Rest of the vocabulary will be ignored. E.g: This is really Fancyyyyyyy will convert into ==> [8 7 9]. As you may guess the word Fancyyyyyyy ignored because its not in out top 10000 words.
pad_sequences: Convert all sentence to the same size. For example in training corpus the document length are different. So all of them convert to seq_len = 256. After this step, your output is [Batch_size * seq_len].
Embedding: Each word converted to a vector with 16 dimension. As a result output of this step is a Tensor with size of [Batch_size * seq_len * embedding_dim].
GlobalAveragePooling1D: Convert your sequence with size of [Batch_size * seq_len * embedding_dim] into [Batch_size * embedding_dim]
unit: is output of dense layer (MLP layer). It covert [Batch_size * embedding_dim] into [Batch_size * unit].

The first layer is vocab_size because each word is represented as an index into the vocabulary. For example, if the input word is 'word', which is the 500th word in the vocabulary, the input is a vector of length vocab_size with all zeros except a one at index 500. This is commonly referred to as a 'one hot' representation.
The embedding layer essentially takes this huge input vector and condenses it into a smaller vector (in this case, length 16) that encodes some of information about the word. The specific embedding weights are learned from training just like any other neural network layer. I'd recommend reading up on word embeddings. The length of 16 is a bit arbitrary here but can be tuned. One could do away with this embedding layer but then the model will have less expressive power (it would just be logistic regression, which is a linear model).
Then, as you said, the last layer is simply predicting the class of the word based on the embedding.


How does Keras produce output of different size y, given an input of size x?

I am new to neural network here. I am reading a lot of guides and tutorial where they will start with an lstm layer where the input size differs from the output size
eg. model.add(LSTM(100, input_shape=(20, 1))) ->
before doing ->
model.add(Dense(80, activation='relu')), etc.
presumably, the output layer for the lstm here has size 100, where the input has only 20
for a dense layer I can imagine how that works because there are plenty of graphs depicting that, but how can a lstm produce output layer of very different size from the input?
and also importantly, of what range of value can the output be given the input (let's say of 20) effectively be? would any value make sense?
The output size can be anything. For example, in case of feeding word embeddings of 256 length and output size 1000 length, it somewhat follows the below steps:
Embedding goes into the LSTM (Here, I am ignoring the batch and sequence length; just one word embedding in one time-step)
The Weight Matrix (Waa, Way, Wax etc are initialized) : These matrices shapes depends upon the output size you gave (e.g. 100 above)
All the needed calculations are followed as per LSTM semantics
The output of 1000 vector length is generated

How is an input translated to the input units of a NN

I am quite new to machine learning and neural nets. I‘ve used the following model for sentiment analysis of short texts. I generally understand how signals are computed, all the way to the output layer. Now what I dont understand is how the inputs are found. When the model classifies a word, how is that word translated to the 512 input units? What features of the word does the model assess and how is that decided?
model = Sequential()
model.add(Dense(512, input_shape=(max_words,), activation='relu'))
model.add(Dense(256, activation='sigmoid'))
model.add(Dense(2, activation='softmax'))
When the model classifies a word, how is that word translated to the
512 input units?
As you already noticed, before any kind of written information (single words, sentences or whole texts) can be processed by a neural network, it must be encoded into a vector representation. This is called an embedding or a representation and to find suitable embeddings is subfield of Natural Language Procesessing (NLP) research.
Over the years a number of different representations were published. For single words e.g. Word2Vec in which a neural network has "learned" the embedding based on the semantic similarity of the words. That means words which are similar in context should be close by in the vector space.
The most simple embedding for a sentence would be a bag-of-words embedding. This means we count how many different words we have in our corpus of sentences (e.g. N) and we transform each sentence into a vector of length N where each index of the vector represents a word and the value at the index the number of occurrences of that word in the sentence.
Of course there are many more sophisticated text embeddings.
There are multiple methods by which you can obtain the vector embedding of a word.
Count based methods: PMI, PPMI and SVD
Prediction based methods: CBOW and Skip-Gram
The count-based methods create a co-occurrence matrix of words of shape Vocabulary*Vocabulary where each word is represented by some sort of count of co-occurrence in K neighborhood.
The prediction-based models train on a corpus and create a vector embedding basis on how close the context of two words are.

MultiClass Keras Classifier prediction output meaning

I have a Keras classifier built using the Keras wrapper of the Scikit-Learn API. The neural network has 10 output nodes, and the training data is all represented using one-hot encoding.
According to Tensorflow documentation, the predict function outputs a shape of (n_samples,). When I fitted 514541 samples, the function returned an array with shape (514541, ), and each entry of the array ranged from 0 to 9.
Since I have ten different outputs, does the numerical value of each entry correspond exactly to the result that I encoded in my training matrix?
i.e. if index 5 of my one-hot encoding of y_train represents "orange", does a prediction value of 5 mean that the neural network predicted "orange"?
Here is a sample of my model:
model = Sequential()
model.add(Dropout(0.2, input_shape=(32,) ))
model.add(Dense(21, activation='selu'))
model.add(Dense(10, activation='softmax'))
There are some issues with your question.
The neural network has 10 output nodes, and the training data is all represented using one-hot encoding.
Since your network has 10 output nodes, and your labels are one-hot encoded, your model's output should also be 10-dimensional, and again hot-encoded, i.e. of shape (n_samples, 10). Moreover, since you use a softmax activation for your final layer, each element of your 10-dimensional output should be in [0, 1], and interpreted as the probability of the output belonging to the respective (one-hot encoded) class.
According to Tensorflow documentation, the predict function outputs a shape of (n_samples,).
It's puzzling why you refer to Tensorflow, while your model is clearly a Keras one; you should refer to the predict method of the Keras sequential API.
When I fitted 514541 samples, the function returned an array with shape (514541, ), and each entry of the array ranged from 0 to 9.
If something like that happens, it must be due to a later part in your code that you do not show here; in any case, the idea would be to find the argument with the highest value from each 10-dimensional network output (since they are interpreted as probabilities, it is intuitive that the element with the highest value would be the most probable). In other words, somewhere in your code there must be something like this:
pred = model.predict(x_test)
y = np.argmax(pred, axis=1) # numpy must have been imported as np
which will give an array of shape (n_samples,), with each y an integer between 0 and 9, as you report.
i.e. if index 5 of my one-hot encoding of y_train represents "orange", does a prediction value of 5 mean that the neural network predicted "orange"?
Provided that the above hold, yes.

How to use Keras LSTM with word embeddings to predict word id's

I have problems understanding how to get the correct output when using word embeddings in Keras. My settings are as follows:
My input are batches of shape (batch_size, sequence_length). Each row
in a batch represents one sentence, the word are represented by word id's. The
sentences are padded with zeros such that all are of the same length.
For example a (3,6) input batch might look like: np.array([[135600],[174580],[138272]])
My targets are given by the input batch shifted one step to the right.
So for each input word I want to predict the next word: np.array([[356000],[745800],[382720]])
I feed such an input batch into the Keras embedding layer. My embedding
size is 100, so the output will be a 3D tensor of shape (batch_size,
sequence_length, embedding_size). So in the little example its (3,6,100)
This 3D batch is fed into an LSTM layer
The output of the LSTM layer is fed into a Dense layer with
(sequence_length) output neurons having a softmax activation
function. So the shape of the output will be like the shape of the input namely (batch_size, sequence_length)
As a loss I am using the categorical crossentropy between the input and target batch
My question:
The output batch will contain probabilities because of the
softmax activation function. But what I want is the network to predict
integers such that the output fits the target batch of integers.
How can I "decode" the output such that I know which word the network is predicting? Or do I have to construct the network differently?
Edit 1:
I have changed the output and target batches from 2D arrays to 3D tensors. So instead of using a target batch of size (batch_size, sequence_length) with integer id's I am now using a one-hot encoded 3D target tensor (batch_size, sequence_length, vocab_size). To get the same format as an output of the network, I have changed the network to output sequences (by setting return_sequences=True in the LSTM layer). Further, the number of output neurons was changed to vocab_size such that the output layer now produces a batch of size (batch_size, sequence_length, vocab_size).
With this 3D encoding I can get the predicted word id using tf.argmax(outputs, 2). This approach seems to work for the moment but I would still be interested whether it's possible to keep the 2D targets/outputs
One, solution, perhaps not the best, is to output one-hot vectors the size of of your dictionary (including dummy words).
Your last layer must output (sequence_length, dictionary_size+1).
Your dense layer will already output the sequence_length if you don't add any Flatten() or Reshape() before it, so it should be a Dense(dictionary_size+1)
You can use the functions keras.utils.to_categorical() to transform an integer in a one-hot vector and keras.backend.argmax() to transform a one=hot vector into an integer.
Unfortunately, this is sort of unpacking your embedding. It would be nice if it were possible to have a reverse embedding or something like that.

Understanding input/output dimensions of neural networks

Let's take a fully-connected neural network with one hidden layer as an example. The input layer consists of 5 units that are each connected to all hidden neurons. In total there are 10 hidden neurons.
Libraries such as Theano and Tensorflow allow multidimensional input/output shapes. For example, we could use sentences of 5 words where each word is represented by a 300d vector.
How is such an input mapped on the described neural network? I do not understand what an ouptut shape of (None, 5, 300) (just an example) means. In my imagination we just have a bunch of neurons through which single numbers flow.
When I have an output shape of (None, 5, 300), how much neurons do I have in the corresponding network? How do I connect the words to my neural network?
Yes, we just have a bunch of neurons throuhg which single numbers flow.
But: if you must give your network 5 numbers as input, it's then convenient to give these numbers in an array with length 5.
And if you're giving 30 thousand examples for your network to train, then it's convenient to create an array with 30 thousand elements, each element being an array of 5 numbers.
In the end, this input with 30 thousand examples of 5 numbers is an array with shape (30000,5).
Each layer then has it's own output shape. Each layer's output is certainly related to its own amount of neurons. Each neuron will throw out a number (or sometimes an array, depending on which layer type you're using). But 10 neurons together will throw out 10 numbers, which will then be packed in an array shaped (30000,10).
The word "None" in those shapes is related to the batch size (the amount of examples you give for training or predicting). You don't define that number, it is automatically understood when you pass a batch.
Looking at your network:
When you have an input of 5 units, you got an input shape of (None,5). But you actually say only (5,) to your model, because the None part is the batch size, which will only appear when training.
This number means: you have to give your network an array with a number of samples, each sample being an array of 5 numbers.
Then, your hidden layer with 10 neurons will calculate and give you 10 numbers as output, in an array shaped as (None, 10).
What is a (None,5,300)?
If you're saying that each word is a 300d vector, there are a few different ways to translate a word in that.
One of the common ways is: how many words you have in your dictionary?
If you have a dictionary with 300 words, you can then make each word be a vector with 300 elements, being all zeros, except for one of them.
Say word "hello" is the first word in your dictionary, it's vector will be [1,0,0,0, ...., 0]
Say word "my" is the second word in your dictionary, it's vector will be [0,1,0,0, ...., 0]
And the word "fly" is the last one in the dictionary, it's vector will be [0,0,0,0, ...., 1]
You do this for your entire dictionary, and whenever you have to pass the word "hello" to your network, you will pass [1,0,0,0 ..., 0] instead.
A sentence with five words will then be an array with five of these arrays. This means, a sentence with five words will be shaped as (5, 300). If you pass 30 thousand sentences as examples: (30000,5,300). In the model, "None" appears as the batch size (None, 5, 300)
There are also other options, such as creating a word Embedding, which will translate the words into vectors of meanings. Meanings which only the network will understand. (There is the Embedding layer on Keras for that).
There are also things called CBOW (continous bag of words).
You have to know what you want to do first, so you can translate your words in some array that fits the network's requirements.
How many neurons do I have for an output of (None,5,300)?
This only tells you about the last layer. The other layers' outputs were all calculated and packed together by the following layers, which changed the output. Each layer has its own output. (When you have a model, you can do a model.summary() and see the output of each layer.)
Even though, it's impossible to answer that question without knowing which types of layers you're using.
There are layers such as Dense that throw out things like (BatchSize,NumberOfNeurons)
But there are layers such as Convolution2D that throw out things like (BatchSize, numberOfChannels, pixelsInX, pixelsInY). For instance, a regular image has three channels: red, blue and green. An array for passing a regular image would be like (3,sizeX,sizeY).
It all depends on which layer type you're using.
Using a word embedding
For using an embedding, it's interesting to read keras documentation about it.
For that you will have to transform your words in indices.
Instead of saying that each word in your dictionary is a vector, you say it's a number.
Word "hello" is 1
Word "my" is 2
Word "fly" is theSizeOfYourDictionary
If you want each sentence to have 100 words, then your input shape will be (None, 100). Where each array of 100 numbers contains numbers representing the words in your dictionary.
The first layer in your model will be an Embedding layer.
model = Sequential()
model.add(Embedding(theSizeOfYourDictionary, 300, input_length=100)
This way, you're creating vectors of size 300 for each word, passing sequences of 100 words. (I'm not used to embeddings, but it seems 300 is a big number, it could be less).
The output of this embedding will be (None, 100, 300).
Then you connect other layers after it.