In Tensorflow Classification, how are the labels ordered when using "predict"? - tensorflow

I'm using the MNIST handwritten numerals dataset to train a CNN.
After training the model, i use predict like this:
predictions = cnn_model.predict(test_images)
predictions[0]
and i get output as:
array([2.1273775e-06, 2.9292005e-05, 1.2424786e-06, 7.6307842e-05,
7.4305902e-08, 7.2301691e-07, 2.5368356e-08, 9.9952960e-01,
1.2401938e-06, 1.2787555e-06], dtype=float32)
In the output, there are 10 probabilities, one for each of numeral from 0 to 9. But how do i know which probability refers to which numeral ?
In this particular case, the probabilities are arranged sequentially for numerals 0 to 9. But why is that ? I didn't define that anywhere.
I tried going over documentation and example implementations found elsewhere on the internet, but no one seems to have addressed this particular behaviour.
Edit:
For context, I've defined my train/test data by:
mnist = tf.keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
train_images = (np.expand_dims(train_images, axis=-1)/255.).astype(np.float32)
train_labels = (train_labels).astype(np.int64)
test_images = (np.expand_dims(test_images, axis=-1)/255.).astype(np.float32)
test_labels = (test_labels).astype(np.int64)
And my model consists of a a few convulution and pooling layers, then a Flatten layer, then a Dense layer with 128 neurons and an output Dense layer with 10 neurons.
After that I simply fit my model and use predict like this:
model.fit(train_images, train_labels, batch_size=BATCH_SIZE, epochs=EPOCHS)
predictions = cnn_model.predict(test_images)
I don't see where I've instructed my code to output first neuron as digit 0, second neuron as digit 1 etc
And if i wanted to change the the sequence in which the resulting digits are output, where do i do that ?
This is really confusing me a lot.

Models work with numbers. Your classes/labels should be represented as numbers (e.g., 0, 1, ...., n). The prediction is always indexed to show probabilities for class 0 at index 0, class 1 at index 1. Now in the MNIST case, you are lucky the labels are integers 0 to 9. Suppose you had to classify images into three classes: cars, bicycles, trucks. You must represent those classes as numerical values. You can arrange it as you wish. If you choose this: {cars: 0, bicycles: 1, trucks: 2}, in other words, if you label your cars as 0, bicycles as 1, and trucks as 2, then your prediction would show probability for cars at index 0, bicycles at index 1 and trucks at index 2.
You could have also decided to choose this setting: {cars: 2, bicycles: 0, trucks: 1}, then your prediction would show probability for cars at index 2, bicycles at index 0 and trucks at index 1, and so on.
The point is, you have to show your classes (as many as you have) as integers indexed from 0 to n where n is the num_classes-1. Your probabilities at prediction would be indexed as such. You don't have to tell the model.
Hope this is now clear.

It depends on how you prepare your labels during training. With MNIST classification, usually, there are two different ways:
One-hot Labels: There are 10 labels in the MNIST data, therefore for each example (image), you create a label array (vector) of length 10 where all the elements are zero except the index corresponding to the digit that your input image is showing. For example, if your input image is showing the digit 8, your label contains zeros everywhere except at the 8th index (e.g. [0,0,0,0,0,0,0,0,1,0]). If your image is showing the digit 2, your label would be something like [0,0,1,0,0,0,0,0,0,0] and so on.
Sparse Labels: you just label each image directly by what digit it is showing, for example if your image is showing the digit 8, your label is a single number with value 8.
In both cases, you could choose the labels however you want, in the MNIST classification it is just intuitive to use the labels 0-9 to show digits 0-9.
Thus, in the prediction, the probability at index 0 is for digit 0, index 1 for digit 1, and so on.
You could choose to prepare your labels differently. For example you could decide to show your labels as follows:
label for digit 0: 9
label for digit 1: 8
label for digit 2: 7
label for digit 3: 6
label for digit 4: 5
label for digit 5: 4
label for digit 6: 3
label for digit 7: 2
label for digit 8: 1
label for digit 9: 0
You could train your model the same way but in this case, the probabilities in the prediction would be inverted. Probability at index 0 would be for digit 9, index 1 for digit 8, and so on.
In short, you have to define your labels using integer indices, but it is up to you to decide and remember what index you chose to refer to which label/class.

Related

How to average input samples by group in Keras?

I want to implement a neural network in Keras of this architecture: say if I have some inputs and they belong to some groups. Then the neural network is like this:
input -> some layers -> separate inputs by groups -> average inputs by groups -> output
In brief, I want to separate inputs by groups then take the average of inputs by groups.
For example, if I have some inputs tensor [1, 2, 3, 4, 5, 6] and they are belonging to two groups [0, 1, 1, 0, 0, 1]. Then I want to the output tensor is like this: [3.333, 3.666, 3.666, 3.333, 3.333, 3.666]. Here 3.333 is the average of group 0 [1, 4, 5] and 3.666 is the average of group 1 [2, 3, 6].
I am not sure if you can separate the inputs as you described above directly in Keras or Tensorflow. Here is what I could come up with:
Create a mask corresponding to each class where 1 is for the element at the index being in the class and 0 for any element of another class. So in your example, you would do [0,1,1,0,0,1] for one class and [1,0,0,1,1,0] for the other. ( if you have more classes, you will correspondingly have more masks )
Stack those vectors to get a 3-D tensor and do 1D convolution with 0 stride. Use tf.nn.conv1d(). Think of those masks as filters of a Convolution operation and it's separating the classes. Be sure to reshape your Tensors to match the operation requirements.
After the convolution, you will have a 3-D Tensor where each vector would contain a classes elements. For your example you should get a Tensor with two vectors as [0,2,3,0,0,6] and [1,0,0,4,5,0]. Use tf.reduce_mean() on the correct axis to get the average of each class.
Multiply the Tensor of the mean : [[3.333], [3.666]] with the masks using tf.multiply() and add the vectors using tf.reduce_sum() on the correct axis. And it should result in the vector you desire.
I have figured out a method. It can be archived by matrix manipulation. First turn the cluster vector to a categorical matrix, for example, if the batch size is 6, the categorical matrix (cluster) is like:
1, 0
1, 0
0, 1
0, 1
1, 0
0, 1
then we generate a cluster_mean matrix:
1/3, 0
1/3, 0
0, 1/3
0, 1/3
1/3, 0
0, 1/3
If we have an input matrix n*b (n is the number of features and b is the batch), then we can get average by cluster by using
cluster * t(cluster_mean) * input
Transpose, average and dot product can be archived by using tensorflow functions.

How does a 1D multi-channel convolutional layer (Keras) train?

I am working with time series EEG data recorded from 10 individual locations on the body to classify future behavior in terms of increasing heart activity. I would like to better understand how my labeled data corresponds to the training inputs.
So far, several RNN configurations as well as countless combinations of vanilla dense networks have not gotten me great results and I'd figure a 1D convnet is worth a try.
The things I'm having trouble understanding are:
1.) Feeding data into the model.
orig shape = (30000 timesteps, 10 channels)
array fed to layer = (300 slices, 100 timesteps, 10 channels)
Are the slices separated by 1 time step, giving me 300 slices of timesteps at either end of the original array, or are they separated end to end? If the second is true, how could I create an array of (30000 - 100) slices separated by one ts and is also compatible with the 1D CNN layer?
2) Matching labels with the training and testing data
My understanding is that when you feed in a sequence of train_x_shape = (30000, 10), there are 30000 labels with train_y_shape = (30000, 2) (2 classes) associated with the train_x data.
So, when (300 slices of) 100 timesteps of train_x data with shape = (300, 100, 10) are fed into the model, does the label value correspond to the entire 100 ts (one label per 100 ts, with this label being equal to the last time step's label), or are each 100 rows/vectors in the slice labeled- one for each ts?
Train input:
train_x = train_x.reshape(train_x.shape[0], 1, train_x.shape[1])
n_timesteps = 100
n_channels = 10
layer : model.add(Convolution1D(filters = n_channels * 2, padding = 'same', kernel_size = 3, input_shape = (n_timesteps, n_channels)))
final layer : model.add(Dense(2, activation = 'softmax'))
I use categorical_crossentropy for loss.
Answer 1
This will really depend on "how did you get those slices"?
The answer is totally dependent on what "you're doing". So, what do you want?
If you have simply reshaped (array.reshape(...)) the original array from shape (30000,10) to shape (300,100,10), the model will see:
300 individual (and not connected) sequences
100 timesteps in each sequence
Sequence 1 goes from step 0 to 299;
Sequence 2 goes from step 300 to 599 and so on.
Creating overlapping slices - Sliding window
If you want to create sequences shifted by only one timestep, make a loop for that.
import numpy as np
originalSequence = someArrayWithShape((30000,10))
newSlices = [] #empty list
start = 0
end = start + 300
while end <= 30000:
newSlices.append(originalSequence[start:end])
start+=1
end+=1
newSlices = np.asarray(newSlices)
Beware: if you do this in the input data, you will have to do a similar thing in your output data as well.
Answer2
Again, that's totally up to you. What do you want to achieve?
Convolutional layers will keep the timesteps with these options:
If you use padding='same', the final length will be the same as the input
If you don't, the final length will be reduced depending on the kernel size you choose
Recurrent layers will keep the timesteps or not depending on:
Whether you use return_sequences=True - Output has timesteps
Or you use return_sequences=False - Output has no timesteps
If you want only one output for each sequence (not per timestep):
Recurrent models:
Use LSTM(...., return_sequences=True) until the last LSTM
The last LSTM will be LSTM(..., return_sequences=False)
Convolutional models:
At some point after the convolutions, choose one of these to add:
GlobalMaxPooling1D
GlobalAveragePooling1D
Flatten (but treat the number of channels later with a Dense(2)
Reshape((2,))
I think I'd go with GlobalMaxPooling2D if using convoltions, but recurrent models seem better for this. (Not a rule, though).
You can choose to use intermediate MaxPooling1D layers to gradually reduce the length from 100 to 50, then to 25 and so on. This will probably reach a better output.
Remember to keep X and Y paired:
import numpy as np
train_x = someArrayWithShape((30000,10))
train_y = someArrayWithShape((30000,2))
newXSlices = [] #empty list
newYSlices = [] #empty list
start = 0
end = start + 300
while end <= 30000:
newXSlices.append(train_x[start:end])
newYSlices.append(train_y[end-1:end])
start+=1
end+=1
newXSlices = np.asarray(newXSlices)
newYSlices = np.asarray(newYSlices)

TensorFlow cookbook skip-gram model with negative similarity

I am currently going through Google's TensorFlow cookbook:
This is a TensorFlow implementation of the skip-gram model.
On line 272, the author decides to negatively multiply the similarity matrix (-sim[j, :]). I am a little bit confused why do we need to negatively multiply the similarity matrix in a skip-gram model. Any ideas?
for j in range(len(valid_words)):
valid_word = word_dictionary_rev[valid_examples[j]]
top_k = 5 # number of nearest neighbors
**nearest = (-sim[j, :]).argsort()[1:top_k+1]**
log_str = "Nearest to {}:".format(valid_word)
for k in range(top_k):
close_word = word_dictionary_rev[nearest[k]]
score = sim[j,nearest[k]]
log_str = "%s %s," % (log_str, close_word)
print(log_str)
Let's go through this example step by step:
First, there's a similarity tensor. It is defined as a matrix of pairwise cosine similarities between embedding vectors:
# Cosine similarity between words
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings,valid_dataset)
similarity= tf.matmul(valid_embeddings,normalized_embeddings,transpose_b=True)
The matrix is computed for all validations words and all dictionary words, and contains numbers between [-1,1]. In this example, the vocab size is 10000 and the validation set consists of 5 words, so the shape of the similarity matrix is (5, 10000).
This matrix is evaluated to a numpy array sim:
sim = sess.run(similarity, feed_dict=feed_dict)
Consequently, sim.shape = (5, 10000) as well.
Next, this line:
nearest = (-sim[j, :]).argsort()[1:top_k+1]
... computes the top_k nearest word indices to the current word j. Take a look at numpy.argsort method. The negation is just a numpy way of sorting in descending order. If there were no minus, the result would be the top_k furthest words from the dictionary, which won't indicate word2vec has learned anything.
Also note that the range is [1:top_k+1], not [:top_k], because the 0-th word is the current validation word itself. There's no point in printing that the closest word to "love" is... "love".
The result of this line would be an array like [ 73 1684 850 1912 326], which corresponds to words sex, fine, youd, trying, execution.

Tensorflow One-Hot

I'm new to Tensorflow (And neural networks) and i am working on a simple classification problem. Would like to ask 2 questions.
Say i have 120 labels of a permutation of [1,2,3,4,5]. Is it really necessary for me to One-Hot encode before feeding it into my graph? If yes, should i encode before feeding into tensorflow?
And if i do One-Hot encode, the softmax prediction will give [0.001 0.202 0.321……0.002 0.0003 0.0004]. Running arg_max will produce the right index. How would i get tensorflow to return me the correct label instead of a one-hot result?
Thank you.
So your input are 120 labels in {1, 2, 3, 4, 5} (each of which can be either digit from 1 to 5)?
# Your input, a 1D tensor of 120 elements from 1-5.
# Better shift your label space to 0-4 instead.
labels = labels - 1
# Now convert to a 2D tensor of 120 x 5 onehot labels.
onehot_labels = tf.one_hot(labels, 5)
# Now some computations.
....
# You end up with some onehot_output
# of the same shape as your labels (120x5).
# As you said, arg_max will give you the index of the result,
# which is a 1D index label of 120 elements.
output = tf.argmax(onehot_output, axis=1).
# You might want to shift back to {1,2,3,4,5}.
output = output + 1

Why use tf.mul in the word2vec training process?

The Word2vec model uses noise-contrastive estimation (NCE) loss to train the model.
Why does it use tf.mul in the true sample logit calculation, but uses tf.matmul in the negative calculation?
See the source code.
One way you can think of the NCE loss calculation is as a batch of independent, binary logistic regression classifications problems. In both cases we are performing the same calculations, even though it does not look like it at the first place.
To show you that we are actually calculating the same thing, assume the follwoing for the true input part:
emb_dim = 3 # dimensions of your embedding vector
batch_size = 2 # number of examples in your trainings batch
vocab_size = 6 # number of total words in your text
# (so your word ids range from 0 - 5)
Furthermore, assume the following training example in your batch:
1 => 0 # given word with word_id=1, I expect word with word_id=0
1 => 2 # given word with word_id=1, I expect word with word_id=2
Then your embedding matrix example_emb has the dimensions [2,3] and your true weight matrix true_w also has the dimensions [2,3], and should look like this:
example_emb = [ [e1,e2,e3], [e1,e2,e3] ] # [2,3] input word
true_w = [ [w1,w2,w3], [w4,w5,w5] ] # [2,3] target word
The example_emb is a subset of the total word embeddings (emb) that you are tryin to learn, and true_w is a a subset of the weights (smb_w_t). Each row in example_emb represents and input vector , and each row in the weight represent a target vector.
So [e1,e2,e3] is the word vector of the input word with word_id = 1 taken from emb, and [w1,w2,w3] is the word vector of the expected target word with word_id = 0.
Now intuitively stated, the classification task you are trying to solve is: given i see input word and target word is this observation correct?
The two classification tasks then are (without the bias, and tensorflow has this handy 'sigmoid_cross_entropy_with_logits' function, which applies the sigmoid later):
logit( 1=>0 ) = dot( [e1,e2,e3], transpose( [w1,w2,w3] ) =>
logit( 1=>0 ) = e1*w1 + e2*w2 + e3*w3
and
logit( 1=>2 ) = dot( [e1,e2,e3], transpose( [w4,w5,w6] ) =>
logit( 1=>2 ) = e1*w4 + e2*w5 + e3*w6
We can calculate [[logit(1=>0)],[logit(1=>2)]] the easiest if we perform an element-wise multiplication tf.mul() and then summing up each row.
The output of this calculations will be a [batch_size, 1] matrix containing the logits for the correct words. We do know the ground truth/label (y') for this examples, which is 1 because these are the correct examples.
true_logits = [
[logit(1=>0)], # first input word of the batch
[logit(1=>2)] # second input word of the batch
]
Now for the second part of your question why you we use tf.matmul() in the negative sampling, let's assume that we draw 3 negative samples (num_sampled=3). So sampled_ids = [3,4,5].
Intuitively, this means that you add six more training examples to your batch, namely:
1 => 3 # given word_id=1, do i expect word_id=3? No, because these are negative examples.
1 => 4
1 => 5
1 => 3 # second input word is also word_id=1
1 => 4
1 => 5
So you look up your sampled_w, which turns out to be a [3, 3] matrix. Your parameters now look like this:
example_emb = [ [e1,e2,e3], [e1,e2,e3] ] # [2,3] input word
sampled_w = [ [w6,w7,w8], [w9,w10,w11], [w12,w13,w14] ] # [3,3] sampled target words
Similar to the true case, what we want is the logits for all negative training examples. E.g., for the first example:
logit(1 => 3) = dot( [e1,e2,e3], transpose( [w6,w7,w8] ) =>
logit(1 => 3) = e1*w6 + e2*w7 + e3*w8
Now in this case, we can use the matrix multiplication after we transpose the sampled_w matrix. This is achieved using the transpose_b=True parameter in the tf.matmul() call. The transposed weight matrix looks like this:
sampled_w_trans = [ [w6,w9,w12], [w7,w10,w13], [w8,w11,w14] ] # [3,3]
So now the tf.matmul() operation will return a [batch_size, 3] matrix, where each row are the logits for one example of the input batch. Each element represents a logit for a classification task.
The whole result matrix of the negative sampling contains this:
sampled_logits = [
[logit(1=>3), logit(1,4), logit(1,5)], # first input word of the batch
[logit(1=>3), logit(1,4), logit(1,5)] # second input word of the batch
]
The labels / ground truth for the sampled_logits are all zeros, because these are the negative examples.
In both cases we perform the same calculation, that is the calculation for a binary classification logistic regression (without the sigmoid, which is applied later).