Why use tf.mul in the word2vec training process? - tensorflow

The Word2vec model uses noise-contrastive estimation (NCE) loss to train the model.
Why does it use tf.mul in the true sample logit calculation, but uses tf.matmul in the negative calculation?
See the source code.

One way you can think of the NCE loss calculation is as a batch of independent, binary logistic regression classifications problems. In both cases we are performing the same calculations, even though it does not look like it at the first place.
To show you that we are actually calculating the same thing, assume the follwoing for the true input part:
emb_dim = 3 # dimensions of your embedding vector
batch_size = 2 # number of examples in your trainings batch
vocab_size = 6 # number of total words in your text
# (so your word ids range from 0 - 5)
Furthermore, assume the following training example in your batch:
1 => 0 # given word with word_id=1, I expect word with word_id=0
1 => 2 # given word with word_id=1, I expect word with word_id=2
Then your embedding matrix example_emb has the dimensions [2,3] and your true weight matrix true_w also has the dimensions [2,3], and should look like this:
example_emb = [ [e1,e2,e3], [e1,e2,e3] ] # [2,3] input word
true_w = [ [w1,w2,w3], [w4,w5,w5] ] # [2,3] target word
The example_emb is a subset of the total word embeddings (emb) that you are tryin to learn, and true_w is a a subset of the weights (smb_w_t). Each row in example_emb represents and input vector , and each row in the weight represent a target vector.
So [e1,e2,e3] is the word vector of the input word with word_id = 1 taken from emb, and [w1,w2,w3] is the word vector of the expected target word with word_id = 0.
Now intuitively stated, the classification task you are trying to solve is: given i see input word and target word is this observation correct?
The two classification tasks then are (without the bias, and tensorflow has this handy 'sigmoid_cross_entropy_with_logits' function, which applies the sigmoid later):
logit( 1=>0 ) = dot( [e1,e2,e3], transpose( [w1,w2,w3] ) =>
logit( 1=>0 ) = e1*w1 + e2*w2 + e3*w3
and
logit( 1=>2 ) = dot( [e1,e2,e3], transpose( [w4,w5,w6] ) =>
logit( 1=>2 ) = e1*w4 + e2*w5 + e3*w6
We can calculate [[logit(1=>0)],[logit(1=>2)]] the easiest if we perform an element-wise multiplication tf.mul() and then summing up each row.
The output of this calculations will be a [batch_size, 1] matrix containing the logits for the correct words. We do know the ground truth/label (y') for this examples, which is 1 because these are the correct examples.
true_logits = [
[logit(1=>0)], # first input word of the batch
[logit(1=>2)] # second input word of the batch
]
Now for the second part of your question why you we use tf.matmul() in the negative sampling, let's assume that we draw 3 negative samples (num_sampled=3). So sampled_ids = [3,4,5].
Intuitively, this means that you add six more training examples to your batch, namely:
1 => 3 # given word_id=1, do i expect word_id=3? No, because these are negative examples.
1 => 4
1 => 5
1 => 3 # second input word is also word_id=1
1 => 4
1 => 5
So you look up your sampled_w, which turns out to be a [3, 3] matrix. Your parameters now look like this:
example_emb = [ [e1,e2,e3], [e1,e2,e3] ] # [2,3] input word
sampled_w = [ [w6,w7,w8], [w9,w10,w11], [w12,w13,w14] ] # [3,3] sampled target words
Similar to the true case, what we want is the logits for all negative training examples. E.g., for the first example:
logit(1 => 3) = dot( [e1,e2,e3], transpose( [w6,w7,w8] ) =>
logit(1 => 3) = e1*w6 + e2*w7 + e3*w8
Now in this case, we can use the matrix multiplication after we transpose the sampled_w matrix. This is achieved using the transpose_b=True parameter in the tf.matmul() call. The transposed weight matrix looks like this:
sampled_w_trans = [ [w6,w9,w12], [w7,w10,w13], [w8,w11,w14] ] # [3,3]
So now the tf.matmul() operation will return a [batch_size, 3] matrix, where each row are the logits for one example of the input batch. Each element represents a logit for a classification task.
The whole result matrix of the negative sampling contains this:
sampled_logits = [
[logit(1=>3), logit(1,4), logit(1,5)], # first input word of the batch
[logit(1=>3), logit(1,4), logit(1,5)] # second input word of the batch
]
The labels / ground truth for the sampled_logits are all zeros, because these are the negative examples.
In both cases we perform the same calculation, that is the calculation for a binary classification logistic regression (without the sigmoid, which is applied later).

Related

In Tensorflow Classification, how are the labels ordered when using "predict"?

I'm using the MNIST handwritten numerals dataset to train a CNN.
After training the model, i use predict like this:
predictions = cnn_model.predict(test_images)
predictions[0]
and i get output as:
array([2.1273775e-06, 2.9292005e-05, 1.2424786e-06, 7.6307842e-05,
7.4305902e-08, 7.2301691e-07, 2.5368356e-08, 9.9952960e-01,
1.2401938e-06, 1.2787555e-06], dtype=float32)
In the output, there are 10 probabilities, one for each of numeral from 0 to 9. But how do i know which probability refers to which numeral ?
In this particular case, the probabilities are arranged sequentially for numerals 0 to 9. But why is that ? I didn't define that anywhere.
I tried going over documentation and example implementations found elsewhere on the internet, but no one seems to have addressed this particular behaviour.
Edit:
For context, I've defined my train/test data by:
mnist = tf.keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()
train_images = (np.expand_dims(train_images, axis=-1)/255.).astype(np.float32)
train_labels = (train_labels).astype(np.int64)
test_images = (np.expand_dims(test_images, axis=-1)/255.).astype(np.float32)
test_labels = (test_labels).astype(np.int64)
And my model consists of a a few convulution and pooling layers, then a Flatten layer, then a Dense layer with 128 neurons and an output Dense layer with 10 neurons.
After that I simply fit my model and use predict like this:
model.fit(train_images, train_labels, batch_size=BATCH_SIZE, epochs=EPOCHS)
predictions = cnn_model.predict(test_images)
I don't see where I've instructed my code to output first neuron as digit 0, second neuron as digit 1 etc
And if i wanted to change the the sequence in which the resulting digits are output, where do i do that ?
This is really confusing me a lot.
Models work with numbers. Your classes/labels should be represented as numbers (e.g., 0, 1, ...., n). The prediction is always indexed to show probabilities for class 0 at index 0, class 1 at index 1. Now in the MNIST case, you are lucky the labels are integers 0 to 9. Suppose you had to classify images into three classes: cars, bicycles, trucks. You must represent those classes as numerical values. You can arrange it as you wish. If you choose this: {cars: 0, bicycles: 1, trucks: 2}, in other words, if you label your cars as 0, bicycles as 1, and trucks as 2, then your prediction would show probability for cars at index 0, bicycles at index 1 and trucks at index 2.
You could have also decided to choose this setting: {cars: 2, bicycles: 0, trucks: 1}, then your prediction would show probability for cars at index 2, bicycles at index 0 and trucks at index 1, and so on.
The point is, you have to show your classes (as many as you have) as integers indexed from 0 to n where n is the num_classes-1. Your probabilities at prediction would be indexed as such. You don't have to tell the model.
Hope this is now clear.
It depends on how you prepare your labels during training. With MNIST classification, usually, there are two different ways:
One-hot Labels: There are 10 labels in the MNIST data, therefore for each example (image), you create a label array (vector) of length 10 where all the elements are zero except the index corresponding to the digit that your input image is showing. For example, if your input image is showing the digit 8, your label contains zeros everywhere except at the 8th index (e.g. [0,0,0,0,0,0,0,0,1,0]). If your image is showing the digit 2, your label would be something like [0,0,1,0,0,0,0,0,0,0] and so on.
Sparse Labels: you just label each image directly by what digit it is showing, for example if your image is showing the digit 8, your label is a single number with value 8.
In both cases, you could choose the labels however you want, in the MNIST classification it is just intuitive to use the labels 0-9 to show digits 0-9.
Thus, in the prediction, the probability at index 0 is for digit 0, index 1 for digit 1, and so on.
You could choose to prepare your labels differently. For example you could decide to show your labels as follows:
label for digit 0: 9
label for digit 1: 8
label for digit 2: 7
label for digit 3: 6
label for digit 4: 5
label for digit 5: 4
label for digit 6: 3
label for digit 7: 2
label for digit 8: 1
label for digit 9: 0
You could train your model the same way but in this case, the probabilities in the prediction would be inverted. Probability at index 0 would be for digit 9, index 1 for digit 8, and so on.
In short, you have to define your labels using integer indices, but it is up to you to decide and remember what index you chose to refer to which label/class.

How does a 1D multi-channel convolutional layer (Keras) train?

I am working with time series EEG data recorded from 10 individual locations on the body to classify future behavior in terms of increasing heart activity. I would like to better understand how my labeled data corresponds to the training inputs.
So far, several RNN configurations as well as countless combinations of vanilla dense networks have not gotten me great results and I'd figure a 1D convnet is worth a try.
The things I'm having trouble understanding are:
1.) Feeding data into the model.
orig shape = (30000 timesteps, 10 channels)
array fed to layer = (300 slices, 100 timesteps, 10 channels)
Are the slices separated by 1 time step, giving me 300 slices of timesteps at either end of the original array, or are they separated end to end? If the second is true, how could I create an array of (30000 - 100) slices separated by one ts and is also compatible with the 1D CNN layer?
2) Matching labels with the training and testing data
My understanding is that when you feed in a sequence of train_x_shape = (30000, 10), there are 30000 labels with train_y_shape = (30000, 2) (2 classes) associated with the train_x data.
So, when (300 slices of) 100 timesteps of train_x data with shape = (300, 100, 10) are fed into the model, does the label value correspond to the entire 100 ts (one label per 100 ts, with this label being equal to the last time step's label), or are each 100 rows/vectors in the slice labeled- one for each ts?
Train input:
train_x = train_x.reshape(train_x.shape[0], 1, train_x.shape[1])
n_timesteps = 100
n_channels = 10
layer : model.add(Convolution1D(filters = n_channels * 2, padding = 'same', kernel_size = 3, input_shape = (n_timesteps, n_channels)))
final layer : model.add(Dense(2, activation = 'softmax'))
I use categorical_crossentropy for loss.
Answer 1
This will really depend on "how did you get those slices"?
The answer is totally dependent on what "you're doing". So, what do you want?
If you have simply reshaped (array.reshape(...)) the original array from shape (30000,10) to shape (300,100,10), the model will see:
300 individual (and not connected) sequences
100 timesteps in each sequence
Sequence 1 goes from step 0 to 299;
Sequence 2 goes from step 300 to 599 and so on.
Creating overlapping slices - Sliding window
If you want to create sequences shifted by only one timestep, make a loop for that.
import numpy as np
originalSequence = someArrayWithShape((30000,10))
newSlices = [] #empty list
start = 0
end = start + 300
while end <= 30000:
newSlices.append(originalSequence[start:end])
start+=1
end+=1
newSlices = np.asarray(newSlices)
Beware: if you do this in the input data, you will have to do a similar thing in your output data as well.
Answer2
Again, that's totally up to you. What do you want to achieve?
Convolutional layers will keep the timesteps with these options:
If you use padding='same', the final length will be the same as the input
If you don't, the final length will be reduced depending on the kernel size you choose
Recurrent layers will keep the timesteps or not depending on:
Whether you use return_sequences=True - Output has timesteps
Or you use return_sequences=False - Output has no timesteps
If you want only one output for each sequence (not per timestep):
Recurrent models:
Use LSTM(...., return_sequences=True) until the last LSTM
The last LSTM will be LSTM(..., return_sequences=False)
Convolutional models:
At some point after the convolutions, choose one of these to add:
GlobalMaxPooling1D
GlobalAveragePooling1D
Flatten (but treat the number of channels later with a Dense(2)
Reshape((2,))
I think I'd go with GlobalMaxPooling2D if using convoltions, but recurrent models seem better for this. (Not a rule, though).
You can choose to use intermediate MaxPooling1D layers to gradually reduce the length from 100 to 50, then to 25 and so on. This will probably reach a better output.
Remember to keep X and Y paired:
import numpy as np
train_x = someArrayWithShape((30000,10))
train_y = someArrayWithShape((30000,2))
newXSlices = [] #empty list
newYSlices = [] #empty list
start = 0
end = start + 300
while end <= 30000:
newXSlices.append(train_x[start:end])
newYSlices.append(train_y[end-1:end])
start+=1
end+=1
newXSlices = np.asarray(newXSlices)
newYSlices = np.asarray(newYSlices)

TensorFlow cookbook skip-gram model with negative similarity

I am currently going through Google's TensorFlow cookbook:
This is a TensorFlow implementation of the skip-gram model.
On line 272, the author decides to negatively multiply the similarity matrix (-sim[j, :]). I am a little bit confused why do we need to negatively multiply the similarity matrix in a skip-gram model. Any ideas?
for j in range(len(valid_words)):
valid_word = word_dictionary_rev[valid_examples[j]]
top_k = 5 # number of nearest neighbors
**nearest = (-sim[j, :]).argsort()[1:top_k+1]**
log_str = "Nearest to {}:".format(valid_word)
for k in range(top_k):
close_word = word_dictionary_rev[nearest[k]]
score = sim[j,nearest[k]]
log_str = "%s %s," % (log_str, close_word)
print(log_str)
Let's go through this example step by step:
First, there's a similarity tensor. It is defined as a matrix of pairwise cosine similarities between embedding vectors:
# Cosine similarity between words
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings,valid_dataset)
similarity= tf.matmul(valid_embeddings,normalized_embeddings,transpose_b=True)
The matrix is computed for all validations words and all dictionary words, and contains numbers between [-1,1]. In this example, the vocab size is 10000 and the validation set consists of 5 words, so the shape of the similarity matrix is (5, 10000).
This matrix is evaluated to a numpy array sim:
sim = sess.run(similarity, feed_dict=feed_dict)
Consequently, sim.shape = (5, 10000) as well.
Next, this line:
nearest = (-sim[j, :]).argsort()[1:top_k+1]
... computes the top_k nearest word indices to the current word j. Take a look at numpy.argsort method. The negation is just a numpy way of sorting in descending order. If there were no minus, the result would be the top_k furthest words from the dictionary, which won't indicate word2vec has learned anything.
Also note that the range is [1:top_k+1], not [:top_k], because the 0-th word is the current validation word itself. There's no point in printing that the closest word to "love" is... "love".
The result of this line would be an array like [ 73 1684 850 1912 326], which corresponds to words sex, fine, youd, trying, execution.

feeding data into an LSTM - Tensorflow RNN PTB tutorial

I am trying to feed data into an LSTM. I am reviewing the code from Tensorflow's RNN tutorial here.
The segment of code of interest is from the reader.py file of the tutorial, in particular the ptb_producer function, which ouputs the X and Y that is used by the LSTM.
raw_data is a list of indexes of words from the ptb.train.txt file. The length of raw_data is 929,589. batch_size is 20, num_steps is 35. Both batch_size and num_steps are based on the LARGEconfig which feeds the data to an LSTM.
I have walked through the code (and added comments for what I've printed) and I understand it up till tf.strided_slice. From the reshape, we have a matrix of indexes of shape (20, 46497).
Strided slice in the first iteration of i, tries to take data from [0, i * num_steps + 1] which is [0,1*35+1] till [batch_size, (i + 1) * num_steps + 1] which is [20, (1+1)*35+1].
Two questions:
where in the matrix is [0,1*35+1] and [20, (1+1)*35+1]? What spots in the (20, 46497) the begin and end in strided_slice is trying to access?
It seems like EVERY iteration of i, will take in data from 0? the very start of the data matrix (20, 46497)?
I guess what I am not understanding is how you would feed data into an LSTM, given the batch size and the num_steps (sequence length).
I have read colahs blog on LSTM and Karpathy's blog on RNN which helps greatly in the understanding of LSTMs, but don't seem to address the exact mechanics of getting data into an LSTM. (maybe I missed something?)
def ptb_producer(raw_data, batch_size, num_steps, name=None):
"""Iterate on the raw PTB data.
This chunks up raw_data into batches of examples and returns Tensors that
are drawn from these batches.
Args:
raw_data: one of the raw data outputs from ptb_raw_data.
batch_size: int, the batch size.
num_steps: int, the number of unrolls.
name: the name of this operation (optional).
Returns:
A pair of Tensors, each shaped [batch_size, num_steps]. The second
element of the tuple is the same data time-shifted to the right by one.
Raises:
tf.errors.InvalidArgumentError: if batch_size or num_steps are too high.
"""
with tf.name_scope(name, "PTBProducer", [raw_data, batch_size, num_steps]):
raw_data = tf.convert_to_tensor(raw_data, name="raw_data", dtype=tf.int32)
data_len = tf.size(raw_data) # prints 929,589
batch_len = data_len // batch_size # prints 46,497
data = tf.reshape(raw_data[0 : batch_size * batch_len],
[batch_size, batch_len])
#this truncates raw data to a multiple of batch_size=20,
#then reshapes to [20, 46497]. prints (20,?)
epoch_size = (batch_len - 1) // num_steps #prints 1327 (number of epoches)
assertion = tf.assert_positive(
epoch_size,
message="epoch_size == 0, decrease batch_size or num_steps")
with tf.control_dependencies([assertion]):
epoch_size = tf.identity(epoch_size, name="epoch_size")
i = tf.train.range_input_producer(epoch_size, shuffle=False).dequeue()
#for each of the 1327 epoches
x = tf.strided_slice(data, [0, i * num_steps], [batch_size, (i + 1) * num_steps]) # prints (?, ?)
x.set_shape([batch_size, num_steps]) #prints (20,35)
y = tf.strided_slice(data, [0, i * num_steps + 1], [batch_size, (i + 1) * num_steps + 1])
y.set_shape([batch_size, num_steps])
return x, y
where in the matrix is [0,1*35+1] and [20, (1+1)*35+1]? What spots in the (20, 46497) the begin and end in strided_slice is trying to access?
So the data matrix is arranged as 20 x 46497. Think of it as 20 sentences with 46497 characters each. The strided slice in your specific example will get character 36 to 70 (71 is not included, typical range semantics) from each line. This is equivalent to
strided = [data[i][36:71] for i in range(20)]
It seems like EVERY iteration of i, will take in data from 0? the very start of the data matrix (20, 46497)?
This is correct. For each iteration we want batch size elements in first dimension i.e. we want a matrix of size batch size x num_steps. Hence the first dimension always goes from 0-19, essentially looping over each sentence. And second dimension increments in num_steps, getting segments of words from each sentence. If we look at the index it extracts from data at each step for x, we will get something like:
- data[0:20,0:35]
- data[0:20,35:70]
- data[0:20,70:105]
- ...
For y we add 1 because we want next character to be predicted.
So your train, label tuples will be like:
- (data[0:20,0:35], data[0:20, 1:36])
- (data[0:20,35:70], data[0:20,36:71])
- (data[0:20,70:105], data[0:20,71:106])
- ...

Tensorflow One-Hot

I'm new to Tensorflow (And neural networks) and i am working on a simple classification problem. Would like to ask 2 questions.
Say i have 120 labels of a permutation of [1,2,3,4,5]. Is it really necessary for me to One-Hot encode before feeding it into my graph? If yes, should i encode before feeding into tensorflow?
And if i do One-Hot encode, the softmax prediction will give [0.001 0.202 0.321……0.002 0.0003 0.0004]. Running arg_max will produce the right index. How would i get tensorflow to return me the correct label instead of a one-hot result?
Thank you.
So your input are 120 labels in {1, 2, 3, 4, 5} (each of which can be either digit from 1 to 5)?
# Your input, a 1D tensor of 120 elements from 1-5.
# Better shift your label space to 0-4 instead.
labels = labels - 1
# Now convert to a 2D tensor of 120 x 5 onehot labels.
onehot_labels = tf.one_hot(labels, 5)
# Now some computations.
....
# You end up with some onehot_output
# of the same shape as your labels (120x5).
# As you said, arg_max will give you the index of the result,
# which is a 1D index label of 120 elements.
output = tf.argmax(onehot_output, axis=1).
# You might want to shift back to {1,2,3,4,5}.
output = output + 1