How does word2vec give one hot word vector from the embedding vector? - tensorflow

I understand how word2vec works.
I want to use word2vec(skip-gram) as input for RNN. Input is embedding word vector. Output is also embedding word vector generated by RNN.
Here’s question! How can I convert the output vector to one hot word vector? I need inverse matrix of embeddings but I don’t have!

The output of an RNN is not an embedding. We convert the output from the last layer in an RNN cell into a vector of vocabulary_size by multiplying with an appropriate matrix.
Take a look at the PTB Language Model example to get a better idea. Specifically look at lines 133-136:
softmax_w = tf.get_variable("softmax_w", [size, vocab_size], dtype=data_type())
softmax_b = tf.get_variable("softmax_b", [vocab_size], dtype=data_type())
logits = tf.matmul(output, softmax_w) + softmax_b
The above operation will give you logits. This logits is a probability distribution over your vocabulary. numpy.random.choice might help you to use these logits to make a prediction.

Related

Simple example of CuDnnGRU based RNN implementation in Tensorflow

I am using the following code for standard GRU implementation:
def BiRNN_deep_dynamic_FAST_FULL_autolength(x,batch_size,dropout,hidden_dim):
seq_len=length_rnn(x)
with tf.variable_scope('forward'):
lstm_cell_fwd =tf.contrib.rnn.GRUCell(hidden_dim,kernel_initializer=tf.contrib.layers.xavier_initializer(),bias_initializer=tf.contrib.layers.xavier_initializer())
lstm_cell_fwd = tf.contrib.rnn.DropoutWrapper(lstm_cell_fwd, output_keep_prob=dropout)
with tf.variable_scope('backward'):
lstm_cell_back =tf.contrib.rnn.GRUCell(hidden_dim,kernel_initializer=tf.contrib.layers.xavier_initializer(),bias_initializer=tf.contrib.layers.xavier_initializer())
lstm_cell_back = tf.contrib.rnn.DropoutWrapper(lstm_cell_back, output_keep_prob=dropout)
outputs,_= tf.nn.bidirectional_dynamic_rnn(cell_fw=lstm_cell_fwd,cell_bw= lstm_cell_back,inputs=x,sequence_length=seq_len,dtype=tf.float32,time_major=False)
outputs_fwd,outputs_bck=outputs
### fwd matrix is the matrix that keeps all the last [-1] vectors
fwd_matrix=tf.gather_nd(outputs_fwd, tf.stack([tf.range(batch_size), seq_len-1], axis=1)) ### 99,64
outputs_fwd=tf.transpose(outputs_fwd,[1,0,2])
outputs_bck=tf.transpose(outputs_bck,[1,0,2])
return outputs_fwd,outputs_bck,fwd_matrix
Can anyone provide a simple example of how to use the tf.contrib.cudnn_rnn.CudnnGRU Cell in a similar fashion? Just swapping out the cells doesn't work.
First issue is that there is no dropout wrapper for CuDnnGRU cell, which is fine. Second it doesnt seem to work with tf.nn.bidirectional_dynamic_rnn. Any help appreciated.
CudnnGRU is not an RNNCell instance. It's more akin to dynamic_rnn.
The tensor manipulations below are equivalent, where input_tensor is a time-major tensor, i.e. of shape [max_sequence_length, batch_size, embedding_size]. CudnnGRU expects the input tensor to be time-major (as opposed to the more standard batch-major format i.e. of shape [batch_size, max_sequence_length, embedding_size]), and it's a good practice to use time-major tensors with RNN ops anyways since they're somewhat faster.
CudnnGRU:
rnn = tf.contrib.cudnn_rnn.CudnnGRU(
num_rnn_layers, hidden_size, direction='bidirectional')
rnn_output = rnn(input_tensor)
CudnnCompatibleGRUCell:
rnn_output = input_tensor
sequence_length = tf.reduce_sum(
tf.sign(inputs),
reduction_indices=0) # 1 if `input_tensor` is batch-major.
for _ in range(num_rnn_layers):
fw_cell = tf.contrib.cudnn_rnn.CudnnCompatibleGRUCell(hidden_size)
bw_cell = tf.contrib.cudnn_rnn.CudnnCompatibleGRUCell(hidden_size)
rnn_output = tf.nn.bidirectional_dynamic_rnn(
fw_cell, bw_cell, rnn_output, sequence_length=sequence_length,
dtype=tf.float32, time_major=True)[1] # Set `time_major` accordingly
Note the following:
If you were using LSTMs, you need not use CudnnCompatibleLSTMCell; you can use the standard LSTMCell. But with GRUs, the Cudnn implementation has inherently different math operations, and in particular, more weights (see the documentation).
Unlike dynamic_rnn, CudnnGRU doesn't allow you to specify sequence lengths. Still, it is over an order of magnitude faster, but you will have to be careful on how you extract your outputs (e.g. if you're interested in the final hidden state of each sequence that is padded and of varying length, you will need each sequence's length).
rnn_output is probably a tuple with lots of (distinct) stuff in both cases. Refer to the documentation, or just print it out, to inspect what parts of the output you need.

How to use Keras LSTM with word embeddings to predict word id's

I have problems understanding how to get the correct output when using word embeddings in Keras. My settings are as follows:
My input are batches of shape (batch_size, sequence_length). Each row
in a batch represents one sentence, the word are represented by word id's. The
sentences are padded with zeros such that all are of the same length.
For example a (3,6) input batch might look like: np.array([[135600],[174580],[138272]])
My targets are given by the input batch shifted one step to the right.
So for each input word I want to predict the next word: np.array([[356000],[745800],[382720]])
I feed such an input batch into the Keras embedding layer. My embedding
size is 100, so the output will be a 3D tensor of shape (batch_size,
sequence_length, embedding_size). So in the little example its (3,6,100)
This 3D batch is fed into an LSTM layer
The output of the LSTM layer is fed into a Dense layer with
(sequence_length) output neurons having a softmax activation
function. So the shape of the output will be like the shape of the input namely (batch_size, sequence_length)
As a loss I am using the categorical crossentropy between the input and target batch
My question:
The output batch will contain probabilities because of the
softmax activation function. But what I want is the network to predict
integers such that the output fits the target batch of integers.
How can I "decode" the output such that I know which word the network is predicting? Or do I have to construct the network differently?
Edit 1:
I have changed the output and target batches from 2D arrays to 3D tensors. So instead of using a target batch of size (batch_size, sequence_length) with integer id's I am now using a one-hot encoded 3D target tensor (batch_size, sequence_length, vocab_size). To get the same format as an output of the network, I have changed the network to output sequences (by setting return_sequences=True in the LSTM layer). Further, the number of output neurons was changed to vocab_size such that the output layer now produces a batch of size (batch_size, sequence_length, vocab_size).
With this 3D encoding I can get the predicted word id using tf.argmax(outputs, 2). This approach seems to work for the moment but I would still be interested whether it's possible to keep the 2D targets/outputs
One, solution, perhaps not the best, is to output one-hot vectors the size of of your dictionary (including dummy words).
Your last layer must output (sequence_length, dictionary_size+1).
Your dense layer will already output the sequence_length if you don't add any Flatten() or Reshape() before it, so it should be a Dense(dictionary_size+1)
You can use the functions keras.utils.to_categorical() to transform an integer in a one-hot vector and keras.backend.argmax() to transform a one=hot vector into an integer.
Unfortunately, this is sort of unpacking your embedding. It would be nice if it were possible to have a reverse embedding or something like that.

Explaining Variational Autoencoder gaussian parameterization

In the original Auto-Encoding Variational Bayes paper, the authors describes the "reparameterization trick" in section 2.4. The trick is to breakup your latent state z into learnable mean and sigma (learned by the encoder) and adding Gaussian noise. You then sample a datapoint from z (basically you generate an encoded image) and let the decoder map the encoded datapoint back to the original image.
I have a hard getting over how strange this is. Could someone explain a bit more on the latent variable model, specifically:
Why are we assuming the latent state is Gaussian?
How is it possible that a Gaussian can generate an image?
And how does backprop corrupt the encoder to learn a Gaussian function as opposed to an unknown non-linear function?
Here is an example implementation of the latent model from here in TensorFlow.
...neural net code maps input to hidden layers z_mean and z_log_sigma
self.z_mean, self.z_log_sigma_sq = \
self._recognition_network(network_weights["weights_recog"],
network_weights["biases_recog"])
# Draw one sample z from Gaussian distribution
n_z = self.network_architecture["n_z"]
eps = tf.random_normal((self.batch_size, n_z), 0, 1,
dtype=tf.float32)
# z = mu + sigma*epsilon
self.z = tf.add(self.z_mean,
tf.mul(tf.sqrt(tf.exp(self.z_log_sigma_sq)), eps))
...neural net code maps z to output
They are not assuming that the activations of the encoder follow a gaussian distribution, they are enforcing that of the possible solutions choose a gaussian resembling one.
The image is generated from decoding a activation/feature, the activations are distributed resembling a gaussian.
They minimize the KL divergence between the activations distribution and a gaussian one.

Trying to understand CNNs for NLP tutorial using Tensorflow

I am following this tutorial in order to understand CNNs in NLP. There are a few things which I don't understand despite having the code in front of me. I hope somebody can clear a few things up here.
The first rather minor thing is the sequence_lengthparameter of the TextCNN object. In the example on github this is just 56 which I think is the max-length of all sentences in the training data. This means that self.input_x is a 56-dimensional vector which will contain just the indices from the dictionary of a sentence for each word.
This list goes into tf.nn.embedding_lookup(W, self.intput_x) which will return a matrix consisting of the word embeddings of those words given by self.input_x. According to this answer this operation is similar to using indexing with numpy:
matrix = np.random.random([1024, 64])
ids = np.array([0, 5, 17, 33])
print matrix[ids]
But the problem here is that self.input_x most of the time looks like [1 3 44 25 64 0 0 0 0 0 0 0 .. 0 0]. So am I correct if I assume that tf.nn.embedding_lookup ignores the value 0?
Another thing I don't get is how tf.nn.embedding_lookup is working here:
# Embedding layer
with tf.device('/cpu:0'), tf.name_scope("embedding"):
W = tf.Variable(
tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),
name="W")
self.embedded_chars = tf.nn.embedding_lookup(W, self.input_x)
self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)
I assume, taht self.embedded_chars is the matrix which is the actual input to the CNN where each row represents the word embedding of one word. But how can tf.nn.embedding_lookup know about those indices given by self.input_x?
The last thing which I don't understand here is
W is our embedding matrix that we learn during training. We initialize it using a random uniform distribution. tf.nn.embedding_lookup creates the actual embedding operation. The result of the embedding operation is a 3-dimensional tensor of shape [None, sequence_length, embedding_size].
Does this mean that we are actually learning the word embeddings here? The tutorial states at the beginning:
We will not used pre-trained word2vec vectors for our word embeddings. Instead, we learn embeddings from scratch.
But I don't see a line of code where this is actually happening. The code of the embedding layer does not look like as if there is anything being trained or learned - so where is it happening?
Answer to ques 1 (So am I correct if I assume that tf.nn.embedding_lookup ignores the value 0?) :
The 0's in the input vector is the index to 0th symbol in the vocabulary, which is the PAD symbol. I don't think it gets ignored when the lookup is performed. 0th row of the embedding matrix will be returned.
Answer to ques 2 (But how can tf.nn.embedding_lookup know about those indices given by self.input_x?) :
Size of the embedding matrix is [V * E] where is the size of vocabulary and E is dimension of embedding vector. 0th row of matrix is embedding vector for 0th element of vocabulary, 1st row of matrix is embedding vector for 1st element of vocabulary.
From the input vector x, we get the indices of words in vocabulary, which are used for indexing the embedding matrix.
Answer to ques 3 (Does this mean that we are actually learning the word embeddings here?).
Yes, we are actually learning the embedding matrix.
In the embedding layer, in line W = tf.Variable( tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),name="W") W is the embedding matrix and by default, in tensorflow trainable=TRUE for variable. So, W will also be a learned parameter. To use pre- trained model, set trainable = False.
For detailed explanation of the code you can follow blog: https://agarnitin86.github.io/blog/2016/12/23/text-classification-cnn

What is the softmax_w and softmax_b in this document?

I'm new to TensorFlow and need to train a language model but run into some difficulties while reading the document as shown bellow.
lstm = rnn_cell.BasicLSTMCell(lstm_size)
# Initial state of the LSTM memory.
state = tf.zeros([batch_size, lstm.state_size])
loss = 0.0
for current_batch_of_words in words_in_dataset:
# The value of state is updated after processing each batch of words.
output, state = lstm(current_batch_of_words, state)
# The LSTM output can be used to make next word predictions
logits = tf.matmul(output, softmax_w) + softmax_b
probabilities = tf.nn.softmax(logits)
loss += loss_function(probabilities, target_words)
I don't understand why this line is needed,
logits = tf.matmul(output, softmax_w) + softmax_b
Since I learned that once the output is computed out and the target_words are known we can directly work out the loss. It seems that the pseudo-code adds an additional layer. In addition, what is the softmax_w and softmax_b which are not aforementioned. I thought I may have missed something important by raising such a simple question.
Please point me in the right direction, and any suggestions are highly appreciated. Thanks a lot.
All that code is doing is adding an extra linear transformation before computing the softmax. softmax_w should be a tf.Variable containing a matrix of weights. softmax_b should be a tf.Variable containing a bias vector.
Take a look at the softmax example in this tutorial for more details:
https://www.tensorflow.org/versions/r0.10/tutorials/mnist/beginners/index.html#softmax-regressions