Trying to understand CNNs for NLP tutorial using Tensorflow - tensorflow

I am following this tutorial in order to understand CNNs in NLP. There are a few things which I don't understand despite having the code in front of me. I hope somebody can clear a few things up here.
The first rather minor thing is the sequence_lengthparameter of the TextCNN object. In the example on github this is just 56 which I think is the max-length of all sentences in the training data. This means that self.input_x is a 56-dimensional vector which will contain just the indices from the dictionary of a sentence for each word.
This list goes into tf.nn.embedding_lookup(W, self.intput_x) which will return a matrix consisting of the word embeddings of those words given by self.input_x. According to this answer this operation is similar to using indexing with numpy:
matrix = np.random.random([1024, 64])
ids = np.array([0, 5, 17, 33])
print matrix[ids]
But the problem here is that self.input_x most of the time looks like [1 3 44 25 64 0 0 0 0 0 0 0 .. 0 0]. So am I correct if I assume that tf.nn.embedding_lookup ignores the value 0?
Another thing I don't get is how tf.nn.embedding_lookup is working here:
# Embedding layer
with tf.device('/cpu:0'), tf.name_scope("embedding"):
W = tf.Variable(
tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),
name="W")
self.embedded_chars = tf.nn.embedding_lookup(W, self.input_x)
self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)
I assume, taht self.embedded_chars is the matrix which is the actual input to the CNN where each row represents the word embedding of one word. But how can tf.nn.embedding_lookup know about those indices given by self.input_x?
The last thing which I don't understand here is
W is our embedding matrix that we learn during training. We initialize it using a random uniform distribution. tf.nn.embedding_lookup creates the actual embedding operation. The result of the embedding operation is a 3-dimensional tensor of shape [None, sequence_length, embedding_size].
Does this mean that we are actually learning the word embeddings here? The tutorial states at the beginning:
We will not used pre-trained word2vec vectors for our word embeddings. Instead, we learn embeddings from scratch.
But I don't see a line of code where this is actually happening. The code of the embedding layer does not look like as if there is anything being trained or learned - so where is it happening?

Answer to ques 1 (So am I correct if I assume that tf.nn.embedding_lookup ignores the value 0?) :
The 0's in the input vector is the index to 0th symbol in the vocabulary, which is the PAD symbol. I don't think it gets ignored when the lookup is performed. 0th row of the embedding matrix will be returned.
Answer to ques 2 (But how can tf.nn.embedding_lookup know about those indices given by self.input_x?) :
Size of the embedding matrix is [V * E] where is the size of vocabulary and E is dimension of embedding vector. 0th row of matrix is embedding vector for 0th element of vocabulary, 1st row of matrix is embedding vector for 1st element of vocabulary.
From the input vector x, we get the indices of words in vocabulary, which are used for indexing the embedding matrix.
Answer to ques 3 (Does this mean that we are actually learning the word embeddings here?).
Yes, we are actually learning the embedding matrix.
In the embedding layer, in line W = tf.Variable( tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),name="W") W is the embedding matrix and by default, in tensorflow trainable=TRUE for variable. So, W will also be a learned parameter. To use pre- trained model, set trainable = False.
For detailed explanation of the code you can follow blog: https://agarnitin86.github.io/blog/2016/12/23/text-classification-cnn

Related

tensorflow keras embedding lstm

I would like to use the Embedding layer before feeding my input data into the LSTM network I am attempting to create.. Here is the relevant part of the code:
input_step1 = Input(shape=(SEQ_LENGTH_STEP1, NR_FEATURES_STEP1),
name='input_step1')
step1_lstm = CuDNNLSTM(50,
return_sequences=True,
return_state = True,
name="step1_lstm")
out_step1, state_h_step1, state_c_step1 = step1_lstm(input_step1)
I am a bit confused regarding how I am supposed to add an Embedding layer here..
Here is the description of the Embedding layer from the documentation:
keras.layers.Embedding(input_dim,
output_dim,
embeddings_initializer='uniform',
embeddings_regularizer=None,
activity_regularizer=None,
embeddings_constraint=None,
mask_zero=False,
input_length=None)
The confusing part is that my defined Input has a sequence length and number of features defined. Writing it here again:
input_step1 = Input(shape=(SEQ_LENGTH_STEP1, NR_FEATURES_STEP1),
name='input_step1')
When defining an Embedding layer, I am pretty confused about which parameters of the Embedding function corresponds to "number of sequence" and "number of features in each time step". Can anyone guide me how I can integrate an Embedding layer to my code above?
ADDENDUM:
If I try the following:
SEQ_LENGTH_STEP1 = 5
NR_FEATURES_STEP1 = 10
input_step1 = Input(shape=(SEQ_LENGTH_STEP1, NR_FEATURES_STEP1),
name='input_step1')
emb = Embedding(input_dim=NR_FEATURES_STEP1,
output_dim=15,
input_length=NR_FEATURES_STEP1)
input_step1_emb = emb(input_step1)
step1_lstm = CuDNNLSTM(50,
return_sequences=True,
return_state = True,
name="step1_lstm")
out_step1, state_h_step1, state_c_step1 = step1_lstm(input_step1_emb)
I get the following error:
ValueError: Input 0 of layer step1_lstm is incompatible with the layer:
expected ndim=3, found ndim=4. Full shape received: [None, 5, 10, 15]
I am obviously not doing the right thing.. Is there a way to integrate Embedding into the LSTM network I am trying to attempt?
From the Keras Embedding documentation:
Arguments
input_dim: int > 0. Size of the vocabulary, i.e. maximum integer index + 1.
output_dim: int >= 0. Dimension of the dense embedding.
input_length: Length of input sequences, when it is
constant. This argument is required if you are going to connect
Flatten then Dense layers upstream (without it, the shape of the dense
outputs cannot be computed).
Therefore, from your description, I assume that:
input_dim corresponds to the vocabulary size (number of distinct words) of your dataset. For example, the vocabulary size of the following dataset is 5:
data = ["Come back Peter,",
"Come back Paul"]
output_dim is an arbitrary hyperparameter that indicates the dimension of your embedding space. In other words, if you set output_dim=x, each word in the sentence will be characterized with x features.
input_length should be set to SEQ_LENGTH_STEP1 (an integer indicating the length of each sentence), assuming that all the sentences have the same length.
The output shape of an embedding layer is (batch_size, input_length, output_dim).
Further notes regarding the addendum:
team_in_step1 is undefined.
Assuming that your first layer is an Embedding layer, the expected shape of the input tensor input_step1 is (batch_size, input_length):
input_step1 = Input(shape=(SEQ_LENGTH_STEP1,),
name='input_step1')
Each integer in this tensor corresponds to a word.
As mentioned above, the embedding layer could be instantiated as follows:
emb = Embedding(input_dim=VOCAB_SIZE,
output_dim=15,
input_length=SEQ_LENGTH_STEP1)
where VOCAB_SIZE is the size of your vocabulary.
This answer contains a reproducible example that you might find useful.

Custom loss in Keras with softmax to one-hot

I have a model that outputs a Softmax, and I would like to develop a custom loss function. The desired behaviour would be:
1) Softmax to one-hot (normally I do numpy.argmax(softmax_vector) and set that index to 1 in a null vector, but this is not allowed in a loss function).
2) Multiply the resulting one-hot vector by my embedding matrix to get an embedding vector (in my context: the word-vector that is associated to a given word, where words have been tokenized and assigned to indices, or classes for the Softmax output).
3) Compare this vector with the target (this could be a normal Keras loss function).
I know how to write a custom loss function in general, but not to do this. I found this closely related question (unanswered), but my case is a bit different, since I would like to preserve my softmax output.
It is possible to mix tensorflow and keras in you customer loss function. Once you can access to all Tensorflow function, things become very easy. I just give you a example of how this function could be imlement.
import tensorflow as tf
def custom_loss(target, softmax):
max_indices = tf.argmax(softmax, -1)
# Get the embedding matrix. In Tensorflow, this can be directly done
# with tf.nn.embedding_lookup
embedding_vectors = tf.nn.embedding_lookup(you_embedding_matrix, max_indices)
# Do anything you want with normal keras loss function
loss = some_keras_loss_function(target, embedding_vectors)
loss = tf.reduce_mean(loss)
return loss
Fan Luo's answer points in the right direction, but ultimately will not work because it involves non-derivable operations. Note such operations are acceptable for the real value (a loss function takes a real value and a predicted value, non-derivable operations are only fine for the real value).
To be fair, that was what I was asking in the first place. It is not possible to do what I wanted, but we can get a similar and derivable behaviour:
1) Element-wise power of the softmax values. This makes smaller values much smaller. For example, with a power of 4 [0.5, 0.2, 0.7] becomes [0.0625, 0.0016, 0.2400]. Note that 0.2 is comparable to 0.7, but 0.0016 is negligible with respect to 0.24. The higher my_power is, the more similar to a one-hot the final result will be.
soft_extreme = Lambda(lambda x: x ** my_power)(softmax)
2) Importantly, both softmax and one-hot vectors are normalized, but not our "soft_extreme". First, find the sum of the array:
norm = tf.reduce_sum(soft_extreme, 1)
3) Normalize soft_extreme:
almost_one_hot = Lambda(lambda x: x / norm)(soft_extreme)
Note: Setting my_power too high in 1) will result in NaNs. If you need a better softmax to one-hot conversion, then you may do steps 1 to 3 two or more times in a row.
4) Finally we want the vector from the dictionary. Lookup is forbidden, but we can take the average vector using matrix multiplication. Because our soft_normalized is similar to one-hot encoding this average will be similar to the vector associated to the highest argument (original intended behaviour). The higher my_power is in (1), the truer this will be:
target_vectors = tf.tensordot(almost_one_hot, embedding_matrix, axes=[[1], [0]])
Note: This will not work directly using batches! In my case, I reshaped my "one hot" (from [batch, dictionary_length] to [batch, 1, dictionary_length] using tf.reshape. Then tiled my embedding_matrix batch times and finally used:
predicted_vectors = tf.matmul(reshaped_one_hot, tiled_embedding)
There may be more elegant solutions (or less memory-hungry, if tiling the embedding matrix is not an option), so feel free to explore more.

Tensorflow num_classes parameter of nce_loss()

My understanding of noise contrastive estimation is that we sample some vectors from our word embeddings (the negative sample), and then calculate the log-likelihood of each. Then we want to maximize the difference between the probability of the target word and the log-likelihood of each of the negative sample words (So if I'm correct about this, we want to optimize the loss function so that it gets as close to 1 as possible).
My question is this:
What is the purpose of the num_classes parameters to the nce_loss function? My best guess is that the number of classes is passed in so that Tensorflow knows the size of the distribution from which the negative samples our drawn, but this might not make sense, since we could just infer the size of the distribution from the variable itself. Otherwise, I can't think of a reason for why we would need to know the total possible number of classes, especially if the language model is only outputting k + 1 predictions (negative sample size + 1 for the target word).
Your guess is correct. The num_classes argument is used to sample negative labels from the log-uniform (Zipfian) distribution.
Here's the link to the source code:
# Sample the negative labels.
# sampled shape: [num_sampled] tensor
# true_expected_count shape = [batch_size, 1] tensor
# sampled_expected_count shape = [num_sampled] tensor
if sampled_values is None:
sampled_values = candidate_sampling_ops.log_uniform_candidate_sampler(
true_classes=labels,
num_true=num_true,
num_sampled=num_sampled,
unique=True,
range_max=num_classes)
The range_max=num_classes argument basically defines the shape of this distribution and also the range of the sampled values - [0, range_max). Note that this range can't be accurately inferred from the labels, because a particular mini-batch can have only small word ids, which would skew the distribution significantly.

LSTM Followed by Mean Pooling (TensorFlow)

I am aware that there is a similar topic at LSTM Followed by Mean Pooling, but that is about Keras and I work in pure TensorFlow.
I have an LSTM network where the recurrence is handled by:
outputs, final_state = tf.nn.dynamic_rnn(cell,
embed,
sequence_length=seq_lengths,
initial_state=initial_state)
where I pass the correct sequence lengths for each sample (padding by zeros). In any case, outputs contains irrelevant outputs since some samples produce longer outputs than others, based on sequence lengths.
Right now I'm extracting the last relevant output by means of the following method:
def extract_axis_1(data, ind):
"""
Get specified elements along the first axis of tensor.
:param data: Tensorflow tensor that will be subsetted.
:param ind: Indices to take (one for each element along axis 0 of data).
:return: Subsetted tensor.
"""
batch_range = tf.range(tf.shape(data)[0])
indices = tf.stack([batch_range, ind], axis=1)
res = tf.reduce_mean(tf.gather_nd(data, indices), axis=0)
where I pass sequence_length - 1 as indices. In reference to the last topic, I would like to select all relevant outputs followed by average pooling, instead of just the last one.
Now, I tried passing nested lists as indeces to extract_axis_1 but tf.stack does not accept this.
Any solution directions for this?
You can exploit the weight parameter of the tf.contrib.seq2seq.sequence_loss function.
From the documentation:
weights: A Tensor of shape [batch_size, sequence_length] and dtype float. weights constitutes the weighting of each prediction in the sequence. When using weights as masking, set all valid timesteps to 1 and all padded timesteps to 0, e.g. a mask returned by tf.sequence_mask.
You need to compute a binary mask that distinguish between your valid outputs and invalid ones. Then you can just provide this mask to the weights parameter of the loss function (probably, you will want to use a loss like this one); the function will not consider the outputs with a 0 weight in the computation of the loss.
If you can't/don't need to use a sequence loss you can do exactly the same thing manually. You compute a binarymask and then multiply your outputs by this mask and provide these as inputs to your fully connected layer.

How to use --word_embeddings option of syntaxnet

I am studying Syntaxnet of Tensoflow.
Howerver, I'm confusing about the way to use --word_embeddings option. Could you show me a example? Thanks a lot.
Not too sure what syntaxnet is but here is how I use word embeddings in my sequence to sequence model.
Here I declare a variable which will store the embedding. It is a matrix with rows equal to the number of words in the vocabulary (NWORDS) and each column is the size of the word vector for that word (WORD_VEC_SIZE). It is randomly initialized and is a trainable parameter of the model.
word_embedding = tf.get_variable('word_embedding', shape = (NWORDS, WORD_VEC_SIZE), initializer = tf.truncated_normal_initializer(0.0, 1,0))
The input to my model is a list of integers which represent the indexs of the words in the embedding. It is of dimension BATCH_SIZE x TIMESTEPS.
source_input = tf.placeholder(tf.int32, (BATCH_SIZE, TIME_STEPS), 'source_input')
Using the embedding and the input I can perform a lookup to convert the integer representations of the words into vectors.
y = tf.nn.embedding_lookup([word_embedding], source_input)
Now y has the shape (BATCH_SIZE, TIME_STEPS, WORD_VEC_SIZE) and can be fed into the model for further processing.