tflearn - fasttext word vector error in embedding layer - tflearn

I am using fasttext word embeddings to create word vectors for sentences for a sentiment analysis binary classifier. Fasttext vectors have both negative and positive numbers.
My embedding layer is
net = embedding(net, input_dim=20000, output_dim=num_hidden)
However I get the error,
InvalidArgumentError (see above for traceback): indices[99,244] = -1 is not in [0, 20000)
[[Node: Embedding/embedding_lookup = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, _class=["loc:#Embedding/W"], validate_indices=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Embedding/W/read, Embedding/Cast)]]
What does this error mean exactly?
I tried to make all the negative values in the word embedding 0, which although it does work, it loses a lot of information
I am not understanding how the input_dim works in the embedding layer in tflearn?
Also I don't understand how to use Fasttext word embeddings with tflearn?
Any help would be much appreciated

Related

Use pre-trained word2vec in lstm language model?

I used tensorflow to train LSTM language model, code is from here.
According to article here, it seems that if I use pre-trained word2vec, it works better.
Using word embeddings such as word2vec and GloVe is a popular method to improve the accuracy of your model. Instead of using one-hot vectors to represent our words, the low-dimensional vectors learned using word2vec or GloVe carry semantic meaning – similar words have similar vectors. Using these vectors is a form of pre-training.
So, I want to use word2vec to redo the training, but I am a little bit confused about how to do this.
The embedding code goes here:
with tf.device("/cpu:0"):
embedding = tf.get_variable(
"embedding", [vocab_size, size], dtype=data_type())
inputs = tf.nn.embedding_lookup(embedding, input_.input_data)
How can I change this code to use pre-trained word2vec?

Understanding word embeddings when converting from bag of words for tweets

The exercise from intro to deep learning this assignment. It uses bag-of-words to represent a tweet. How to use word embeddings to achieve the same?I played around the word2vec tool, I came across following questions:
(i) How to obtain pre-trained embeddings to represent these tweets? (To use word2vec directly instead of training these tweets for embedding vectors.)How to use word2vec to use such pre-trained model?
(ii) How to train a tensorflow 2 hidden layer architecture once we obtain embeddings from word2vec (i.e. dimensions will change due to embedding_size) or (continuation of previous bow model what will be additional changes due to embeddings)
Previously it was:
input dimension : (None, vocab_size)
Layer-1: (input_data * weights_1) + biases_1
Layer-2: (layer_1 * weights_2) + biases_2
output layer: (layer_2 * n_classes) + n_classes
output dimension: (None, n_classes)
(iii) Is it necessary to obtain embeddings for given data of tweets by training word2vec from scratch? How to train data of around 14k tweets using word2vec (not gensim or GloVe)? Will word2vec preprocess # as stopping word?

Trying to understand CNNs for NLP tutorial using Tensorflow

I am following this tutorial in order to understand CNNs in NLP. There are a few things which I don't understand despite having the code in front of me. I hope somebody can clear a few things up here.
The first rather minor thing is the sequence_lengthparameter of the TextCNN object. In the example on github this is just 56 which I think is the max-length of all sentences in the training data. This means that self.input_x is a 56-dimensional vector which will contain just the indices from the dictionary of a sentence for each word.
This list goes into tf.nn.embedding_lookup(W, self.intput_x) which will return a matrix consisting of the word embeddings of those words given by self.input_x. According to this answer this operation is similar to using indexing with numpy:
matrix = np.random.random([1024, 64])
ids = np.array([0, 5, 17, 33])
print matrix[ids]
But the problem here is that self.input_x most of the time looks like [1 3 44 25 64 0 0 0 0 0 0 0 .. 0 0]. So am I correct if I assume that tf.nn.embedding_lookup ignores the value 0?
Another thing I don't get is how tf.nn.embedding_lookup is working here:
# Embedding layer
with tf.device('/cpu:0'), tf.name_scope("embedding"):
W = tf.Variable(
tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),
name="W")
self.embedded_chars = tf.nn.embedding_lookup(W, self.input_x)
self.embedded_chars_expanded = tf.expand_dims(self.embedded_chars, -1)
I assume, taht self.embedded_chars is the matrix which is the actual input to the CNN where each row represents the word embedding of one word. But how can tf.nn.embedding_lookup know about those indices given by self.input_x?
The last thing which I don't understand here is
W is our embedding matrix that we learn during training. We initialize it using a random uniform distribution. tf.nn.embedding_lookup creates the actual embedding operation. The result of the embedding operation is a 3-dimensional tensor of shape [None, sequence_length, embedding_size].
Does this mean that we are actually learning the word embeddings here? The tutorial states at the beginning:
We will not used pre-trained word2vec vectors for our word embeddings. Instead, we learn embeddings from scratch.
But I don't see a line of code where this is actually happening. The code of the embedding layer does not look like as if there is anything being trained or learned - so where is it happening?
Answer to ques 1 (So am I correct if I assume that tf.nn.embedding_lookup ignores the value 0?) :
The 0's in the input vector is the index to 0th symbol in the vocabulary, which is the PAD symbol. I don't think it gets ignored when the lookup is performed. 0th row of the embedding matrix will be returned.
Answer to ques 2 (But how can tf.nn.embedding_lookup know about those indices given by self.input_x?) :
Size of the embedding matrix is [V * E] where is the size of vocabulary and E is dimension of embedding vector. 0th row of matrix is embedding vector for 0th element of vocabulary, 1st row of matrix is embedding vector for 1st element of vocabulary.
From the input vector x, we get the indices of words in vocabulary, which are used for indexing the embedding matrix.
Answer to ques 3 (Does this mean that we are actually learning the word embeddings here?).
Yes, we are actually learning the embedding matrix.
In the embedding layer, in line W = tf.Variable( tf.random_uniform([vocab_size, embedding_size], -1.0, 1.0),name="W") W is the embedding matrix and by default, in tensorflow trainable=TRUE for variable. So, W will also be a learned parameter. To use pre- trained model, set trainable = False.
For detailed explanation of the code you can follow blog: https://agarnitin86.github.io/blog/2016/12/23/text-classification-cnn

Regarding setting up the target tensor shape for sparse_categorical_crossentropy

I am trying to experiment with a multi-layer encoder-decoder type of network. The screenshot of the last several layers of network architecture is as follows. This is how I setup model compiling and training process.
optimizer = SGD(lr=0.001, momentum=0.9, decay=0.0005, nesterov=False)
autoencoder.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer, metrics=['accuracy'])
model.fit(imgs_train, imgs_mask_train, batch_size=batch_size, nb_epoch=nb_epoch, verbose=1,callbacks=[model_checkpoint])
imgs_train and imgs_mask_train are of shape (2000, 1, 128, 128). imgs_train represent the raw image and imgs_mask_train represents the mask image. I am trying to solve a semantic segmentation problem. However, running the program generates the following error message, (I only keep the main related part).
tensorflow.python.pywrap_tensorflow.StatusNotOK: Invalid argument: logits first dimension must match labels size. logits shape=[4096,128] labels shape=[524288]
[[Node: SparseSoftmaxCrossEntropyWithLogits = SparseSoftmaxCrossEntropyWithLogits[T=DT_FLOAT, Tlabels=DT_INT64, _device="/job:localhost/replica:0/task:0/cpu:0"](Reshape_364, Cast_158)]]
It seems to me that the loss function of sparse_categorical_crossentropy causes the problem for the current (imgs_train, imgs_mask_train) shape setting. The Keras API does not include the detail about how to setup the target tensor. Any suggestions are highly appreciated!
I am currently trying to figure the same problem and as far as I can tell it takes a sparse representation of the target category. That means integers as the target label instead of the one-hot encoded binary class matrix.
Concerning your problem, do you have categories in your masking or do you just have information about the outline of an object? With outline information it becomes a pixel wise binary loss instead of a categorical one. If you have categories, the output of your decoder should have dimensionality (None, number_of_classes, 128, 128). On that you should be able to use a sparse target mask but I haven't tried this myself...
Hope that helps

How does word2vec give one hot word vector from the embedding vector?

I understand how word2vec works.
I want to use word2vec(skip-gram) as input for RNN. Input is embedding word vector. Output is also embedding word vector generated by RNN.
Here’s question! How can I convert the output vector to one hot word vector? I need inverse matrix of embeddings but I don’t have!
The output of an RNN is not an embedding. We convert the output from the last layer in an RNN cell into a vector of vocabulary_size by multiplying with an appropriate matrix.
Take a look at the PTB Language Model example to get a better idea. Specifically look at lines 133-136:
softmax_w = tf.get_variable("softmax_w", [size, vocab_size], dtype=data_type())
softmax_b = tf.get_variable("softmax_b", [vocab_size], dtype=data_type())
logits = tf.matmul(output, softmax_w) + softmax_b
The above operation will give you logits. This logits is a probability distribution over your vocabulary. numpy.random.choice might help you to use these logits to make a prediction.