How to deal with large(>2GB) embedding lookup table in tensorflow? - tensorflow

When I use pre-trained word vectors to do classification with LSTM, I wondered how to deal with embedding lookup table larger than 2gb in tensorflow.
To do this, I tried to make embedding lookup table like the code below,
data = tf.nn.embedding_lookup(vector_array, input_data)
got this value error.
ValueError: Cannot create a tensor proto whose content is larger than 2GB
variable vector_array on the code is numpy array, and it contains about 14 million unique tokens and 100 dimension word vectors for each word.
thank you for your helping with

You need to copy it to a tf variable. There's a great answer to this question in StackOverflow:
Using a pre-trained word embedding (word2vec or Glove) in TensorFlow
This is how I did it:
embedding_weights = tf.Variable(tf.constant(0.0, shape=[embedding_vocab_size, EMBEDDING_DIM]),trainable=False, name="embedding_weights")
embedding_placeholder = tf.placeholder(tf.float32, [embedding_vocab_size, EMBEDDING_DIM])
embedding_init = embedding_weights.assign(embedding_placeholder)
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
sess.run(embedding_init, feed_dict={embedding_placeholder: embedding_matrix})
You can then use the embedding_weights variable for performing the lookup (remember to store word-index mapping)
Update: Use of the variable is not required but it allows you to save it for future use so that you don't have to re-do the whole thing again (it takes a while on my laptop when loading very large embeddings). If that's not important, you can simply use placeholders like Niklas Schnelle suggested

For me the accepted answer doesn't seem to work. While there is no error the results were terrible (when compared to a smaller embedding via direct initialization) and I suspect the embeddings were just the constant 0 the tf.Variable() is initialized with.
Using just a placeholder without an extra variable
self.Wembed = tf.placeholder(
tf.float32, self.embeddings.shape,
name='Wembed')
and then feeding the embedding on every session.run() of the graph seems to work however.

Using feed_dict with large embeddings was too slow for me with TF 1.8, probably due to the issue mentioned by Niklas Schnelle.
I ended up with the following code:
embeddings_ph = tf.placeholder(tf.float32, wordVectors.shape, name='wordEmbeddings_ph')
embeddings_var = tf.Variable(embeddings_ph, trainable=False, name='wordEmbeddings')
embeddings = tf.nn.embedding_lookup(embeddings_var,input_data)
.....
sess.run(tf.global_variables_initializer(), feed_dict={embeddings_ph:wordVectors})

Related

Extend word embedding layer for incremental word2vec training with Tensorflow

I'd like to train word vectors/embeddings incrementally. With each incremental run I want to extend the vocabulary of the model and add new rows to the embeddings matrix.
The embeddings matrix is a partitioned variable, so ideally I want to avoid using assign since it's not implemented for partitioned variables.
One way I tried, looks like this:
# Set prev_vocab_size and new_vocab_size
#accordingly to the corpus/text of the current run
prev_embeddings = tf.get_variable(
'prev_embeddings',
shape=[prev_vocab_size, FLAGS.embedding_size],
dtype=tf.float32,
initializer=tf.random_uniform_initializer(-1.0, 1.0)
)
new_embeddings = tf.get_variable(
'new_embeddings',
shape=[new_vocab_to_add,
FLAGS.embedding_size],
dtype=tf.float32,
initializer=tf.random_uniform_initializer(
-1.0, 1.0)
)
combined_embeddings = tf.concat(
[prev_embeddings, new_embeddings], 0)
embeddings = tf.Variable(
combined_embeddings,
expected_shape=[prev_vocab_size + new_vocab_to_add, FLAGS.embedding_size],
dtype=tf.float32,
name='embeddings')
Now, this works well for the first run. But if I do the second run, I get a Assign requires shapes of both tensors to match error because the restored original prev_embeddings variable (from the first run) doesn't match the new shape (based on the extended vocab) I declare in the second run.
So I modified the tf.train.Saver to save the new_embeddings as the prev_embeddings like this:
saver = tf.train.Saver({"prev_embeddings": new_embeddings})
Now, in the second run, the prev_embeddings has the shape of new_embeddings in the previous run and I don't get an error for this.
However, now the new_embeddings in the second run has a different shape than in the first run and therefore when restoring the variables from the first run, I get another Assign requires shapes of both tensors to match error.
What's the best way to extend/expand the embeddings variable incrementally with new vectors for new words in the vocabulary while keeping the old and trained vectors?
Any help would be much appreciated.

How to initialize a keras tensor employed in an API model

I am trying to implemente a Memory-augmented neural network, in which the memory and the read/write/usage weight vectors are updated according to a combination of their previous values. These weigths are different from the classic weight matrices between layers that are automatically updated with the fit() function! My problem is the following: how can I correctly initialize these weights as keras tensors and use them in the model? I explain it better with the following simplified example.
My API model is something like:
input = Input(shape=(5,6))
controller = LSTM(20, activation='tanh',stateful=False, return_sequences=True)(input)
write_key = Dense(4,activation='tanh')(controller)
read_key = Dense(4,activation='tanh')(controller)
w_w = Add()([w_u, w_r]) #<---- UPDATE OF WRITE WEIGHTS
to_write = Dot()([w_w, write_key])
M = Add()([M,to_write])
cos_sim = Dot()([M,read_key])
w_r = Lambda(lambda x: softmax(x,axis=1))(cos_sim) #<---- UPDATE OF READ WEIGHTS
w_u = Add()([w_u,w_r,w_w]) #<---- UPDATE OF USAGE WEIGHTS
retrieved_memory = Dot()([w_r,M])
controller_output = concatenate([controller,retrieved_memory])
final_output = Dense(6,activation='sigmoid')(controller_output)`
You can see that, in order to compute w_w^t, I have to have first defined w_r^{t-1} and w_u^{t-1}. So, at the beginning I have to provide a valid initialization for these vectors. What is the best way to do it? The initializations I would like to have are:
M = K.variable(numpy.zeros((10,4))) # MEMORY
w_r = K.variable(numpy.zeros((1,10))) # READ WEIGHTS
w_u = K.variable(numpy.zeros((1,10))) # USAGE WEIGHTS`
But, analogously to what said in #2486(entron), these commands do not return a keras tensor with all the needed meta-data and so this returns the following error:
AttributeError: 'NoneType' object has no attribute 'inbound_nodes'
I also thought to use the old M, w_r and w_u as further inputs at each iteration and analogously get in output the same variables to complete the loop. But this means that I have to use the fit() function to train online the model having just the target as final output (Model 1), and employ the predict() function on the model with all the secondary outputs (Model 2) to get the variables to use at the next iteration. I have also to pass the weigth matrices from Model 1 to Model 2 using get_weights() and set_weights(). As you can see, it becomes a little bit messy and too slow.
Do you have any suggestions for this problem?
P.S. Please, do not focus too much on the API model above because it is a simplified (almost meaningless) version of the complete one where I skipped several key steps.

How do you create a dynamic_rnn with dynamic "zero_state" (Fails with Inference)

I have been working with the "dynamic_rnn" to create a model.
The model is based upon a 80 time period signal, and I want to zero the "initial_state" before each run so I have setup the following code fragment to accomplish this:
state = cell_L1.zero_state(self.BatchSize,Xinputs.dtype)
outputs, outState = rnn.dynamic_rnn(cell_L1,Xinputs,initial_state=state, dtype=tf.float32)
This works great for the training process. The problem is once I go to the inference, where my BatchSize = 1, I get an error back as the rnn "state" doesn't match the new Xinputs shape. So what I figured is I need to make "self.BatchSize" based upon the input batch size rather than hard code it. I tried many different approaches, and none of them have worked. I would rather not pass a bunch of zeros through the feed_dict as it is a constant based upon the batch size.
Here are some of my attempts. They all generally fail since the input size is unknown upon building the graph:
state = cell_L1.zero_state(Xinputs.get_shape()[0],Xinputs.dtype)
.....
state = tf.zeros([Xinputs.get_shape()[0], self.state_size], Xinputs.dtype, name="RnnInitializer")
Another approach, thinking the initializer might not get called until run-time, but still failed at graph build:
init = lambda shape, dtype: np.zeros(*shape)
state = tf.get_variable("state", shape=[Xinputs.get_shape()[0], self.state_size],initializer=init)
Is there a way to get this constant initial state to be created dynamically or do I need to reset it through the feed_dict with tensor-serving code? Is there a clever way to do this only once within the graph maybe with an tf.Variable.assign?
The solution to the problem was how to obtain the "batch_size" such that the variable is not hard coded.
This was the correct approach from the given example:
Xinputs = tf.placeholder(tf.int32, (None, self.sequence_size, self.num_params), name="input")
state = cell_L1.zero_state(Xinputs.get_shape()[0],Xinputs.dtype)
The problem is the use of "get_shape()[0]", this returns the "shape" of the tensor and takes the batch_size value at [0]. The documentation doesn't seem to be that clear, but this appears to be a constant value so when you load the graph into an inference, this value is still hard coded (maybe only evaluated at graph creation?).
Using the "tf.shape()" function, seems to do the trick. This doesn't return the shape, but a tensor. So this seems to be updated more at run-time. Using this code fragment solved the problem of a training batch of 128 and then loading the graph into TensorFlow-Service inference handling a batch of just 1.
Xinputs = tf.placeholder(tf.int32, (None, self.sequence_size, self.num_params), name="input")
batch_size = tf.shape(Xinputs)[0]
state = self.cell_L1.zero_state(batch_size,Xinputs.dtype)
Here is a good link to TensorFlow FAQ which describes this approach 'How do I build a graph that works with variable batch sizes?':
https://www.tensorflow.org/resources/faq

Reusing part of a tensorflow trained graph

So, I trained a tensorflow model with a few layers, more or less like this:
with tf.variable_scope('model1') as scope:
inputs = tf.placeholder(tf.int32, [None, num_time_steps])
embeddings = tf.get_variable('embeddings', (vocab_size, embedding_size))
lstm = tf.nn.rnn_cell.LSTMCell(lstm_units)
embedded = tf.nn.embedding_lookup(embeddings, inputs)
_, state = tf.nn.dynamic_rnn(lstm, embedded, dtype=tf.float32, scope=scope)
# more stuff on the state
Now, I wanted to reuse the embedding matrix and the lstm weights in another model, which is very different from this one except for these two components.
As far as I know, if I load them with a tf.Saver object, it will look for
variables with the exact same names, but I'm using different variable_scopes in the two graphs.
In this answer, it is suggested to create the graph where the LSTM is trained as a superset of the other one, but I don't think it is possible in my case, given the differences in the two models. Anyway, I don't think it is a good idea to make one graph dependent on the other, if they do independent things.
I thought about changing the variable scope of the LSTM weights and embeddings in the serialized graph. I mean, where it originally read model1/Weights:0 or something, it would be another_scope/Weights:0. Is it possible and feasible?
Of course, if there is a better solution, it is also welcome.
I found out that the Saver can be initialized with a dictionary mapping variable names (without the trailing :0) in the serialized file to the variable objects I want to restore in the graph. For example:
varmap = {'model1/some_scope/weights': variable_in_model2,
'model1/another_scope/weights': another_variable_in_model2}
saver = tf.train.Saver(varmap)
saver.restore(sess, path_to_saved_file)

Word Embedding for Convolution Neural Network

I am trying to apply word2vec for convolution neural network. I am new with Tensorflow. Here is my code for pre-train layer.
W = tf.Variable(tf.constant(0.0, shape=[vocabulary_size, embedding_size]),
trainable=False, name="W")
embedding_placeholder = tf.placeholder(tf.float32, [vocabulary_size, embedding_size])
embedding_init = W.assign(embedding_placeholder)
sess = tf.Session()
sess.run(embedding_init, feed_dict={embedding_placeholder: final_embeddings})
I think I should use embedding_lookup but not sure how to use it. I really appreace it someone could give some advice.
Thanks
Tensorflow has an example using word2vec-cnn for text classification: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/skflow/text_classification_cnn.py
You are on the right track. As embedding_lookup works under the assumption that words are represented as integer ids you need to transform your inputs vectors to comply with that. Furthermore, you need to make sure that your transformed words are correctly indexed into the embedding matrix. What I did was I used the information about the index-to-word-mapping generated from the embedding model (I used gensim for training my embeddings) to create a word-to-index lookup table that I subsequently used to transform my input vectors.
I am doing something similar. I stumbled upon this blog that implements the paper "Convolutional neural networks for Sentence Classification". This blog is good. http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/