Why does the LSTM cell unit need the reuse flag? - tensorflow

I was looking at the tensorflow LSTM tutorial and it had the line of code:
def lstm_cell():
# With the latest TensorFlow source code (as of Mar 27, 2017),
# the BasicLSTMCell will need a reuse parameter which is unfortunately not
# defined in TensorFlow 1.0. To maintain backwards compatibility, we add
# an argument check here:
if 'reuse' in inspect.getargspec(tf.contrib.rnn.BasicLSTMCell.__init__).args:
return tf.contrib.rnn.BasicLSTMCell(
size, forget_bias=0.0, state_is_tuple=True,
reuse=tf.get_variable_scope().reuse)
else:
return tf.contrib.rnn.BasicLSTMCell(
size, forget_bias=0.0, state_is_tuple=True)
which seems confusing to me. It seems there is an issue with the re-use flag. Why is it important. Why do we need it? I know RNNs always share parameters as we produce new states, so why would it be randomly be decided to remove such an important flag...which makes me think I might not understand what is missing. I understand very well Chris's blog post on LSTM so I find this piece of code in the tutorial really mysterious.

Related

Tensorflow Embedding for training and inference

I am trying to code a simple Neural machine translation using tensorflow. But I am a little stuck regarding the understanding of the embedding on tensorflow :
I do not understand the difference between tf.contrib.layers.embed_sequence(inputs, vocab_size=target_vocab_size,embed_dim=decoding_embedding_size)
and
dec_embeddings = tf.Variable(tf.random_uniform([target_vocab_size, decoding_embedding_size]))
dec_embed_input = tf.nn.embedding_lookup(dec_embeddings, dec_input)
In which case should I use one to another ?
The second thing I do not understand is about tf.contrib.seq2seq.TrainingHelper and tf.contrib.seq2seq.GreedyEmbeddingHelper. I know that in the case of translation, we use mainly TrainingHelper for the training step (use the previous target to predict the next target) and GreedyEmbeddingHelper for the inference step (use the previous timestep to predict the next target).
But I do not understand how does it work. In particular the different parameters used. For example why do we need a sequence length in the case of TrainingHelper (why do we not used an EOS)? Why both of them do not use the embedding_lookup or embedding_sequence as input ?
I suppose that you're coming from this seq2seq tutorial. Even though this question is starting to get old, I'll try to answer for the people passing by like me:
For the first question, I looked at the source code behind tf.contrib.layers.embed_sequence, and it is actually using tf.nn.embedding_lookup. So it just wraps it, and creates the embedding matrix (tf.Variable(tf.random_uniform([target_vocab_size, decoding_embedding_size]))) for you. Although this is convenient and less verbose, by using embed_sequence there doesn't seem to a direct way to access the embeddings. So if you want to, you have to query for the internal variable used as the embedding matrix by using the same name space. I have to admit that the code in the tutorial above is confusing. I even suspect he's using different embeddings in the encoder and the decoder.
For the second question:
I guess it is equivalent to use a sequence length or an embedding.
The TrainingHelper doesn't need the embedding_lookup as it only forwards the inputs to the decoder, GreedyEmbeddingHelper does take as a first input the embedding_lookup as mentioned in the documentation.
If I understand you correctly, the first question is about the differences between tf.contrib.layers.embed_sequence and tf.nn.embedding_lookup.
According to the official docs (https://www.tensorflow.org/api_docs/python/tf/contrib/layers/embed_sequence),
Typical use case would be reusing embeddings between an encoder and decoder.
I think tf.contrib.layers.embed_sequence is designed for seq2seq models.
I found the following post:
https://github.com/tensorflow/tensorflow/issues/17417
where #ispirmustafa mentioned:
embedding_lookup doesn't support invalid ids.
Also, in another post: tf.contrib.layers.embed_sequence() is for what?
#user1930402 said:
When building a neural network model that has multiple gates that take features as input, by using tensorflow.contrib.layers.embed_sequence, you can reduce the number of parameters in your network while preserving depth. For example, it eliminates the need for each gates of the LSTM to perform its own linear projection of features.
It allows for arbitrary input shapes, which helps the implementation be simple and flexible.
For the second question, sorry that I didn't use TrainingHelper and can't answer your question.

Tensorflow Hub Image Modules: Clarity on Preprocessing and Output values

Many thanks for support!
I currently use TF Slim - and TF Hub seems like a very useful addition for transfer learning. However the following things are not clear from the documentation:
1. Is preprocessing done implicitly? Is this based on "trainable=True/False" parameter in constructor of module?
module = hub.Module("https://tfhub.dev/google/imagenet/inception_v3/feature_vector/1", trainable=True)
When I use Tf-slim I use the preprocess method:
inception_preprocessing.preprocess_image(image, img_height, img_width, is_training)
2.How to get access to AuxLogits for an inception model? Seems to be missing:
import tensorflow_hub as hub
import tensorflow as tf
img = tf.random_uniform([10,299,299,3])
module = hub.Module("https://tfhub.dev/google/imagenet/inception_v3/feature_vector/1", trainable=True)
outputs = module(dict(images=img), signature="image_feature_vector", as_dict=True)
The output is
dict_keys(['InceptionV3/Mixed_6b', 'InceptionV3/MaxPool_5a_3x3', 'InceptionV3/Mixed_6c', 'InceptionV3/Mixed_6d', 'InceptionV3/Mixed_6e', 'InceptionV3/Mixed_7a', 'InceptionV3/Mixed_7b', 'InceptionV3/Conv2d_2a_3x3', 'InceptionV3/Mixed_7c', 'InceptionV3/Conv2d_4a_3x3', 'InceptionV3/Conv2d_1a_3x3', 'InceptionV3/global_pool', 'InceptionV3/MaxPool_3a_3x3', 'InceptionV3/Conv2d_2b_3x3', 'InceptionV3/Conv2d_3b_1x1', 'default', 'InceptionV3/Mixed_5b', 'InceptionV3/Mixed_5c', 'InceptionV3/Mixed_5d', 'InceptionV3/Mixed_6a'])
These are excellent questions; let me try to give good answers also for readers less familiar with TF-Slim.
1. Preprocessing is not done by the module, because it is a lot about your data, and not so much about the CNN architecture within the module. The module only handles transforming input values from the canonical [0,1] range into whatever the pre-trained CNN within the module expects.
Lengthy rationale: Preprocessing of images for CNN training usually consists of decoding the input JPEG (or whatever), selecting a (reasonably large) random crop from it, random photometric and geometric transformations (distort colors, flip left/right, etc.), and resizing to the common image size for a batch of training inputs. The TensorFlow Hub modules that implement https://tensorflow.org/hub/common_signatures/images leave all of that to your code around the module.
The primary reason is that the suitable random transformations depend a lot on your training task, but not on the architecture or trained state weights of the module. For example, color distortions will help if you classify cars vs dogs, but probably not for ripe vs unripe bananas, and so on.
Also, a batch of images that have been decoded but not yet cropped/resized are hard to represent as a single tensor (unless you make it a 1-D tensor of encoded strings, but that brings other problems, such as breaking backprop into module inputs for advanced uses).
Bottom line: The Python code using the module needs to do image preprocessing (except scaling values), for example, as in https://github.com/tensorflow/hub/blob/master/examples/image_retraining/retrain.py
The slim preprocessing methods conflate the dataset-specific random transformations (tuned for Imagenet!) with the re-scaling to the architecture's value range (which the Hub module does for you). That means they are not directly applicable here.
2. Indeed, auxiliary heads are missing from the initial set of modules published under tfhub.dev/google/..., but I expect them to work fine for re-training anyways.
More details: Not all architectures have auxiliary heads, and even the original Inception paper says their effect was "relatively minor" [Szegedy&al. 2015; ยง5]. Using an image feature vector module for a custom classification task would burden the module consumer code with checking for aux features and, if found, putting aux logits and a loss term on top.
This complication did not seem to pull its weight, but more experiments might refute that assessment. (Please share in a GitHub issue if you know of any.)
For now, the only way to put an aux head onto https://tfhub.dev/google/imagenet/inception_v3/feature_vector/1 is to copy&paste some lines from https://github.com/tensorflow/models/blob/master/research/slim/nets/inception_v3.py (search "Auxiliary head logits") and apply that to the "Inception_V3/Mixed_6e" output that you saw.
3. You didn't ask, but: For training, the module's documentation recommends to pass hub.Module(..., tags={"train"}), or else batch norm operates in inference mode (and dropout, if the module had any).
Hope this explains how and why things are.
Arno (from the TensorFlow Hub developers)

There is no "name" variable in the constructor of BasicLSTMCell

In order to differentiate LSTMs, I wish to give a name to the BasicLSTMCell variable in my code. But it reported the following error:
num_units=self.config.num_lstm_units, state_is_tuple=True, name="some_basic_lstm")
TypeError: __init__() got an unexpected keyword argument 'name'
And I found in the library of my tensorflow installation. Int the file rnn_cell_impl.py:
class BasicLSTMCell(RNNCell):
"""Basic LSTM recurrent network cell.
The implementation is based on: http://arxiv.org/abs/1409.2329.
We add forget_bias (default: 1) to the biases of the forget gate in order to
reduce the scale of forgetting in the beginning of the training.
It does not allow cell clipping, a projection layer, and does not
use peep-hole connections: it is the basic baseline.
For advanced models, please use the full #{tf.nn.rnn_cell.LSTMCell}
that follows.
"""
def __init__(self, num_units, forget_bias=1.0,
state_is_tuple=True, activation=None, reuse=None):
"""Initialize the basic LSTM cell.
Args:
num_units: int, The number of units in the LSTM cell.
forget_bias: float, The bias added to forget gates (see above).
Must set to `0.0` manually when restoring from CudnnLSTM-trained
checkpoints.
state_is_tuple: If True, accepted and returned states are 2-tuples of
the `c_state` and `m_state`. If False, they are concatenated
along the column axis. The latter behavior will soon be deprecated.
activation: Activation function of the inner states. Default: `tanh`.
reuse: (optional) Python boolean describing whether to reuse variables
in an existing scope. If not `True`, and the existing scope already has
the given variables, an error is raised.
Is it a bug in my version of tensorflow? How can I give it a "name"?
I think #aswinids provided the best answer here in comments, but let me explain why it is should not be considered a bug. An LSTM cell is comprised of at least 4 variables (there are a few others used for control flow and such). There are 4 sub-network operations that occur in an LSTM. The diagram below from Colah's blog illustrates the internals of an LSTM cell (http://colah.github.io/posts/2015-08-Understanding-LSTMs/):
Each of the yellow boxes has a set of weights assigned to it and is effectively a single layer neural network operation (piped together in an interesting way, defined by the LSTM architecture).
A good approach to naming these would then be tf.variable_scope('some_name') such that all 4 of the variables defined in the LSTM have a common base naming structure such as:
lstm_cell/f_t
lstm_cell/i_t
lstm_cell/C_t
lstm_cell/o_t
I suspect that previously they just did this and hard coded lstm_cell or whatever name they used as the prefix for all the variables under the LSMT cell. In the later versions as #ashwinids points out, there is a name variable and I suspect that just replaced lstm_cell I used in the example here.

What is the difference between the trainable_weights and trainable_variables in the tensorflow basic lstm_cell?

While trying to copy the weights of a LSTM Cell in Tensorflow using the Basic LSTM Cell as documented here, i stumbled upon both the trainable_weights and trainable_variables property.
Source code has not really been informative for a noob like me sadly. A little bit of experimenting did yield the following information though:
Both have the exact same layout, being a list of length two, where the first entry is a tf.Variable of shape: (2*num_units, 4*num_units), the second entry of the list is of shape (4*num_units,), where num_units is the num_units from initializing the BasicLSTMCell.
The intuitive guess for me is now, that the first list item is a concatenation of the weights of the four internal layers of the lstm, the second item being a concatenation of the respective biases, fitting the expected sizes of these obviously.
Now the question is, whether there is actually any difference between these? I assume they might just be a result of inheriting these from the rnn_cell class?
From the source code of the Layer class that RNNCell inherits from:
#property
def trainable_variables(self):
return self.trainable_weights
See here. The RNN classes don't seem to overwrite this definition -- I would assume it's there for special layer types that have trainable variables that don't quite qualify as "weights". Batch normalization would come to mind, but unfortunately I can't find any mention of trainable_variables in that one's source code (except for GraphKeys.TRAINABLE_VARIABLES which is different).

Is there any way to get variable importance with Keras?

I am looking for a proper or best way to get variable importance in a Neural Network created with Keras. The way I currently do it is I just take the weights (not the biases) of the variables in the first layer with the assumption that more important variables will have higher weights in the first layer. Is there another/better way of doing it?
Since everything will be mixed up along the network, the first layer alone can't tell you about the importance of each variable. The following layers can also increase or decrease their importance, and even make one variable affect the importance of another variable. Every single neuron in the first layer itself will give each variable a different importance too, so it's not something that straightforward.
I suggest you do model.predict(inputs) using inputs containing arrays of zeros, making only the variable you want to study be 1 in the input.
That way, you see the result for each variable alone. Even though, this will still not help you with the cases where one variable increases the importance of another variable.
*Edited to include relevant code to implement permutation importance.
I answered a similar question at Feature Importance Chart in neural network using Keras in Python. It does implement what Teque5 mentioned above, namely shuffling the variable among your sample or permutation importance using the ELI5 package.
from keras.wrappers.scikit_learn import KerasClassifier, KerasRegressor
import eli5
from eli5.sklearn import PermutationImportance
def base_model():
model = Sequential()
...
return model
X = ...
y = ...
my_model = KerasRegressor(build_fn=basemodel, **sk_params)
my_model.fit(X,y)
perm = PermutationImportance(my_model, random_state=1).fit(X,y)
eli5.show_weights(perm, feature_names = X.columns.tolist())
It is not that simple. For example, in later stages the variable could be reduced to 0.
I'd have a look at LIME (Local Interpretable Model-Agnostic Explanations). The basic idea is to set some inputs to zero, pass it through the model and see if the result is similar. If yes, then that variable might not be that important. But there is more about it and if you want to know it, then you should read the paper.
See marcotcr/lime on GitHub.
This is a relatively old post with relatively old answers, so I would like to offer another suggestion of using SHAP to determine feature importance for your Keras models. SHAP also allows you to process Keras models using layers requiring 3d input like LSTM and GRU while eli5 cannot.
To avoid double-posting, I would like to offer my answer to a similar question on Stackoverflow on using SHAP.