Tensorflow: jointly training CNN + LSTM - tensorflow

There are quite a few examples on how to use LSTMs alone in TF, but I couldn't find any good examples on how to train CNN + LSTM jointly.
From what I see, it is not quite straightforward how to do such training, and I can think of a few options here:
First, I believe the simplest solution (or the most primitive one) would be to train CNN independently to learn features and then to train LSTM on CNN features without updating the CNN part, since one would probably have to extract and save these features in numpy and then feed them to LSTM in TF. But in that scenario, one would probably have to use a differently labeled dataset for pretraining of CNN, which eliminates the advantage of end to end training, i.e. learning of features for final objective targeted by LSTM (besides the fact that one has to have these additional labels in the first place).
Second option would be to concatenate all time slices in the batch
dimension (4-d Tensor), feed it to CNN then somehow repack those
features to 5-d Tensor again needed for training LSTM and then apply a cost function. My main concern, is if it is possible to do such thing. Also, handling variable length sequences becomes a little bit tricky. For example, in prediction scenario you would only feed single frame at the time. Thus, I would be really happy to see some examples if that is the right way of doing joint training. Besides that, this solution looks more like a hack, thus, if there is a better way to do so, it would be great if someone could share it.
Thank you in advance !

For joint training, you can consider using tf.map_fn as described in the documentation https://www.tensorflow.org/api_docs/python/tf/map_fn.
Lets assume that the CNN is built along similar lines as described here https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10.py.
def joint_inference(sequence):
inference_fn = lambda image: inference(image)
logit_sequence = tf.map_fn(inference_fn, sequence, dtype=tf.float32, swap_memory=True)
lstm_cell = tf.contrib.rnn.LSTMCell(128)
output_state, intermediate_state = tf.nn.dynamic_rnn(cell=lstm_cell, inputs=logit_sequence)
projection_function = lambda state: tf.contrib.layers.linear(state, num_outputs=num_classes, activation_fn=tf.nn.sigmoid)
projection_logits = tf.map_fn(projection_function, output_state)
return projection_logits
Warning: You might have to look into device placement as described here https://www.tensorflow.org/tutorials/using_gpu if your model is larger than the memory gpu can allocate.
An Alternative would be to flatten the video batch to create an image batch, do a forward pass from CNN and reshape the features for LSTM.

Related

Function CNN model written in Keras 2.2.4 not learning in TensorFlow or Keras 2.4

I am dealing with an object detection problem and using a model which is actually functioning (its results have been published on a paper and I have the original code). Originally, the code was written with Keras 2.2.4 without importing TensorFlow and trained and tested on the same dataset that I am using at the moment. However, when I try to run the same model with TensorFlow 2.x it just won't learn a thing.
I have tried importing everything from TensorFlow 2.4, but I have the same problem if I import everything (layers, models, optimizers...) from Keras 2.4. And I have tried to do so on two different devices, both using a GPU. Namely, what is happening is that the loss function decreases ridiculously fast, but the accuracy won't increase a bit (or, if it does, it gets stuck around 10% or smth). Also, every now and then this happens from an epoch to the next one:
Loss undergoes HUGE jumps between consecutive epochs, and all this without any changes in accuracy
I have tried to train the network on another dataset (had to change the last layers in order to match the required dimensions) and the model seemed to be learning in a normal way, i.e. the accuracy actually increases and the loss doesn't reach 0.0x in one epoch.
I can't post the script, but the model is an Encoder-Decoder network: consecutive Convolutions with increasing number of filters reduce the dimensions of the image, and a specular path of Transposed Convolutions restores the original dimensions. So basically the network only contains:
Conv2D
Conv2DTranspose
BatchNormalization
Activation("relu")
Activation("sigmoid")
concatenate
6 is used to put together outputs from parallel paths or distant layers; 3 and 4 are used after every Conv or ConvTranspose; 5 is only used as final activation function, i.e. as output layer.
I think the problem is pretty generic and I am honestly surprised that I couldn't find a single question about it. What could be happening here? The problem must have something to do with TF/Keras versions, but I can't find any documentation about it and I have been trying to change so many things but nothing changes. It's crazy because if I didn't know that the model works I would try to rewrite it from scratch so I am afraid that this problem may occurr with a new network and I won't be able to understand whether it's the libraries or the model itself.
Thank you in advance! :)
EDIT
Code snippets:
Convolutional block:
encoder1 = Conv2D(filters=first_layer_channels, kernel_size=2, strides=2)(input)
encoder1 = BatchNormalization()(encoder1)
encoder1 = Activation('relu')(encoder1)
Decoder
decoder1 = Conv2DTranspose(filters=first_layer_channels, kernel_size=2, strides=2)(encoder4)
decoder1 = BatchNormalization()(decoder1)
decoder1 = Activation('relu')(decoder1)
Final layers:
final = Conv2D(filters=total, kernel_size=1)(decoder4)
final = BatchNormalization()(final)
Last_Conv = Activation('sigmoid')(final)
The task is human pose estimation: the network (which, I recall, works on this specific task with Keras 2.2.4) has to predict twenty binary maps containing the positions of specific keypoints.

Purpose of batch channel in tensorflow model on forward pass of 1 input

So far I have trained a couple different models in TensorFlow (with Keras) and I see that getting the batch_size right seems to be important not just for speed of training but also the resultant accuracy of the model.
What confuses me is a case where a model has an actual batch channel as the first dimension on the input (and output as well). If my batch size is 32 but I'm always inputting 1 data at run-time then where does the batch channel apply? How could I utilize the vast majority of it if I'm inherently only using 1/batch_size amount of it in forward pass?
If you are curious the model I am researching, it is this one:
https://github.com/pierluigiferrari/ssd_keras/blob/master/models/keras_ssd300.py
see:
Output shape of predictions: (batch, n_boxes_total, n_classes + 4 + 8)
predictions = Concatenate(axis=2, name='predictions')([mbox_conf_softmax, mbox_loc, mbox_priorbox])
The tensors had run through numerous other layers that had constants and such pretrained with [batch_size] as well. To me it just seems like inputs at various batch index would have to yield different results. Maybe I just need something incredibly obvious pointed out to me.
It would seem that after training you must recompile the model with a batch size of 1, then transfer the weights from the training model to the new model for evaluation. The alternative is performing 'batch_size' count of predictions at once (which of course is not always feasible per application). If there are alternatives (or if I read wrong) please feel free to add an answer.

Questions about tensorflow GetStarted tutorial

So I was reading the tensorflow getstarted tutorial and I found it very hard to follow. There were a lot of explanations missing about each function and why they are necesary (or not).
In the tf.estimator section, what's the meaning or what are they supposed to be the "x_eval" and "y_eval" arrays? The x_train and y_train arrays give the desired output (which is the corresponding y coordinate) for a given x coordinate. But the x_eval and y_eval values are incorrect: for x=5, y should be -4, not -4.1. Where do those values come from? What do x_eval and y_eval mean? Are they necesary? How did they choose those values?
The difference between "input_fn" (what does "fn" even mean?) and "train_input_fn". I see that the only difference is one has
num_epochs=None, shuffle=True
num_epochs=1000, shuffle=False
but I don't understand what "input_fn" or "train_input_fn" are/do, or what's the difference between the two, or if both are necesary.
3.In the
estimator.train(input_fn=input_fn, steps=1000)
piece of code, I don't understand the difference between "steps" and "num_epochs". What's the meaning of each one? Can you have num_epochs=1000 and steps=1000 too?
The final question is, how do i get the W and the b? In the previous way of doing it (not using tf.estimator) they explicitelly found that W=-1 and b=1. If I was doing a more complex neural network, involving biases and weights, I think I would want to recover the actual values of the weights and biases. That's the whole point of why I'm using tensorflow, to find the weights! So how do I recover them in the tf.estimator example?
These are just some of the questions that bugged me while reading the "getStarted" tutorial. I personally think it leaves a lot to desire, since it's very unclear what each thing does and you can at best guess.
I agree with you that the tf.estimator is not very well introduced in this "getting started" tutorial. I also think that some machine learning background would help with understanding what happens in the tutorial.
As for the answers to your questions:
In machine learning, we usually minimizer the loss of the model on the training set, and then we evaluate the performance of the model on the evaluation set. This is because it is easy to overfit the training set and get 100% accuracy on it, so using a separate validation set makes it impossible to cheat in this way.
Here (x_train, y_train) corresponds to the training set, where the global minimum is obtained for W=-1, b=1.
The validation set (x_eval, y_eval) doesn't have to perfectly follow the distribution of the training set. Although we can get a loss of 0 on the training set, we obtain a small loss on the validation set because we don't have exactly y_eval = - x_eval + 1
input_fn means "input function". This is to indicate that the object input_fn is a function.
In tf.estimator, you need to provide an input function if you want to train the estimator (estimator.train()) or evaluate it (estimator.evaluate()).
Usually you want different transformations for training or evaluation, so you have two functions train_input_fn and eval_input_fn (the input_fn in the tutorial is almost equivalent to train_input_fn and is just confusing).
For instance, during training we want to train for multiple epochs (i.e. multiple times on the dataset). For evaluation, we only need one pass over the validation data to compute the metrics we need
The number of epochs is the number of times we repeat the entire dataset. For instance if we train for 10 epochs, the model will see each input 10 times.
When we train a machine learning model, we usually use mini-batches of data. For instance if we have 1,000 images, we can train on batches of 100 images. Therefore, training for 10 epochs means training on 100 batches of data.
Once the estimator is trained, you can access the list of variables through estimator.get_variable_names() and the value of a variable through estimator.get_variable_value().
Usually we never need to do that, as we can for instance use the trained estimator to predict on new examples, using estimator.predict().
If you feel that the getting started is confusing, you can always submit a GitHub issue to tell the TensorFlow team and explain your point.

Tensorflow embeddings

I know what embeddings are and how they are trained. Precisely, while referring to the tensorflow's documentation, I came across two different articles. I wish to know what exactly is the difference between them.
link 1: Tensorflow | Vector Representations of words
In the first tutorial, they have explicitly trained embeddings on a specific dataset. There is a distinct session run to train those embeddings. I can then later on save the learnt embeddings as a numpy object and use the
tf.nn.embedding_lookup() function while training an LSTM network.
link 2: Tensorflow | Embeddings
In this second article however, I couldn't understand what is happening.
word_embeddings = tf.get_variable(“word_embeddings”,
[vocabulary_size, embedding_size])
embedded_word_ids = tf.gather(word_embeddings, word_ids)
This is given under the training embeddings sections. My doubt is: does the gather function train the embeddings automatically? I am not sure since this op ran very fast on my pc.
Generally: What is the right way to convert words into vectors (link1 or link2) in tensorflow for training a seq2seq model? Also, how to train the embeddings for a seq2seq dataset, since the data is in the form of separate sequences for my task unlike (a continuous sequence of words refer: link 1 dataset)
Alright! anyway, I have found the answer to this question and I am posting it so that others might benefit from it.
The first link is more of a tutorial that steps you through the process of exactly how the embeddings are learnt.
In practical cases, such as training seq2seq models or Any other encoder-decoder models, we use the second approach where the embedding matrix gets tuned appropriately while the model gets trained.

pruning tensorflow connections and weights (using cifar10 cnn)

I'm using tensorflow to run a cnn for image classification.
I use tensorflow cifar10 cnn implementation.(tensorflow cifar10)
I want to decrease the number of connections, meaning I want to prune the low-weight connections.
How can I create a new graph(subgraph) without some of the nuerones?
Tensorflow does not allow you lock/freeze a particular kernel of a particular layer, that I have found. The only I've found to do this is to use the tf.assign() function as shown in
How to freeze/lock weights of one Tensorflow variable (e.g., one CNN kernel of one layer
It's fairly cave-man but I've seen no other solution that works. Essentially, you have to .assign() the values every so often as you iterate through the data. Since this approach is so inelegant and brute-force, it's very slow. I do the .assign() every 100 batches.
Someone please post a better solution and soon!
The cifar10 model you point to, and for that matter, most models written in TensorFlow, do not model the weights (and hence, connections) of individual neurons directly in the computation graph. For instance, for fully connected layers, all the connections between the two layers, say, with M neurons in the layer below, and 'N' neurons in the layer above, are modeled by one MxN weight matrix. If you wanted to completely remove a neuron and all of its outgoing connections from the layer below, you can simply slice out a (M-1)xN matrix by removing the relevant row, and multiply it with the corresponding M-1 activations of the neurons.
Another way is add an addition mask to control the connections.
The first step involves adding mask and threshold variables to the
layers that need to undergo pruning. The variable mask is the same
shape as the layer's weight tensor and determines which of the weights
participate in the forward execution of the graph.
There is a pruning implementation under tensorflow/contrib/model_pruning to prune the model. Hope this can help you to prune model quickly.
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/model_pruning
I think google has an updated answer here : https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/model_pruning
Removing pruning nodes from the trained graph:
$ bazel build -c opt contrib/model_pruning:strip_pruning_vars
$ bazel-bin/contrib/model_pruning/strip_pruning_vars --checkpoint_path=/tmp/cifar10_train --output_node_names=softmax_linear/softmax_linear_2 --filename=cifar_pruned.pb
I suppose that cifar_pruned.pb will be smaller, since the pruned "or zero masked" variables are removed.