How is this function programatically building a LSTM - tensorflow

Here is the code:
def lstm(o, i, state):
#these are all calculated seperately, no overlap until....
#(input * input weights) + (output * weights for previous output) + bias
input_gate = tf.sigmoid(tf.matmul(i, w_ii) + tf.matmul(o,w_io) + b_i)
#(input * forget weights) + (output * weights for previous output) + bias
output_gate = tf.sigmoid(tf.matmul(i, w_oi) + tf.matmul(o,w_oo) + b_o)
#(input * forget weights) + (output * weights for previous output) + bias
forget_gate = tf.sigmoid(tf.matmul(i, w_fi) + tf.matmul(o,w_fo) + b_f)
memory_cell = tf.sigmoid(tf.matmul(i, w_ci) + tf.matmul(o,w_co) + b_c)
state = forget_gate * state + input_gate * memory_cell
output = output_gate * tf.tanh(state)
return output, state
And here is the drawing of the lstm:
I'm having trouble understanding how the two match up. Any help would be much appreciated.

This is an excellent blogpost on LSTMs. This code is directly implementing the LSTM; the code here is equivalent to the equations listed on Wikipedia:
The input & output weights reflect the state of the network. In a simple fully-connected (FC) layer, we'd only have one weight matrix, which is what we would use to calculate the output of the layer:
The advantage of a LSTM, however, is that it includes multiple sources of information, or state; this is what we refer to when we say that a LSTM has memory. We have the output gate, just like the FC layer, but we also have the forget gate, the input gate, the cell state, and the hidden state. These all combine to provide multiple, different, sources of information. The equations show how they come together to produce the output.
In the equations, x is the input, and f_t is the input gate. I would recommend reading the linked blogpost and the Wikipedia article to get an understanding of how the equations implement a LSTM.
The image depicts the input gate providing output to the cell based on the values from previous cells, and previous values of the input gate. The cell also incorporates the forget gate; the inputs are then fed into the output gate, which also takes previous values of the output gate as inputs.

Related

What Loss function to use for binary classification in CNN using float labels?

So I am building a CNN that gets images using labels that go from 0 to 1.
What I mean is that I am trying to perform detection of one thing in the image and each image has a label between 0 and 1 that stands for the probability of said type of event being in that image.
I want to output this probability so I am using a sigmoid activation function in the output layer but I am having trouble in deciding what loss function makes sense in this situation. If my labels were 0 and 1s I would use Binary CrossEntropy but does that still make sense when my labels are floats ranging from 0 to 1?
Cheers.
This solution is for logits (output of last linear layer) not for output probabilities
def loss(logits, soft_labels):
anti_soft_labels = 1 - soft_labels
return soft_labels * tf.nn.softplus(-logits)
+ anti_soft_labels * tf.nn.softplus(logits) + tf.math.xlogy(soft_labels, soft_labels) + tf.math.xlogy(anti_soft_labels, anti_soft_labels)
loss(logits=tf.constant([10., 0, -10]), soft_labels=tf.constant([1., 0.5, 0.]))
# [4.53989e-05, 0.00000e+00, 4.53989e-05]
If you need to have 0 as minimal loss value for any soft label use
def loss(logits, soft_labels):
anti_soft_labels = 1 - soft_labels
return soft_labels * tf.nn.softplus(-logits) \
+ anti_soft_labels * tf.nn.softplus(logits) \
+ tf.math.xlogy(soft_labels, soft_labels) \
+ tf.math.xlogy(anti_soft_labels, anti_soft_labels)
loss(logits=tf.constant([10., 0, -10]), soft_labels=tf.constant([1., 0.5, 0.]))
# [4.53989e-05, 0.00000e+00, 4.53989e-05]```

How does the multi-input deep learning work in Keras?

I have a multi-input convolutional neural network model that inputs 2 images from 2 datasets to give one output which is the class of the two inputs. The two datasets have the same classes. I used 2 vgg16 models and concatenate them to classify the two images.
vgg16_model = keras.applications.vgg16.VGG16()
input_layer1= vgg16_model .input
last_layer1 = vgg16_model.get_layer('fc2').output
vgg16_model2 = keras.applications.vgg16.VGG16()
input_layer2= vgg16_model .input
last_layer2 = vgg16_model.get_layer('fc2').output
con = concatenate([last_layer1, last_layer2]) # merge the outputs of the two models
output_layer = Dense(no_classes, activation='softmax', name='prediction')(con)
multimodal_model1 = Model(inputs=[input_layer1, input_layer2], outputs=[output_layer])
My questions are:
1- Which case from the following represents how the images enter to the model?
One to One
database1-img1 + database2-img1
database1-img2 + database2-img2
database1-img3 + database2-img3
database1-img4 + database2-img4
.........
Many to many
database1-img1 + database2-img1
database1-img1 + database2-img2
database1-img1 + database2-img3
database1-img1 + database2-img4
database1-img2 + database2-img1
database1-img2 + database2-img2
database1-img2 + database2-img3
database1-img2 + database2-img4
.........
2- In general in deep learning, Does the images enter from the two datasets to the model at the same time have the same class (labels) or not?
It is a 1:1 mapping, the same should be with multiple outputs as well.
When you have a model such as Model(inputs=[input_layer1, input_layer2], outputs=[output_layer]) or even Model(inputs=[input_layer1, input_layer2], outputs=[output_layer1, output_layer2]) , You must feed it with inputs / output of the same shape.
Assume the other case - You will need to have ds1.shape[0] * ds2.shape[0] different labels, for each possible mix of the 2 datasets, and will need to have them ordered at a certain way. That is not really feasible, at least not simply.
2. Its not as if the same images have the same label, but the Pair of both images have a single label.

How to handle log(0) when using cross entropy

In order to make the case simple and intuitive, I will using binary (0 and 1) classification for illustration.
Loss function
loss = np.multiply(np.log(predY), Y) + np.multiply((1 - Y), np.log(1 - predY)) #cross entropy
cost = -np.sum(loss)/m #num of examples in batch is m
Probability of Y
predY is computed using sigmoid and logits can be thought as the outcome of from a neural network before reaching the classification step
predY = sigmoid(logits) #binary case
def sigmoid(X):
return 1/(1 + np.exp(-X))
Problem
Suppose we are running a feed-forward net.
Inputs: [3, 5]: 3 is number of examples and 5 is feature size (fabricated data)
Num of hidden units: 100 (only 1 hidden layer)
Iterations: 10000
Such arrangement is set to overfit. When it's overfitting, we can perfectly predict the probability for the training examples; in other words, sigmoid outputs either 1 or 0, exact number because the exponential gets exploded. If this is the case, we would have np.log(0) undefined. How do you usually handle this issue?
If you don't mind the dependency on scipy, you can use scipy.special.xlogy. You would replace the expression
np.multiply(np.log(predY), Y) + np.multiply((1 - Y), np.log(1 - predY))
with
xlogy(Y, predY) + xlogy(1 - Y, 1 - predY)
If you expect predY to contain very small values, you might get better numerical results using scipy.special.xlog1py in the second term:
xlogy(Y, predY) + xlog1py(1 - Y, -predY)
Alternatively, knowing that the values in Y are either 0 or 1, you can compute the cost in an entirely different way:
Yis1 = Y == 1
cost = -(np.log(predY[Yis1]).sum() + np.log(1 - predY[~Yis1]).sum())/m
How do you usually handle this issue?
Add small number (something like 1e-15) to predY - this number doesn't make predictions much off, and it solves log(0) issue.
BTW if your algorithm outputs zeros and ones it might be useful to check the histogram of returned probabilities - when algorithm is so sure that something's happening it can be a sign of overfitting.
One common way to deal with log(x) and y / x where x is always non-negative but can become 0 is to add a small constant (as written by Jakub).
You can also clip the value (e.g. tf.clip_by_value or np.clip).

change loss function during training

Suppose my loss function is of the following form:
loss = a*loss_1 + (1-a)*loss_2
Suppose also I am training for 100 steps. How can I dynamically change the loss function in tensorflow so that I gradually change "a" from 1 to 0 during the 100 steps of training?
To be precise, I want my loss to be
loss = 1*loss_1+0*loss_2 = loss_1
at the beginning of training (at step 1)
and
loss = 0*loss_1+1*loss_2 = loss_2 at the end (step 100)
with some kind of gradual (doesn't have to be continuous) decrease in between.
Assuming that the value of a does not depend on the computation done at the current step, create a placeholder for a then pass the value you want using the feed dictionary.
You can use tf.train.polynomial_decay.
tf.train.polynomial_decay(learning_rate=1, global_step=step_from_placeholder,
decay_steps=100, end_learning_rate=0,
power=1.0, cycle=False, name=None)
This computes
global_step = min(global_step, decay_steps)
decayed_learning_rate = (learning_rate - end_learning_rate) * \
(1 - global_step / decay_steps) ** (power) + end_learning_rate

tensorflow : conv2d_transpose : Matching desired output dimensions

How can I force certain dimensionality of the output of the conv2d_transpose layer ? My problem is that I use it for upsampling and I want to match the dimensionality of my labels and the output of the NN, for example if I have a feature map as Bx25x40xC how can I make it Bx100x160xC (i.e. upsample exactly 4x times)?
It seems like dimensions of the output can be calculated using
h = ((h_in - 1) * stride_h) + kernel_h - 2 * pad_h
w = ((w_in - 1) * stride_w) + kernel_w - 2 * pad_w
one can manipulate strides and kernels, but padding is controlled by 'same'/'valid' algorithms which, to my understanding, means they are pretty much uncontrollable, so is the resulting output size. For comparison, in caffe, one can at least force the padding in attempt to match the desired output explicitly.