Understanding Tensorflow BasicLSTMCell Kernel and Bias shape - tensorflow

I want to understand better those shape of the Tensorflow´s BasicLSTMCell Kernel and Bias.
class BasicLSTMCell(LayerRNNCell):
input_depth = inputs_shape[1].value
h_depth = self._num_units
self._kernel = self.add_variable(
shape=[input_depth + h_depth, 4 * self._num_units])
self._bias = self.add_variable(
shape=[4 * self._num_units],
Why does the kernel have the shape=[input_depth + h_depth, 4 * self._num_units]) and the bias the shape = [4 * self._num_units] ? Maybe the factor 4 come from the forget gate, block input, input gate and output gate? And what´s the reason for the summation of input_depth and h_depth?
More information about my LSTM Network:
num_input = 12, timesteps = 820, num_hidden = 64, num_classes = 2.
With tf.trainables_variables() i get the following information:
Variable name: Variable:0 Shape: (64, 2) Parameters: 128
Variable name: Variable_1:0 Shape: (2,) Parameters: 2
Variable name: rnn/basic_lstm_cell/kernel:0 Shape: (76, 256) Parameters: 19456
Variable name: rnn/basic_lstm_cell/bias:0 Shape: (256,) Parameters: 256
The following Code defines my LSTM Network.
def RNN(x, weights, biases):
x = tf.unstack(x, timesteps, 1)
lstm_cell = rnn.BasicLSTMCell(num_hidden)
outputs, states = rnn.static_rnn(lstm_cell, x, dtype=tf.float32)
return tf.matmul(outputs[-1], weights['out']) + biases['out']

First, about summing input_depth and h_depth: RNNs generally follow equations like h_t = W*h_t-1 + V*x_t to compute the state h at time t. That is, we apply a matrix multiplication to the last state and the current input and add the two. This is actually equivalent to concatenating h_t-1 and x_t (let's just call this c), "stacking" the two matrices W and V (let's just call this S) and computing S*c.
Now we only have one matrix multiplication instead of two; I believe this can be parallelized more effectively so this is done for performance reasons. Since h_t-1 has size h_depth and x has size input_depth we need to add the two dimensionalities for the concatenated vector c.
Second, you are right about the factor 4 coming from the gates. This is essentially the same as above: Instead of carrying out four separate matrix multiplications for the input and each of the gates, we carry out one multiplication that results in a big vector that is the input and all four gate values concatenated. Then we can just split this vector into four parts. In the LSTM cell source code this happens in lines 627-633.


Custom TensorFlow loss function with batch size > 1?

I have some neural network with following code snippets, note that batch_size == 1 and input_dim == output_dim:
net_in = tf.Variable(tf.zeros(shape = [batch_size, input_dim]), dtype=tf.float32)
input_placeholder = tf.compat.v1.placeholder(shape = [batch_size, input_dim], dtype=tf.float32)
assign_input = net_in.assign(input_placeholder)
# Some matmuls, activations, dropouts, normalizations...
net_out = tf.tanh(output_before_activation)
def loss_fn(output, input):
#input.shape = output.shape = (batch_size, input_dim)
output = tf.reshape(output, [input_dim,]) # shape them into 1d vectors
input = tf.reshape(input, [input_dim,])
return my_fn_that_only_takes_in_vectors(output, input)
# Create session, preprocess data ...
for epoch in epoch_num:
for batch in range(total_example_num // batch_size):
sess.run(assign_input, feed_dict = {input_placeholder : some_appropriate_numpy_array})
sess.run(optimizer.minimize(loss_fn(net_out, net_in)))
Currently the neural network above works fine, but it is very slow because it updates gradient every sample (batch size = 1). I would like to set batch size > 1, but my_fn_that_only_takes_in_vectors cannot accommodate matrices whose first dimension is not 1. Due to the nature of my custom loss, flattening the batch input into a vector of length (batch_size * input_dim) seems to not work.
How would I write my new custom loss_fn now that the input and output are N x input_dim where N > 1? In Keras this would not have been an issue because keras somehow takes the average of the gradients of each example in the batch. For my TensorFlow function, should I take each row as a vector individually, pass them to my_fn_that_only_takes_in_vectors, then take the average of the results?
You can use a function that computes the loss on the whole batch, and works independently on the batch size. Basically the operations are applied to the whole first dimension of the input (the first dimension represents the element number in the batch). Here is an example, I hope this helps to see how the operations are carried out:
def my_loss(y_true, y_pred):
dx2 = tf.math.squared_difference(y_true[:, 0], y_true[:, 2]) # shape (BatchSize, )
dy2 = tf.math.squared_difference(y_true[:, 1], y_true[:, 3]) # shape: (BatchSize, )
denominator = dx2 + dy2 # shape: (BatchSize, )
dst_vec = tf.math.squared_difference(y_true, y_pred) # shape: (Batch, n_labels)
numerator = tf.reduce_sum(dst_vec, axis=-1) # shape: (BatchSize,)
loss_vector = tf.cast(numerator / denominator, dtype="float32") # shape: (BatchSize,) this is a vector containing the loss of each element of the batch
loss = tf.reduce_sum(loss_vector ) #if you want to sum the losses
return loss
I am not sure whether you need to return the sum or the avg of the losses for the batch.
If you sum, make sure to use a validation dataset with same batch size, otherwise the loss is not comparable.

What's the effect of projection layer after n-grams convolution banks?

I'm studying CBHG module for extracting representations from sequence in Tacotron.
CBHG is consisted of (1-D convolution bank - highway network - bidirectional GRU).
Inputs of 1-d conv is made by 'lookup_table' for embedding a~z.
After 1-D conv caculate, there is 'tf.concat' for concated the results.
And there is projection with this result.
I think it's associate with word2vec embedding like 'CBOW' but It's very hard to me.
What is the effect of projection layer after concated result by n-grams convolution? Extracting meaningful 'n' in n-grams conv banks?
Please help me.
with tf.variable_scope('conv_bank'):
# Convolution bank: concatenate on the last axis
# to stack channels from all convolutions
conv_fn = lambda k: \
conv1d(inputs, k, bank_channel_size,
tf.nn.relu, is_training, 'conv1d_%d' % k)
conv_outputs = tf.concat(
[conv_fn(k) for k in range(1, bank_size+1)], axis=-1,
# Maxpooling:
maxpool_output = tf.layers.max_pooling1d(
# Two projection layers:
proj_out = maxpool_output
for idx, proj_size in enumerate(proj_sizes):
activation_fn = None if idx == len(proj_sizes) - 1 else tf.nn.relu
proj_out = conv1d(
proj_out, proj_width, proj_size, activation_fn,
is_training, 'proj_{}'.format(idx + 1))

How to extract all weights from LSTM cell in vanila Tensorflow?

I am train LSTM network
cell_fw = tf.contrib.rnn.BasicLSTMCell(HIDDEN_SIZE)
cell_bw = tf.contrib.rnn.BasicLSTMCell(HIDDEN_SIZE)
rnn_outputs, final_state_fw, final_state_bw = tf.contrib.rnn.static_bidirectional_rnn(
Further, I am try to save it coefficients:
d = {}
with tf.Session() as sess:
# train code ...
variables_names =[v.name for v in tf.global_variables()]
values = sess.run(variables_names)
for k,v in zip(variables_names, values):
d[k] = v
Dictionary d have only 2 objects from each LSTM cell:
[(k,v.shape) for (k,v) in sorted(d.items(), key=lambda x:x[0])]
[('bidirectional_rnn/bw/basic_lstm_cell/biases:0', (1024,)),
('bidirectional_rnn/bw/basic_lstm_cell/weights:0', (272, 1024)),
('bidirectional_rnn/fw/basic_lstm_cell/biases:0', (1024,)),
('bidirectional_rnn/fw/basic_lstm_cell/weights:0', (272, 1024)),
('char_embedding:0', (70, 16)),
('softmax_biases:0', (5068,)),
('softmax_weights:0', (5068, 512))]
I'm puzzled. Each LSTM cell should contain up to 4 trainable layers, or not? If so, how to get all weights from LSTM-cell??
the 4 weights (and biases) of a LSTM cell are stored as a single tensor, where slices along the second axis correspond to the different kind of weights (in gate, forget gate, ecc)
For instance, I guess that in your case the value of HIDDEN_SIZE is 256
To access the different parts, you should slice the tensors along the axis of length 1024 (but I don't know in which order the different kind of weights are stored...)

What does `y_dim` represent in the following GANs model?

This question is about an implementation example of GANs using TensorFlow on images.
I've excerpted part of code that I thought was enough to provide a context, reference for full code. In following code, it has defined a discriminator function, which can be considered as a typical convolution neural network, loosely speaking. As it is seen, discriminator operations are conditioned on y_dim, could someone help explain what is y_dim? Looking at Args, I am still very confused about the definition of y_dim.
class DCGAN(object):
def __init__(self, sess, image_size=108, is_crop=True,
batch_size=64, sample_size=64, output_size=64,
y_dim=None, z_dim=100, gf_dim=64, df_dim=64,
gfc_dim=1024, dfc_dim=1024, c_dim=3, dataset_name='default',
checkpoint_dir=None, sample_dir=None):
sess: TensorFlow session
batch_size: The size of batch. Should be specified before training.
output_size: (optional) The resolution in pixels of the images. [64]
y_dim: (optional) Dimension of dim for y. [None]
z_dim: (optional) Dimension of dim for Z. [100]
gf_dim: (optional) Dimension of gen filters in first conv layer. [64]
df_dim: (optional) Dimension of discrim filters in first conv layer. [64]
gfc_dim: (optional) Dimension of gen units for for fully connected layer. [1024]
dfc_dim: (optional) Dimension of discrim units for fully connected layer. [1024]
c_dim: (optional) Dimension of image color. For grayscale input, set to 1. [3]
def discriminator(self, image, y=None, reuse=False):
if reuse:
if not self.y_dim:
h0 = lrelu(conv2d(image, self.df_dim, name='d_h0_conv'))
h1 = lrelu(self.d_bn1(conv2d(h0, self.df_dim * 2, name='d_h1_conv')))
h2 = lrelu(self.d_bn2(conv2d(h1, self.df_dim * 4, name='d_h2_conv')))
h3 = lrelu(self.d_bn3(conv2d(h2, self.df_dim * 8, name='d_h3_conv')))
h4 = linear(tf.reshape(h3, [self.batch_size, -1]), 1, 'd_h3_lin')
return tf.nn.sigmoid(h4), h4
yb = tf.reshape(y, [self.batch_size, 1, 1, self.y_dim])
x = conv_cond_concat(image, yb)
h0 = lrelu(conv2d(x, self.c_dim + self.y_dim, name='d_h0_conv'))
h0 = conv_cond_concat(h0, yb)
h1 = lrelu(self.d_bn1(conv2d(h0, self.df_dim + self.y_dim, name='d_h1_conv')))
h1 = tf.reshape(h1, [self.batch_size, -1])
h1 = tf.concat(1, [h1, y])
h2 = lrelu(self.d_bn2(linear(h1, self.dfc_dim, 'd_h2_lin')))
h2 = tf.concat(1, [h2, y])
h3 = linear(h2, 1, 'd_h3_lin')
return tf.nn.sigmoid(h3), h3
y_dim is the length of the labels.
It seems to be used in the DCGAN class to say 'Do we know the number of training labels'. If we do we can define all these tensors with the known dimensions.
E.g. Look at the main.py file where the MNIST dataset is used. See that y_dim is set to 10 for the 0-9 labels? It is 'None' for the second option.
Hope this helps.

RNN & Batches in Tensorflow

The batche approach for RNN in Tensorflow is not clear to me. For example tf.nn.rnn Take as input list of Tensors [BATCH_SIZE x INPUT_SIZE]. We normally are feeding to session batches of data, so why it take list of batches not single batch?
This leads to next confusion for me:
data = []
for _ in range(0, len(train_input)):
data.append(tf.placeholder(tf.float32, [CONST_BATCH_SIZE, CONST_INPUT_SIZE]))
lstm = tf.nn.rnn_cell.BasicLSTMCell(CONST_NUM_OF_HIDDEN_STATES)
val, state = tf.nn.rnn(lstm, data, dtype=tf.float32)
I pass list of Tensors [CONST_BATCH_SIZE x CONST_INPUT_OTPUT_SIZE] to tf.nn.rnn and got output value that is list of Tensors [CONST_BATCH_SIZE x CONST_NUM_OF_HIDDEN_STATES]. Now I want to use softmax for all HIDDEN_STATES outputs and need to calculate weights with matmaul + bias
Should I use for matmul:
weight = tf.Variable(tf.zeros([CONST_NUM_OF_HIDDEN_STATES, CONST_OTPUT_SIZE]))
for i in val:
mult = tf.matmul(i, weight)
bias = tf.Variable(tf.zeros([CONST_OTPUT_SIZE]))
prediction = tf.nn.softmax(mult + bias)
Or should I create 2D array from val and then use tf.matmul without for?
This should work. output is batched data from RNN. For all the batch input probs will have the probability.
logits = tf.matmul(output, softmax_w) + softmax_b
probs = tf.nn.softmax(logits)