calculating the number of parameters of a GRU layer (Keras) - tensorflow

Why the number of parameters of the GRU layer is 9600?
Shouldn't it be ((16+32)*32 + 32) * 3 * 2 = 9,408 ?
or, rearranging,
32*(16 + 32 + 1)*3*2 = 9408
model = tf.keras.Sequential([
tf.keras.layers.Embedding(input_dim=4500, output_dim=16, input_length=200),
tf.keras.layers.Bidirectional(tf.keras.layers.GRU(32)),
tf.keras.layers.Dense(6, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()

The key is that tensorflow will separate biases for input and recurrent kernels when the parameter reset_after=True in GRUCell. You can look at some of the source code in GRUCell as follow:
if self.use_bias:
if not self.reset_after:
bias_shape = (3 * self.units,)
else:
# separate biases for input and recurrent kernels
# Note: the shape is intentionally different from CuDNNGRU biases
# `(2 * 3 * self.units,)`, so that we can distinguish the classes
# when loading and converting saved weights.
bias_shape = (2, 3 * self.units)
Taking the reset gate as an example, we generally see the following formulas.
But if we set reset_after=True, the actual formula is as follows:
As you can see, the default parameter of GRU is reset_after=True in tensorflow2. But the default parameter of GRU is reset_after=False in tensorflow1.x.
So the number of parameters of a GRU layer should be ((16+32)*32 + 32 + 32) * 3 * 2 = 9600 in tensorflow2.

I figured out a little bit more about this, as an addition to the accepted answer. What Keras does in GRUCell.call() is:
With reset_after=False (default in TensorFlow 1):
With reset_after=True (default in TensorFlow 2):
After training with reset_after=False, b_xh equals b_hz, b_xr equals b_hrand b_xh equals b_hh, because (I assume) TensorFlow realizes that each of these pairs of vectors can be combined into one single parameter vector - just like the OP pointed out in a comment above. However, with reset_after=True, that's not the case for b_xh and b_hh - they can and will be different, so they can not be combined into one vector, and that's why the total parameter count is higher.

Related

How to specify input layer with Keras

I came across this code for tuning the topology of the neural network. However I am unsure of how I can instantiate the first layer without flatening the input.
My input is like this:
With M features (the rows) and N samples (the columns).
How can I create the first (input) layer?
# Initialize sequential API and start building model.
model = keras.Sequential()
model.add(keras.layers.Flatten(input_shape=(28,28)))
# Tune the number of hidden layers and units in each.
# Number of hidden layers: 1 - 5
# Number of Units: 32 - 512 with stepsize of 32
for i in range(1, hp.Int("num_layers", 2, 6)):
model.add(
keras.layers.Dense(
units=hp.Int("units_" + str(i), min_value=32, max_value=512, step=32),
activation="relu")
)
# Tune dropout layer with values from 0 - 0.3 with stepsize of 0.1.
model.add(keras.layers.Dropout(hp.Float("dropout_" + str(i), 0, 0.3, step=0.1)))
# Add output layer.
model.add(keras.layers.Dense(units=10, activation="softmax"))
I know that Keras usually instantiates the first hidden layer along with the input layer, but I don't see how I can do it in this framework. Below is the code for instantiating input + first hidden layer at once.
model.add(Dense(100, input_shape=(CpG_num,), kernel_initializer='normal', activation='relu')
If you have multiple inputs and want to set your input shape, let's suppose you have a dataframe with m-> rows, n-> columns... then simply do this...
m = no_of_rows #1000
n = no_of_columns #10
no_of_layers = 64
#we will not write m because m will be taken as a batch here.
_input = tf.keras.layers.Input(shape=(n))
dense = tf.keras.layers.Dense(no_of_layers)(_input)
output = tf.keras.backend.function(_input , dense)
#Now, I can see that it is working or not...!
x = np.random.randn(1000 , 10)
print(output(x).shape)

Non-linear loss combination

My network has 2 outputs. I'm trying to have a loss on two terms that is not a linear sum of two losses:
def weightedBCE(y_true, y_pred):
assert y_pred.shape[2] == 2
y_pred_val = y_pred[:,:,0]
stds = y_pred[:,:,1]
bce = K.binary_crossentropy(y_true, y_pred_val)
loss = bce * (1. + LAM*stds )
return loss
The final layers of my model are defined like this (outSall has 3 values):
std = make_std_model()(outSall)
final = Dense(1, activation="sigmoid")(outSall)
output = concatenate([DSAfinal, std ], axis=-1)
But it doesn't work because Kears expects 1 loss function per output. My loss uses both outputs of the network together.
The first output is a standard classification one with Binary Cross Entropy loss, but I want it to be multiplied by (1+ LAM* stds) with a lambda factor multiplying stds. stds are the second output of the network.
How can I do this?
assert y_pred.shape[2] == 2
IndexError: list index out of range
Update:
I had an extra index, now fixed. See below. But I get an error pasted below.
def weightedBCE(y_true, y_pred):
assert y_pred.shape[1] == 2
y_pred_val = y_pred[:,0]
stds = y_pred[:,1]
bce = K.binary_crossentropy(y_true, y_pred_val)
loss = bce * (1. + LAM*stds )
return loss
ValueError: logits and labels must have the same shape ((?,) vs (?, ?)
Update2:
Keras assumes the y_true has same shape as y_pred. Which was the problem. Changed the loss to:
def weightedBCE(y_true, y_pred):
assert y_pred.shape[1] == 2
y_pred_val = y_pred[:,0]
stds = y_pred[:,1]
bce = K.binary_crossentropy(y_true[:,0], y_pred_val)
loss = bce * (1. + LAM*stds )
return loss
There is still some problem with handling two outputs, see Binary Cross Entropy not giving similar results when I have 2 outputs
Instead of creating a Keras model with two outputs, create a Keras model with a single output which is a concatenation of the two tensors (you can use keras.layers.Concatenate for that). Then you can compile the model with a single custom loss function, as the one you wrote above.

CNTK Python: Dense Layer output size doesn't match expecation?

I'm training the tutorials/language understanding model in CNTK/Python
def create_model():
with C.layers.default_options(initial_state=0.1):
return C.layers.Sequential([
C.layers.Embedding(emb_dim, name='embed'),
C.layers.Recurrence(C.layers.LSTM(hidden_dim), go_backwards=False),
C.layers.Dense(num_labels, name='classify')
])
model = model_func(x)
For some reason, model.eval(data)[0].shape is (2 * 16) not (1 * 16), where num_labels = 16. I'm very confused. Why is it 2 * 16 instead of 1 * 16, given the last layer is a dense layer with size = num_labels=16?
Thanks!
Most likely the data element that you are passing in has a shape (2, x), i.e. you are passing in multiple values for evaluation, so eval() is returning a prediction for each of the values you passed in to the model.

How to calculate input_dim for a keras sequential model?

Keras Dense layer needs an input_dim or input_shape to be specified. What value do I put in there?
My input is a matrix of 1,000,000 rows and only 3 columns. My output is 1,600 classes.
What do I put there?
dimensionality of the inputs (1000000, 1600)
2 because it's a 2D matrix
input_dim is the number of dimensions of the features, in your case that is just 3. The equivalent notation for input_shape, which is an actual dimensional shape, is (3,)
In your case
lets assume x and y=target variable and are look like as follows after feature engineering
x.shape
(1000000, 3)
y.shape
((1000000, 1600)
# as first layer in a sequential model:
model = Sequential()
model.add(Dense(32, input_shape=x.shape[1])) # Input layer
# now the model will take as input arrays of shape (*, 3)
# and output arrays of shape (*, 32)
...
...
model.add(Dense(y.shape[1],activation='softmax')) # Output layer
y.shape[1]= 1600, the number of output which is the number of classes you have, since you are dealing with Classification.
X = dataset.iloc[:, 3:13]
meaning the X parameter having all the rows and 3rd column till 12th column inclusive and 13th column exclusive.
We will also have a X0 parameter to be given to the neural network, so total
input layers becomes 10+1 = 11.
Dense(input_dim = 11, activation = 'relu', kernel_initializer = 'he_uniform')

Implementing contrastive loss and triplet loss in Tensorflow

I started to play with TensorFlow two days ago and I'm wondering if there is the triplet and the contrastive losses implemented.
I've been looking at the documentation, but I haven't found any example or description about these things.
Update (2018/03/19): I wrote a blog post detailing how to implement triplet loss in TensorFlow.
You need to implement yourself the contrastive loss or the triplet loss, but once you know the pairs or triplets this is quite easy.
Contrastive Loss
Suppose you have as input the pairs of data and their label (positive or negative, i.e. same class or different class). For instance you have images as input of size 28x28x1:
left = tf.placeholder(tf.float32, [None, 28, 28, 1])
right = tf.placeholder(tf.float32, [None, 28, 28, 1])
label = tf.placeholder(tf.int32, [None, 1]). # 0 if same, 1 if different
margin = 0.2
left_output = model(left) # shape [None, 128]
right_output = model(right) # shape [None, 128]
d = tf.reduce_sum(tf.square(left_output - right_output), 1)
d_sqrt = tf.sqrt(d)
loss = label * tf.square(tf.maximum(0., margin - d_sqrt)) + (1 - label) * d
loss = 0.5 * tf.reduce_mean(loss)
Triplet Loss
Same as with contrastive loss, but with triplets (anchor, positive, negative). You don't need labels here.
anchor_output = ... # shape [None, 128]
positive_output = ... # shape [None, 128]
negative_output = ... # shape [None, 128]
d_pos = tf.reduce_sum(tf.square(anchor_output - positive_output), 1)
d_neg = tf.reduce_sum(tf.square(anchor_output - negative_output), 1)
loss = tf.maximum(0., margin + d_pos - d_neg)
loss = tf.reduce_mean(loss)
The real trouble when implementing triplet loss or contrastive loss in TensorFlow is how to sample the triplets or pairs. I will focus on generating triplets because it is harder than generating pairs.
The easiest way is to generate them outside of the Tensorflow graph, i.e. in python and feed them to the network through the placeholders. Basically you select images 3 at a time, with the first two from the same class and the third from another class. We then perform a feedforward on these triplets, and compute the triplet loss.
The issue here is that generating triplets is complicated. We want them to be valid triplets, triplets with a positive loss (otherwise the loss is 0 and the network doesn't learn).
To know whether a triplet is good or not you need to compute its loss, so you already make one feedforward through the network...
Clearly, implementing triplet loss in Tensorflow is hard, and there are ways to make it more efficient than sampling in python but explaining them would require a whole blog post !
Triplet loss with semihard negative mining is now implemented in tf.contrib, as follows:
triplet_semihard_loss(
labels,
embeddings,
margin=1.0
)
where:
Args:
labels: 1-D tf.int32 Tensor with shape [batch_size] of multiclass
integer labels.
embeddings: 2-D float Tensor of embedding vectors.Embeddings should
be l2 normalized.
margin: Float, margin term in theloss definition.
Returns:
triplet_loss: tf.float32 scalar.
For further information, check the link bellow:
https://www.tensorflow.org/versions/master/api_docs/python/tf/contrib/losses/metric_learning/triplet_semihard_loss
Tiago, I don't think you are using the same formula Olivier gave.
Here is the right code (not sure it will work though, just fixing the formula) :
def compute_euclidean_distance(x, y):
"""
Computes the euclidean distance between two tensorflow variables
"""
d = tf.reduce_sum(tf.square(tf.sub(x, y)),1)
return d
def compute_contrastive_loss(left_feature, right_feature, label, margin):
"""
Compute the contrastive loss as in
L = 0.5 * Y * D^2 + 0.5 * (Y-1) * {max(0, margin - D)}^2
**Parameters**
left_feature: First element of the pair
right_feature: Second element of the pair
label: Label of the pair (0 or 1)
margin: Contrastive margin
**Returns**
Return the loss operation
"""
label = tf.to_float(label)
one = tf.constant(1.0)
d = compute_euclidean_distance(left_feature, right_feature)
d_sqrt = tf.sqrt(compute_euclidean_distance(left_feature, right_feature))
first_part = tf.mul(one-label, d)# (Y-1)*(d)
max_part = tf.square(tf.maximum(margin-d_sqrt, 0))
second_part = tf.mul(label, max_part) # (Y) * max(margin - d, 0)
loss = 0.5 * tf.reduce_mean(first_part + second_part)
return loss