I have a question about tf.keras.constraints method.
(1)
class WeightsSumOne(tf.keras.constraints.Constraint):
def __call__(self, w):
return tf.nn.softmax(w, axis=0)
output = layers.Dense(1, use_bias=False,
kernel_constraint = WeightsSumOne())(input)
(2)
intermediate = layers.Dense(1, use_bias = False)
intermediate.set_weights(tf.nn.softmax(intermediate.get_weights(), axis=0))
Do (1) and (2) perform the same process?
The reason why I ask the question is that Keras Documentation said that
They are per-variable projection functions applied to the target
variable after each gradient update (when using fit()).
(https://keras.io/api/layers/constraints/)
Unlike (1), I think that the constraint is applied before each gradient update in case of (2).
In my opinion, the gradients of weights of (1) and (2) are different, because the softmax is applied before the gradient calculation in the second case, but after the gradient calculation in the first case.
If I am wrong, I would appreciate it if you point out the wrong part.
They are not the same.
In the first case, the constraint is applied to the weights but in the second case its on the output of the dense layer (after multiplying with the inputs).
Construct a model in the first case:
inp = keras.Input(shape=(3,5))
out = keras.layers.Dense(1, use_bias=False, kernel_initializer=tf.ones_initializer(),
kernel_constraint= WeightsSumOne())(inp)
model = keras.Model(inp, out)
model.compile('adam', 'mse')
dummy run,
inputs = tf.random.normal(shape=(1,3,5))
outputs = tf.random.normal(shape=(1,3,1))
model.fit(inputs,outputs, epochs=1)
check the layer weights of model
print(model.layers[1].get_weights()[0])
#outputs
array([[0.2],
[0.2],
[0.2],
[0.2],
[0.2]]
Construct the model in the second case
inp = keras.Input(shape=(3,5))
out = keras.layers.Dense(1, activation='softmax', use_bias=False,
kernel_initializer=tf.ones_initializer())(inp)
model1 = keras.Model(inp, out)
model1.compile('adam', 'mse')
#dummy run
model1.fit(inputs,outputs, epochs=1)
check the layer weights of model1,
print(model1.layers[1].get_weights()[0])
#outputs
array([[1.],
[1.],
[1.],
[1.],
[1.]],
We can see the layer weight of model is softmax of layer weight of model1
Related
I want to extract the gradient of a RNN model starting with an embedding layer using Tensorflow's GradientTape (using tensorflow 1.14 with eager execution). The model is a simple LSTM binary classifier, which is trained with a binary crossentropy loss:
inputs = Input(name='inputs', shape=[150])
layer = Embedding(2000, 50, input_length=150)(inputs)
layer = LSTM(64)(layer)
layer = Dense(256, name='FC1')(layer)
layer = Activation('relu')(layer)
layer = Dropout(0.5)(layer)
layer = Dense(1, name='out_layer')(layer)
layer = Activation('sigmoid')(layer)
model = Model(inputs=inputs, outputs=layer)
GradientTape should return "... a list or nested structure of Tensors (or IndexedSlices, or None, or CompositeTensor), one for each element in sources". What is the correct way to use it to recover (and apply) the gradient?
I tried the following code:
with tf.GradientTape() as tape:
y_ = model(inputs)
loss_value = BinaryCrossEntropy()(y_true=targets, y_pred=y_)
grads = tape.gradient(loss_value, model.trainable_variables)
# some custom processing
optimizer = RMSprop(learning_rate=0.001, name="context")
optimizer.apply_gradients(list(zip(grads, model.trainable_variables)), name="context")
I would expect the returned gradient to be of size (2000,50), i.e., the shape of weights for the embedding layer. Instead, it takes a size that depends on the batch size, and cannot be used (at least with the code above) with apply_gradient. Changing the number of inputs consistently changes the first dimension of the gradient to batch_size * 150, while the shape of the trainable variables stays correct. If using 8 inputs, for example, I get the following result:
input shape: (8, 150), output shape: (8, 1)
model.trainable_variables shapes: (2000, 50),(50, 256),(64, 256),(256,),(64, 256),(256,),(256, 1),(1,)
tape.gradient shapes: (1200, 50),(50, 256),(64, 256),(256,),(64, 256),(256,),(256, 1),(1,)
With a batch size of 32, the first compunent would be (4800, 50), and so on. This doesn't match my understanding of GradientTape.gradient, since the returned gradient doesn't have the same size as the sources parameter. What did I miss?
I am building a Siamese network using Keras(TensorFlow) where the target is a binary column, i.e., match or mismatch(1 or 0). But the model fit method throws an error saying that the y_pred is not compatible with the y_true shape. I am using the binary_crossentropy loss function.
Here is the error I see:
Here is the code I am using:
model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=[tf.keras.metrics.Recall()])
history = model.fit([X_train_entity_1.todense(),X_train_entity_2.todense()],np.array(y_train),
epochs=2,
batch_size=32,
verbose=2,
shuffle=True)
My Input data shapes are as follows:
Inputs:
X_train_entity_1.shape is (700,2822)
X_train_entity_2.shape is (700,2822)
Target:
y_train.shape is (700,1)
In the error it throws, y_pred is the variable which was created internally. What is y_pred dimension is 2822 when I am having a binary target. And 2822 dimension actually matches the input size, but how do I understand this?
Here is the model I created:
in_layers = []
out_layers = []
for i in range(2):
input_layer = Input(shape=(1,))
embedding_layer = Embedding(embed_input_size+1, embed_output_size)(input_layer)
lstm_layer_1 = Bidirectional(LSTM(1024, return_sequences=True,recurrent_dropout=0.2, dropout=0.2))(embedding_layer)
lstm_layer_2 = Bidirectional(LSTM(512, return_sequences=True,recurrent_dropout=0.2, dropout=0.2))(lstm_layer_1)
in_layers.append(input_layer)
out_layers.append(lstm_layer_2)
merge = concatenate(out_layers)
dense1 = Dense(256, activation='relu', kernel_initializer='he_normal', name='data_embed')(merge)
drp1 = Dropout(0.4)(dense1)
btch_norm1 = BatchNormalization()(drp1)
dense2 = Dense(32, activation='relu', kernel_initializer='he_normal')(btch_norm1)
drp2 = Dropout(0.4)(dense2)
btch_norm2 = BatchNormalization()(drp2)
output = Dense(1, activation='sigmoid')(btch_norm2)
model = Model(inputs=in_layers, outputs=output)
model.summary()
Since my data is very sparse, I used todense. And there the type is as follows:
type(X_train_entity_1) is scipy.sparse.csr.csr_matrix
type(X_train_entity_1.todense()) is numpy.matrix
type(X_train_entity_2) is scipy.sparse.csr.csr_matrix
type(X_train_entity_2.todense()) is numpy.matrix
Summary of last few layers as follows:
Mismatched shape in the Input layer. The input shape needs to match the shape of a single element passed as x, or dataset.shape[1:]. So since your dataset size is (700,2822), that is 700 samples of size 2822. So your input shape should be 2822.
Change:
input_layer = Input(shape=(1,))
To:
input_layer = Input(shape=(2822,))
You need to set return_sequences in the lstm_layer_2 to False:
lstm_layer_2 = Bidirectional(LSTM(512, return_sequences=False, recurrent_dropout=0.2, dropout=0.2))(lstm_layer_1)
Otherwise, you will still have the timesteps of your input. That is why you have the shape (None, 2822, 1). You can also add a Flatten layer prior to your output layer, but I would recommend setting return_sequences=False.
Note that a Dense layer computes the dot product between the inputs and the kernel along the last axis of the inputs.
I have a model in Keras where I would like to use two loss functions. The model consists of an autoencoder and a classifier on top of it. I would like to have one loss function that makes sure the autoencoder is fitted reasonably well (for example, it can be mse) and another loss function that evaluates the classifier (for example, categorical_crossentropy). I would like to fit my model and use a loss function that would be a linear combination of the two loss functions.
# loss functions
def ae_mse_loss(x_true, x_pred):
ae_loss = K.mean(K.square(x_true - x_pred), axis=1)
return ae_loss
def clf_loss(y_true, y_pred):
return K.sum(K.categorical_crossentropy(y_true, y_pred), axis=-1)
def combined_loss(y_true, y_pred):
???
return ae_loss + w1*clf_loss
where w1 is some weight that defines "importance of clf_loss" in the final combined loss.
# autoencoder
ae_in_layer = Input(shape=in_dim, name='ae_in_layer')
ae_interm_layer1 = Dense(interm_dim, activation='relu', name='ae_interm_layer1')(ae_in_layer)
ae_mid_layer = Dense(latent_dim, activation='relu', name='ae_mid_layer')(ae_interm_layer1)
ae_interm_layer2 = Dense(interm_dim, activation='relu', name='ae_interm_layer2')(ae_mid_layer)
ae_out_layer = Dense(in_dim, activation='linear', name='ae_out_layer')(ae_interm_layer2)
ae_model=Model(ae_input_layer, ae_out_layer)
ae_model.compile(optimizer='adam', loss = ae_mse_loss)
# classifier
clf_in_layer = Dense(interm_dim, activation='sigmoid', name='clf_in_layer')(ae_out_layer)
clf_out_layer = Dense(3, activation='softmax', name='clf_out_layer')(clf_in_layer)
clf_model = Model(clf_in_layer, clf_out_layer)
clf_model.compile(optimizer='adam', loss = combined_loss, metrics = [ae_mse_loss, clf_loss])
What I'm not sure about is how to distinguish y_true and y_pred in the two loss functions (since they refer to true and predicted data at different stages in the model). What I had in mind is something like this (I'm not sure how to implement it since obviously I need to pass only one set of arguments y_true & y_pred):
def combined_loss(y_true, y_pred):
ae_loss = ae_mse_loss(x_true_ae, x_pred_ae)
clf_loss = clf_loss(y_true_clf, y_pred_clf)
return ae_loss + w1*clf_loss
I could define this problem as two separate models and train each model separately but I would really prefer if I could do this all at once if possible (since it would optimize both problems simultaneously). I realize, this model doesn't make much sense but it demonstrates the (much more complicated) problem I'm trying to solve in a simple way.
Any suggestions would be appreciated.
All you need is simply available in native keras
you can automatically combine multiple losses using loss_weights parameter
In the example below I tried to reproduce your example where I combined an mse loss for the regression task and a categorical_crossentropy for the classification task
in_dim = 10
interm_dim = 64
latent_dim = 32
n_class = 3
n_sample = 100
X = np.random.uniform(0,1, (n_sample,in_dim))
y = tf.keras.utils.to_categorical(np.random.randint(0,n_class, n_sample))
# autoencoder
ae_in_layer = Input(shape=in_dim, name='ae_in_layer')
ae_interm_layer1 = Dense(interm_dim, activation='relu', name='ae_interm_layer1')(ae_in_layer)
ae_mid_layer = Dense(latent_dim, activation='relu', name='ae_mid_layer')(ae_interm_layer1)
ae_interm_layer2 = Dense(interm_dim, activation='relu', name='ae_interm_layer2')(ae_mid_layer)
ae_out_layer = Dense(in_dim, activation='linear', name='ae_out_layer')(ae_interm_layer2)
# classifier
clf_in_layer = Dense(interm_dim, activation='sigmoid', name='clf_in_layer')(ae_out_layer)
clf_out_layer = Dense(n_class, activation='softmax', name='clf_out_layer')(clf_in_layer)
model = Model(ae_in_layer, [ae_out_layer,clf_out_layer])
model.compile(optimizer='adam',
loss = {'ae_out_layer':'mse', 'clf_out_layer':'categorical_crossentropy'},
loss_weights = {'ae_out_layer':1., 'clf_out_layer':0.5})
model.fit(X, [X,y], epochs=10)
In this specific case, the loss is the result of 1*ae_out_layer_loss + 0.5*clf_out_layer_loss
I have a simple Keras sequential model.
I have N categories and i have to predict in which category the next point will fall based on the previous one.
The weird thing is that when i remove the Softmax activation function from the output layer the performance are better (lower loss and highest sparse_categorical_accuracy).
As loss i'm using the sparse_categorical_crossentropy with logits=True.
Is there any reason for that? Should not be the opposite?
Thank you in advance for any suggestion!
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
model = tf.keras.Sequential([
tf.keras.layers.Embedding(vocab_size, embedding_dim,
batch_input_shape=[batch_size, None]),
tf.keras.layers.GRU(rnn_units,
return_sequences=True,
stateful=True,
recurrent_initializer='glorot_uniform'),
tf.keras.layers.Dense(vocab_size, activation='softmax')
])
return model
model = build_model(
vocab_size = vocab_size,
embedding_dim=embedding_dim,
rnn_units=rnn_units,
batch_size=BATCH_SIZE)
def loss(labels, logits):
return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)
model.compile(optimizer='adam', loss=loss, metrics=['sparse_categorical_accuracy'])
EPOCHS = 5
history = model.fit(train_set, epochs=EPOCHS, validation_data=val_set,)
In a nutshell, when you are using the option from_logits = True, you are telling the loss function that your neural network output is not normalized. Since you are using softmax activation in your last layer, your outputs are indeed normalized, so you have two options:
Remove the softmax activation as you have already tried. Keep in mind that, after this, your output probabilities won't be normalized.
Use from_logits = False.
I am trying to construct a basic "vanilla gradient" saliency heatmap (gradient-based feature attribution) for MNIST using keras. I know there are libraries such as this one to compute saliency heatmaps, but I would like to construct this from scratch since the vanilla gradient approach seems conceptually straightforward to implement. I have trained the following digit classifier in Keras using functional model definition:
input = layers.Input(shape=(28,28,1), name='input')
conv2d_1 = layers.Conv2D(32, kernel_size=(3, 3), activation='relu')(input)
maxpooling2d_1 = layers.MaxPooling2D(pool_size=(2, 2), name='maxpooling2d_1')(conv2d_1)
conv2d_2 = layers.Conv2D(64, kernel_size=(3, 3), activation='relu')(maxpooling2d_1)
maxpooling2d_2 = layers.MaxPooling2D(pool_size=(2, 2))(conv2d_2)
flatten = layers.Flatten(name='flatten')(maxpooling2d_2)
dropout = layers.Dropout(0.5, name='dropout')(flatten)
dense = layers.Dense(num_classes, activation='softmax', name='dense')(dropout)
model = keras.models.Model(inputs=input, outputs=dense)
Now, I want to compute the saliency map for a single MNIST image. Since the final layer has a softmax activation and the denominator is a normalization term (so that the output nodes add up to 1), I believe that I need to either take the pre-softmax output or change the activation of the trained model linear for computing saliency maps. I will do the latter.
model.layers[-1].activation = tf.keras.activations.linear # swap activation to linear
input = loaded_model.layers[0].input
output = loaded_model.layers[-1].output
input_image = x_test[0] # shape is (28, 28, 1)
pred = np.argmax(loaded_model.predict(np.expand_dims(input_image, axis=0))) # predicted class
However, I am not sure what to do beyond this. I know I can use the following K.gradients(output, input) to compute gradients. That being said, I believe I should compute the gradient of the predicted class with respect to the input image, versus computing the gradient of the entire output. How would I do this? Also, I'm not sure how to evaluate the saliency heatmap for a specific image/prediction. I imagine I will have to use sess = tf.keras.backend.get_session() and sess.run(), but not sure exactly. I would greatly appreciate any help with completing the saliency heatmap code. Thanks!
If you add the activation as a single layer after the last dense layer with:
keras.layers.Activation('softmax')
you can do:
linear_model = keras.Model(input=model, output=model.layers[-2].output)
To then compute the gradients like:
def get_saliency_map(model, image, class_idx):
with tf.GradientTape() as tape:
tape.watch(image)
predictions = model(image)
loss = predictions[:, class_idx]
# Get the gradients of the loss w.r.t to the input image.
gradient = tape.gradient(loss, image)
# take maximum across channels
gradient = tf.reduce_max(gradient, axis=-1)
# convert to numpy
gradient = gradient.numpy()
# normaliz between 0 and 1
min_val, max_val = np.min(gradient), np.max(gradient)
smap = (gradient - min_val) / (max_val - min_val + keras.backend.epsilon())
return smap