Imagine I have a convolutional neural network to classify MNIST digits, such as this Keras example. This is purely for experimentation so I don't have a clear reason or justification as to why I'm doing this, but let's say I would like to regularize or penalize the output of an intermediate layer. I realize that the visualization below does not correspond to the MNIST CNN example and instead just has several fully connected layers. However, to help visualize what I mean let's say I want to impose a penalty on the node values in layer 4 (either pre or post activation is fine with me).
In addition to having a categorical cross entropy loss term which is typical for multi-class classification, I would like to add another term to the loss function that minimizes the squared sum of the output at a given layer. This is somewhat similar in concept to l2 regularization, except that l2 regularization is penalizing the squared sum of all weights in the network. Instead, I am purely interested in the values of a given layer (e.g. layer 4) and not all the weights in the network.
I realize that this requires writing a custom loss function using keras backend to combine categorical crossentropy and the penalty term, but I am not sure how to use an intermediate layer for the penalty term in the loss function. I would greatly appreciate help on how to do this. Thanks!

Actually, what you are interested in is regularization and in Keras there are two different kinds of built-in regularization approach available for most of the layers (e.g. Dense, Conv1D, Conv2D, etc.):
Weight regularization, which penalizes the weights of a layer. Usually, you can use kernel_regularizer and bias_regularizer arguments when constructing a layer to enable it. For example:
l1_l2 = tf.keras.regularizers.l1_l2(l1=1.0, l2=0.01)
x = tf.keras.layers.Dense(..., kernel_regularizer=l1_l2, bias_regularizer=l1_l2)
Activity regularization, which penalizes the output (i.e. activation) of a layer. To enable this, you can use activity_regularizer argument when constructing a layer:
l1_l2 = tf.keras.regularizers.l1_l2(l1=1.0, l2=0.01)
x = tf.keras.layers.Dense(..., activity_regularizer=l1_l2)
Note that you can set activity regularization through activity_regularizer argument for all the layers, even custom layers.
In both cases, the penalties are summed into the model's loss function, and the result would be the final loss value which would be optimized by the optimizer during training.
Further, besides the built-in regularization methods (i.e. L1 and L2), you can define your own custom regularizer method (see Developing new regularizers). As always, the documentation provides additional information which might be helpful as well.

Just specify the hidden layer as an additional output. As tf.keras.Models can have multiple outputs, this is totally allowed. Then define your custom loss using both values.
Extending your example:
input = tf.keras.Input(...)
x1 = tf.keras.layers.Dense(10)(input)
x2 = tf.keras.layers.Dense(10)(x1)
x3 = tf.keras.layers.Dense(10)(x2)
model = tf.keras.Model(inputs=[input], outputs=[x3, x2])
for the custom loss function I think it's something like this:
def custom_loss(y_true, y_pred):
x2, x3 = y_pred
label = y_true # you might need to provide a dummy var for x2
return f1(x2) + f2(y_pred, x3) # whatever you want to do with f1, f2

Another way to add loss based on input or calculations at a given layer is to use the add_loss() API. If you are already creating a custom layer, the custom loss can be added directly to the layer. Or a custom layer can be created that simply takes the input, calculates and adds the loss, and then passes the unchanged input along to the next layer.
Here is the code taken directly from the documentation (in case the link is ever broken):
from tensorflow.keras.layers import Layer
class MyActivityRegularizer(Layer):
"""Layer that creates an activity sparsity regularization loss."""
def __init__(self, rate=1e-2):
super(MyActivityRegularizer, self).__init__()
self.rate = rate
def call(self, inputs):
# We use `add_loss` to create a regularization loss
# that depends on the inputs.
self.add_loss(self.rate * tf.reduce_sum(tf.square(inputs)))
return inputs


Should feature embeddings be taken before or after dropout layer in neural network?

I am training a binary text classification model using BERT as follows:
def create_model():
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
preprocessed_text = bert_preprocess(text_input)
outputs = bert_encoder(preprocessed_text)
# Neural network layers
l1 = tf.keras.layers.Dropout(0.1, name="dropout")(outputs['pooled_output'])
l2 = tf.keras.layers.Dense(1, activation='sigmoid', name="output")(l1)
# Use inputs and outputs to construct a final model
model = tf.keras.Model(inputs=[text_input], outputs=[l2])
return model
This code is borrowed from the example on tfhub:
I want to extract feature embeddings from the penultimate layer and use them for comparison, clustering, visualization, etc between examples. Should this be done before dropout (l1 in the model above) or after dropout (l2 in the model above)?
I am trying to figure out whether this choice makes a significant difference, or is it fine either way? For example, if I extract feature embeddings after dropout and compute feature similarities between two examples, this might be affected by which nodes are randomly set to 0 (but perhaps this is okay).
In order to answer your question let's recall how a Dropout layer works:
The Dropout layer is usually used as a means to mitigate overfitting. Suppose two layers, A and B, are connected through a Dropout layer. Then during the training phase, neurons in layer A are being randomly dropped. That prevents layer B from becoming too dependent upon specific neurons in layer A, as these neurons are not always available. Therefore, layer B has to take into consideration the overall signal coming from layer A, and (hopefully) cannot cling to some noise which is specific to the training set.
An important point to note is that the Dropout mechanism is activated only during the training phase. While predicting, Dropout does nothing.
If I understand you correctly, you want to know whether to take the features before or after the Dropout (note that in your network l1 denotes the features after Dropout has been applied). If so, I would take the features before Dropout, because technically it does not really matter (Dropout is inactive during prediction) and it is more reasonable to do so (Dropout is not meaningful without a following layer).

where does class_weights or weighted loss penalize the network?

I am working on a Semantic segmentation project where I have to work on multiclass data which is highly imbalanced. I searched for optimizing it during training using the parameter and in that to use class_weights or sample_weights.
I can implement a following using a class_weight dictionary as
{ 0:1, 1:10,2:15 }
I also saw a method of updating weights in loss function
But at what point do these weights get updated?
If class_weights are used where will it get penalized? I already have a kernel_regularizer for each layer so if my classes have to be penalized based on my class weights then will it penalize the output of each layer y=Wx+b or only at the final layer?
Same if I use a weighted loss function will it get penalized only on the final layer before loss calculation or on each layer and then the final loss is calculated?
Any explanation on this would be very useful.
The class_weights you mentioned in your dictionary are there to account for your imbalanced data. They will never change, they are only there to increase the penalty for misclassified instances of minority classes (that way your network pays more attention to them and the gradients returned treat one 'Class2' instance as if it was 15 times more important than one 'Class0' instance).
The kernel_regularizer you mention resides at your loss function and penalizes large weight norms for weight matrices throughout the network (if you use kernel_regularizer = tf.keras.regularizers.l1(0.01) in a Dense layer, it only affects that layer). So that is a different weight that has nothing to do with classes, only with weights inside your network. Your eventual loss will be something like loss = Cross_entropy + a * norm(Weight_matrix) and that way the network will have as an additional task assigned to it to minimize the classification loss (cross entropy) while the weight norms remain low.

Keras: Custom loss function with training data not directly related to model

I am trying to convert my CNN written with tensorflow layers to use the keras api in tensorflow (I am using the keras api provided by TF 1.x), and am having issue writing a custom loss function, to train the model.
According to this guide, when defining a loss function it expects the arguments (y_true, y_pred)
def basic_loss_function(y_true, y_pred):
return ...
However, in every example I have seen, y_true is somehow directly related to the model (in the simple case it is the output of the network). In my problem, this is not the case. How do implement this if my loss function depends on some training data that is unrelated to the tensors of the model?
To be concrete, here is my problem:
I am trying to learn an image embedding trained on pairs of images. My training data includes image pairs and annotations of matching points between the image pairs (image coordinates). The input feature is only the image pairs, and the network is trained in a siamese configuration.
I am able to implement this successfully with tensorflow layers and train it sucesfully with tensorflow estimators.
My current implementations builds a tf Dataset from a large database of tf Records, where the features is a dictionary containing the images and arrays of matching points. Before I could easily feed these arrays of image coordinates to the loss function, but here it is unclear how to do so.
There is a hack I often use that is to calculate the loss within the model, by means of Lambda layers. (When the loss is independent from the true data, for instance, and the model doesn't really have an output to be compared)
In a functional API model:
def loss_calc(x):
loss_input_1, loss_input_2 = x #arbirtray inputs, you choose
#according to what you gave to the Lambda layer
#here you use some external data that doesn't relate to the samples
externalData = K.constant(external_numpy_data)
#calculate the loss
return the loss
Using the outputs of the model itself (the tensor(s) that are used in your loss)
loss = Lambda(loss_calc)([model_output_1, model_output_2])
Create the model outputting the loss instead of the outputs:
model = Model(inputs, loss)
Create a dummy keras loss function for compilation:
def dummy_loss(y_true, y_pred):
return y_pred #where y_pred is the loss itself, the output of the model above
model.compile(loss = dummy_loss, ....)
Use any dummy array correctly sized regarding number of samples for training, it will be ignored:, np.zeros((number_of_samples,)), ...)
Another way of doing it, is using a custom training loop.
This is much more work, though.
Although you're using TF1, you can still turn eager execution on at the very beginning of your code and do stuff like it's done in TF2. (tf.enable_eager_execution())
Follow the tutorial for custom training loops:
Here, you calculate the gradients yourself, of any result regarding whatever you want. This means you don't need to follow Keras standards of training.
Finally, you can use the approach you suggested of model.add_loss.
In this case, you calculate the loss exaclty the same way I did in the first answer. And pass this loss tensor to add_loss.
You can probably compile a model with loss=None then (not sure), because you're going to use other losses, not the standard one.
In this case, your model's output will probably be None too, and you should fit with y=None.

Keras multiple binary outputs

Can someone help me understand a bit better this problem? I must train a neural network which should output 200 mutually independent categories, each of these categories is a percentage ranging from 0 to 1. This seems to me like a binary_crossentropy problem, but every example I see on the internet uses binary_crossentropy with a single output. Since my output should be 200, if I apply binary_crossentropy, would that be correct?
This is what I have in mind, is that a correct approach or should I change it?
inputs = Input(shape=(input_shape,))
hidden = Dense(2048, activation='relu')(inputs)
hidden = Dense(2048, activation='relu')(hidden)
output = Dense(200, name='output_cat', activation='sigmoid')(hidden)
model = Model(inputs=inputs, outputs=[output])
loss_map = {'output_cat': 'binary_crossentropy'}
model.compile(loss=loss_map, optimizer="sgd", metrics=['mae', 'accuracy'])
To optimize for multiple independent binary classification problems (and not multiple category problem where you can use categorical_crossentropy) using Keras, you could do the following (here I take the example of 2 independent binary outputs, but you can extend that as much as needed):
inputs = Input(shape=(input_shape,))
hidden = Dense(2048, activation='relu')(inputs)
hidden = Dense(2048, activation='relu')(hidden)
output = Dense(units = 2, activation='sigmoid')(hidden )
here you split your output using Keras's Lambda layer:
output_1 = Lambda(lambda x: x[...,:1])(output)
output_2 = Lambda(lambda x: x[...,1:])(output)
adad = optimizers.Adadelta()
your model output becomes a list of the different independent outputs
model = Model(inputs, [output_1, output_2])
you compile the model using one loss function for each output, in a list. (In fact, if you give only one kind of loss function, I believe it will apply it to all the outputs independently)
model.compile(optimizer=adad, loss=['binary_crossentropy','binary_crossentropy'])
I know this is an old question, but I believe the accepted answer is incorrect and the most upvoted answer is workable but not optimal. The original poster's method is the correct way to solve this problem. His output is 200 independent probabilities from 0 to 1, so his output layer should be a dense layer with 200 neurons and a sigmoid activation layer. It's not a categorical_crossentropy problem because it's not 200 mutually exclusive categories. Also, there's no reason to split the output using a lambda layer when a single dense layer will do. The original poster's method is correct. Here's another way to do it using the Keras interface.
model = Sequential()
model.add(Dense(2048, input_dim=n_input, activation='relu'))
model.add(Dense(2048, input_dim=n_input, activation='relu'))
model.add(Dense(200, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
binary_crossentropy with Sigmoid activation function is used for binary (positive and negative) classification, whereas your case is multi-class classification. In the case of multi-class classification, categorical_crossentropy with softmax activation is used. The Sigmoid activation function generates the probability of input being positive class, and SoftMax generates probability corresponding to input being in each class. The class with maximum probability is assigned to the input.
For multiple category classification problems, you should use categorical_crossentropy rather than binary_crossentropy. With this, when your model classifies an input, it is going give a dispersion of probabilities between all 200 categories. The category that receives the highest probability will be the output for that particular input.
You can see this when you call model.predict(). If you were to call this function only on one input, for example, and print the results, you will see a result of 200 percentages (in total summing to 1). The hope is that one of those 200 percentages would be vastly higher than the others, which signals that the model thinks that there is a strong probability that this is the correct output (category) for this particular input.
This video may help clarify the prediction piece. Printing out the predictions starts around 3:17, but to get the full context, you'll need to start from the beginning.
When there are multiple classes, categorical_crossentropy should be used. Refer to another answer here.

DeepLearning Anomaly Detection for images

I am still relatively new to the world of Deep Learning. I wanted to create a Deep Learning model (preferably using Tensorflow/Keras) for image anomaly detection. By anomaly detection I mean, essentially a OneClassSVM.
I have already tried sklearn's OneClassSVM using HOG features from the image. I was wondering if there is some example of how I can do this in deep learning. I looked up but couldn't find one single code piece that handles this case.
The way of doing this in Keras is with the KerasRegressor wrapper module (they wrap sci-kit learn's regressor interface). Useful information can also be found in the source code of that module. Basically you first have to define your Network Model, for example:
def simple_model():
#Input layer
data_in = Input(shape=(13,))
#First layer, fully connected, ReLU activation
layer_1 = Dense(13,activation='relu',kernel_initializer='normal')(data_in)
#second layer...etc
layer_2 = Dense(6,activation='relu',kernel_initializer='normal')(layer_1)
#Output, single node without activation
data_out = Dense(1, kernel_initializer='normal')(layer_2)
#Save and Compile model
model = Model(inputs=data_in, outputs=data_out)
#you may choose any loss or optimizer function, be careful which you chose
model.compile(loss='mean_squared_error', optimizer='adam')
return model
Then, pass it to the KerasRegressor builder and fit with your data:
from keras.wrappers.scikit_learn import KerasRegressor
#chose your epochs and batches
regressor = KerasRegressor(build_fn=simple_model, nb_epoch=100, batch_size=64)
#fit with your data, labels, epochs=100)
For which you can now do predictions or obtain its score:
p = regressor.predict(data_test) #obtain predicted value
score = regressor.score(data_test, labels_test) #obtain test score
In your case, as you need to detect anomalous images from the ones that are ok, one approach you can take is to train your regressor by passing anomalous images labeled 1 and images that are ok labeled 0.
This will make your model to return a value closer to 1 when the input is an anomalous image, enabling you to threshold the desired results. You can think of this output as its R^2 coefficient to the "Anomalous Model" you trained as 1 (perfect match).
Also, as you mentioned, Autoencoders are another way to do anomaly detection. For this I suggest you take a look at the Keras Blog post Building Autoencoders in Keras, where they explain in detail about the implementation of them with the Keras library.
It is worth noticing that Single-class classification is another way of saying Regression.
Classification tries to find a probability distribution among the N possible classes, and you usually pick the most probable class as the output (that is why most Classification Networks use Sigmoid activation on their output labels, as it has range [0, 1]). Its output is discrete/categorical.
Similarly, Regression tries to find the best model that represents your data, by minimizing the error or some other metric (like the well-known R^2 metric, or Coefficient of Determination). Its output is a real number/continuous (and the reason why most Regression Networks don't use activations on their outputs). I hope this helps, good luck with your coding.