Keras multiple binary outputs - tensorflow

Can someone help me understand a bit better this problem? I must train a neural network which should output 200 mutually independent categories, each of these categories is a percentage ranging from 0 to 1. This seems to me like a binary_crossentropy problem, but every example I see on the internet uses binary_crossentropy with a single output. Since my output should be 200, if I apply binary_crossentropy, would that be correct?
This is what I have in mind, is that a correct approach or should I change it?
inputs = Input(shape=(input_shape,))
hidden = Dense(2048, activation='relu')(inputs)
hidden = Dense(2048, activation='relu')(hidden)
output = Dense(200, name='output_cat', activation='sigmoid')(hidden)
model = Model(inputs=inputs, outputs=[output])
loss_map = {'output_cat': 'binary_crossentropy'}
model.compile(loss=loss_map, optimizer="sgd", metrics=['mae', 'accuracy'])

To optimize for multiple independent binary classification problems (and not multiple category problem where you can use categorical_crossentropy) using Keras, you could do the following (here I take the example of 2 independent binary outputs, but you can extend that as much as needed):
inputs = Input(shape=(input_shape,))
hidden = Dense(2048, activation='relu')(inputs)
hidden = Dense(2048, activation='relu')(hidden)
output = Dense(units = 2, activation='sigmoid')(hidden )
here you split your output using Keras's Lambda layer:
output_1 = Lambda(lambda x: x[...,:1])(output)
output_2 = Lambda(lambda x: x[...,1:])(output)
adad = optimizers.Adadelta()
your model output becomes a list of the different independent outputs
model = Model(inputs, [output_1, output_2])
you compile the model using one loss function for each output, in a list. (In fact, if you give only one kind of loss function, I believe it will apply it to all the outputs independently)
model.compile(optimizer=adad, loss=['binary_crossentropy','binary_crossentropy'])

I know this is an old question, but I believe the accepted answer is incorrect and the most upvoted answer is workable but not optimal. The original poster's method is the correct way to solve this problem. His output is 200 independent probabilities from 0 to 1, so his output layer should be a dense layer with 200 neurons and a sigmoid activation layer. It's not a categorical_crossentropy problem because it's not 200 mutually exclusive categories. Also, there's no reason to split the output using a lambda layer when a single dense layer will do. The original poster's method is correct. Here's another way to do it using the Keras interface.
model = Sequential()
model.add(Dense(2048, input_dim=n_input, activation='relu'))
model.add(Dense(2048, input_dim=n_input, activation='relu'))
model.add(Dense(200, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

binary_crossentropy with Sigmoid activation function is used for binary (positive and negative) classification, whereas your case is multi-class classification. In the case of multi-class classification, categorical_crossentropy with softmax activation is used. The Sigmoid activation function generates the probability of input being positive class, and SoftMax generates probability corresponding to input being in each class. The class with maximum probability is assigned to the input.

For multiple category classification problems, you should use categorical_crossentropy rather than binary_crossentropy. With this, when your model classifies an input, it is going give a dispersion of probabilities between all 200 categories. The category that receives the highest probability will be the output for that particular input.
You can see this when you call model.predict(). If you were to call this function only on one input, for example, and print the results, you will see a result of 200 percentages (in total summing to 1). The hope is that one of those 200 percentages would be vastly higher than the others, which signals that the model thinks that there is a strong probability that this is the correct output (category) for this particular input.
This video may help clarify the prediction piece. Printing out the predictions starts around 3:17, but to get the full context, you'll need to start from the beginning.

When there are multiple classes, categorical_crossentropy should be used. Refer to another answer here.


Custom loss function in Keras that penalizes output from intermediate layer

Imagine I have a convolutional neural network to classify MNIST digits, such as this Keras example. This is purely for experimentation so I don't have a clear reason or justification as to why I'm doing this, but let's say I would like to regularize or penalize the output of an intermediate layer. I realize that the visualization below does not correspond to the MNIST CNN example and instead just has several fully connected layers. However, to help visualize what I mean let's say I want to impose a penalty on the node values in layer 4 (either pre or post activation is fine with me).
In addition to having a categorical cross entropy loss term which is typical for multi-class classification, I would like to add another term to the loss function that minimizes the squared sum of the output at a given layer. This is somewhat similar in concept to l2 regularization, except that l2 regularization is penalizing the squared sum of all weights in the network. Instead, I am purely interested in the values of a given layer (e.g. layer 4) and not all the weights in the network.
I realize that this requires writing a custom loss function using keras backend to combine categorical crossentropy and the penalty term, but I am not sure how to use an intermediate layer for the penalty term in the loss function. I would greatly appreciate help on how to do this. Thanks!
Actually, what you are interested in is regularization and in Keras there are two different kinds of built-in regularization approach available for most of the layers (e.g. Dense, Conv1D, Conv2D, etc.):
Weight regularization, which penalizes the weights of a layer. Usually, you can use kernel_regularizer and bias_regularizer arguments when constructing a layer to enable it. For example:
l1_l2 = tf.keras.regularizers.l1_l2(l1=1.0, l2=0.01)
x = tf.keras.layers.Dense(..., kernel_regularizer=l1_l2, bias_regularizer=l1_l2)
Activity regularization, which penalizes the output (i.e. activation) of a layer. To enable this, you can use activity_regularizer argument when constructing a layer:
l1_l2 = tf.keras.regularizers.l1_l2(l1=1.0, l2=0.01)
x = tf.keras.layers.Dense(..., activity_regularizer=l1_l2)
Note that you can set activity regularization through activity_regularizer argument for all the layers, even custom layers.
In both cases, the penalties are summed into the model's loss function, and the result would be the final loss value which would be optimized by the optimizer during training.
Further, besides the built-in regularization methods (i.e. L1 and L2), you can define your own custom regularizer method (see Developing new regularizers). As always, the documentation provides additional information which might be helpful as well.
Just specify the hidden layer as an additional output. As tf.keras.Models can have multiple outputs, this is totally allowed. Then define your custom loss using both values.
Extending your example:
input = tf.keras.Input(...)
x1 = tf.keras.layers.Dense(10)(input)
x2 = tf.keras.layers.Dense(10)(x1)
x3 = tf.keras.layers.Dense(10)(x2)
model = tf.keras.Model(inputs=[input], outputs=[x3, x2])
for the custom loss function I think it's something like this:
def custom_loss(y_true, y_pred):
x2, x3 = y_pred
label = y_true # you might need to provide a dummy var for x2
return f1(x2) + f2(y_pred, x3) # whatever you want to do with f1, f2
Another way to add loss based on input or calculations at a given layer is to use the add_loss() API. If you are already creating a custom layer, the custom loss can be added directly to the layer. Or a custom layer can be created that simply takes the input, calculates and adds the loss, and then passes the unchanged input along to the next layer.
Here is the code taken directly from the documentation (in case the link is ever broken):
from tensorflow.keras.layers import Layer
class MyActivityRegularizer(Layer):
"""Layer that creates an activity sparsity regularization loss."""
def __init__(self, rate=1e-2):
super(MyActivityRegularizer, self).__init__()
self.rate = rate
def call(self, inputs):
# We use `add_loss` to create a regularization loss
# that depends on the inputs.
self.add_loss(self.rate * tf.reduce_sum(tf.square(inputs)))
return inputs

How to calculate confidence score of a Neural Network prediction

I am using a deep neural network model (implemented in keras)to make predictions. Something like this:
def make_model():
model = Sequential()
model.add(Conv2D(20,(5,5), activation = "relu"))
model.add(Dense(20, activation = "relu"))
model.add(Lambda(lambda x: tf.expand_dims(x, axis=1)))
model.add(SimpleRNN(50, activation="relu"))
model.add(Dense(1, activation="sigmoid"))
model.compile(loss = "binary_crossentropy", optimizer = adagrad, metrics = ["accuracy"])
return model
model = make_model(), y_train, validation_data = (x_validation,y_validation), epochs = 25, batch_size = 25, verbose = 1)
prediction = model.predict_classes(x)
probabilities = model.predict_proba(x) #I assume these are the probabilities of class being predictied
My problem is a classification(binary) problem. I wish to calculate the confidence score of each of these prediction i.e. I wish to know - Is my model 99% certain it is "0" or is it 58% it is "0".
I have found some views on how to do it, but can't implement them. The approach I wish to follow says: "With classifiers, when you output you can interpret values as the probability of belonging to each specific class. You can use their distribution as a rough measure of how confident you are that an observation belongs to that class."
How should I predict with something like above model so that I get its confidence about each predictions? I would appreciate some practical examples (preferably in Keras).
The softmax is a problematic way to estimate a confidence of the model`s prediction.
There are a few recent papers about this topic.
You can look for "calibration" of neural networks in order to find relevant papers.
This is one example you can start with -
In Keras, there is a method called predict() that is available for both Sequential and Functional models. It will work fine in your case if you are using binary_crossentropy as your loss function and a final Dense layer with a sigmoid activation function.
Here is how to call it with one test data instance. Below, mymodel.predict() will return an array of two probabilities adding up to 1.0. These values are the confidence scores that you mentioned. You can further use np.where() as shown below to determine which of the two probabilities (the one over 50%) will be the final class.
yhat_probabilities = mymodel.predict(mytestdata, batch_size=1)
yhat_classes = np.where(yhat_probabilities > 0.5, 1, 0).squeeze().item()
I've come to understand that the probabilities that are output by logistic regression can be interpreted as confidence.
Here are some links to help you come to your own conclusion.
how to assess the confidence score of a prediction with scikit-learn
Feel free to upvote my answer if you find it useful.
How about to use a softmax as the activation in the last layer? Let's say something like this:
model.add(Dense(2, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer = adagrad, metrics = ["accuracy"])
In this way, for each data point, you will be given a probabilistic-ish result by the model, which tells what is the likelihood that your data point belongs to each of two classes.
For example for a given X, if the model returns (0.3,0.7), you will know it is more likely that X belongs to class 1 than class 0. and you know that the likelihood has been estimated to be 0.7 over 0.3.

Keras: BiLSTM only works when return_sequences=True

I've been trying to implement this BiLSTM in Keras:
Here is where I'm at, and it kind of works:
inputs_w = Input(shape=(sequence_length,), dtype='int32')
inputs_pos = Input(shape=(sequence_length,), dtype='int32')
inputs_cue = Input(shape=(sequence_length,), dtype='int32')
w_emb = Embedding(vocabulary_size+1, embedding_dim, input_length=sequence_length, trainable=False)(inputs_w)
p_emb = Embedding(tag_voc_size+1, embedding_dim, input_length=sequence_length, trainable=False)(inputs_pos)
c_emb = Embedding(2, embedding_dim, input_length=sequence_length, trainable=False)(inputs_cue)
summed = keras.layers.add([w_emb, p_emb, c_emb])
BiLSTM = Bidirectional(CuDNNLSTM(hidden_dims, return_sequences=True))(summed)
DPT = Dropout(0.2)(BiLSTM)
outputs = Dense(2, activation='softmax')(DPT)
checkpoint = ModelCheckpoint('bilstm_one_hot.hdf5', monitor='val_loss', verbose=1, save_best_only=True, mode='auto')
early = EarlyStopping(monitor='val_loss', min_delta=0.0001, patience=5, verbose=1, mode='auto')
model = Model(inputs=[inputs_w, inputs_pos, inputs_cue], outputs=outputs)
model.compile('adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()[X_train, X_pos_train, X_cues_train], Y_train, batch_size=batch_size, epochs=num_epochs, verbose=1, validation_split=0.2, callbacks=[early, checkpoint])
In the original code, in Tensorflow, the author uses masking and softmax cross entropy with logits. I don't get how to implement this in Keras yet. If you have any advice don't hesitate.
My main issue here is with return_sequences=True. The author doesn't appear to be using it in his tensorflow implementation and when I turn it to False, I get this error:
ValueError: Error when checking target: expected dense_1 to have 2 dimensions, but got array with shape (820, 109, 2)
I also tried using:
outputs = TimeDistributed(Dense(2, activation='softmax'))(BiLSTM)
which returns and AssertionError without any information.
Any ideas ?
the author uses masking and softmax cross entropy with logits. I don't get how to implement this in Keras yet.
Regarding softmax cross entropy with logits, you are doing it correctly. softmax_cross_entropy_with_logits as the loss function + no activation function on the last layer is the same as your approach with categorical_crossentropy as loss + softmax activation on the last layer. The only difference is that the latter one is numerically less stable. If this turns out to be an issue for you, you can (if your Keras backend is tensorflow) just pass tf.softmax_cross_entropy_with_logits as your loss. If you have another backend, you will have to look for an equivalent there.
Regarding masking, I'm not sure if I fully understand what the author is doing. However, in Keras the Embedding layer has a mask_zero parameter that you can set to True. In that case all timesteps that have a 0 will be ignored in all further calculations. In your source, it is not 0 that is being masked, though, so you would have to adjust the indices accordingly. If that doesn't work, there is the Masking layer in Keras that you can put before your recurrent layer, but I have little experience with that.
My main issue here is with return_sequences=True. The author doesn't
appear to be using it
What makes you think that he doesn't use it? Just because that keyword does not appear in the code doesn't mean anything. But I'm also not sure. The code is pretty old and I don't find it in the docs anymore that could tell what the defaults are.
Anyway, if you want to use return_sequences=False (for whatever reason) be aware that this changes the output shape of the layer:
with return_sequences=True the output shape is (batch_size, timesteps, features)
with return_sequences=False the output shape is (batch_size, features)
The error you are getting is basically telling you that your network's output has one dimension less than the target y values you are feeding it.
So, to me it looks like return_sequences=True is just what you need, but without further information it is hard to tell.
Then, regarding TimeDistributed. I'm not quite sure what you are trying to achieve with it, but quoting from the docs:
This wrapper applies a layer to every temporal slice of an input.
The input should be at least 3D, and the dimension of index one will be considered to be the temporal dimension.
(emphasis is mine)
I'm not sure from your question, in which scenario the empty assertion occurs.
If you have a recurrent layer with return_sequences=False before, you are again missing a dimension (I can't tell you why the assertion is empty, though).
If you have a recurrent layer with return_sequences=True before, it should work, but it would be completely useless, as Dense is applied in a time distributed way anyways. If I'm not mistaken, this behavior of the Dense layer was changed in some older Keras version (they should really update the example there and stop using Dense!). As the code you are referring to is quite old, it's well possible that TimeDistributed was needed back then, but is not needed anymore.
If your plan was to restore the missing dimension, TimeDistributed won't help you, but RepeatVector would. But, as already said, in that case better use return_sequences=True in the first place.
The problem is that your target values seem to be time distributed. So you have 109 timesteps with a onehot target vector of size two. This is why you need the return_sequences=True. Otherwise you will just feed the last timestep to the Dense layer and you would just have one output.
So depending on what you need you need to keep it like it is now or if just the last timestep is enough for you you can get rid of it, but then you would need to adjust the y values accordingly.

Difference between Dense(2) and Dense(1) as the final layer of a binary classification CNN?

In a CNN for binary classification of images, should the shape of output be (number of images, 1) or (number of images, 2)? Specifically, here are 2 kinds of last layer in a CNN:
keras.layers.Dense(2, activation = 'softmax')(previousLayer)
keras.layers.Dense(1, activation = 'softmax')(previousLayer)
In the first case, for every image there are 2 output values (probability of belonging to group 1 and probability of belonging to group 2). In the second case, each image has only 1 output value, which is its label (0 or 1, label=1 means it belongs to group 1).
Which one is correct? Is there intrinsic difference? I don't want to recognize any object in those images, just divide them into 2 groups.
Thanks a lot!
This first one is the correct solution:
keras.layers.Dense(2, activation = 'softmax')(previousLayer)
Usually, we use the softmax activation function to do classification tasks, and the output width will be the number of the categories. This means that if you want to classify one object into three categories with the labels A,B, or C, you would need to make the Dense layer generate an output with a shape of (None, 3). Then you can use the cross_entropyloss function to calculate the LOSS, automatically calculate the gradient, and do the back-propagation process.
If you want to only generate one value with the Dense layer, that means you get a tensor with a shape of (None, 1) - so it produces a single numeric value, like a regression task. You are using the value of the output to represent the category. The answer is correct, but does not perform like the general solution of the classification task.
The difference is if the class probabilities are independent of each other (multi-label classification) or not.
When there are 2 classes and you generally have P(c=1) + P(c=0) = 1 then
keras.layers.Dense(2, activation = 'softmax')
keras.layers.Dense(1, activation = 'sigmoid')
both are correct in terms of class probabilities. The only difference being how you supply the labels during training. But
keras.layers.Dense(2, activation = 'sigmoid')
is incorrect in that context. However, it is correct implementation if you have P(c=1) + P(c=0) != 1. This is the case for multi-label classification where an instance may belong to more than one correct class.

Getting keras LSTM layer to accept two inputs?

I'm working with padded sequences of maximum length 50. I have two types of sequence data:
1) A sequence, seq1, of integers (1-100) that correspond to event types (e.g. [3,6,3,1,45,45....3]
2) A sequence, seq2, of integers representing time, in minutes, from the last event in seq1. So the last element is zero, by definition. So for example [100, 96, 96, 45, 44, 12,... 0]. seq1 and seq2 are the same length, 50.
I'm trying to run the LSTM primarily on the event/seq1 data, but have the time/seq2 strongly influence the forget gate within the LSTM. The reason for this is I want the LSTM to tend to really penalize older events and be more likely to forget them. I was thinking about multiplying the forget weight by the inverse of the current value of the time/seq2 sequence. Or maybe (1/seq2_element + 1), to handle cases where it's zero minutes.
I see in the keras code (LSTMCell class) where the change would have to be:
f = self.recurrent_activation(x_f +,self.recurrent_kernel_f))
So I need to modify keras' LSTM code to accept multiple inputs. As an initial test, within the LSTMCell class, I changed the call function to look like this:
def call(self, inputs, states, training=None):
time_input = inputs[1]
inputs = inputs[0]
So that it can handle two inputs given as a list.
When I try running the model with the Functional API:
# Input 1: event type sequences
# Take the event integer sequences, run them through an embedding layer to get float vectors, then run through LSTM
main_input = Input(shape =(max_seq_length,), dtype = 'int32', name = 'main_input')
x = Embedding(output_dim = embedding_length, input_dim = num_unique_event_symbols, input_length = max_seq_length, mask_zero=True)(main_input)
## Input 2: time vectors
auxiliary_input = Input(shape=(max_seq_length,1), dtype='float32', name='aux_input')
m = Masking(mask_value = 99999999.0)(auxiliary_input)
lstm_out = LSTM(32)(x, time_vector = m)
# Auxiliary loss here from first input
auxiliary_output = Dense(1, activation='sigmoid', name='aux_output')(lstm_out)
# An abitrary number of dense, hidden layers here
x = Dense(64, activation='relu')(lstm_out)
# The main output node
main_output = Dense(1, activation='sigmoid', name='main_output')(x)
## Compile and fit the model
model = Model(inputs=[main_input, auxiliary_input], outputs=[main_output, auxiliary_output])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'], loss_weights=[1., 0.2])
np.random.seed(21)[train_X1, train_X2], [train_Y, train_Y], epochs=1, batch_size=200)
However, I get the following error:
An `initial_state` was passed that is not compatible with `cell.state_size`. Received `state_spec`=[InputSpec(shape=(None, 50, 1), ndim=3)]; however `cell.state_size` is (32, 32)
Any advice?
You can't pass a list of inputs to default recurrent layers in Keras. The input_spec is fixed and the recurrent code is implemented based on single tensor input also pointed out in the documentation, ie it doesn't magically iterate over 2 inputs of same timesteps and pass that to the cell. This is partly because of how the iterations are optimised and assumptions made if the network is unrolled etc.
If you like 2 inputs, you can pass constants (doc) to the cell which will pass the tensor as is. This is mainly to implement attention models in the future. So 1 input will iterate over timesteps while the other will not. If you really like 2 inputs to be iterated like a zip() in python, you will have to implement a custom layer.
I would like to throw in a different ideas here. They don't require you to modify the Keras code.
After the embedding layer of the event types, stack the embeddings with the elapsed time. The Keras function is keras.layers.Concatenate(axis=-1). Imagine this, a single even type is mapped to a n dimensional vector by the embedding layer. You just add the elapsed time as one more dimension after the embedding so that it becomes a n+1 vector.
Another idea, sort of related to your problem/question and may help here, is 1D convolution. The convolution can happen right after the concatenated embeddings. The intuition for applying convolution to event types and elapsed time is actually 1x1 convolution. In such a way that you linearly combine the two together and the parameters are trained. Note in terms of convolution, the dimensions of the vectors are called channels. Of course, you can also convolve more than 1 event at a step. Just try it. It may or may not help.