For educational purposes I've been creating deep learning library for some time now. Few days ago I received a task
for intern position to create a model from scratch using numpy which will classify digits from subset of MNIST dataset into 2 classes (0 - odd, 1 - prime). Everything was going well until
the time has come to create a loss function. Because it is a
binary classification problem, I've chosen binary crossentropy.
There is an implementation :
def loss(self, target: np.ndarray, predicted: np.ndarray, epsilon=1e-7) -> np.ndarray:
predicted = np.clip(predicted, epsilon, 1 - epsilon)
predicted = np.log(predicted / (1 - predicted))
return (target * -np.log(self.sigmoid(predicted)) +
(1 - target) * -np.log(1 - self.sigmoid(predicted)))
Basically it is almost the same function which keras has for numpy backend. The output from this loss function with batch
size 16 is as follows:
I have a strong doubts that they should not look like this.
Maybe it is the problem with dataset which we had to refactor
by ourself. To clarify the typical sample is just a 28x28 pixel values matrix and label is just a single number 0 or 1.
The next problem occurs when I try to sum up loss for a whole epoch and save it to the something like Keras history object.
Should I like sum up losses for every batch iteration and then
divide it by number of sampled (which sound wrong to me) or have to properly calculate epoch loss?
Thanks for help in advance and stay safe and healthy!
I believe your current output is for a mini batch, otherwise your 'predicted' should be a single value and not an ndarray.
Also what do you mean by epoch loss? You should be computing loss for every minibatch which is the average loss as described.
I'm training a network with temporal data, and determine which of ~60 outputs are "active" at any given timestep (classified as 1 or 0 in the label data) - so I have an output of 60x1 floats that should represent a probability.
My input data is shaped as (X, 1, frames, dataPoints) - where X is the number of recorded sequences I have (I'm new to ML, I think this is 'batches'), frames is how long the longest sequence is (the rest are -1 padded and masked), and dataPoints is the actual input data for any given frame.
This is mostly an LTSM layer with return_sequences, but my input data is unbalanced.
For any given timestep, odds are ~85% that AN output is activated - but for any given output it's likely active at most 5% of the time.
When I attempted to apply a class weight of {0: 0.01, 1:0.99} (pending tuning), I get an error stating "class_weight not supported for 3+ dimensional targets". I've done some googling and people are suggesting compiling with sample_weight_mode of temporal and modifying sample weight, but (A) that doesn't seem right for my data (no individual sample is more important, but each 1 classification within all the samples is important), and (B) I don't understand the dimensionality of what that's doing.
How can I apply the class weighting to help balance each 1 classification with this data structure?
Side note: I'm rescaling the output of the LSTM to 0->1 since it uses tanh activation (and must use tanh activation for CUDA acceleration), and from_logits=False in my binary cross entropy loss.
Extra points if I can just use built-in tf/keras stuff and not have to write a custom loss function.
EDIT to include some code:
I have a data generator that outputs x and y in the shape of:
x.shape == (1, frameCount, inputFeatureLength) where frameCount is the number of frames in the temporal sequence, and inputFeatureLength is the size of the input data (around 100).
y.shape == (1, frameCount, outputSize) where outputSize is about 60 features.
I can successfully compile the mode, but when I try to with class_weight={0:0.01, 1:0.99} as an argument, I get the error ValueError: class_weight not supported for 3+ dimensional targets.
I've looked into sample weights, but as far as I can tell even using sample_weight_mode="temporal" on it'll let me give sample weights per frame of output, but not per each of the ~60 outputs per frame.
I am trying to create a custom loss function for a regression problem that would minimize the number of elements that falls above a certain threshold. my code for this is:
import tensorflow as tf
epsilon = 0.000001
def custom_loss(actual, predicted): # loss
actual = actual * 12
predicted = predicted * 12
# outputs a value between 1 and 20
vector = tf.sqrt(2 * (tf.square(predicted - actual + epsilon)) / (predicted + actual + epsilon))
# Count number of elements above threshold value of 5
fail_count = tf.cast(tf.size(vector[vector>5]), tf.float32)
return fail_count
I however, run into the following error:
ValueError: No gradients provided for any variable: ...
How do I solve this problem?
I don't think you can use this loss function, because the loss does not vary smoothly as the model parameters vary - it will jump from one value to another different value as parameters pass a theshold point. So tensorflow can't calculate gradients, and so can't train the model.
It's the same reason that 'number of images incorrectly classified' isn't used as a loss function, and categorical cross-entropy, which does vary smoothly as parameters change, is used instead.
You may need to find a smoothly varying function that approximates what you want.
[Added after your response below...]
This might do it. It becomes closer to your function as temperature is reduced. But it may not have good training dynamics, and there could be better solutions out there. One approach might be to start training with relatively large temperature, and reduce it as training progresses.
temperature = 1.0
Running tensorflow 2.x in Colab with its internal keras version (tf.keras). My model is a 3D convolutional UNET for multiclass segmentation (not sure if it's relevant).
I've successfully trained (high enough accuracy on validation) this model the traditional way but I'd like to do augmentation to improve it, therefore I'm switching to (hand-written) generators. When I use generators I see my loss increasing and my accuracy decreasing a lot (e.g.: loss increasing 4-fold, not some %) in the fit.
To try to localize the issue I've tried loading my trained weights and computing the metrics on the data returned by the generators. And what's happening makes no sense. I can see that the results visually are ok.
2s 2s/step - loss: 0.4037 - categorical_accuracy: 0.8716
2s/step - loss: 1.7825 - categorical_accuracy: 0.7158
7s 2s/step - loss: 1.7478 - categorical_accuracy: 0.7038
Why would the loss vary with the number of steps? I could guess some % due to statistical variations... not 4 fold increase!
If I try
x,y = next(validationGenerator)
nSamples = x.shape[0]
meanLoss = np.zeros(nSamples)
meanAcc = np.zeros(nSamples)
for pIdx in range(nSamples):
y_pred = model.predict(np.expand_dims(x[pIdx,:,:,:,:],axis=0))
I get accuracy~85% and loss ~0.44. Which is what I expect from the previous fit, and it varies by vary little from one batch to the other. And these are the same exact numbers that I get if I do model.evaluate() with 1 step (using the same generator function).
However I need about 30 steps to run trough my whole training dataset. What should I do?
If I fit my already good model to this generator it indeed worsen the performances a lot (it goes from a nice segmentation of the image to uniform predictions of 25% for each of the 4 classes!!!!)
Any idea on where to debud the issue? I've also visually looked at the images produced by the generator and at the model predictions and everything looks correct (as testified by the numbers I found when evaluating using a single step). I've tried writing a minimal working example with a 2 layers model but... in it the issue does not happen.
UPDATE: Generators code
So, as I've been asked, these are the generators code. They're handwritten
def dataGen (X,Y_train):
patchS = 64 #set the size of the patch I extract
batchS = 16 #number of samples per batch
nSamples = X.shape[0] #get total number of samples
immSize = X.shape[1:] #get the shape of the iamge to crop
#Get 4 patches from each image
#extract them randomly, and in random patient order
patList = np.array(range(0,nSamples),dtype='int16')
patList = patList.reshape(nSamples,1)
patList = np.tile(patList,(4,2))
patList[:nSamples,0]=0 #Use this index to tell the code where to get the patch from
Xout = np.zeros((batchS,patchS,patchS,patchS,immSize[3])) #allocate output vector
while True:
Yout = np.zeros((batchS,patchS,patchS,patchS)) #allocate vector of labels
for patIdx in range(batchS):
XSR = 32* (patList[patStart+patIdx,0]//2) #get the index of where to extract the patch
YSR = 32* (patList[patStart+patIdx,0]%2)
xStart = random.randrange(XSR,XSR+32) #get a patch randomly somewhere between a range
yStart = random.randrange(YSR,YSR+32)
zStart = random.randrange(0,26)
patInd = patList[patStart+patIdx,1]
Xout[patIdx,:,:,:,:] = X[patInd,xStart:(xStart+patchS),yStart:(yStart+patchS),zStart:(zStart+patchS),:]
Yout[patIdx,:,:,:] = Y_train[patInd,xStart:(xStart+patchS),yStart:(yStart+patchS),zStart:(zStart+patchS)]
np.random.shuffle(patList) #after going through the whole list restart
patStart = patStart+batchS
Yout = tf.keras.utils.to_categorical (Yout, num_classes=4, dtype='float32') #convert to one hot encoding
yield Xout, Yout
Posting the workaround I've found for the future person coming here from google.
Apparently the issue lies in how keras calls a handwritten generator. When it was called multiple times in a row by using evaluate(gen, steps=N) apparently it returned wrong outputs. There's no documentation around about how to address this or how a generator should be written.
I ended up writing my code using a tf.keras.utils.sequence class and the same previous code now works perfectly. No way to know why.
Here are different factors that affect loss & accuracy:
For Accuracy, we know that it measures the accuracy of the prediction: i.e. correct-classes /total-classes.
While loss tracks the inverse-confidence of the prediction.
A high Loss indicates that although the model is performing well with the prediction, It is becoming uncertain of the prediction it is making.
For example, For an image classification scenario, The image of a cat is passed into two models. Model A predicts {cat: 0.8, dog: 0.2} and model B predicts {cat: 0.6, dog: 0.4}.
Both models will score the same accuracy, but model B will have a higher loss.
On your evaluation part, Based on the documentation
Steps: Integer or None. Total number of steps (batches of samples) before declaring the evaluation round finished. Ignored with the default value of None. If x is a dataset and steps is None, 'evaluate' will run until the dataset is exhausted. This argument is not supported by array inputs.
So for simplify, it's getting the Nth batch of your validation samples.
It could be that the model prediction is becoming uncertain since the majority of the unknown data falls on those specific steps. which in your case, steps 2 & 3.
So, As the evaluation steps progress, The prediction becomes more uncertain leading to a higher loss.
You might need to retrain your model with more training samples but of course, you need to be careful since you might encounter overfitting.
In terms of data augmentation, you might wanna check this link
In Training Perspective, proper data augmentation is one of the factors that leads to good model performance.
Suppose my data consists of images of bubbles, and the labels are histograms describing the distribution of sizes, for example:
0-10mm 10%
10-20mm 30%
20-30mm 40%
30-40mm 20%
It is important to note that -
All size percentages sum to 100% (or 1.0 to be more precise).
I don't have annotated data, so i can't train an object detector and then just calculate the distribution by counting objects detected. However, i do have a feature extractor train on my data.
I implemented a simple CNN that consists of -
Resnet50 backbone.
Global max pooling.
1x1 convolution of 6 filters (6 distribution bins in labels).
After some experiments i came to the conclusion that softmax and cross entropy as loss function does not suit my problem and needs.
I thought that maybe a cosine similarity loss, with a light modification, may be a good alternative (normalization will be part of post process). This is the implementation:
def cosine_similarity_loss(logits, probs, weights=1.0, label_smoothing=0):
x1_val = tf.sqrt(tf.reduce_sum(tf.matmul(logits, tf.transpose(logits)), axis=1))
x2_val = tf.sqrt(tf.reduce_sum(tf.matmul(probs, tf.transpose(probs)), axis=1))
denom = tf.multiply(x1_val, x2_val)
num = tf.reduce_sum(tf.multiply(logits, probs), axis=1)
cosine_sim = tf.math.divide(num, denom)
cosine_dist = tf.math.reduce_mean(1 - tf.square(cosine_sim)) # Cosine Distance. Reduce mean for shape compatibility.
return cosine_dist
Loss is a summation of cosine distance and l2 regularization on weights. After first feed forward i got loss: 3.1267 and after second feed forward i got loss: 96003645440.0000 - meaning weights exploded (logits: [[-785595.812 -553858.625 -545579.625 -148547.875 -12845.8633 19871.1055]] while probs: [[0.466 0.297 0.19 0.047 0 0]]).
What could be the reason for such rapid and extreme increase?
My guess is cosine distance does an internal normalisation of the logits, removing the magnitude, and thus there is no gradient to propogate that opposes the values increasing. BTW weights is not used in your implementation.
What about just plain Euclidian distance using sigmoid instead of softmax in the last layer. Also, I would try adding another one or two dense layers (say size 512) between resnet50 and output dense layer.
I am using autoencoders to do anomaly detection. So, I have finished training my model and now I want to calculate the reconstruction loss for each entry in the dataset. so that I can assign anomalies to data points with high reconstruction loss.
This is my current code to calculate the reconstruction loss
But this is really slow. By my estimation, it should take 5 hours to go through the dataset whereas training one epoch occurs in approx 55 mins.
I feel that converting to tensor operation is bottlenecking the code, but I can't find a better way to do it.
I've tried changing the batch sizes but it does not make much of a difference. I have to use the convert to tensor part because K.eval is throwing an error if I do it normally.
for i in range(0, encoded_dataset.shape[0], batch_size):
y_true = tf.convert_to_tensor(encoded_dataset[i:i+batch_size].values,
y_pred= tf.convert_to_tensor(ae1.predict(encoded_dataset[i:i+batch_size].values),
# Append the batch losses (numpy array) to the list
reconstruction_loss_transaction.append(K.eval(loss_function( y_true, y_pred)))
I was able to train in 55 mins per epoch. So I feel prediction should not take 5 hours per epoch. encoded_dataset is a variable that has the entire dataset in main memory as a data frame.
I am using Azure VM instance.
K.eval(loss_function(y_true,y_pred) is to find the loss for each row of the batch
So y_true will be of size (batch_size,2000) and so will y_pred
K.eval(loss_function(y_true,y_pred) will give me an output of
(batch_size,1) evaluating binary cross entropy on each row of y
_true and y_pred
Moved from comments:
My suspicion is that ae1.predict and K.eval(loss_function) are behaving in unexpected ways. ae1.predict should normally be used to output the loss function value as well as y_pred. When you create the model, specify that the loss value is another output (you can have a list of multiple outputs), then just call predict here once to get both y_pred the loss value in one call.
But I want the loss for each row . Won't the loss returned by the predict method be the mean loss for the entire batch?
The answer depends on how the loss function is implemented. Both ways produce perfectly valid and identical results in TF under the hood. You could average the loss over the batch before taking the gradient w.r.t. the loss, or take the gradient w.r.t. a vector of losses. The gradient operation in TF will perform the averaging of the losses for you if you use the latter approach (see SO articles on taking the per-sample gradient, it's actually hard to do).
If Keras implements the loss with reduce_mean built into the loss, you could just define your own loss. If you're using square loss, replacing 'mean_squared_error' with lambda y_true, y_pred: tf.square(y_pred - y_true). That would produce square error instead of MSE (no difference to the gradient), but look here for the variant including the mean.
In any case this produces a per sample loss so long as you don't use tf.reduce_mean, which is purely optional in the loss. Another option is to simply compute the loss separately from what you optimize for and make that an output of the model, also perfectly valid.