How to stop model from predicting probabilities? - tensorflow

I have created a model that is meant to predict when a curve has reached a maximum/minimum. However, this model is also meant to predict whether the curve is not at a maximum or a minimum.
To do this I labeled my dataset with another column that indicates, "1" if the curve has reached a minimum, "2" if the curve has reached a maximum, and "0" if the curve is not at a maximum or a minimum. I used the savgol_filter from scipy.signal to smooth the data so that the noisy maximums and minimums could be ignored.
The real issue comes during training. Instead, the model predicts probabilities of 0, 1, and 2 based on the number of these in the dataset. It results in a constant spread in the predictions of all three columns, with no actual prediction occurring. The training loss does not improve either.
Example of what the predictions look like:
[0.80979747 0.09480771 0.09539483] - 0.0
[0.8098259 0.09479709 0.09537704] - 1.0
[0.8097819 0.0948175 0.09540048] - 0.0
[0.80970675 0.09484979 0.09544353] - 0.0
[0.80981696 0.09480083 0.09538227] - 0.0
[0.80991155 0.09476246 0.09532603] - 2.0
[0.8098903 0.09477566 0.09533402] - 0.0
Something things that I have tried:
Using weights:
Weights just made it predict 0 less, but the training did not
improve, and the predictions were still the same constant spread.
Predicting if the output will be 0:
Predicting if the output will be 0 resulted in the same probability
like predictions.
I have a feeling that with the data I have available this is the most accurate way to make these predictions for the model.
Are there any other solutions or is my hunch correct?

Related

Unstable loss in binary classification for time-series data - extremely imbalanced dataset

I am working on deep learning model to detect regions of timesteps with anomalies. This model should classify each timestep as possessing the anomaly or not.
My labels are something like this:
labels = [0 0 0 1 0 0 0 0 1 0 0 0 ...]
The 0s represent 'normal' timesteps and the 1s represent the existence of an anomaly. In reality, my dataset is very very imbalanced:
My training set consists of over 7000 samples, where only 1400 samples = 20% of those contain at least 1 anomaly (timestep = 1)
I am feeding samples with 4096 timesteps each. The average number of anomalies, in the samples that contain them, is around 2. So, assuming there is an anomaly, the % of anomalous timesteps ranges from 0.02% to 0.04% for each sample.
With that said, I do need to shift from the standard binary cross entropy to something that highlights the anomalous timesteps from the anomaly free timesteps.
So, I experimented adding weights to the anomalous class in such a way that the model is forced to learn from the anomalies and not just reduce its loss from the anomaly-free timesteps. It actually worked well and the model seems to learn to detect anomalous timesteps. One problem however is that training can become quite unstable (and unpredictable), with sudden loss spikes appearing and affecting the learning process. Below, you can see the effects on the loss and metrics charts for two of my trainings:
After going through a debugging process for the trainings, I am confident that the problem comes from ocasional predictions given for the anomalous timesteps. That is, in some samples of a certain epoch, and in some anomalous timesteps, the model is giving a very low prediction, e.g. 0.01, for the 1s label (should be close to 1 ofc). Considering the very high (but supposedly necessary) weights given to the anomalous timesteps, the penaly is really extreme and the loss just skyrockets.
Going deeper, if I inspect the losses of the sample where the jump happened and look for the batch right before the loss jumped, I see that the losses are all around 10^-2 - 0.0053, 0.004, 0.0041... - not a single sample with a loss over those values. Overall, an average loss of 0.005. However, if I inspect the loss of the following batch, in that same sample, the avg. loss of the batch is already 3.6, with a part of the samples with a low loss but another part with a very high loss - e.g. 9.2, 7.7, 8.9... I can confirm that all the high losses come from the penalties given at predicting the 1s timesteps. The following batches of the same sample and some of the batches of the next epoch get affected and take some time to start decreasing again and going back to a stable learning process.
With this said, I am having this problem for some weeks already and really need some guidance in what I could try to deal with the spikes, which I assume that arise on the gradient updates associated with anomalous timesteps that are harder to learn.
I am currently using a simple 2-layer keras LSTM model with 64 units each and a dense as the last layer with a 1 unit dense layer with sigmoid activation. As for the optimizer I am using Adam. I am training with batch size 128. Some things to consider also:
I have tried changes in weights and other loss functions. Ultimately, if I reduce the weights given to the anomalous timesteps the model doesn't give so much importance to them and the loss reduces by considering only the anomalous free timesteps. I have also considered focal binary cross entropy loss but it doesn't seem to do anything that could avoid those jumps as, in the end, it is all about adding or reducing weights for certain timesteps.
My current learning rate is the Adam's default, 10⁻3. I have tried reducing the learning rate which leads to less impactful spikes (they're still there though) but the model also takes much more time or gets stuck. Not sure if it would be the way to go in this case, as the training seems to go well except for these cases. Decaying learning rate might also not make too much sense as the spikes can happen earlier in the training and not only on later epochs. Not sure if this is the way to go.
I am still investigating gradient clipping as a solution. I am still not sure on what values to use and if it is actually an effective solution for my case, but from what I understood of it, it should allow to counter those jumps resulting from those 'almost' exploding gradients.
The spikes could originate from sample noise / bad samples. However, since I am already using batch size 128 and I have already tested training with simple synthetic samples I have created and the spikes were still there, I guess it is not a problem with specific samples.
The imbalance obviously plays the bigger role here. Not sure if undersampling the majority class of samples of 4096 timesteps (like increasing from 20% to 50% the amount of samples with at least an anomalous timestep) would make a big difference here since each sample of timesteps is by itself very imbalanced as it contains around 2 timesteps with anomalies. It is a problem with the imbalance within each sample.
I know it might be quite some context but honestly I am already into my limit of trying stuff for weeks.
The solutions I am inclined to go for next are either gradient clipping or just changing my samples to be more centered around the anomalous timesteps, in such a way that it contains less anomaly free timesteps and hopefully allows for convergence without having to apply such drastic weights to anomalous timesteps. This last option is more difficult for me to opt for due to some restrictions, but I might look at it if I have nothing else available.
What do you think? I am able to provide more information if needed.

Regression accuracy with neural network in low density regions

I am developing a neural net which needs to predict values between -1 and 1. However, I am only really concerned about the values at the ends of scale, say between -1 and -0.7 and between 0.7 and 1.
I do not mind if 0.6, for example, gets predicted to be 0.1. However, I do want to know if it's 0.8 or 0.9.
The distribution of my data is roughly normal, so there are many more samples in the range where I'm not concerned about the accuracy. It seems therefore that the training process is likely to lead to greater accuracy in the centre.
How can I configure the training or engineer my expected result to overcome this?
Thanks very much.
You could assign the observations to deciles, turn it into a classification problem and either assign a greater weight to the ranges you care about in the loss or just simply oversample them during training. By default, I'd go with weighing the classes in the loss function, as it is straight-forward to match with a weighted metric. Oversampling can be useful if you know that the distribution of your training data is different from the real data distribution.
To assign certain classes a greater weight in the loss function with Keras, you can pass a class_weight parameter to Model.fit. If label 0 is the first decile and label 9 is the last decile, you could double the weight of the first and last two deciles as follows:
class_weight = {
0: 2,
1: 2,
2: 1,
3: 1,
4: 1,
5: 1,
6: 1,
7: 1,
8: 2,
9: 2
}
model.fit(..., class_weight=class_weight)
To oversample certain classes, you'd include them more often in the batches than the class distribution would suggest. The simplest way to implement this is to sample observation indices with numpy.random.choice that has an optional parameter to specify probabilities for each entry. (Note that Keras Model.fit also has a sample_weight parameter where you can assign weights to each observation in the training data that will be applied when computing the loss function, but the intended use case is to weigh samples by the confidence in their labels, so I don't think it's applicable here.)

Train neural network: Mathematical reason for Nan due to batch size

I am training a CNN. I use Googles pre-trained inceptionV3 with a replaced last layer for classification. During training, I had a lot of issues with my cross entropy loss becoming nan.After trying different things (reducing learning rate, checking the data etc.) it turned out the training batch size was too high.
Reducing training batch size from 100 to 60 solved the issue. Can you provide an explanation why too high batch sizes cause this issue with a cross entropy loss function? Also is there a way to overcome this issue to work with higher batch sizes (there is a paper suggesting batch sizes of 200+ images for better accuracy)?
Larger weights (resulting exploding gradients) of the network produces skewed probabilities in the soft max layer. For example, [0 1 0 0 0 ] instead of [0.1 0.6 0.1 0.1 0.1]. Therefore, produce numerically unstable values in the cross entropy loss function.
cross_entropy = -tf.reduce_sum(y_*tf.log(y_))
when y_ = 0, cross_entropy becomes infinite (since 0*log(0)) and hence nan.
The main reason for weights to become larger and larger is the exploding gradient problem. Lets consider the gradient update,
∆wij = −η ∂Ei/ ∂wi
where η is the learning rate and ∂Ei/∂wij is the partial derivation of the loss w.r.t weights. Notice that ∂Ei/ ∂wi is the average over a mini-batch B. Therefore, the gradient will depend on the mini-batch size |B| and the learning rate η.
In order to tackle this problem, you can reduce the learning rate. As a rule of thumb, its better to set the initial learning rate to zero and increase by a really small number at a time to observe the loss.
Moreover, Reducing the mini batch size results in increasing the variance of stochastic gradient updates. This sometimes helps to mitigate nan by adding noise to the gradient update direction.

Tensorflow semantic segmentation gives zero loss

I am training a model for segmenting machine printed text from the images. The images might contain barcodes and handwritten text also. Ground truths images are processed so that 0 represents machine print and 1 represents the remaining. And I am using 5 layer CNN with dilation which outputs 2 maps in the end.
And my loss is calculated as follows:
def loss(logits, labels):
logits = tf.reshape(logits, [-1, 2])
labels = tf.reshape(labels, [-1])
cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=labels)
cross_entropy_mean = tf.reduce_mean(cross_entropy, name='cross_entropy')
And I have some images which contain only handwritten text and their corresponding ground truths are blank pages which are represented by 1s.
When I train the model, for these images I am getting a loss of 0 and training accuracy of 100%. Is this correct? How can this loss be zero?
For other images which contain barcodes or machine print, am getting some loss and they are converging properly.
And when I test this model, barcodes are correctly ignored. But it outputs both machine print and handwritten text where I need only machine print.
Can someone guide me on where I am going wrong, please!
UPDATE 1:
I was using a learning rate of 0.01 before and changing it to 0.0001 gave me some loss and it seems to converge but not very well.
But, then how a high learning rate will give a loss of 0?
When I use the same model in Caffe with learning rate of 0.01 it gave some loss and it converges well compared to in Tensorflow.
Your loss calculation looks fine but a loss of zero is weird in your case. Have you tried playing with the learning rate? Maybe decrease it. I have encountered weird loss values and decreasing the learning rate helped me.

Tensorflow Loss for Non-Independent Classes

I am using a Tensorflow network for classification between classes that are similar to their neighboring classes, i.e. not independent. For example, let's say we want to predict among 10 classes but the predictions are not merely "correct" or "incorrect." Instead, if the correct class is 7 and network predicts 6, the loss should be less than if the network predicted 5, because 6 is closer to the correct answer than 5. My understanding is that cross entropy and 1-hot vectors provides "all or nothing" loss rather than a "continuous" loss that reflects the magnitude of the error. If that is correct, how does one implement such a continuous loss in Tensorflow?
-- Update June 13 2016 ----
An example application might be color recognition. If the network predicts "green" but the true color is yellow-green, then the loss should be less than if the network predicted blue because green is a better prediction than blue.
You can choose to implement a continuous function (e.g. hue from HSV) as a single output, and construct your own loss calculation that reflects what you want to optimize. In that case you'd just have a single output value that ranged between 0.0 and 1.0, and the loss would be evaluated based on the distance from the labeled value.