In Tensorflow, how to convert scores from neural net into discrete values as a part of learning process - tensorflow

Hello fellow tensorflowians!
I have a following schema:
I input some continous variables (actually, word embeddings I took from google word2vec), and I am trying to predict output that can be considered as continous as well as discrete (sorry, mathematicians! but it depends on one's training goal actually).
Output takes values from 0 to 1000 with interval of 0.25 (or a precision hyperparameter), so : 0, 0.25, 0.50, ..., 100.0 .
I know that it is not possible to include something like tf.to_int (I can omit fractions if it's necessary) or tf.round, because these are not differentiable, so we can't backpropagate. However, I feel that there is some solution that allows network to "know" that it is searching for rounded solution: some small fractions of integers like 0.25, 5.75, but I actually don't even know where to look. I looked up quantization, but that seems to be a bit of an overkill.
So my question is:
How to inform graph that we don't accept values below 0.0 ? Would doing abs on network output "logits" (regression predictions) be something worth considering? If no, can I modify the loss term to severely punish scores below 0 and using absolute error instead of squared error? I may be not aware of full consequences of doing that
I don't care whether prediction of 4.5 is 4.49999 or 4.4 because I round up predictions to nearest .25 to get accuracy, and that's my final model evaluation metric. If so, can I use?
precision = 0.01 # so that sqrt(precision) == 0.1
loss=tf.reduce_mean(tf.max(0, tf.square(tf.sub(logits, targets)) - precision ))

Related

What is the expected output range from Keras custom loss function?

Context
I would like to implement a custom loss function. Given the input and a predicted output there is a real life loss what can be calculated using the predicted output and some known real life facts which belongs to the input. I would prefer to use this real life meaning loss value as loss function instead of using any distance algorithm between the predicted output and expected output.
This real life loss for every given predicted output is between -10.0 and 50.0, where the higher the better, with other words this is the learning optimizing goal.
Question
What would Keras expect (or utilize in optimal way) as the loss function output? Should loss function output normalized between say between [1.0, 0.0]? Or just multiply [-10.0, 50.0] by -1 -> [-50.0, 10.0] and substract 10.0 -> [-60.0, 0.0]?
edit: I meant here: Or just multiply [-10.0, 50.0] by -1 -> [-50.0, 10.0] and add 50.0 -> [0.0, 60.0]?
Note
I am completely beginner in NN, so if I completely miss something here, just please point the right direction in the fewest words.
By reading your question, I am able to interpret the "Real life loss value belonging to the input" as the "ground truth"/"expected output" of any input.
The question seems vague to differentiate between your requirement and this.

Difference in output between TensorFlow and TF-Lite

I have a tensorflow model that I converted to tensorflow-lite. But, there is a deviation in inference accuracy. Is it a normal behaviour?
I found that the inference output is different after fourth decimal place between these two models.
While training in TensorFlow,
All the variables and constants may be in dtype=float64. These numbers are larger in terms of decimal places.
Since, they are training variables there value in not constant.
After converting to TensorFlow lite,
The training variables are converted to Constant operations Their values are fixed
When we run the lite model on Android or iOS, these values are converted to float32.
Hence the precision is lost in TensorFlow Lite.
On float32 precision
float32 is the default value type used in TensorFlow. Let's talk a bit about float32 type and the importance of order of operations. Basically, there is a neat table from this post that shows how the precision of a float changes as the magnitude increases:
Float Value Float Precision
1 1.19E-07
10 9.54E-07
100 7.63E-06
1,000 6.10E-05
10,000 0.000977
100,000 0.00781
1,000,000 0.0625
10,000,000 1
100,000,000 8
1,000,000,000 64
What does it says? In float32, you cannot expect to have exact values, you will only have discretization points that are hopefully close to the real value. The larger the value, the less close you can possibly be to it.
You can learn more about the IEEE 754 single precision format here, here, and here, and you can even google more about it.
Now back to TF-Lite
What conversion from TensorFlow to TF-Lite has to do with the above said property of float32? Consider the following situation:
sum_1 = a_1 + a_2 + a_3 + a_4 + ... + a_n
sum_2 = a_2 + a_1 + a_4 + a_3 + ... + a_n
i.e. sum_1 and sum_2 only differs in the order of summation. Will they be equal? maybe, or maybe not! Same for other accumulative operations, e.g. multiplications, convolutions, etc. That's the key: in float32 calculation, order matters! (this is similar to the issue where calculations of the same model on CPU and GPU slightly differs). I've stumpled upon this problem countless of times when porting between frameworks (caffe, tensorflow, torch, etc.)
So, even if the implementation of any of the layers in TF-Lite differs even slightly from TensorFlow, you will end up with an error of 1e-5, maximum 1e-4. It is acceptable for single precision floats, so don't be bothered about it.

Using np.expm1 to compute sigmoid function

When computing a sigmoid function, small or large values of x will return 0 and 1 respectively due to lack of floating-point precision. In numpy, the function np.expm1 will compute exp(x)-1 with better precision for extreme values of x. However, an equivalent function for computing exp(x)+1, (denominator in sigmoid function), does not exist. I could not figure out how to use np.expm1 to compute a sigmoid with increased precision at extreme values. Is there a way to do so?
1/(np.exp(-20)+1)==1.0
#False
1/(np.exp(-50)+1)==1.0
# True
np.expm1 mitigates loss of significance which occurs when taking the difference between two almost equal numbers (because many significant places will cancel each other the result will have fewer signficant places than the data type could store).
1/(np.exp(-50)+1)==1.0
is a limitation of the data type, not the algorithm. floats cannot resolve a difference from 1.0 as small as exp(-50). Indeed, the nearest floats left and right of 1.0 are
>>> np.nextafter(1.0, 0.0)
0.9999999999999999
>>> np.nextafter(1.0, 2.0)
1.0000000000000002
indicating a resolution of oom 10^-16, nowhere near fine enough to discriminate between 1 and 1 +/- exp(-50)

Should RNN attention weights over variable length sequences be re-normalized to "mask" the effects of zero-padding?

To be clear, I am referring to "self-attention" of the type described in Hierarchical Attention Networks for Document Classification and implemented many places, for example: here. I am not referring to the seq2seq type of attention used in encoder-decoder models (i.e. Bahdanau), although my question might apply to that as well... I am just not as familiar with it.
Self-attention basically just computes a weighted average of RNN hidden states (a generalization of mean-pooling, i.e. un-weighted average). When there are variable length sequences in the same batch, they will typically be zero-padded to the length of the longest sequence in the batch (if using dynamic RNN). When the attention weights are computed for each sequence, the final step is a softmax, so the attention weights sum to 1.
However, in every attention implementation I have seen, there is no care taken to mask out, or otherwise cancel, the effects of the zero-padding on the attention weights. This seems wrong to me, but I fear maybe I am missing something since nobody else seems bothered by this.
For example, consider a sequence of length 2, zero-padded to length 5. Ultimately this leads to the attention weights being computed as the softmax of a similarly 0-padded vector, e.g.:
weights = softmax([0.1, 0.2, 0, 0, 0]) = [0.20, 0.23, 0.19, 0.19, 0.19]
and because exp(0)=1, the zero-padding in effect "waters down" the attention weights. This can be easily fixed, after the softmax operation, by multiplying the weights with a binary mask, i.e.
mask = [1, 1, 0, 0, 0]
and then re-normalizing the weights to sum to 1. Which would result in:
weights = [0.48, 0.52, 0, 0, 0]
When I do this, I almost always see a performance boost (in the accuracy of my models - I am doing document classification/regression). So why does nobody do this?
For a while I considered that maybe all that matters is the relative values of the attention weights (i.e., ratios), since the gradient doesn't pass through the zero-padding anyway. But then why would we use softmax at all, as opposed to just exp(.), if normalization doesn't matter? (plus, that wouldn't explain the performance boost...)
Great question! I believe your concern is valid and zero attention scores for the padded encoder outputs do affect the attention. However, there are few aspects that you have to keep in mind:
There are different score functions, the one in tf-rnn-attention uses simple linear + tanh + linear transformation. But even this score function can learn to output negative scores. If you look at the code and imagine inputs consists of zeros, vector v is not necessarily zero due to bias and the dot product with u_omega can boost it further to low negative numbers (in other words, plain simple NN with a non-linearity can make both positive and negative predictions). Low negative scores don't water down the high scores in softmax.
Due to bucketing technique, the sequences within a bucket usually have roughly the same length, so it's unlikely to have half of the input sequence padded with zeros. Of course, it doesn't fix anything, it just means that in real applications negative effect from the padding is naturally limited.
You mentioned it in the end, but I'd like to stress it too: the final attended output is the weighted sum of encoder outputs, i.e. relative values actually matter. Take your own example and compute the weighted sum in this case:
the first one is 0.2 * o1 + 0.23 * o2 (the rest is zero)
the second one is 0.48 * o1 + 0.52 * o2 (the rest is zero too)
Yes, the magnitude of the second vector is two times bigger and it isn't a critical issue, because it goes then to the linear layer. But relative attention on o2 is just 7% higher, than it would have been with masking.
What this means is that even if the attention weights won't do a good job in learning to ignore zero outputs, the end effect on the output vector is still good enough for the decoder to take the right outputs into account, in this case to concentrate on o2.
Hope this convinces you that re-normalization isn't that critical, though probably will speed-up learning if actually applied.
BERT implementation applies a padding mask for calculating attention score.
Adds 0 to the non-padding attention score and adds -10000 to padding attention scores. the e^-10000 is very small w.r.t to other attention score values.
attention_score = [0.1, 0.2, 0, 0, 0]
mask = [0, 0, -10000, -10000] # -10000 is a large negative value
attention_score += mask
weights = softmax(attention_score)

What is meaning of "parameter optimization of SVM by PSO"?

I can change parameters C and epsilon manually to obtain an optimised result, but I found that there is parameter optimization of SVM by PSO (or any other optimization algorithm). There is no algorithm. What does it mean: how can PSO automatically optimize the SVM parameters? I read several papers on this topic, but I'm still not sure.
Particle Swarm Optimization is a technique that uses the ML parameters (SVM parameters, in your case) as its features.
Each "particle" in the swarm is characterized by those parameter values. For instance, you might have initial coordinates of
degree epsilon gamma C
p1 3 0.001 0.25 1.0
p2 3 0.003 0.20 0.9
p3 2 0.0003 0.30 1.2
p4 4 0.010 0.25 0.5
...
pn ...........................
The "fitness" of each particle (p1-p4 shown here out of a population of n particles) is measured by the accuracy of the resulting model: the PSO algorithm trains and tests a model for each particle, returning that model's error rate as the value analogous to that from the training loss function (which it how the value is computed).
On each iteration, particles move toward the fittest neighbours. The process repeats until a maximum (hopefully the global one) appears as a convergence point. This process is simply one from the familiar gradient descent family.
There are two basic PSO variants. In gbest (global best), every particle affects every other particle, sort of a universal gravitation principle. It converges quickly, but may well miss a global max in favor of a local max that happened to be nearer to the swarm's original center. In lbest (local best), a particle responds to only its k closest neighbors. This can form localized clusters; it converges more slowly, but is more likely to find the global max in a non-convex space.
I'll try to briefly explain enough to answer your clarification questions. If that doesn't work, I'm afraid you'll probably have to find someone to discuss this in front of a white board.
To use PSO, you have to decide which SVM parameters you'll try to optimize, and how many particles you want to use. PSO is a meta-algorithm, so its features are the SVM parameters. The PSO parameters are population (how many particles you want to use, update neighbourhood (lbest size and a distance function; gbest is the all-inclusive case), and velocity (learning rate for the SVM parameters).
For a bit of illustration, let's assume the particle table above, extended to a population of 20 particles. We'll use lbest with a neighbourhood of 4, and a velocity of 0.1. We choose (randomly, in a grid, or however we think might give us nice results) the initial values of degree, epsilon, gamma, and C for each of the 20 particles.
Each iteration of PSO works like this:
# Train the model described by each particle's "position"
For each of the 20 particles:
Train an SVM with the SVM input and the given parameters.
Test the SVM; return the error rate as the PSO loss function value.
# Update the particle positions
for each of the 20 particles:
find the nearest 4 neighbours (using the PSO distance function)
identify the neighbour with the lowest loss (SVM's error rate).
adjust this particle's features (degree, epsilon, gamma, C) 0.1 of the way toward that neighbour's features. 0.1 is our learning rate / velocity. (Yes, I realize that changing degree is not likely to happen (it's a discrete value) without a special case in the update routine.
Continue iterating through PSO until the particles have converged to your liking.
gbest is simply lbest with an infinite neighbourhood; in that case, you don't need a distance function on the particle space.