I have a tensorflow model that I converted to tensorflow-lite. But, there is a deviation in inference accuracy. Is it a normal behaviour?
I found that the inference output is different after fourth decimal place between these two models.
While training in TensorFlow,
All the variables and constants may be in dtype=float64. These numbers are larger in terms of decimal places.
Since, they are training variables there value in not constant.
After converting to TensorFlow lite,
The training variables are converted to Constant operations Their values are fixed
When we run the lite model on Android or iOS, these values are converted to float32.
Hence the precision is lost in TensorFlow Lite.
On float32 precision
float32 is the default value type used in TensorFlow. Let's talk a bit about float32 type and the importance of order of operations. Basically, there is a neat table from this post that shows how the precision of a float changes as the magnitude increases:
Float Value Float Precision
1 1.19E-07
10 9.54E-07
100 7.63E-06
1,000 6.10E-05
10,000 0.000977
100,000 0.00781
1,000,000 0.0625
10,000,000 1
100,000,000 8
1,000,000,000 64
What does it says? In float32, you cannot expect to have exact values, you will only have discretization points that are hopefully close to the real value. The larger the value, the less close you can possibly be to it.
You can learn more about the IEEE 754 single precision format here, here, and here, and you can even google more about it.
Now back to TF-Lite
What conversion from TensorFlow to TF-Lite has to do with the above said property of float32? Consider the following situation:
sum_1 = a_1 + a_2 + a_3 + a_4 + ... + a_n
sum_2 = a_2 + a_1 + a_4 + a_3 + ... + a_n
i.e. sum_1 and sum_2 only differs in the order of summation. Will they be equal? maybe, or maybe not! Same for other accumulative operations, e.g. multiplications, convolutions, etc. That's the key: in float32 calculation, order matters! (this is similar to the issue where calculations of the same model on CPU and GPU slightly differs). I've stumpled upon this problem countless of times when porting between frameworks (caffe, tensorflow, torch, etc.)
So, even if the implementation of any of the layers in TF-Lite differs even slightly from TensorFlow, you will end up with an error of 1e-5, maximum 1e-4. It is acceptable for single precision floats, so don't be bothered about it.
Related
In this paper (Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference) published by google, quantization scheme is described as follows:
Where,
M = S1 * S2 / S3
S1, S2 and S3 are scales of inputs and output respectively.
Both S1 (and zero point Z1) and S2 (and zero point Z2) can be determined easily, whether "offline" or "online". But what about S3 (and zero point Z3)? These parameters are dependent on "actual" output scale (i.e., the float value without quantization). But output scale is unknown before it is computed.
According to the tensorflow documentation:
At inference, weights are converted from 8-bits of precision to floating point and computed using floating-point kernels. This conversion is done once and cached to reduce latency.
But the code below says something different:
tensor_utils::BatchQuantizeFloats(
input_ptr, batch_size, input_size, quant_data, scaling_factors_ptr,
input_offset_ptr, params->asymmetric_quantize_inputs);
for (int b = 0; b < batch_size; ++b) {
// Incorporate scaling of the filter.
scaling_factors_ptr[b] *= filter->params.scale;
}
// Compute output += weight * quantized_input
int32_t* scratch = GetTensorData<int32_t>(accum_scratch);
tensor_utils::MatrixBatchVectorMultiplyAccumulate(
filter_data, num_units, input_size, quant_data, scaling_factors_ptr,
batch_size, GetTensorData<float>(output), /*per_channel_scale=*/nullptr,
input_offset_ptr, scratch, row_sums_ptr, &data->compute_row_sums,
CpuBackendContext::GetFromContext(context));
Here we can see:
scaling_factors_ptr[b] *= filter->params.scale;
I think this means:
S1 * S2 is computed.
The weights are still integers. Just the final results are floats.
It seems S3 and Z3 don't have to be computed. But if so, how can the final float results be close to the unquantized results?
This inconsistency between paper, documentation and code makes me very confused. I can't tell what I miss. Can anyone help me?
Let me answer my own question. All of a sudden I saw what I missed when I was
riding bicycle. The code in the question above is from the function
tflite::ops::builtin::fully_connected::EvalHybrid(). Here the
name has explained everything! Value in the output of matrix multiplication is
denoted as r3 in section 2.2 of the paper. In terms of equation
(2) in section 2.2, we have:
If we want to get the float result of matrix multiplication, we can use equation (4) in section 2.2, then convert the result back to floats, OR we can use equation (3) with the left side replaced with r3, as in:
If we choose all the zero points to be 0, then the formula above becomes:
And this is just what EvalHybrid() does (ignoring the bias for the moment). Turns out the paper gives an outline of the quantization algorithm, while the implementation uses different variants.
When computing a sigmoid function, small or large values of x will return 0 and 1 respectively due to lack of floating-point precision. In numpy, the function np.expm1 will compute exp(x)-1 with better precision for extreme values of x. However, an equivalent function for computing exp(x)+1, (denominator in sigmoid function), does not exist. I could not figure out how to use np.expm1 to compute a sigmoid with increased precision at extreme values. Is there a way to do so?
1/(np.exp(-20)+1)==1.0
#False
1/(np.exp(-50)+1)==1.0
# True
np.expm1 mitigates loss of significance which occurs when taking the difference between two almost equal numbers (because many significant places will cancel each other the result will have fewer signficant places than the data type could store).
1/(np.exp(-50)+1)==1.0
is a limitation of the data type, not the algorithm. floats cannot resolve a difference from 1.0 as small as exp(-50). Indeed, the nearest floats left and right of 1.0 are
>>> np.nextafter(1.0, 0.0)
0.9999999999999999
>>> np.nextafter(1.0, 2.0)
1.0000000000000002
indicating a resolution of oom 10^-16, nowhere near fine enough to discriminate between 1 and 1 +/- exp(-50)
I'm trying to run a hyperparameter optimization script, for a convNN using Tensorflow.
As you may know, TF handling of the GPU-Memory isn't that fancy(don't think it will ever be, thanks to the TPU). So my question is how do I know to choose the filter dimensions and the batchsize, so that the GPU-memory don't get exhausted.
Here's the equation that I'm thinking of:
image_shape =128x128x3(3 color channel)
batchSitze = 20 ( is the smallest possible batchsize, since I got 20 klasses)
filter_shape= fw_fh_fd[filter_width=4, filter_height=4, filter_depth=32]
As far as understood, using tf.conv2d function will need the following amount of memory:
image_width * image_height *numerofchannel*batchSize*filter_height*filter_width*filter_depth*32bit
since we're tf.float32 type for each pixel.
in the given example, the needed memory, will be :
128x128x3x20x4x4x32x32 =16106127360 (bits), which is all most 16GB of memory.
I'm not the formula is correct, so I hope to get a validation or the a correction of what I'm missing.
Actually, this will take only about 44MB of memory, mostly taken by the output.
Your input is 20x128x128x3
The convolution kernel is 4x4x3x32
The output is 20x128x128x32
When you sum up the total, you get
(20*128*128*3 + 4*4*3*32 + 20*128*128*32) * 4 / 1024**2 ≈ 44MB
(In the above, 4 is for the size in bytes of float32 and 1024**2 is to get the result in MB).
Your batch size can be smaller than your number of classes. Think about ImageNet and its 1000 classes: people are training with batch sizes 10 times smaller.
EDIT
Here is a tensorboard screenshot of the net — it reports 40MB rather than 44MB, probably because it excludes the input — and you also have all the tensor sizes I mentioned earlier.
Hello fellow tensorflowians!
I have a following schema:
I input some continous variables (actually, word embeddings I took from google word2vec), and I am trying to predict output that can be considered as continous as well as discrete (sorry, mathematicians! but it depends on one's training goal actually).
Output takes values from 0 to 1000 with interval of 0.25 (or a precision hyperparameter), so : 0, 0.25, 0.50, ..., 100.0 .
I know that it is not possible to include something like tf.to_int (I can omit fractions if it's necessary) or tf.round, because these are not differentiable, so we can't backpropagate. However, I feel that there is some solution that allows network to "know" that it is searching for rounded solution: some small fractions of integers like 0.25, 5.75, but I actually don't even know where to look. I looked up quantization, but that seems to be a bit of an overkill.
So my question is:
How to inform graph that we don't accept values below 0.0 ? Would doing abs on network output "logits" (regression predictions) be something worth considering? If no, can I modify the loss term to severely punish scores below 0 and using absolute error instead of squared error? I may be not aware of full consequences of doing that
I don't care whether prediction of 4.5 is 4.49999 or 4.4 because I round up predictions to nearest .25 to get accuracy, and that's my final model evaluation metric. If so, can I use?
precision = 0.01 # so that sqrt(precision) == 0.1
loss=tf.reduce_mean(tf.max(0, tf.square(tf.sub(logits, targets)) - precision ))
In order to reduce the tensor, I defined all the variables with dytpe=tf.float16 in my Model, and then defined the optimizer:
optimizer = tf.train.AdamOptimizer(self.learning_rate)
self.compute_gradients = optimizer.compute_gradients(self.mean_loss_reg)
train_adam_op = optimizer.apply_gradients(self.compute_gradients, global_step=self.global_step)
Everything works ok! but after I run the train_adam_op, the the gradients and variables are nan in python. I wander If the apply_gradients() API supports tf.float16 type? Why I got nan after apply_gradients() was called by session.run()....
The dynamic range of fp16 is fairly limited compared to that of 32-bit floats. As a result, it's pretty easy to overflow or underflow them, which often results in the NaN that you've encountered.
You can insert a few check_numerics operations in your model to help pinpoint the specific operation(s) that becomes unstable when performed on fp16.
For example, you can wrap a L2 loss operation as follow to check that its result fits in an fp16
A = tf.l2_loss(some_tensor)
becomes
A = tf.check_numerics(tf.l2_loss(some_tensor), "found the root cause")
The most common source of overflows and underflows are the exp(), the log(), as well as the various classification primitives, so I would start looking there.
Once you've figured out which sequence of operations is problematic, you can update your model to perform that sequence using 32-bit floats by using tf.cast() to convert the inputs of the sequence to 32bit floats, and cast the result back to fp16.