Saving float and double precision in C++ during cin/cout and scanf/printf

Saving float and double precision in C++ during cin/cout and scanf/printf - printf

I want to read floats and doubles from standart input and save its precision (exact the same digits after floating point) and be able to output (cout/printf) as it is. What the most convinient (and simplest way) to do this?
Thanks!

float f;
cin >> f;
cout << f;

Use setprecision.
Here is the solution
cout<<setprecision(the precision you want to set here)<<variablename;
eg. If you want to set precision of the output to 5 for variable var use it like this:
cout<<setprecision(5)<<var;
setprecision is a manipulator. Learn more about manipulators here.
It sets the decimal precision to be used to
format floating-point values on output operations.
Behaves as if member precision were called with n as argument on the
stream on which it is inserted/extracted as a manipulator (it can be
inserted/extracted on input streams or output streams).
This is a manipulator and is declared in header <iomanip>

Since the input has an unknown precision, the simplest method is to read them as strings, not doubles/floats.
If you need the float value, a simple string to double conversion is needed.
Any other method will probably fail since you rely on imperfect conversion from string to float done by the standard library.
The latter can't distinguish between 0.4 and 0.40.

Related

How to handle precision problems of floating point numbers?

I am using Firebird 3.0.4 (both in Windows and Linux) and I have the following procedure that clearly demonstrates my problem with floating point numbers, and that also demonstrates a possible workaround:
create or alter procedure test_float returns (res double precision,
res1 double precision,
res2 double precision)
as
declare variable z1 double precision;
declare variable z2 double precision;
declare variable z3 double precision;
begin
z1=15;
z2=1.1;
z3=0.49;
res=z1*z2*z3; /* one expects res to be 8.085, but internally, inside the procedure
it is represented as 8.084999999999.
The procedure-internal representation is repaired when then
res is sent to the output of the procedure, but the procedure-internal
representation (which is worng) impacts the further calculations */
res1=round(res, 2);
res2=round(round(res, 8), 2);
suspend;
end
On can see the result of the procedure with:
select proc.res, proc.res1, proc.res2
from test_float proc
The result is
RES RES1 RES2
8,085 8,08 8,09
But one can expect that RES2 should be 8.09.
One can clearly see that the internal representation of the res contains 8.0849999 (e.g. one can assign res to the exception message and then raise this exception), it is repaired during output but it leads to the failed calculations when such variable is used in the further calculations.
RES2 demonstrates the repair: I can always apply ROUND(..., 8) to repair the internal representation. I am ready to go with this solution, but my question is - is it acceptable workaround (when the outer ROUND is with strictly less than 5 decimal places) or is there better workaround.
All my tests pass with this workaround, but the feeling is bad.
Of course, I know the minimum that every programmer should know about floats (there is article about that) and I know that one should not use double for business calculations.

This is an inherent problem with calculating with floating point numbers, and is not specific to Firebird. The problem is that the calculation of 15 * 1.1 * 0.49 using double precision numbers is not exactly 8.085. In fact, if you would do 8.085 - RES, you'd get a value that is (approximately) 1.776356839400251e-015 (although likely your client will just present it as 0.00000000).
You would get similar results in different languages. For example, in Java
DecimalFormat df = new DecimalFormat("#.00");
df.format(15 * 1.1 * 0.49);
will also produce 8.08 for exactly the same reason.
Also, if you would change the order of operations, you would get a different result. For example using 15 * 0.49 * 1.1 would produce 8.085 and round to 8.09, so the actual results would match your expectations.
Given round itself also returns a double precision, this isn't really a good way to handle this in your SQL code, because the rounded value with a higher number of decimals might still yield a value slightly less than what you'd expect because of how floating point numbers work, so the double round may still fail for some numbers even if the presentation in your client 'looks' correct.
If you purely want this for presentation purposes, it might be better to do this in your frontend, but alternatively you could try tricks like adding a small value and casting to decimal, for example something like:
cast(RES + 1e-10 as decimal(18,2))
However this still has rounding issues, because it is impossible to distinguish between values that genuinely are 8.08499999999 (and should be rounded down to 8.08), and values where the result of calculation just happens to be 8.08499999999 in floating point, while it would be 8.085 in exact numerics (and therefor need to be rounded up to 8.09).
In a similar vein, you could try to use double casting to decimal (eg cast(cast(res as decimal(18,3)) as decimal(18,2))), or casting the decimal and then rounding (eg round(cast(res as decimal(18,3)), 2). This would be a bit more consistent than double rounding because the first cast will convert to exact numerics, but again this has similar downside as mentioned above.
Although you don't want to hear this answer, if you want exact numeric semantics, you shouldn't be using floating point types.

Kotlin - Converting Float to Double while maintaining precision

In Kotlin 123.456 is a valid Double value, however, 123.456F.toDouble() results in 123.45600128173828 - presumably just the way precision is handled between the two.
I want to be able to convert freely between the two, specifically for cases like this:
123.456F -> 123.456 // Float to Double
123.456 -> 123.456F // Double to Float
How can I convert a float to a double in cases like this, and maintain precision?

It's a big ugly, but you could convert your Float to a String and back out to a Double:
val myDouble: Double = 123.456f.toString().toDouble()
// 123.456d
You could always encapsulate this in an extension function:
fun Float.toExactDouble(): Double =
this.toString().toDouble()
val myDouble = 123.456f.toExactDouble()

In Kotlin 123.456 is a valid Double value
Actually, that's not quite true. There's a Double value very close to 123.456, but it's not exactly 123.456. What you're seeing is the consequences of that.
So you can't maintain precision, because you don't have that precision to start with!
Short answer:
If you need exact values, don't use floating-point!
(In particular: Never store money values in floating-point! See for example this question.)
The best alternative is usually BigDecimal which can store and calculate decimal fractions to an arbitrary precision. They're less efficient, but Kotlin's operator overloading makes them painless to use (unlike Java!).
Or if you're not going to be doing any calculations, you could store them as Strings.
Or if you'll only need a certain number of decimal places, you could scale them all up to Ints (or Longs).
Technical explanation:
Floats and Doubles use binary floating-point; they store an integer, and an integer power of 2 to multiple or divide it by. (For example, 3/4 would be stored as 3*2⁻².) This means they can store a wide range of binary fractions exactly.
However, just as you can't store 1/3 as a decimal fraction (it's 0.3333333333…, but any finite number of digits will only be an approximation), so you can't store 1/10 as a binary fraction (it's 0.000110011001100…). This means that a binary floating-point number can't store most decimal numbers exactly.
Instead, they store the nearest possible value to the number you want. And the routines which convert them to a String will try to undo that difference, by rounding appropriately. But that doesn't always give the result you expect.
Floating-point numbers are great when you need a huge range of values (e.g. in scientific and technical use), but don't care about storing them exactly.

Why BigFloat.to_s is not precise enough?

I am not sure if this is a bug. But I've been playing with big and I cant understand why this code works this way:
https://carc.in/#/r/2w96
Code
require "big"
x = BigInt.new(1<<30) * (1<<30) * (1<<30)
puts "BigInt: #{x}"
x = BigFloat.new(1<<30) * (1<<30) * (1<<30)
puts "BigFloat: #{x}"
puts "BigInt from BigFloat: #{x.to_big_i}"
Output
BigInt: 1237940039285380274899124224
BigFloat: 1237940039285380274900000000
BigInt from BigFloat: 1237940039285380274899124224
First I though that BigFloat requires to change BigFloat.default_precision to work with bigger number. But from this code it looks like it only matters when trying to output #to_s value.
Same with precision of BigFloat set to 1024 (https://carc.in/#/r/2w98):
Output
BigInt: 1237940039285380274899124224
BigFloat: 1237940039285380274899124224
BigInt from BigFloat: 1237940039285380274899124224
BigFloat.to_s uses LibGMP.mpf_get_str(nil, out expptr, 10, 0, self). Where GMP is saying:
mpf_get_str (char *str, mp_exp_t *expptr, int base, size_t n_digits, const mpf_t op)
Convert op to a string of digits in base base. The base argument may vary from 2 to 62 or from -2 to -36. Up to n_digits digits will be generated. Trailing zeros are not returned. No more digits than can be accurately represented by op are ever generated. If n_digits is 0 then that accurate maximum number of digits are generated.
Thanks.

In GMP (it applies to all languages not just Crystal), integers (C mpz_t, Crystal BigInt) and floats (C mpf_t, Crystal BigFloat) have separate default precision.
Also, note that using an explicit precision is better than setting a default one, because the default precision might not be reentrant (it depends on a configure-time switch). Also, if someone reads only a part of your code, they may skip the part with setting the default precision and assume a wrong one. Although I do not know the Crystal binding well, I assume that such functionality is exposed somewhere.
The zero parameter passed to mpf_get_str means to guess the value from the precision. I know the number of significant digits is proportional and close to precision / log2(10). Floating point numbers have finite precision. In that case, it was not the mpf_get_str call which made the last digits zero - it was the internal representation that did not keep such data. It looks like your (default) precision is too small to store all the necessary digits.
To summarize, there are two solutions:
Set a global default precision. Although this approach will work, it will require to either change the default precision frequently, or use one in the whole program. Both ways, the approach with the default precision is a form of procrastination which is going to have its vengeance later.
Set a precision on variable basis. This is a better solution than the former. Although it requires more code (1-2 more lines per variable initialization), it is going to pay back later. For example, in a space object tracking system, the physics calculations have to be super-precise, but other systems could use lower precision numbers for speed and memory saving.
I am still unsure what made the conversion BigFloat --> BigInt yield the missing digits.

In what cases do we need functions for both double, float and long double?

In the math-headers we see
extern float fabsf(float);
extern double fabs(double);
extern long double fabsl(long double);
...
extern float fmodf(float, float);
extern double fmod(double, double);
extern long double fmodl(long double, long double);
Why is there one function for each type?
Isn't this a lot of duplicate code? If I where to say write a lerp-function or a clamp-function would I need to write one for each type?
Seems like we will have duplicate code where there's only one thing changing – the type.
extern float clampf(float value, float min, float max)
{
if(value > max)
return max;
if(value < min)
return min;
return value;
}
extern double clamp(double value, double min, double max)
{
if(value > max)
return max;
if(value < min)
return min;
return value;
}
Question 1: What is the historical reason for this structure?
Question 2: Should I follow the same pattern? Or should I only implement the double-kind since it is the one which is most common?
Question 3: Or should I just use macro's to overcome the type-issue altogether?

Historically (circa C89 and before), the math library contained only the double-precision versions of these functions, which is why those versions have no suffix. If you needed to compute the sine of a float, you either wrote your own implementation, or (more likely!) you simply wrote:
float x;
float y = sin(x);
However, this introduces some overhead on modern architectures. Specifically, on the most common architectures today, it is necessary for the compiler to emit code that looks something like this:
convert x to double
call sin
convert result to float
These conversions are pretty fast (about the same as an addition, usually), but they still have some cost. On top of the cost of conversion, sin needs to deliver a result that has ~53 bits of precision, more than half of which are completely wasted if the result is just going to be converted back to single precision. Between these two factors, it is possible for a dedicated single-precision sin routine to be about twice as fast; that’s a significant win for some very frequently-used library functions!
If we look at functions like fabs (and assume that the compiler does not simply inline and lower them), the situation is much, much worse. fabs, on a typical modern architecture, is a simple bitwise-and operation. So the two conversions bracketing the call (if all you have is double) are significantly more expensive than the operation itself, and can easily cause a 5x slowdown. That’s why multiple versions of these functions were added to support each FP type.
If you don’t want to keep track of all of them, you can #include <tgmath.h>, which will infer the correct function to use based on the type of the argument (meaning
sin((float)x)
will generate a call to sinf(x), whereas
sin((long double)x)
will call sinl(x)).
In your own code, you usually know a priori what the type of your arguments is, and only need to support one or maybe two types. clamp and lerp in particular are graphics operations, and almost universally are used only in single-precision variants.
Incidentally, the fact that you’re using clamp and lerp is a pretty good indication that you might want to look at writing your code in OpenCL instead of C/Obj-C; the OpenCL math library implements these operations (and many other similar operations) for you, and provides implementations that work with a wide range of basic types, including vectors.

float and double are different data types, same as int and long int. You can use the functions which operate on double on float values and implicit conversion will happen to make it work as expected in most circumstances, but if you use functions which operate on float on double values, you will almost inevitably lose precision.
There are other longer explanations available, e.g. What's the difference between a single precision and double precision floating point operation? .

Objective C float is not showing all decimals

I'm passing a float to a method but it's not showing all decimals. I have no idea why this is happening.
Here's an example:
[[LocationApiCliente sharedInstance] nearPlacesUsingLatitude:-58.3645248331830402 andLongitude:-34.6030467894227982];
Then:
- (BOOL)nearPlacesUsingLatitude:(double)latitude andLongitude:(double) longitude {
NSString *urlWithCoords = [NSString stringWithFormat:#"%#&lat=%f&long=%f", CountriesPath, latitude, longitude];
Printing urlWithCoords will result in:
format=json&lat=-58.364525&long=-34.603047
More of this. What I'm getting from the output terminal:
(lldb) p -3.13419834918349f
(float) $4 = -3.1342
(lldb) p -3.13419834918349
(double) $5 = -3.1342
Any ideas?

Change the %fs in your formatting strings to specify the desired number of decimals, e.g., %.16f.
Note that the number of decimals shown does not guarantee that they are correct, but at least they won't be truncated.
Overall the problem is that floating point numbers do not contain information about the precision, and cannot precisely represent some decimal values, so formatting can not in the general case “autodetect” the number of decimals. So you just need to override the default by specifying the desired number and accept that it's not representative of location accuracy. But since you seem to be passing the floats to another program via the URL, this shouldn't be a problem—a larger number of decimals is better.

It looks like CoreLocation uses doubles to represent degrees, so I'd be surprised if there's any more geographic precision to be found on the device.
But, in general, if you want to represent higher precision than double, you can use long double in Objective-C like this...
long double myPi = 3.141592653589793;
NSLog(#"%16.16Lf", myPi);

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas