In Vulkan a format such as VK_FORMAT_R8_UNORM maps a single-precision float in the range [0.0f,1.0f] to an 8-bit unsigned integer. Is the formula for the float->uint8_t direction exactly:
uint8_t unorm(float x) {
return roundf(x*255.0f);
where round is the standard C function? or is it something else? or is it implementation-defined?
(Note that the above would give half as many values to uint8_t(0) and uint8_t(255) as it would to the other values uint8_t(1), uint8_t(2) through uint8_t(254).)
From 3.9.2. Conversion from Floating-Point to Normalized Fixed-Point
The conversion from a floating-point value f to the corresponding unsigned normalized fixed-point value c is defined by first clamping f to the range [0,1], then computing
c = convertFloatToUint(f × (2b - 1), b)
where convertFloatToUint(r,b) returns one of the two unsigned binary integer values with exactly b bits which are closest to the floating-point value r. Implementations should round to nearest. If r is equal to an integer, then that integer value must be returned. In particular, if f is equal to 0.0 or 1.0, then c must be assigned 0 or 2b - 1, respectively.
i need to know the maximum value of float64 and complex128 variable types in golang. go doesn't seem to have an equivalent of float.h and i don't know how to calculate it.
For example,
package main
import (
func main() {
const f = math.MaxFloat64
fmt.Printf("%[1]T %[1]v\n", f)
const c = complex(math.MaxFloat64, math.MaxFloat64)
fmt.Printf("%[1]T %[1]v\n", c)
float64 1.7976931348623157e+308
complex128 (1.7976931348623157e+308+1.7976931348623157e+308i)
Package math
import "math"
Floating-point limit values. Max is the largest finite value
representable by the type. SmallestNonzero is the smallest positive,
non-zero value representable by the type.
const (
MaxFloat32 = 3.40282346638528859811704183484516925440e+38 // 2**127 * (2**24 - 1) / 2**23
SmallestNonzeroFloat32 = 1.401298464324817070923729583289916131280e-45 // 1 / 2**(127 - 1 + 23)
MaxFloat64 = 1.797693134862315708145274237317043567981e+308 // 2**1023 * (2**53 - 1) / 2**52
SmallestNonzeroFloat64 = 4.940656458412465441765687928682213723651e-324 // 1 / 2**(1023 - 1 + 52)
The Go Programming Language Specification
Numeric types
A numeric type represents sets of integer or floating-point values.
The predeclared architecture-independent numeric types are:
uint8 the set of all unsigned 8-bit integers (0 to 255)
uint16 the set of all unsigned 16-bit integers (0 to 65535)
uint32 the set of all unsigned 32-bit integers (0 to 4294967295)
uint64 the set of all unsigned 64-bit integers (0 to 18446744073709551615)
int8 the set of all signed 8-bit integers (-128 to 127)
int16 the set of all signed 16-bit integers (-32768 to 32767)
int32 the set of all signed 32-bit integers (-2147483648 to 2147483647)
int64 the set of all signed 64-bit integers (-9223372036854775808 to 9223372036854775807)
float32 the set of all IEEE-754 32-bit floating-point numbers
float64 the set of all IEEE-754 64-bit floating-point numbers
complex64 the set of all complex numbers with float32 real and imaginary parts
complex128 the set of all complex numbers with float64 real and imaginary parts
byte alias for uint8
rune alias for int32
The value of an n-bit integer is n bits wide and represented using
two's complement arithmetic.
There is also a set of predeclared numeric types with
implementation-specific sizes:
uint either 32 or 64 bits
int same size as uint
uintptr an unsigned integer large enough to store the uninterpreted bits of a pointer value
To avoid portability issues all numeric types are distinct except
byte, which is an alias for uint8, and rune, which is an alias for
int32. Conversions are required when different numeric types are mixed
in an expression or assignment. For instance, int32 and int are not
the same type even though they may have the same size on a particular
You can also consider using the Inf method from the math package which
returns a value for infinity (positive or negative if you want), but is considered to be float64.
Not too sure if there is an argument for one or the other between math.MaxFloat64 and math.Inf(). Comparing the two I've found that Go interprets the infinity values to be larger than the max float ones.
package main
import (
func main() {
infPos := math.Inf(1) // gives positive infinity
fmt.Printf("%[1]T %[1]v\n", infPos)
infNeg := math.Inf(-1) // gives negative infinity
fmt.Printf("%[1]T %[1]v\n", infNeg)
Does anyone know why integer division in C# returns an integer and not a float?
What is the idea behind it? (Is it only a legacy of C/C++?)
In C#:
float x = 13 / 4;
//== operator is overridden here to use epsilon compare
if (x == 3.0)
print 'Hello world';
Result of this code would be:
'Hello world'
Strictly speaking, there is no such thing as integer division (division by definition is an operation which produces a rational number, integers are a very small subset of which.)
While it is common for new programmer to make this mistake of performing integer division when they actually meant to use floating point division, in actual practice integer division is a very common operation. If you are assuming that people rarely use it, and that every time you do division you'll always need to remember to cast to floating points, you are mistaken.
First off, integer division is quite a bit faster, so if you only need a whole number result, one would want to use the more efficient algorithm.
Secondly, there are a number of algorithms that use integer division, and if the result of division was always a floating point number you would be forced to round the result every time. One example off of the top of my head is changing the base of a number. Calculating each digit involves the integer division of a number along with the remainder, rather than the floating point division of the number.
Because of these (and other related) reasons, integer division results in an integer. If you want to get the floating point division of two integers you'll just need to remember to cast one to a double/float/decimal.
See C# specification. There are three types of division operators
Integer division
Floating-point division
Decimal division
In your case we have Integer division, with following rules applied:
The division rounds the result towards zero, and the absolute value of
the result is the largest possible integer that is less than the
absolute value of the quotient of the two operands. The result is zero
or positive when the two operands have the same sign and zero or
negative when the two operands have opposite signs.
I think the reason why C# use this type of division for integers (some languages return floating result) is hardware - integers division is faster and simpler.
Each data type is capable of overloading each operator. If both the numerator and the denominator are integers, the integer type will perform the division operation and it will return an integer type. If you want floating point division, you must cast one or more of the number to floating point types before dividing them. For instance:
int x = 13;
int y = 4;
float x = (float)y / (float)z;
or, if you are using literals:
float x = 13f / 4f;
Keep in mind, floating points are not precise. If you care about precision, use something like the decimal type, instead.
Since you don't use any suffix, the literals 13 and 4 are interpreted as integer:
If the literal has no suffix, it has the first of these types in which its value can be represented: int, uint, long, ulong.
Thus, since you declare 13 as integer, integer division will be performed:
For an operation of the form x / y, binary operator overload resolution is applied to select a specific operator implementation. The operands are converted to the parameter types of the selected operator, and the type of the result is the return type of the operator.
The predefined division operators are listed below. The operators all compute the quotient of x and y.
Integer division:
int operator /(int x, int y);
uint operator /(uint x, uint y);
long operator /(long x, long y);
ulong operator /(ulong x, ulong y);
And so rounding down occurs:
The division rounds the result towards zero, and the absolute value of the result is the largest possible integer that is less than the absolute value of the quotient of the two operands. The result is zero or positive when the two operands have the same sign and zero or negative when the two operands have opposite signs.
If you do the following:
int x = 13f / 4f;
You'll receive a compiler error, since a floating-point division (the / operator of 13f) results in a float, which cannot be cast to int implicitly.
If you want the division to be a floating-point division, you'll have to make the result a float:
float x = 13 / 4;
Notice that you'll still divide integers, which will implicitly be cast to float: the result will be 3.0. To explicitly declare the operands as float, using the f suffix (13f, 4f).
Might be useful:
double a = 5.0/2.0;
Console.WriteLine (a); // 2.5
double b = 5/2;
Console.WriteLine (b); // 2
int c = 5/2;
Console.WriteLine (c); // 2
double d = 5f/2f;
Console.WriteLine (d); // 2.5
It's just a basic operation.
Remember when you learned to divide. In the beginning we solved 9/6 = 1 with remainder 3.
9 / 6 == 1 //true
9 % 6 == 3 // true
The /-operator in combination with the %-operator are used to retrieve those values.
The result will always be of type that has the greater range of the numerator and the denominator. The exceptions are byte and short, which produce int (Int32).
var a = (byte)5 / (byte)2; // 2 (Int32)
var b = (short)5 / (byte)2; // 2 (Int32)
var c = 5 / 2; // 2 (Int32)
var d = 5 / 2U; // 2 (UInt32)
var e = 5L / 2U; // 2 (Int64)
var f = 5L / 2UL; // 2 (UInt64)
var g = 5F / 2UL; // 2.5 (Single/float)
var h = 5F / 2D; // 2.5 (Double)
var i = 5.0 / 2F; // 2.5 (Double)
var j = 5M / 2; // 2.5 (Decimal)
var k = 5M / 2F; // Not allowed
There is no implicit conversion between floating-point types and the decimal type, so division between them is not allowed. You have to explicitly cast and decide which one you want (Decimal has more precision and a smaller range compared to floating-point types).
As a little trick to know what you are obtaining you can use var, so the compiler will tell you the type to expect:
int a = 1;
int b = 2;
var result = a/b;
your compiler will tell you that result would be of type int here.
I'm trying to produce a a float by dividing two ints in my program. Here is what I'd expect:
1 / 120 = 0.00833
Here is the code I'm using:
float a = 1 / 120;
However it doesn't give me the result I'd expect. When I print it out I get the following:
Do the following
float a = 1./120.
You need to specify that you want to use floating point math.
There's a few ways to do this:
If you really are interested in dividing two constants, you can specify that you want floating point math by making the first constant a float (or double). All it takes is a decimal point.
float a = 1./120;
You don't need to make the second constant a float, though it doesn't hurt anything.
Frankly, this is pretty easy to miss so I'd suggest adding a trailing zero and some spacing.
float a = 1.0 / 120;
If you really want to do the math with an integer variable, you can type cast it:
float a = (float)i/120;
float a = 1/120;
float b = 1.0/120;
float c = 1.0/120.0;
float d = 1.0f/120.0f;
NSLog(#"Value of A:%f B:%f C:%f D:%f",a,b,c,d);
Output: Value of A:0.000000 B:0.008333 C:0.008333 D:0.008333
For float variable a : int / int yields integer which you are assigning to float and printing it so 0.0000000
For float variable b: float / int yields float, assigning to float and printing it 0.008333
For float variable c: float / float yields float, so 0.008333
Last one is more precise float. Previous ones are of type double: all floating point values are stored as double data types unless the value is followed by an 'f' to specifically specify a float rather than as a double.
In C (and therefore also in Objective-C), expressions are almost always evaluated without regard to the context in which they appear.
The expression 1 / 120 is a division of two int operands, so it yields an int result. Integer division truncates, so 1 / 120 yields 0. The fact that the result is used to initialize a float object doesn't change the way 1 / 120 is evaluated.
This can be counterintuitive at times, especially if you're accustomed to the way calculators generally work (they usually store all results in floating-point).
As the other answers have said, to get a result close to 0.00833 (which can't be represented exactly, BTW), you need to do a floating-point division rather than an integer division, by making one or both of the operands floating-point. If one operand is floating-point and the other is an integer, the integer operand is converted to floating-point first; there is no direct floating-point by integer division operation.
Note that, as #0x8badf00d's comment says, the result should be 0. Something else must be going wrong for the printed result to be inf. If you can show us more code, preferably a small complete program, we can help figure that out.
(There are languages in which integer division yields a floating-point result. Even in those languages, the evaluation isn't necessarily affected by its context. Python version 3 is one such language; C, Objective-C, and Python version 2 are not.)
I just came across this piece of code:
Dim d As Double
For i = 1 To 10
d = d + 0.1
MsgBox(d = 1)
MsgBox(1 - d)
Can anyone explain me the reason for that? Why d is set to 1?
Floating point types and integer types cannot be compared directly, as their binary representations are different.
The result of adding 0.1 ten times as a floating point type may well be a value that is close to 1, but not exactly.
When comparing floating point values, you need to use a minimum value by which the values can differ and still be considered the same value (this value is normally known as the epsilon). This value depends on the application.
I suggest reading What Every Computer Scientist Should Know About Floating-Point Arithmetic for an in-depth discussion.
As for comaring 1 to 1.0 - these are different types so will not compare to each other.
.1 (1/10th) is a repeating fraction when converted to binary:
It would be like trying to show 1/3 as a decimal: you just can't do it accurately.
This is because a double is always only an approximation of the value and not the exact value itself (like a floating point value). When you need an exact decimal value, instead use a Decimal.
Contrast with:
Dim d As Decimal
For i = 1 To 10
d = d + 0.1
MsgBox(d = 1)
MsgBox(1 - d)
I ran into an unexpected result in round-tripping Int32.MaxValue into a System.Single:
Int32 i = Int32.MaxValue;
Single s = i;
Int32 c = (Int32)s;
Debug.WriteLine(i); // 2147483647
Debug.WriteLine(c); // -2147483648
I realized that it must be overflowing, since Single doesn't have enough bits in the significand to hold the Int32 value, and it rounds up. When I changed the conv.r4 to conv.r4.ovf in the IL, an OverflowExcpetion is thrown. Fair enough...
However, while I was investigating this issue, I compiled this code in java and ran it and got the following:
int i = Integer.MAX_VALUE;
float s = (float)i;
int c = (int)s;
System.out.println(i); // 2147483647
System.out.println(c); // 2147483647
I don't know much about the JVM, but I wonder how it does this. It seems much less surprising, but how does it retain the extra digit after rounding to 2.14748365E9? Does it keep some kind of internal representation around and then replace it when casting back to int? Or does it just round down to Integer.MAX_VALUE to avoid overflow?
This case is explicitly handled by §5.1.3 of the Java Language Specification:
A narrowing conversion of a
floating-point number to an integral
type T takes two steps:
In the first step, the floating-point number is converted
either to a long, if T is long, or to
an int, if T is byte, short, char, or
int, as follows:
If the floating-point number is NaN (§4.2.3), the result of the
first step of the conversion is an int
or long 0.
Otherwise, if the floating-point number is not an
infinity, the floating-point value is
rounded to an integer value V,
rounding toward zero using IEEE 754
round-toward-zero mode (§4.2.3). Then
there are two cases:
If T is long, and this integer value can be represented as a
long, then the result of the first
step is the long value V.
Otherwise, if this integer value can be represented as an int,
then the result of the first step is
the int value V.
Otherwise, one of the following two cases must be true:
The value must be too small (a negative value of large magnitude
or negative infinity), and the result
of the first step is the smallest
representable value of type int or
The value must be too large (a positive value of large magnitude
or positive infinity), and the result
of the first step is the largest
representable value of type int or