How do I print a half-precision float using printf in the AMD OpenCL SDK? - printf

The Programming Guide has instructions for doubles (%ld) and vector types (e.g. %v4f), but not half-precision floats.

Normally in C, varargs arguments are automatically promoted to larger datatypes, such as float to double. The OpenCL documentation seems to imply that a similar promotion applies there.
Therefore a simple %f should work also for half-length floats.

Related

Standard text representation for floating-point numbers

Is there a standard text representation for the floating-point numbers that is supported by the most popular languages?
What is the standard fro representing infinities and NaNs?
There isn't a general consensus, unfortunately.
However, there seems to be some convergence on hexadecimal notation for floats. See pg. 57/58 of http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf
The advantage of this notation is that you can precisely represent the value of the float as represented by the machine without worrying about any loss of precision. See this page for examples: https://www.exploringbinary.com/hexadecimal-floating-point-constants/
Note that NaN and Infinity values are not supported by hexadecimal-floats. There seems to be no general consensus on how to write these. Most languages actually don't even allow writing these as constants, so you resort to expressions such as 0/0 or 1/0 etc. instead.
Since you tagged this question with serialization, I'd recommend simply serializing using the bit-pattern you have for the float value. This will cost you 8-characters for single-precision and 16-characters for double-precision, (64 bits and 128 bits respectively, assuming 8-bit per character). Perhaps not the most efficient, but it'll ensure you can encode all possible values and transmit precisely.

Explaining the different types in Metal and SIMD

When working with Metal, I find there's a bewildering number of types and it's not always clear to me which type I should be using in which context.
In Apple's Metal Shading Language Specification, there's a pretty clear table of which types are supported within a Metal shader file. However, there's plenty of sample code available that seems to use additional types that are part of SIMD. On the macOS (Objective-C) side of things, the Metal types are not available but the SIMD ones are and I'm not sure which ones I'm supposed to be used.
For example:
In the Metal Spec, there's float2 that is described as a "vector" data type representing two floating components.
On the app side, the following all seem to be used or represented in some capacity:
float2, which is typedef ::simd_float2 float2 in vector_types.h
Noted: "In C or Objective-C, this type is available as simd_float2."
vector_float2, which is typedef simd_float2 vector_float2
Noted: "This type is deprecated; you should use simd_float2 or simd::float2 instead"
simd_float2, which is typedef __attribute__((__ext_vector_type__(2))) float simd_float2
::simd_float2 and simd::float2 ?
A similar situation exists for matrix types:
matrix_float4x4, simd_float4x4, ::simd_float4x4 and float4x4,
Could someone please shed some light on why there are so many typedefs with seemingly overlapping functionality? If you were writing a new application today (2018) in Objective-C / Objective-C++, which type should you use to represent two floating values (x/y) and which type for matrix transforms that can be shared between app code and Metal?
The types with vector_ and matrix_ prefixes have been deprecated in favor of those with the simd_ prefix, so the general guidance (using float4 as an example) would be:
In C code, use the simd_float4 type. (You have to include the prefix unless you provide your own typedef, since C doesn't have namespaces.)
Same for Objective-C.
In C++ code, use the simd::float4 type, which you can shorten to float4 by using namespace simd;.
Same for Objective-C++.
In Metal code, use the float4 type, since float4 is a fundamental type in the Metal Shading Language [1].
In Swift code, use the float4 type, since the simd_ types are typealiased to shorter names.
Update: In Swift 5, float4 and related types have been deprecated in favor of SIMD4<Float> and related types.
These types are all fundamentally equivalent, and all have the same size and alignment characteristics so you can use them across languages. That is, in fact, one of the design goals of the simd framework.
I'll leave a discussion of packed types to another day, since you didn't ask.
[1] Metal is an unusual case since it defines float4 in the global namespace, then imports it into the metal namespace, which is also exported as the simd namespace. It additionally aliases float4 as vector_float4. So, you can use any of the above names for this vector type (except simd_float4). Prefer float4.
which type should you use to represent two floating values (x/y)
If you can avoid it, don't use a single SIMD vector to represent a single geometry x,y vector if you're using CPU SIMD.
CPU SIMD works best when you have many of the same thing in each SIMD vector, because they're actually stores in 16-byte or 32-byte vector registers where "vertical" operations between two vectors are cheap (packed add or multiply), but "horizontal" operations can mostly only be done with a shuffle + a vertical operation.
For example a vector of 4 x values and another vector of 4 y values lets you do 4 dot-products or 4 cross-products in parallel with no shuffling, so the overall throughput is significantly more dot-products per clock cycle than if you had a vector of [x1, y1, x2, y2].
See https://stackoverflow.com/tags/sse/info, and especially these slides: SIMD at Insomniac Games (GDC 2015) for more about planning your data layout and program design for doing many similar operations in parallel instead of trying to accelerate single operations.
The one exception to this rule is if you're only adding / subtracting to translate coordinates, because that's still purely a vertical operation even with an array-of-structs. And thus fine for CPU short-vector SIMD based on 16-byte vectors. (e.g. the 2nd element in one vector only interacts with the 2nd element in another vector, so no shuffling is needed.)
GPU SIMD is different, and I think has no problem with interleaved data. I'm not a GPU expert.
(I don't use Objective C or Metal, so I can't help you with the details of their type names, just what the underlying CPU hardware is good at. That's basically the same for x86 SSE/AVX, ARM NEON / AArch64 SIMD, or PowerPC Altivec. Horizontal operations are slower.)

gmock: Testing two float vectors

I am trying to write a test for a vector.
For STL containers, I tried:
EXPECT_THAT(float_vec1, ElementsAreArray(float_vec2));
However I need to insert a margin.
Is there an ElementsAreArray equivalent of FloatNear(a_float, max_abs_error)?
Yes, I've used the Pointwise container matcher, which you can give a matcher and an expected container (any STL container and is compatible with non-dynamically allocated c-style arrays).
EXPECT_THAT(float_vec1, Pointwise(matcher, float_vec2))
For the matcher You can use FloatEq() which uses ULP-based float comparisons.
EXPECT_THAT(float_vec1, Pointwise(FloatEq(), float_vec2))
However, I've found it is easier to use FloatNear(float max_abs_error) just to define my own floating point error like you want.
float ferr = 1e-5;
EXPECT_THAT(float_vec1,
Pointwise(FloatNear(ferr), float_vec2));

OpenCL: Type conversion overhead

What is the cost of casting a variable to a different type in OpenCL?
Example: I want to take dot product of 2 int3 vectors (AFAIK dot() isn't overloaded for int3s), so instead of implementing dot() by myself in unvectorized way, I want to vectorize the code by using the native dot() for float3. First I convert the 2 vectors to float3s and then I cast the result to int.
Which of the two functions, foo and bar, is less time consuming (and why)?
inline int foo(int3 a, int3 b) {
return a.x*b.x + a.y*b.y + a.z*b.z;
}
inline int bar(int3 a, int3 b) {
return (int)dot(convert_float3(a), convert_float3(b));
}
As has been suggested in the comments, measuring is going to be the most useful tool in practice, and the cost of individual instructions is heavily dependent on hardware architecture, but also the compiler.
Nevertheless, a comparison to other operations is useful, and at least AMD publishes a list of the instruction throughput for their devices in this section of their OpenCL optimisation guide, and this includes float-to-int and int-to-float conversion.
In your particular case, I strongly suspect your "vectorising" attempts will have detrimental effects. Most modern GPUs aren't SIMD processors in the CPU SIMD sense. The threads run in lock-step, but each thread operates on scalars. A "horizontal" operation like a dot product may not be particularly efficient even if the GPU does use per-thread SIMD.
If you can limit the range of each of your integers to 24 bits, a series of mad24() and mul24() calls will most likely be fastest. But again - measure. Try the different options on a range of hardware, and run them lots of times, applying basic stats to make sure you aren't just seeing random variation/overhead.
A separate thing to note with regard to integer-to-float conversions is that such conversions are often "free" when you sample as floats from an image object containing integers.

Why is CGFloat float on 32 bit and double on 64 bit?

From "CoreGraphics/CGBase.h":
#if defined(__LP64__) && __LP64__
# define CGFLOAT_TYPE double
# define CGFLOAT_IS_DOUBLE 1
# define CGFLOAT_MIN DBL_MIN
# define CGFLOAT_MAX DBL_MAX
#else
# define CGFLOAT_TYPE float
# define CGFLOAT_IS_DOUBLE 0
# define CGFLOAT_MIN FLT_MIN
# define CGFLOAT_MAX FLT_MAX
#endif
Why did Apple do this? What's the advantage?
I can seem to think of downsides only. Please enlighten me.
Apple explicitly says they did it "to provide a wider range and accuracy for graphical quantities." You can debate whether the wider range and accuracy have been really helpful in practice, but Apple is clear on what they were thinking.
It's worth remembering, BTW, that CGFloat was added in OS X 10.5, long before iPhones (and certainly long before 64-bit iPhones). Going 64-bit is more obviously beneficial on "big memory" machines like Macs. And Apple made "local architecture" types that were supposed to make it easier to transition between the "old" and "new" worlds. I think it's interesting that Swift brought over NSInteger as the default Int type (i.e. Int is architecture-specific). But they made Float and Double architecture independent. There is no equivalent of CGFloat in the language. I read this as a tacit acknowledgement that CGFloat wasn't the greatest idea. NEON only supports single precision floating point math. Double precision math has to be done on the VFP. (Not that NEON was a consideration when CGFloat was invented.)
It's a performance thing.
On a 32-bit CPU, a single precision, 32-bit float can be stored in a single register, and moved around quickly and efficiently, because it's the same size as an architecture-native pointer.
On a 64-bit CPU architecture, a 64-bit IEEE double has the same advantage of being the same size as a native pointer/register/etc.