DirectX does not allow double precision math in VB.net - vb.net

I have a large VB.net application that does FEM structural analysis. It requires double precision math. The application also uses DirectX for the graphics. I now know that DirectX intentionally sets the "floating-point unit” (FPU) to single precision by default when it starts. That is a big problem. I need to figure out how to start DirectX but preserve the double precision. Currently I start DirectX with the following:
Dev = New Device(0, DeviceType.Hardware, Panel2,
CreateFlags.SoftwareVertexProcessing, pParams)
I have read that using “CreateFlags.FpuPreserve” as shown below will preserve the double precision. But when I try this DirectX does not start.
Dev = New Device(0, DeviceType.Hardware, Panel2, CreateFlags.FpuPreserve, pParams)
Can anybody tell me how to do start DirectX from with VB.net and preserve the double precision?

First, to get your code to work, you need to still specify software vertex processing. In VB, this is done as Dev = New Device(0, DeviceType.Hardware, Panel2, CreateFlags.SoftwareVertexProcessing Or CreateFlags.FpuPreserve, pParams)
Once you've fixed that, you'll find it still doesn't do what you want. FpuPreserve corresponds to D3DCREATE_FPU_PRESERVE. This affects the floating-point precision calculations on the CPU, not the GPU.
To get double-precision GPU calculations, you first needs to use Direct3D 11 API or higher (it looks like use are using Direct3D 9). Even then double-precision support is an optional feature; you have to query ID3D11Device::CheckFeatureSupport with D3D11_FEATURE_DOUBLES to see if it exists.
In SharpDX it would be device.CheckFeatureSupport(Feature.ShaderDoubles), where device is of type Direct3D11.Device, or GraphicsDeviceFeatures.HasDoublePrecision if you are using SharpDX.Toolkit.Graphics.

Yes that did it thanks! The call needed the "Or CreateFlags.FpuPreserve" added to it as shown above, the "or" being critical. Double precision is then maintained in CPU / FPU calculations, which is exactly what I need. I don't know what precision is being used in the graphics calculations but I don't care about that.

Related

Objective C - Why 64bit means different variable types

I have heard that people use types such as NSInteger or CGFloat rather than int or float is because of something to do with 64bit systems. I still don't understand why that is necessary, even though I do it throughout my own code. Basically, why would I need a larger integer for a 64bit system?
People also say that it is not necessary in iOS at the moment, although it may be necessary in the future with 64bit iphones and such.
It is all explained here:
Introduction to 64-Bit Transition Guide
In the section Major 64-Bit Changes:
Data Type Size and Alignment
OS X uses two data models: ILP32 (in
which integers, long integers, and pointers are 32-bit quantities) and
LP64 (in which integers are 32-bit quantities, and long integers and
pointers are 64-bit quantities). Other types are equivalent to their
32-bit counterparts (except for size_t and a few others that are
defined based on the size of long integers or pointers).
While almost all UNIX and Linux implementations use LP64, other
operating systems use various data models. Windows, for example, uses
LLP64, in which long long variables and pointers are 64-bit
quantities, while long integers are 32-bit quantities. Cray, by
contrast, uses ILP64, in which int variables are also 64-bit
quantities.
In OS X, the default alignment used for data structure layout is
natural alignment (with a few exceptions noted below). Natural
alignment means that data elements within a structure are aligned at
intervals corresponding to the width of the underlying data type. For
example, an int variable, which is 4 bytes wide, would be aligned on a
4-byte boundary.
There is a lot more that you can read in this document. It is very well written. I strongly recommend it to you.
Essentially it boils down to this: If you use CGFloat/NSInteger/etc, Apple can make backwards-incompatible changes and you can mostly update your app by just recompiling your code. You really don't want to be going through your app, checking every use of int and double.
What backwards-incompatible changes? Plenty!
M68K to PowerPC
64-bit PowerPC
PowerPC to x86
x86-64 (I'm not sure if this came before or after iOS.)
iOS
CGFloat means "the floating-point type that CoreGraphics" uses: double on OS X and float on iOS. If you use CGFloat, your code will work on both platforms without unnecessarily losing performance (on iOS) or precision (on OS X).
NSInteger and NSUInteger are less clear-cut, but they're used approximately where you might use ssize_t or size_t in standard C. int or unsigned int simply isn't big enough on 64-bit OS X, where you might have a list with more than ~2 billion items. (The unsignedness doesn't increase it to 4 billion due to the way NSNotFound is defined.)

iPhone/iPad double precision math

The accepted answer to this Community Wiki question: What are best practices that you use when writing Objective-C and Cocoa? says that iPhones can't do double precision math (or rather, they can, but only emulated in software.) Unfortunately the 2009 link it provides as a reference: Double vs float on the iPhone directly contradicts that statement. Both of these are old, being written when the 3GS was just coming out.
So what's the story with the latest arm7 architecture? Do I need to worry about the 'thumb' compiler flag the second link references? Can I safely use double variables? I am having nasty flashbacks to dark days of 386SXs and DXs and 'math coprocessors.' Tell me it's 2012 and we've moved on.
In all of the iPhones, there is no reason double precision shouldn't be supported. They all use Cortex-A8 architecture with the cp15 coprocessor (which supports IEEE float and double calculations in hardware).
http://www.arm.com/products/processors/technologies/vector-floating-point.php
So yes, you can safely use doubles and it shouldn't be software emulated on the iPhones.
Although this is done in hardware, it may still take more cycles to perform double mathemtic arithmatic vs single (float). In addition to using double, I would check to make sure the precision is appropriate for your application.
As a side note, if the processor supports NEON instruction set, doubles and floats may be calculated faster than using the coprocessor.
http://pandorawiki.org/Floating_Point_Optimization#Compiler_Support
EDIT: Though VFP and Neon are optional extensions to the ARM Core, most of the cortex-A8 have them and all of them ones used in the iPhone and iPad do as well.

Why are C variable sizes implementation specific?

Wouldn't it make more sense (for example) for an int to always be 4 bytes?
How do I ensure my C programs are cross platform if variable sizes are implementation specific?
The types' sizes aren't defined by c because c code needs to be able to compile on embedded systems as well as your average x86 processor and future processors.
You can include stdint.h and then use types like:
int32_t (32 bit integer type)
and
uint32_t (32 bit unsigned integer type)
C is often used to write low-level code that's specific to the CPU architecture it runs on. The size of an int, or of a pointer type, is supposed to map to the native types supported by the processor. On a 32-bit CPU, 32-bit ints make sense, but they won't fit on the smaller CPUs which were common in the early 1970s, or on the microcomputers which followed a decade later. Nowadays, if your processor has native support for 64-bit integer types, why would you want to be limited to 32-bit ints?
Of course, this makes writing truly portable programs more challenging, as you've suggested. The best way to ensure that you don't build accidental dependencies on a particular architecture's types into your programs is to port them to a variety of architectures early on, and to build and test them on all those architectures often. Over time, you'll become familiar with the portability issues, and you'll tend to write your code more carefully.
Becoming aware that not all processors have integer types the same width as their pointer types, or that not all computers use twos-complement arithmetic for signed integers, will help you to recognize these assumptions in your own code.
You need to check the size of int in your implementation. Don't assume it is always 4 bytes. Use
sizeof(int)

Arbitrary Precision Arithmetic (Bignum) for 16-bit processor

I'm developing an application for a 16-bit embedded device (80251 microcontroller), and I need arbitrary precision arithmetic. Does anyone know of a library that works for the 8051 or 80251?
GMP doesn't explicitly support the 8051, and I'm wary of the problems I could run into on a 16-bit device.
Thanks
Try this one. Or, give us an idea of what you're trying to do with it; understanding the workload would help a lot. TTMath looks promising. Or, there are approximately a zillion of them listed in the Wikipedia article.

Optimizing for ARM: Why different CPUs affects different algorithms differently (and drastically)

I was doing some benchmarks for the performance of code on Windows mobile devices, and noticed that some algorithms were doing significantly better on some hosts, and significantly worse on others. Of course, taking into account the difference in clock speeds.
The statistics for reference (all results are generated from the same binary, compiled by Visual Studio 2005 targeting ARMv4):
Intel XScale PXA270
Algorithm A: 22642 ms
Algorithm B: 29271 ms
ARM1136EJ-S core (embedded in a MSM7201A chip)
Algorithm A: 24874 ms
Algorithm B: 29504 ms
ARM926EJ-S core (embedded in an OMAP 850 chip)
Algorithm A: 70215 ms
Algorithm B: 31652 ms (!)
I checked out floating point as a possible cause, and while algorithm B does use floating point code, it does not use it from the inner loop, and none of the cores seem to have a FPU.
So my question is, what mechanic may be causing this difference, preferrably with suggestions on how to fix/avoid the bottleneck in question.
Thanks in advance.
One possible cause is that the 926 has a shorter pipeline (5 cycles vs. 8 cycles for the 1136, iirc), so branch mispredictions are less costly on the 926.
That said, there are a lot of architectural differences between those processors, too many to say for sure why you see this effect without knowing something about the instructions that you're actually executing.
Clock speed is only one factor. Bus width and latency are big if not bigger factors. Cache is a factor. Speed of the media the program is run from if run from media and not memory.
Is this test using any shared libraries at all at any point in the test or is it all internal code? Fetching shared libraries on media that will vary from platform to platform (even if it is say the same sd card).
Is this the same algorithm compiled separately for each platform or the same binary? You can and will see some compiler induced variation as well. 50% faster and slower can easily come from the same compiler on the same platform by varying compiler settings. If possible you want to execute the same binary, and insure that no shared libraries are used in the loop under test. If not the same binary disassemble the loop under test for each platform and insure that there are no variations other than register selection.
From the data you have presented, its difficult to point the exact problem, but we can share some of the prior experience
Cache setting (check if all the
processors has the same CACHE
setting)
You need to check both D-Cache and I-Cache
For analysis,
Break down your code further, not just as algorithm but at a block level, and try to understand the block that causes the bottle-neck. After you find the block that causes the bottle-neck, try to disassemble the block's source code, and check the assembly. It may help.
Looks like the problem is in cache settings or something memory-related (maybe I-Cache "overflow").
Pipeline stalls, branch miss-predictions usually give less significant differences.
You can try to count some basic operations, executed in each algorithm, for example:
number of "easy" arithmetical/bitwise ops (+-|^&) and shifts by constant
number of shifts by variable
number of multiplications
number of "hard" arithmetics operations (divides, floating point ops)
number of aligned memory reads (32bit)
number of byte memory reads (8bit) (it's slower than 32bit)
number of aligned memory writes (32bit)
number of byte memory writes (8bit)
number of branches
something else, don't remember more :)
And you'll get info, that things get 926 much slower. After this you can check suspicious blocks, making using of them more or less intensive. And you'll get the answer.
Furthermore, it's much better to enable assembly listing generation in VS and use it (but not your high-level source code) as base for research.
p.s.: maybe the problem is in OS/software/firmware? Did you testing on clean system? OS is the same on all devices?