I mean in the current implementation of clang or the gcc version.
C++ and Java guys always tell me that exceptions do not cost any performance unless they are thrown. Is the same true for Objective-C?
C++ and Java guys always tell me that exceptions do not cost any performance unless they are thrown. Is the same true for Objective-C?
Short Answer
Only in 64-bit OS X and iOS.
They're not entirely free. To be more precise, the model is optimized to minimize costs during regular execution (moving consequences elsewhere).
Detailed Answer
On 32 bit OS X and iOS, exceptions have runtime costs even if they are not thrown. These architectures do not use Zero Cost Exceptions.
In 64 bit OS X, ObjC moved over to borrow C++'s "Zero Cost Exceptions". Zero Cost Exceptions have very very low execution overhead, unless thrown. Zero Cost Exceptions effectively move execution cost to binary size. This was one of the primary reasons they were not initially used in iOS. Enabling C++ exceptions and RTTI can increase the binary size by more than 50% -- of course, I would expect those numbers to be far lower in pure ObjC simply because there is less to execute when unwinding.
In arm64, the exception model was changed from Set Jump Long Jump to Itanium-derived Zero Cost Exceptions (judging by the assembly).
However, Idiomatic ObjC programs are not written or prepared to recover from exceptions, so you should reserve their use for situations which you do not intend to recover from (if you decide to use them at all). More details in the Clang manual on ARC, and in other sectons of the referenced page.
According to some 2007 release notes for the Objective-C runtime in Mac OS X v10.5, they have re-written the 64bit implementation of Objective-C exceptions to provide "zero-cost" try blocks and interoperability with C++.
Apparently, these "zero-cost" try blocks incur no time penalty when entering a try, unlike its 32-bit counterpart which must call setjmp() and other functions. Apparently throwing them is "much more expensive".
This is the only bit of information I can find in Apple's release notes so I would have to assume that this still applies in todays runtimes, and as such, 32bit exceptions = expensive, 64-bit exceptions = "zero-cost"
Related
I read about Just-in-time compilation (JIT) and as I understood, there are two approaches for this – Interpreter and JIT, both of which interpreting the bytecode at runtime.
Why not just preparatively interprete all the bytecode to machine code, and only then start to run the process with no more need for interpreter?
Another reason for late JIT compiling has to do with optimization: At run-time the VM can detect more/other patterns it may optimize than the compiler could ever do at compile-time. JIT pre-compiling at startup will always have to be static, and the same could have been done by the compiler already, but through analysis of the actual run-time behaviour the VM may have more information on possible optimizations and may therefore produce better optimization results.
For example, the VM can detect that a single piece of code is actually run a million times at run-time and perform appropriate optimizations which the compiler may have no information about, not unlike the branch prediction that's done at runtime in modern CPUs.
More information can be found in the Wikipedia article on "Adaptive optimization".
Simple: Because it takes time to precompile everything to machine code. And users don't want to wait on the application to start. Remember, the precompilation would have to make a lot of optimizations which takes time.
The server version of JVM is more aggressive in precompiling and optimizing code upfront because code on the server side tends to be executed more often and for a longer period of time before the process is shutdown.
However, a solution (for .Net) is an application called NGen which make the precompilation upfront such that it isn't needed after that point. You only have to run that once.
Not all VM's include an interpreter. For instance Chrome and CLR (.Net) always compiles to machine code before running. However, they have multiple levels of optimizations to reduce the startup time.
I found link showing how runtime recompilation can optimize performance and save extra CPU cycles.
Inlining expansion: To decrease the cost of procedure calls.
Removing redundant loads: When 2 compiled code results in some duplicate code then it can be removed and further optimised by recompilation at run time.
Copy propagation
Eliminating dead code
Here is another link for the same explanation given above.
for one of my programs I needed a scripting language to dynamically change the world (unit ai, world generation etc), So I wrote a Compiler for a rather basic language (simple objects without inheritance, 1d arrays, 32 bit ints/floats, strings) which also uses reference counting for garbage collection. The Compiler outputs stack based bytecode.
My problem now is that my VM isnt efficient enough (it actually runs 15-30 times slower than unoptimised C). Its a really simple VM which implements decoding with a giant SWITCH-CASE block.
the vm code looks like this:
switch(*ip++)
case ADD:
...
break;
case SUB:
...
break;
So my question is if it is possible to recompile my scripts to x86 assembler and execute them them at runtime. (I think thats what JIT compilers do). I googled a lot but I didn´t found any code samples for example to send x86 code to the processor. If anyone has links to tutorials that explain how to build better VM´s I would be very happy.
I have a C project which was previously being built with Codesourcery's gnu tool chain. Recently it was converted to use Realview's armcc compiler but the performance that we are getting with Realview tools is very poor compared to when it is compiled with gnu tools. Shouldnt it be opposite case i.e it should give better performance when compiled with Realview's tools? What am I missing here. How can I improve the performance with Realview's tools?
Also I have noticed that if I run the binary produced by Realview Tools with Lauterbach it crashes but If I run it using Realview ICE it runs fine.
UPDATE 1
Realview Command line:
armcc -c --diag_style=ide
--depend_format=unix_escaped --no_depend_system_headers --no_unaligned_access --c99 --arm_only --debug --gnu --cpu=ARM1136J-S --fpu=SoftVFP --apcs=/nointerwork -O3 -Otime
GNU GCC command line:
arm-none-eabi-gcc -mcpu=arm1136jf-s
-mlittle-endian -msoft-float -O3 -Wall
I am using Realview Tools version 4.1 and GCC version 4.4.1
UPDATE 2
Lauterbach issue has been solved. It was being caused because of Semihosting as the semihosting SWI was not being handled in Lauterbach environment. Retargeting the C library to avoid Semihosting did the trick and now my program runs successfully with Lauterbach as well as Realview ICE. But the performance issue is as it is.
Since you have optimisations on, and in some environments it crashes, it may be that your code uses undefined behaviour or other latent error. Such behaviour can change with optimisation, or even break altogether.
I suggest that you try both tool-chains without optimisation, and make sure that the warning level is set high, and you fix them all. GCC is far better that armcc at error checking so is a reasonable static analysis check. If the code builds clean it is more likely to work and may be easier for the optimiser to handle.
Have you tried removing the '--no_unaligned_access'? ARM11s can typically do unaligned access (if enabled in the startup code) and forcing the compiler/library to not do them may be slowing down your code.
The current version of RVCT says of '--fpu=SoftVFP':
In previous releases of RVCT, if you
specified --fpu=softvfp and a CPU with
implicit VFP hardware, the linker
chose a library that implemented the
software floating-point calls using
VFP instructions. This is no longer
the case. If you require this legacy
behavior, use --fpu=softvfp+vfp.
This suggests to me that if you perhaps have an old version of RVCT the behaviour will be to use software floating point regardless of the presence of hardware floating point. While in the GNU version -msoft-float will use hardware floating point instructions when an FPU is available.
So what version of RVCT are you using?
Either way I suggest that you remove the --fpu option since the compiler will make an implicit appropriate selection based on the --cpu option selected. You also need to correct the CPU selection, your RVCT option says --cpu=ARM1136J-S not ARM1136FJ-S as you told GCC. This will no doubt prevent the compiler from generating VFP instructions, since you told it it has no VFP.
The same source code can produce dramatically different binaries due to factors like. Different compilers (llvm vs gcc, gcc 4 vs gcc3, etc). Different versions of the same compiler. Different compiler options if the same compiler. Optimization (on either compiler). Compiled for release or debug (or whatever terms you want to use, the binaries are quite different). When going embedded, you add in the complication of a bootloader or rom monitor (debugger) and things like that. Then add to that the host side tools that talk to the rom monitor or compiled in debugger. Despite being a far better compiler than gcc, arm compilers were infected with the assumption that the binaries would always be run on top of their rom monitor. I want to remember that by the time rvct became their primary compiler that assumption was on its way out, but I have not really used their tools since then.
The bottom line is there are a handful of major factors that can affect the differences between binaries that can and will lead to a different experience. Assuming that you will get the same performance or results, is a bad assumption, the expectation is that the results will differ. Likewise, within the same environment, you should be able to create binaries that give dramatically different performance results. All from the same source code.
Do you have compiler optimizations turned on in your CodeSourcery build, but not in the Realview build?
The GCC manual says:
-fobjc-direct-dispatch
Allow fast jumps to the message dispatcher. On
Darwin this is accomplished via the comm page.
Can I assume this flag eliminates dynamic dispatch? How does it work?
I believe it should be as fast as a C function call if it is linked directly.
No, the dynamic dispatch is still there (calls still route through objc_msgSend). And this option doesn't introduce any difference currently with x86(-64).
From http://developer.apple.com/legacy/mac/library/documentation/DeveloperTools/gcc-3.3/gcc/Objective_002dC-Dialect-Options.html:
For some functions (such as objc_msgSend) called very frequently by Objective-C programs, special entry points exist in high memory that may be jumped to directly (e.g., via the "bla" instruction on the PowerPC) for improved performance. The fobjc-direct-dispatch option will cause such jumps to be generated. This option is only available in conjunction with the NeXT runtime; furthermore, programs built with the -fobjc-direct-dispatch option will only run on Mac OS X 10.4 (Tiger) or later systems.
I am working on lock free structure with g++ compiler. It seems that with -o1 switch, g++ will change the execution order of my code. How can I forbid g++'s optimization on certain part of my code while maintain the optimization to other part? I know I can split it to two files and link them, but it looks ugly.
If you find that gcc changes the order of execution in your code, you should consider using a memory barrier. Just don't assume that volatile variables will protect you from that issue. They will only make sure that in a single thread, the behavior is what the language guarantees, and will always read variables from their memory location to account for changes "invisible" to the executing code. (e.g changes to a variable done by a signal handler).
GCC supports OpenMP since version 4.2. You can use it to create a memory barrier with a special #pragma directive.
A very good insight about locking free code is this PDF by Herb Sutter and Andrei Alexandrescu: C++ and the Perils of Double-Checked Locking
You can use a function attribute "__attribute__ ((optimize 0))" to set the optimization for a single function, or "#pragma GCC optimize" for a block of code. These are only for GCC 4.4, though, I think - check your GCC manual. If they aren't supported, separation of the source is your only option.
I would also say, though, that if your code fails with optimization turned on, it is most likely that your code is just wrong, especially as you're trying to do something that is fundamentally very difficult. The processor will potentially perform reordering on your code (within the limits of sequential consistency) so any re-ordering that you're getting with GCC could potentially occur anyway.