cpuid + rdtsc and out-of-order execution - rdtsc

cpuid is used as a serializing instruction to prevent ooo execution when benchmarking, since the execution of benchmarked instructions might be reordered before rdtsc if it's used alone. My question is whether it is still possible for the instructions below rdtsc to be reordered in between cpuid and rdtsc? Since rdtsc is not a serializing instruction, instructions can be freely reordered around it?

Since RDTSC does not depend on any input (it takes no arguments) in principle the OOO pipeline will run it as soon as it can. The reason you add a serializing instruction before it is not to let the RDTSC execute earlier.
There is an answer from John McCalpin here, you might find it useful. He explains the OOO reordering for the RDTSCP instruction (which behaves differently from RDTSC) which you may prefer to use instead.

Related

Python's support for multi-threading

I heard that python still has this global interpreter lock issue. As a result, threads execution in python are not actually multi-threaded.
What are the possible solutions to overcome this problem?
I am using python 2.7.3
For understanding python's GIL, I would recommend using this link: http://www.dabeaz.com/python/UnderstandingGIL.pdf
From python wiki:
The GIL is controversial because it prevents multithreaded CPython programs from taking full advantage of multiprocessor systems in certain situations. Note that potentially blocking or long-running operations, such as I/O, image processing, and NumPy number crunching, happen outside the GIL. Therefore it is only in multithreaded programs that spend a lot of time inside the GIL, interpreting CPython bytecode, that the GIL becomes a bottleneck.
There are discussions on eliminating the GIL, but I guess its not achieved yet. If you really want to achieve multi-threading for your custom code, you can also switch to Java.
Do see if that helps.

How do GPU handle indirect branches

My understanding of GPUs is that they handle branches by executing all path while suspending instances that are not supposed to execute the path. This works well for if/then/else kind of construct and loops (instance that terminated the loop can be suspended until all instance are suspended).
This flat out does not work if the branch is indirect. But modern GPUs (Fermi and beyond for nVidia, not sure when it appear for AMD, R600 ?) claim to support indirect branches (function pointers, virtual dispatch, ...).
Question is, what kind of magic is going on in the chip to make this happen ?
Accordingly to the Cuda programming guide there is some strong restrictions on virtual functions and dynamic dispatching.
See http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#functions for more information. Another interesting article about how code is mapped to the GPU hardware is http://yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html .
GPUs are like CPUs in regards to indirect branches. They both have an IP (instruction pointer) that points to physical memory. This IP is incremented for each hardware instruction that gets executed. An indirect branch just sets the IP to the new location. How this is done is a little bit more complicated. I will use PTX for Nvidia and GCN Assembly for AMD.
An AMD GCN GPU can have its IP simply set from any register. Example: "s_branch S8" The IP can be set with any value. In fact, on an AMD GPU, its possible to write to the program memory in a kernel and then set the IP to execute it (self modifying code).
On NVidia's PTX there is no indirect jump. I have been waiting for real hardware indirect branch support since 2009. The most current version of the PTX ISA 4.3 still does not have indirect branching. In the current PTX ISA manual, http://docs.nvidia.com/cuda/parallel-thread-execution, it still reads that "Indirect branch is currently unimplemented".
However, "indirect calls" are supported via jump tables. These are slightly different then indirect branches but do the same thing. I did some testing with jump tables in the past and the performance was not great. I believe the way this works is that the kernel is lunched with a table of already known call locations. Then when it runs across a "call %r10(params)" (something like that) it saves the current IP and then references the jump table by an index and then sets the IP to that address. I'm not 100% sure but its something like that.
Like you said, besides branching both AMD and NVidia GPUS also allow instructions to be executed but ignored. It executes them but does not write the output. This is another way of handing an if/then/else as some cores are ignored while others run. This does not really have much to do with branching. This is a trick to just avoid a time consuming branches. Some CPUs like the Intel Itanium also do this.
You can also try searching under these other names also: Indirect Calls, Indirect Branches, Dynamic branching, virtual functions, function pointers or jump tables
Hope this helps. Sorry I went so long.

GCC -mthumb against -marm

I am working on performance optimizations of ARM C/C++ code, compiled with GCC. CPU is Tegra 3.
As I know flags -mthumb means generating old 16-bit Thumb instructions. On different tests, I have 10-15% performance increase with -marm against -mthumb.
Is -mthumb used only for compatibility and for performance, while -marm is generally better?
I am asking because android-cmake used -mthumb in Release mode and -marm in Debug. This is very confusing for me.
Thumb is not the older instruction-set, but in fact the newer one. The current revision being Thumb-2, which is a mixed 16/32-bit instruction set. The Thumb1 instruction set was a compressed version of the original ARM instruction set. The CPU would fetch the the instruction, decompress it into ARM and then process it. These days (ARMv7 and above), Thumb-2 is preferred for everything but performance critical or system code. For example, GCC will by default generate Thumb2 for ARMv7 (Like your Tegra3), as the higher code density provided by the 16/32-bit ISA allows for better icache utilization. But this is something which is very hard to measure in a normal benchmark, because most benchmarks will fit into the L1 icache anyway.
For more information check the Wikipedia site: http://en.wikipedia.org/wiki/ARM_architecture#Thumb
ARM is a 32 bit instruction so has more bits to do more things in a single instruction while THUMB with only 16 bits might have to split the same functionality between 2 instructions. Based on the assumption that non-memory instructions took more or less the same time, fewer instructions mean faster code. There were also some things that just couldn't be done with THUMB code.
The idea was then that ARM would be used for performance critical functionality while THUMB (which fits 2 instructions into a 32 bit word) would be used to minimize storage space of programs.
As CPU memory caching became more critical, having more instructions in the icache was a bigger determinant of speed than functional density per instruction. This meant that THUMB code became faster than the equivalent ARM code. ARM (corp) therefore created THUMB32 which is a variable length instruction that incorporates most ARM functionality. THUMB32 should in most cases give you denser as well as faster code due to better caching.

Optimisation , Compilers and Its Effects

(i) If a Program is optimised for one CPU class (e.g. Multi-Core Core i7)
by compiling the Code on the same , then will its performance
be at sub-optimal level on other CPUs from older generations (e.g. Pentium 4)
... Optimizing may prove harmful for performance on other CPUs..?
(ii)For optimization, compilers may use x86 extensions (like SSE 4) which are
not available in older CPUs.... so ,Is there a fall-back to some non-extensions
based routine on older CPUs..?
(iii)Is Intel C++ Compiler is more optimizing than Visual C++ Compiler or GCC..
(iv) Will a truly Multi-Core Threaded application will perform effeciently on a
older CPUs (like Pentium III or 4)..?
Compiling on a platform does not mean optimizing for this platform. (maybe it's just bad wording in your question.)
In all compilers I've used, optimizing for platform X does not affect the instruction set, only how it is used, e.g. optimizing for i7 does not enable SSE2 instructions.
Also, optimizers in most cases avoid "pessimizing" non-optimized platforms, e.g. when optimizing for i7, typically a small improvement on i7 will not not be chosen if it means a major hit for another common platform.
It also depends in the performance differences in the instruction sets - my impression is that they've become much less in the last decade (but I haven't delved to deep lately - might be wrong for the latest generations). Also consider that optimizations make a notable difference only in few places.
To illustrate possible options for an optimizer, consider the following methods to implement a switch statement:
sequence if (x==c) goto label
range check and jump table
binary search
combination of the above
the "best" algorithm depends on the relative cost of comparisons, jumps by fixed offsets and jumps to an address read from memory. They don't differ much on modern platforms, but even small differences can create a preference for one or other implementation.
It is probably true that optimising code for execution on CPU X will make that code less optimal on CPU Y than the same code optimised for execution on CPU Y. Probably.
Probably not.
Impossible to generalise. You have to test your code and come to your own conclusions.
Probably not.
For every argument about why X should be faster than Y under some set of conditions (choice of compiler, choice of CPU, choice of optimisation flags for compilation) some clever SOer will find a counter-argument, for every example a counter-example. When the rubber meets the road the only recourse you have is to test and measure. If you want to know whether compiler X is 'better' than compiler Y first define what you mean by better, then run a lot of experiments, then analyse the results.
I) If you did not tell the compiler which CPU type to favor, the odds are that it will be slightly sub-optimal on all CPUs. On the other hand, if you let the compiler know to optimize for your specific type of CPU, then it can definitely be sub-optimal on other CPU types.
II) No (for Intel and MS at least). If you tell the compiler to compile with SSE4, it will feel safe using SSE4 anywhere in the code without testing. It becomes your responsibility to ensure that your platform is capable of executing SSE4 instructions, otherwise your program will crash. You might want to compile two libraries and load the proper one. An alternative to compiling for SSE4 (or any other instruction set) is to use intrinsics, these will check internally for the best performing set of instructions (at the cost of a slight overhead). Note that I am not talking about instruction instrinsics here (those are specific to an instruction set), but intrinsic functions.
III) That is a whole other discussion in itself. It changes with every version, and may be different for different programs. So the only solution here is to test. Just a note though; Intel compilers are known not to compile well for running on anything other than Intel (e.g.: intrinsic functions may not recognize the instruction set of a AMD or Via CPU).
IV) If we ignore the on-die efficiencies of newer CPUs and the obvious architecture differences, then yes it may perform as well on older CPU. Multi-Core processing is not dependent per se on the CPU type. But the performance is VERY dependent on the machine architecture (e.g.: memory bandwidth, NUMA, chip-to-chip bus), and differences in the Multi-Core communication (e.g.: cache coherency, bus locking mechanism, shared cache). All this makes it impossible to compare newer and older CPU efficiencies in MP, but that is not what you are asking I believe. So on the whole, a MP program made for newer CPUs, should not be using less efficiently the MP aspects of older CPUs. Or in other words, just tweaking the MP aspects of a program specifically for an older CPU will not do much. Obviously you could rewrite your algorithm to more efficiently use a specific CPU (e.g.: A shared cache may permit you to use an algorithm that exchanges more data between working threads, but this algo will die on a system with no shared cache, full bus lock and low memory latency/bandwidth), but it involves a lot more than just MP related tweaks.
(1) Not only is it possible but it has been documented on pretty much every generation of x86 processor. Go back to the 8088 and work your way forward, every generation. Clock for clock the newer processor was slower for the current mainstream applications and operating systems (including Linux). The 32 to 64 bit transition is not helping, more cores and less clock speed is making it even worse. And this is true going backward as well for the same reason.
(2) Bank on your binaries failing or crashing. Sometimes you get lucky, most of the time you dont. There are new instructions yes, and to support them would probably mean trap for an undefined instruction and have a software emulation of that instruction which would be horribly slow and the lack of demand for it means it is probably not well done or just not there. Optimization can use new instructions but more than that the bulk of the optimization that I am guessing you are talking about has to do with reordering the instructions so that the various pipelines do not stall. So you arrange them to be fast on one generation processor they will be slower on another because in the x86 family the cores change too much. AMD had a good run there for a while as they would make the same code just run faster instead of trying to invent new processors that eventually would be faster when the software caught up. No longer true both amd and intel are struggling to just keep chips running without crashing.
(3) Generally, yes. For example gcc is a horrible compiler, one size fits all fits no one well, it can never and will never be any good at optimizing. For example gcc 4.x code is slower on gcc 3.x code for the same processor (yes all of this is subjective, it all depends on the specific application being compiled). The in house compilers I have used were leaps and bounds ahead of the cheap or free ones (I am not limiting myself to x86 here). Are they worth the price though? That is the question.
In general because of the horrible new programming languages and gobs of memory, storage, layers of caching, software engineering skills are at an all time low. Which means the pool of engineers capable of making a good compiler much less a good optimizing compiler decreases with time, this has been going on for at least 10 years. So even the in house compilers are degrading with time, or they just have their employees to work on and contribute to the open source tools instead having an in house tool. Also the tools the hardware engineers use are degrading for the same reason, so we now have processors that we hope to just run without crashing and not so much try to optimize for. There are so many bugs and chip variations that most of the compiler work is avoiding the bugs. Bottom line, gcc has singlehandedly destroyed the compiler world.
(4) See (2) above. Don't bank on it. Your operating system that you want to run this on will likely not install on the older processor anyway, saving you the pain. For the same reason that the binaries optimized for your pentium III ran slower on your Pentium 4 and vice versa. Code written to work well on multi core processors will run slower on single core processors than if you had optimized the same application for a single core processor.
The root of the problem is the x86 instruction set is dreadful. So many far superior instructions sets have come along that do not require hardware tricks to make them faster every generation. But the wintel machine created two monopolies and the others couldnt penetrate the market. My friends keep reminding me that these x86 machines are microcoded such that you really dont see the instruction set inside. Which angers me even more that the horrible isa is just an interpretation layer. It is kinda like using Java. The problems you have outlined in your questions will continue so long as intel stays on top, if the replacement does not become the monopoly then we will be stuck forever in the Java model where you are one side or the other of a common denominator, either you emulate the common platform on your specific hardware, or you are writing apps and compiling to the common platform.

Is "the optimized delay" a myth or is it real?

From time to time you hear stories that are meant to illustrate how good someone is at something, and sometimes you hear about the guy how is so into code optimization that he optimizes his delay loop.
Since this really sounds like it's a strange thing to do as it's much better to start a "timer interrupt" instead of a optimized buzy wait,
and nobody ever tend to tells you the name of the optimizing hacker.
That has left me to wonder if it is a urban myth or is it real?
What do you say, reality or fiction?
Thanks
Johan
Update: It sounds like ShuggyCoUk was on to something,
wonder if we can find a example.
Update: Just a little clarification, this question is about the "delay" function it self and how that is implemented, not how and where you call it.
And what that purpose was, and how that system became better.
Update: It's no myth, those guys seems to exist
Thanks
ShuggyCoUk
This has more than a kernel of truth about it...
Spin wait can be much better than a signal based interrupt or a yield.
You trade some throughput for much reduced latency.
Often this is vitally important within an OS itself.
You allow yourself the freedom to do operations not possible within an interrupt handler
memory allocation for example.
You can get considerably finer grained control of the interval waited since you can essentially measure the cycle count.
However spin waits are tricky to get right.
If you can you should use use proper idle instructions which:
can power down parts of the core, improving power usage/heat dissipation and even allowing other cores to go faster.
In Hyper Thread based CPUs you allow the other logical thread to use the full CPU pipeline while you spin.
an instruction you might think was a no-op could cause the CPU to execute them out of order via the super scalar execution units. The resulting code may get unforeseen out of order artefacts which force the CPU to apply a great deal of effort in terms of stalls and memory barriers which are unwanted.
This is why you let someone else write the spin wait loop for you in most cases..
In Linux there is the cpu_relax macro
on arm this is barrier()
on x86 this is rep_nop()
In Windows there is YieldProcessor
Accessible in .Net via Thread.SpinWait
OS X eschews providing a standard implementation unless you are in the kernel
see this document and note that it encourages the use only of lck_spin_t
As to some citations of using PAUSE for spin waits:
PostGresSQL
Linux
See also the note that this is better on non P4 as well due to reducing power
The version I've always heard is of a group of hardware programmers who developed a special instruction that optimised the idle (not busy) loop of their operating system. This is mentioned in Kernighan & Pike's book The Practice Of Programming, but even there they admit it may be an Urban Myth.
I've heard stories of programmers who intentionally put in long delay loops early in projects and removed them later as "optimizations" to impress management. Never figured out if the stories were apocryphal or not.