Best-case instruction throughput on ARM NEON - neon

What is the best-case instruction throughput for a compute-bound algorithm coded in ARM-NEON?
For example, if I have a simple algorithm based on a large number of 8-bit->8-bit operations, what is the fastest possible execution speed (measured in 8-bit operations per cycle) that could be sustained if we assume full latency hiding of any memory I/O.
I am initially interested in Cortex-A8, but if you also have data for different processors, please note the differences.

As nobar mentioned, this will vary depending on micro-architecture (Samsung/Apple/Qualcomm) etc. But basically (stock A8 implementation) NEON is a 64 bit architecture with two (or one) 64 bit operands giving a 64 bit result. So without any pipeline (data dependency) stalls or I/O stalls, an integer pipeline can do 8, 8-bit operations per cycle in SIMD fashion. So the best case on stock arm processors that are single issue for ALU/Mult operations is probably "8."
You can look at the ARM architecture reference for an idea of how long various instructions take on stock ARM A8 processors. If you aren't familiar with the nomenclature, "D" registers are 64 bit, "Q" are double wide 128 bit registers, and instructions can treat the data in the registers as 8,16 or 32 bit formats.
A nice overview of a stock A8 architecture is via TI's A8 NEON Architecture page.
Specifically about the differences between processors, a lot of ARM implementers don't make their architecture details known except for extremely powerful customers, so noting the differences is fairly difficult but as Stephen Canon notes below, the newer higher end A15-ish ones will probably double the performance for some types of instructions, and lower power ones will probably halve it for some types of instructions.

Most integer operations on Cortex-A8's NEON unit are executed 128-bits at a time, not 64-bits. You can see the throughput in the TRM, found here: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344k/index.html Some notable exceptions include multiplications, shift by register value, and bit selects. But if you think about it, if there weren't 128-bit integer operations there'd be a lot less reason to use these instructions, since Cortex-A8 can already execute two 32-bit scalar integer operations in parallel.
Sadly, Cortex-A8 and A9 were the last ARM cores to include public documentation of execution performance. I haven't done extensive testing, but I think A15 can execute a 128-bit and 64-bit NEON operation in parallel (not sure what restrictions there are). And from what I've heard in passing - this is totally untested - both Cortex-A5 and A7 have 64-bit NEON execution. A5 is further limited by only having 32-bit NEON load/store throughput (while A8 actually has 128-bit, and A9 and A7 have 64-bit)

Related

Skills needed in 8-bit, 16-bit, 32-bit

Are there any specific skillsets required with 8-bit, 16-bit and 32-bit processing for embedded developers?
Yes, there are specific skills expected and differences between 8bit and 32bit processors. (Ignoring 16 bit, since there's so few of them available)
8 bit processors and tools are vastly different than the 32bit variants (even excluding Linux based systems).
Processor architecture
Memory availability
Peripheral complexity
An 8051 is a strange beast and plopping your average CS in front of one and asking them to make a product is asking for something that only mostly works. It's multiple memory spaces, lack of stack, constrained register file, and constrained memory really make "modern" computer science difficult.
Even an AVR, which is less of a strange beast, still has constraints that a 32 bit processor just doesn't have, particularly memory
And all of these are very different than writing code on an embedded linux platform.
In general processors and microcontrollers using 32 bit architecture tend to be more complex and used in more complex applications. As such, someone with only 8 bit device experience may not process the skills or experience necessary for more complex projects.
So it is not specifically the bit-width that is the issue, but it is used simply as a shorthand or proxy for complexity of systems. It is a very crude measure in any event since architectures differ widely even withing the bit-width classification; AVR, PIC and x51 for example are very different, as are 68K, ARM and x86. Even within the ARM family a Cortex-M device is very different from an A-class device.
Beware of any job spec that uses such broad skill classifications - something for you to challenge perhaps in the interview.

Are modern GPUs considered to be RISC based or CISC based?

I'm trying to figure out if modern GPUs have a reduced instruction set, or a complex instruction set.
Wikipedia says that it's not the size of the instruction set, rather how many cycles it takes to complete an instruction.
In RISC processors, each instruction can be completed in one cycle.
In CISC processors, it takes several cycles to complete some instructions.
I'm trying to figure out what the case is for modern GPUs.
If you mean Nvidia then it's clearly RISC as its most GPUs don't even have integer division and modulo operations in hardware, only shifts, bitwise operations and 3 arithmetic operations (addition, subtraction, multiplication) are used to implement those 2. I can't find example but this question (modular arithmetic on the gpu) shows that mod uses
procedure which implements some sophisticated algorithm (about 50 instructions or even more)
Even NVVM (Nvidia virtual machine) language called PTX uses more operations some of which are "baked" into a bunch of simpler operations anyway after conversion to one of native languages (there are different versions of such languages because of nature of GPUs and their generations/families but those are just called SASS altogether).
You can see here all the available operations along with description on each which are yet very short and not very clear (especially if you don't have background in machine level programming like knowing that "scaled" means 1 left shifted to operand just as in x86's "FSCALE" or "Scale factor" etc.):
https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#instruction-set-ref
If you mean AMDGPU then there is a lot of instructions and it's not so clear because some sources tell that they switched from VLIW to something just when Southern Islands GPUs were released.
RISC instruction set : the load/store unit is independent from other units so basically for loading and storing specific instruction are used
CISC insruction set : the ad/store unit in embedded in the instrction execution routine , therfore the instruction is more comlex than RISC instruction because CISC instruction beside the operation it will perform the load and store stage and this require more transistor logic to be used for one ibstruction
The goal of CISC was to take common coding patterns and accelerate them in hardware. You see this in the constant extensions to the base architecture. See Intel's MMX and SSE, and AMD's 3DNow!, etc. https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions This also makes for good marketing, as you need to upgrade to the new processor to accelerate the newest common tasks, and keeps coders busy constantly translating their code patterns to the new extensions.
The goal of RISC was the opposite. It tried to perform few base functions as fast as possible. The coder then needs to continue to break down their common coding tasks to those simple instructions (although high-level programming languages and code packages/libraries accomplish this for you). RISC continues to survive as the architecture for ARM processors. See: https://en.wikipedia.org/wiki/Reduced_instruction_set_computer
I note that GPUs are similar to the RISC philosophy, in that the goal is to perform as many relatively simple computations as fast as possible. The move toward deep learning created a need for training millions of relatively simple parameters, hence the move back toward a highly parallel, relatively simple architecture. Having both philosophies implemented inside your computer is the best of both worlds.

Performance Differences between evaluation boards [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
Our company is a proud owner of a STM32f4 evaluation board ( cortex M4F) ,
We received another evaluation board that is (ARM7TDMI board).
Before starting the migration to the ARM7 evaluation board, we want to know if the hardware is strong enough for us,so we wont waste anytime to discover it later.
Our project utilize many DSP algorithms (that takes advantage of the FPU) , heavy usage of SDIO , and around 1 megabyte of memory .
So , i was thinking to do the following tests on both evaluation boards ,and see the performance differences between them :
Math : Addition , Subtraction,Division,Multiplication , Abs and Sqrtf . It will run i a loop ( and only floating numbers will be used).
SDIO : read/write a 2 kilobyte buffer in a loop
Memory : read/write to the external and internal RAM in a loop.
In your opinion , do this results will give as any indication about the performance differences ,and what to expect from the "real" project ?
Thanks
Michael
I would advise against any new design based on ARM7 - it is a legacy ARM architecture. You should check the the vendor's part status and planned obsolescence for any part you intend to design in. No vendor is releasing new designs based on ARM7.
I would also suggest that for DSP algorithms, the DSP features of the Cortex-M4 are more important than its floating point. The ARM Cortex-M CMSIS includes a DSP library that takes advantage of this. Either way fixed-point DSP algorithms will be far more efficient than using floating point.
Cortex-M is a far more efficient design that ARM7 achieving 1.2 DMIPS per MHz compared to less than 1.0 DMIPS per MHz. That coupled with DSP instructions, floating-point, and separate buses for on-chip flags, RAM and peripherals make most code significantly faster on Cortex-M.
The Cortex-M architecture defines the SYSCLK and interrupt controller, wheras on ARM7 these are defined by the chip vendor and vary between vendors making porting of code between them more difficult.
STM32F4xx parts run at upto 180MHz; Most ARM7 parts are 60MHz or less.
Performing a comparison using floating point is almost pointless. Floating point hardware will easily outperform software floating point necessary on ARM7 by a factor of 5 to 10 at least. Unless your application can cope with that drop in performance, it is unsuited to ARM7. However, most applications do not need floating point. Integer or fixed point algorithms can run around 5 times faster than software floating point, so compete with hardware floating point. Remember also that the Cortex-M4 FPU is single precision only.
It would be more reasonable to be comparing Cortex-M3 with Cortex-M4 to test the sensitivity of your application to lack of hardware FP and DSP support.
SDIO performance will be limited by the SDIO interface and the SD card itself (which vary widely in performance even at the same "speed rating") - the load imposed on the processor itself will be very low, or it will spend most of its time waiting for data if your application busy-waits rather than doing something useful while waiting on the SD card. The use of DMA transfers can make the CPU load more-or-less negligible.
The following diagram illustrates how ARM7 is positioned compared with Cortex-M4. The latter is both higher performance and greater capability. At the same clock frequency, Cortex-M4 sites between ARM9 and ARM11 on the performance scale.
I do not think that you need to perform any benchmark tests comparing ARM7 and Cortex M4 since the broad performance figures are already available. What you could perhaps do is measure the CPU load of your existing application on its current platform. If it is low (perhaps < 20%) and it spends most of its time idle, then ARM7 might be feasible. Of course if your application is not running on an RTOS or scheduler with an idle task, then measuring true CPU load might be difficult.
I would have thought that the M4F would be a far more capable part than the venerable 7TDMI processor. I have not used an ARM7 with floating point coprocessor and would expect that as you are wanting to do floating point DSP the M4F would be more suited to your application.
Having the floating point in hardware will speed your processing and may allow power savings to be made by slowing the processor clock.
I would be reluctant to start a new design based on a version of the ARM that is at least 10 years old

Optimisation , Compilers and Its Effects

(i) If a Program is optimised for one CPU class (e.g. Multi-Core Core i7)
by compiling the Code on the same , then will its performance
be at sub-optimal level on other CPUs from older generations (e.g. Pentium 4)
... Optimizing may prove harmful for performance on other CPUs..?
(ii)For optimization, compilers may use x86 extensions (like SSE 4) which are
not available in older CPUs.... so ,Is there a fall-back to some non-extensions
based routine on older CPUs..?
(iii)Is Intel C++ Compiler is more optimizing than Visual C++ Compiler or GCC..
(iv) Will a truly Multi-Core Threaded application will perform effeciently on a
older CPUs (like Pentium III or 4)..?
Compiling on a platform does not mean optimizing for this platform. (maybe it's just bad wording in your question.)
In all compilers I've used, optimizing for platform X does not affect the instruction set, only how it is used, e.g. optimizing for i7 does not enable SSE2 instructions.
Also, optimizers in most cases avoid "pessimizing" non-optimized platforms, e.g. when optimizing for i7, typically a small improvement on i7 will not not be chosen if it means a major hit for another common platform.
It also depends in the performance differences in the instruction sets - my impression is that they've become much less in the last decade (but I haven't delved to deep lately - might be wrong for the latest generations). Also consider that optimizations make a notable difference only in few places.
To illustrate possible options for an optimizer, consider the following methods to implement a switch statement:
sequence if (x==c) goto label
range check and jump table
binary search
combination of the above
the "best" algorithm depends on the relative cost of comparisons, jumps by fixed offsets and jumps to an address read from memory. They don't differ much on modern platforms, but even small differences can create a preference for one or other implementation.
It is probably true that optimising code for execution on CPU X will make that code less optimal on CPU Y than the same code optimised for execution on CPU Y. Probably.
Probably not.
Impossible to generalise. You have to test your code and come to your own conclusions.
Probably not.
For every argument about why X should be faster than Y under some set of conditions (choice of compiler, choice of CPU, choice of optimisation flags for compilation) some clever SOer will find a counter-argument, for every example a counter-example. When the rubber meets the road the only recourse you have is to test and measure. If you want to know whether compiler X is 'better' than compiler Y first define what you mean by better, then run a lot of experiments, then analyse the results.
I) If you did not tell the compiler which CPU type to favor, the odds are that it will be slightly sub-optimal on all CPUs. On the other hand, if you let the compiler know to optimize for your specific type of CPU, then it can definitely be sub-optimal on other CPU types.
II) No (for Intel and MS at least). If you tell the compiler to compile with SSE4, it will feel safe using SSE4 anywhere in the code without testing. It becomes your responsibility to ensure that your platform is capable of executing SSE4 instructions, otherwise your program will crash. You might want to compile two libraries and load the proper one. An alternative to compiling for SSE4 (or any other instruction set) is to use intrinsics, these will check internally for the best performing set of instructions (at the cost of a slight overhead). Note that I am not talking about instruction instrinsics here (those are specific to an instruction set), but intrinsic functions.
III) That is a whole other discussion in itself. It changes with every version, and may be different for different programs. So the only solution here is to test. Just a note though; Intel compilers are known not to compile well for running on anything other than Intel (e.g.: intrinsic functions may not recognize the instruction set of a AMD or Via CPU).
IV) If we ignore the on-die efficiencies of newer CPUs and the obvious architecture differences, then yes it may perform as well on older CPU. Multi-Core processing is not dependent per se on the CPU type. But the performance is VERY dependent on the machine architecture (e.g.: memory bandwidth, NUMA, chip-to-chip bus), and differences in the Multi-Core communication (e.g.: cache coherency, bus locking mechanism, shared cache). All this makes it impossible to compare newer and older CPU efficiencies in MP, but that is not what you are asking I believe. So on the whole, a MP program made for newer CPUs, should not be using less efficiently the MP aspects of older CPUs. Or in other words, just tweaking the MP aspects of a program specifically for an older CPU will not do much. Obviously you could rewrite your algorithm to more efficiently use a specific CPU (e.g.: A shared cache may permit you to use an algorithm that exchanges more data between working threads, but this algo will die on a system with no shared cache, full bus lock and low memory latency/bandwidth), but it involves a lot more than just MP related tweaks.
(1) Not only is it possible but it has been documented on pretty much every generation of x86 processor. Go back to the 8088 and work your way forward, every generation. Clock for clock the newer processor was slower for the current mainstream applications and operating systems (including Linux). The 32 to 64 bit transition is not helping, more cores and less clock speed is making it even worse. And this is true going backward as well for the same reason.
(2) Bank on your binaries failing or crashing. Sometimes you get lucky, most of the time you dont. There are new instructions yes, and to support them would probably mean trap for an undefined instruction and have a software emulation of that instruction which would be horribly slow and the lack of demand for it means it is probably not well done or just not there. Optimization can use new instructions but more than that the bulk of the optimization that I am guessing you are talking about has to do with reordering the instructions so that the various pipelines do not stall. So you arrange them to be fast on one generation processor they will be slower on another because in the x86 family the cores change too much. AMD had a good run there for a while as they would make the same code just run faster instead of trying to invent new processors that eventually would be faster when the software caught up. No longer true both amd and intel are struggling to just keep chips running without crashing.
(3) Generally, yes. For example gcc is a horrible compiler, one size fits all fits no one well, it can never and will never be any good at optimizing. For example gcc 4.x code is slower on gcc 3.x code for the same processor (yes all of this is subjective, it all depends on the specific application being compiled). The in house compilers I have used were leaps and bounds ahead of the cheap or free ones (I am not limiting myself to x86 here). Are they worth the price though? That is the question.
In general because of the horrible new programming languages and gobs of memory, storage, layers of caching, software engineering skills are at an all time low. Which means the pool of engineers capable of making a good compiler much less a good optimizing compiler decreases with time, this has been going on for at least 10 years. So even the in house compilers are degrading with time, or they just have their employees to work on and contribute to the open source tools instead having an in house tool. Also the tools the hardware engineers use are degrading for the same reason, so we now have processors that we hope to just run without crashing and not so much try to optimize for. There are so many bugs and chip variations that most of the compiler work is avoiding the bugs. Bottom line, gcc has singlehandedly destroyed the compiler world.
(4) See (2) above. Don't bank on it. Your operating system that you want to run this on will likely not install on the older processor anyway, saving you the pain. For the same reason that the binaries optimized for your pentium III ran slower on your Pentium 4 and vice versa. Code written to work well on multi core processors will run slower on single core processors than if you had optimized the same application for a single core processor.
The root of the problem is the x86 instruction set is dreadful. So many far superior instructions sets have come along that do not require hardware tricks to make them faster every generation. But the wintel machine created two monopolies and the others couldnt penetrate the market. My friends keep reminding me that these x86 machines are microcoded such that you really dont see the instruction set inside. Which angers me even more that the horrible isa is just an interpretation layer. It is kinda like using Java. The problems you have outlined in your questions will continue so long as intel stays on top, if the replacement does not become the monopoly then we will be stuck forever in the Java model where you are one side or the other of a common denominator, either you emulate the common platform on your specific hardware, or you are writing apps and compiling to the common platform.

Optimizing for ARM: Why different CPUs affects different algorithms differently (and drastically)

I was doing some benchmarks for the performance of code on Windows mobile devices, and noticed that some algorithms were doing significantly better on some hosts, and significantly worse on others. Of course, taking into account the difference in clock speeds.
The statistics for reference (all results are generated from the same binary, compiled by Visual Studio 2005 targeting ARMv4):
Intel XScale PXA270
Algorithm A: 22642 ms
Algorithm B: 29271 ms
ARM1136EJ-S core (embedded in a MSM7201A chip)
Algorithm A: 24874 ms
Algorithm B: 29504 ms
ARM926EJ-S core (embedded in an OMAP 850 chip)
Algorithm A: 70215 ms
Algorithm B: 31652 ms (!)
I checked out floating point as a possible cause, and while algorithm B does use floating point code, it does not use it from the inner loop, and none of the cores seem to have a FPU.
So my question is, what mechanic may be causing this difference, preferrably with suggestions on how to fix/avoid the bottleneck in question.
Thanks in advance.
One possible cause is that the 926 has a shorter pipeline (5 cycles vs. 8 cycles for the 1136, iirc), so branch mispredictions are less costly on the 926.
That said, there are a lot of architectural differences between those processors, too many to say for sure why you see this effect without knowing something about the instructions that you're actually executing.
Clock speed is only one factor. Bus width and latency are big if not bigger factors. Cache is a factor. Speed of the media the program is run from if run from media and not memory.
Is this test using any shared libraries at all at any point in the test or is it all internal code? Fetching shared libraries on media that will vary from platform to platform (even if it is say the same sd card).
Is this the same algorithm compiled separately for each platform or the same binary? You can and will see some compiler induced variation as well. 50% faster and slower can easily come from the same compiler on the same platform by varying compiler settings. If possible you want to execute the same binary, and insure that no shared libraries are used in the loop under test. If not the same binary disassemble the loop under test for each platform and insure that there are no variations other than register selection.
From the data you have presented, its difficult to point the exact problem, but we can share some of the prior experience
Cache setting (check if all the
processors has the same CACHE
setting)
You need to check both D-Cache and I-Cache
For analysis,
Break down your code further, not just as algorithm but at a block level, and try to understand the block that causes the bottle-neck. After you find the block that causes the bottle-neck, try to disassemble the block's source code, and check the assembly. It may help.
Looks like the problem is in cache settings or something memory-related (maybe I-Cache "overflow").
Pipeline stalls, branch miss-predictions usually give less significant differences.
You can try to count some basic operations, executed in each algorithm, for example:
number of "easy" arithmetical/bitwise ops (+-|^&) and shifts by constant
number of shifts by variable
number of multiplications
number of "hard" arithmetics operations (divides, floating point ops)
number of aligned memory reads (32bit)
number of byte memory reads (8bit) (it's slower than 32bit)
number of aligned memory writes (32bit)
number of byte memory writes (8bit)
number of branches
something else, don't remember more :)
And you'll get info, that things get 926 much slower. After this you can check suspicious blocks, making using of them more or less intensive. And you'll get the answer.
Furthermore, it's much better to enable assembly listing generation in VS and use it (but not your high-level source code) as base for research.
p.s.: maybe the problem is in OS/software/firmware? Did you testing on clean system? OS is the same on all devices?