Is There An Exhaustive Profiler? - optimization

Relatively often I see people benchmark/profile (or advise someone else to benchmark/profile) a specific piece of code in one specific circumstance on one specific CPU in one specific computer; and then (potentially falsely) assume that this result applies to code in different circumstances (e.g. other logical CPUs in the same core under different loads) on a wide variety of very different CPUs (e.g. "all 64-bit 80x86") in a wide variety of different computers (e.g. with different RAM timings, etc).
What I'm looking for is a kind of profiler that is able to generate profiling results for many CPUs under many conditions (primarily through interpreting the code rather than direct measurement); and then combine all of the results using weighting factors (where weighting factors represent how much the user cares about each measured case) to create a result that is actually useful and not misleading.
Is there any profiling tool that fits this description?

I think there is no universal performance forecasting tool published in Internet; but there may be some inside CPU vendors to optimize next microarchitectures.
There is valgrind binary instrumentation platform with callgrind/cachegrind (slow) simple model profilers. Callgrind counts basic block executions in model like 1 instruction is something like 1 cpu clock; cachegrind additionally instruments models memory accesses with some 2 level cache model and also may model simple branch predictor. Both tools has no knowledge/models of wide decode/execution/retire capabilities of modern OOO CPUs from Vendor 1 and Vendor 2 of "all 64-bit 80x86"-compatible CPUs (and OOO cpus are similar in basic OOO capabilities and performance).
There were several open-source projects of OOO CPU simulators (from slow to very slow) like: MARSSx86 (http://marss86.org/, 2012) based on PTLsim (http://www.facom.ufms.br/~ricardo/Courses/CompArchII/Tools/PTLSim/PTLsimManual.pdf, 2007), or Sniper Multi-Core Simulator (with help of Graphite framework). (There is also DRAMSim/DRAMSim2 memory simulator which is needed for accurate system simulation and it is used in several other simulator projects; it can be optionally used in RISC-V Rocket-Chip simulator)
You may be interested in some (very-very slow - tens KIPS) cycle-accurate simulator / microarchitecture simulator, but there are not too many open source variants of them. There are some commercial simulators (for example in ARM world - ARM Cycle Models / CPAKs; ARC nSIM, ...); or simplescalar.com. There are also in-house proprietary simulators (no access to them for us).
The only public approximation to microarchitecture/cycle simulator is Vendor 1's IACA: https://software.intel.com/en-us/articles/intel-architecture-code-analyzer (inexact partial model of OOO port planning for short code sequences like inner loops without any memory hierarchy modeling). And there is other tool "SDE" from Vendor 1 to estimate/debug some future CPU instruction extensions with older CPUs and PIN binary rewriting tool: https://software.intel.com/en-us/articles/intel-software-development-emulator.

Related

Virtual memory beyond page tables

I am working on a research project to develop an OS for a many-core(1000+) chip. we are looking into implementing a virtual memory type system for memory permissions (read/write/execute) that would allow memory to be safely shared across cores.
basically we want a system that would allow us to mark a 'page' as being readable by some subset of cores writeable by another...etc. we are not going to be doing address translation (at least at this point) but we need a way to efficiently set and query permissions. it is going to be a software filled datastructure with a simple TLB style cache.
Our intuition is that simply replicating page tables for each core will be too expensive (in terms of memory usage).
what datastructures would be efficient for this kind of problem?
thanks
Have you looked at how common multi-core (2-12 core) CPU's address this problem?
Do you know where/when/why/how the solution that is used in these common multi-core CPU's -- will not scale to a 1,000+ cores?
In other words -- can you quantify what's wrong with the existing solution, which is working, and has been working, with common CPU's whose core count <= 12 ?
If you know that -- then the answer is closer than you think, because it just requires understanding how AMD/Intel solved the problem on a lesser scale -- and what's needed to make their solution work on a greater one (Maybe more memory for tables, algorithm tweaks, etc.)
Look at AMD's/Intel's data structures -- then build a software simulator for 1,000+ cores with those data structures, and see where/when/why and how your simulation fails -- if it fails...
Ideally build your simulator with a user-selectable number of cores, then TEST, TEST, TEST with different amounts of cores -- working your way up, noting bottlenecks along the way.
Your simulator should work EXACTLY as well as AMD (if you're using AMD data structures) or Intel (if you're using Intel data structures) -- at the same core count as one of their chips... because it should prove that THEY (AMD/Intel) are doing what they're doing correctly (because they are), and because that will help prove that your simulation program is doing it's simulation correctly -- at a specific number of cores.
Wishing you luck!

Which kinds of low level facilities aren't typically supported on multi-core machines?

I'm looking at some optimized, low level, cross platform, concurrency code designed to run on multi-core machines, and want to check some of its assumptions.
Support for hardware optimizations of some kinds aren't, probably, supported on multi core designs (for example, Out of Order Execution support [wikipedia] seems like a good candidate - it takes a lot of surface area to implement, and can be a power hog). Does anyone have a list of other such facilities - ones typically available on single or small number of core machines, but typically left out from machines with larger number of cores on them?
Today, multicore machines are warmed-over die shrinks of uniprocessors. You could almost imagine sawing a 4-core die into 4 1-core dice. I exaggerate only a little bit.
In future, multicore machines will be more thoughtfully designed for energy efficiency and area efficiency. You may see the same ISA, but with different mixes of resources (more or fewer numbers of duplicated functional units), and even with some sharing of resources between cores (e.g. AMD Bulldozer). And, as you say, backing off from the complexity and energy overhead of no-holds-barred out-of-order execution. This will most likely be perceived as different instruction-per-clock (IPC) differences (more or less performance) on the same instruction set architecture.
Also as vendors have to juggle a hypothetical portfolio of big out-of-order serial performance optimized cores and small in-order or less-out-of-order (OoO) and narrower, more energy efficient "throughput" cores, they will be challenged to keep these different implementations in sync with the evolutions of their ISAs. Some cores may support new instructions, new state, new coprocessors, virtualization, security, etc. earlier than others. This leads to a challenge of coding to the common denominator while also lighting up the new facilities for better perf or energy efficiency (or whatever) on those cores that have the new capabilities.
So to answer your specific question, all the traditional computer architecture techniques for trading gates for expressive-power, or performance, or energy efficiency may be rethought and selectively removed in future small throughput-oriented cores.
Hardware multithreading
Aggressive OoO -> humble OoO or even in-order execution
High degrees of microarchitectural speculation
Fancy branch predictors
Big TLBs
Fancy memory prefetchers
Deep pipelines
Wide issue / many copies of functional units
Big caches, wide buses to caches
...
But it goes both ways. It may also be that the new small throughput-optimized energy-optimized cores have new features not present in the older OoO cores. For example, the Larrabee New Instructions (LRBni) (http://www.drdobbs.com/high-performance-computing/216402188) were proposed for a machine with dozens of simpler cores. As another example, the small cores may turn to hardware multithreading to afford better memory latency tolerance to compensate for smaller private caches.
Also, having lots of small energy frugal cores means you may be willing to dedicate and therefore customize some of the cores to optimize performance for particular valuable workloads. For example, the Tensilica custom processors and tools anticipate that some of your small cores will have additional instructions and custom problem-specific datapaths (accelerating an inner loop of video decoding, for example). So in these cases the little core may (counter-intuitively) have much better performance than the much larger core.
Makes sense?
Happy hacking!

Optimisation , Compilers and Its Effects

(i) If a Program is optimised for one CPU class (e.g. Multi-Core Core i7)
by compiling the Code on the same , then will its performance
be at sub-optimal level on other CPUs from older generations (e.g. Pentium 4)
... Optimizing may prove harmful for performance on other CPUs..?
(ii)For optimization, compilers may use x86 extensions (like SSE 4) which are
not available in older CPUs.... so ,Is there a fall-back to some non-extensions
based routine on older CPUs..?
(iii)Is Intel C++ Compiler is more optimizing than Visual C++ Compiler or GCC..
(iv) Will a truly Multi-Core Threaded application will perform effeciently on a
older CPUs (like Pentium III or 4)..?
Compiling on a platform does not mean optimizing for this platform. (maybe it's just bad wording in your question.)
In all compilers I've used, optimizing for platform X does not affect the instruction set, only how it is used, e.g. optimizing for i7 does not enable SSE2 instructions.
Also, optimizers in most cases avoid "pessimizing" non-optimized platforms, e.g. when optimizing for i7, typically a small improvement on i7 will not not be chosen if it means a major hit for another common platform.
It also depends in the performance differences in the instruction sets - my impression is that they've become much less in the last decade (but I haven't delved to deep lately - might be wrong for the latest generations). Also consider that optimizations make a notable difference only in few places.
To illustrate possible options for an optimizer, consider the following methods to implement a switch statement:
sequence if (x==c) goto label
range check and jump table
binary search
combination of the above
the "best" algorithm depends on the relative cost of comparisons, jumps by fixed offsets and jumps to an address read from memory. They don't differ much on modern platforms, but even small differences can create a preference for one or other implementation.
It is probably true that optimising code for execution on CPU X will make that code less optimal on CPU Y than the same code optimised for execution on CPU Y. Probably.
Probably not.
Impossible to generalise. You have to test your code and come to your own conclusions.
Probably not.
For every argument about why X should be faster than Y under some set of conditions (choice of compiler, choice of CPU, choice of optimisation flags for compilation) some clever SOer will find a counter-argument, for every example a counter-example. When the rubber meets the road the only recourse you have is to test and measure. If you want to know whether compiler X is 'better' than compiler Y first define what you mean by better, then run a lot of experiments, then analyse the results.
I) If you did not tell the compiler which CPU type to favor, the odds are that it will be slightly sub-optimal on all CPUs. On the other hand, if you let the compiler know to optimize for your specific type of CPU, then it can definitely be sub-optimal on other CPU types.
II) No (for Intel and MS at least). If you tell the compiler to compile with SSE4, it will feel safe using SSE4 anywhere in the code without testing. It becomes your responsibility to ensure that your platform is capable of executing SSE4 instructions, otherwise your program will crash. You might want to compile two libraries and load the proper one. An alternative to compiling for SSE4 (or any other instruction set) is to use intrinsics, these will check internally for the best performing set of instructions (at the cost of a slight overhead). Note that I am not talking about instruction instrinsics here (those are specific to an instruction set), but intrinsic functions.
III) That is a whole other discussion in itself. It changes with every version, and may be different for different programs. So the only solution here is to test. Just a note though; Intel compilers are known not to compile well for running on anything other than Intel (e.g.: intrinsic functions may not recognize the instruction set of a AMD or Via CPU).
IV) If we ignore the on-die efficiencies of newer CPUs and the obvious architecture differences, then yes it may perform as well on older CPU. Multi-Core processing is not dependent per se on the CPU type. But the performance is VERY dependent on the machine architecture (e.g.: memory bandwidth, NUMA, chip-to-chip bus), and differences in the Multi-Core communication (e.g.: cache coherency, bus locking mechanism, shared cache). All this makes it impossible to compare newer and older CPU efficiencies in MP, but that is not what you are asking I believe. So on the whole, a MP program made for newer CPUs, should not be using less efficiently the MP aspects of older CPUs. Or in other words, just tweaking the MP aspects of a program specifically for an older CPU will not do much. Obviously you could rewrite your algorithm to more efficiently use a specific CPU (e.g.: A shared cache may permit you to use an algorithm that exchanges more data between working threads, but this algo will die on a system with no shared cache, full bus lock and low memory latency/bandwidth), but it involves a lot more than just MP related tweaks.
(1) Not only is it possible but it has been documented on pretty much every generation of x86 processor. Go back to the 8088 and work your way forward, every generation. Clock for clock the newer processor was slower for the current mainstream applications and operating systems (including Linux). The 32 to 64 bit transition is not helping, more cores and less clock speed is making it even worse. And this is true going backward as well for the same reason.
(2) Bank on your binaries failing or crashing. Sometimes you get lucky, most of the time you dont. There are new instructions yes, and to support them would probably mean trap for an undefined instruction and have a software emulation of that instruction which would be horribly slow and the lack of demand for it means it is probably not well done or just not there. Optimization can use new instructions but more than that the bulk of the optimization that I am guessing you are talking about has to do with reordering the instructions so that the various pipelines do not stall. So you arrange them to be fast on one generation processor they will be slower on another because in the x86 family the cores change too much. AMD had a good run there for a while as they would make the same code just run faster instead of trying to invent new processors that eventually would be faster when the software caught up. No longer true both amd and intel are struggling to just keep chips running without crashing.
(3) Generally, yes. For example gcc is a horrible compiler, one size fits all fits no one well, it can never and will never be any good at optimizing. For example gcc 4.x code is slower on gcc 3.x code for the same processor (yes all of this is subjective, it all depends on the specific application being compiled). The in house compilers I have used were leaps and bounds ahead of the cheap or free ones (I am not limiting myself to x86 here). Are they worth the price though? That is the question.
In general because of the horrible new programming languages and gobs of memory, storage, layers of caching, software engineering skills are at an all time low. Which means the pool of engineers capable of making a good compiler much less a good optimizing compiler decreases with time, this has been going on for at least 10 years. So even the in house compilers are degrading with time, or they just have their employees to work on and contribute to the open source tools instead having an in house tool. Also the tools the hardware engineers use are degrading for the same reason, so we now have processors that we hope to just run without crashing and not so much try to optimize for. There are so many bugs and chip variations that most of the compiler work is avoiding the bugs. Bottom line, gcc has singlehandedly destroyed the compiler world.
(4) See (2) above. Don't bank on it. Your operating system that you want to run this on will likely not install on the older processor anyway, saving you the pain. For the same reason that the binaries optimized for your pentium III ran slower on your Pentium 4 and vice versa. Code written to work well on multi core processors will run slower on single core processors than if you had optimized the same application for a single core processor.
The root of the problem is the x86 instruction set is dreadful. So many far superior instructions sets have come along that do not require hardware tricks to make them faster every generation. But the wintel machine created two monopolies and the others couldnt penetrate the market. My friends keep reminding me that these x86 machines are microcoded such that you really dont see the instruction set inside. Which angers me even more that the horrible isa is just an interpretation layer. It is kinda like using Java. The problems you have outlined in your questions will continue so long as intel stays on top, if the replacement does not become the monopoly then we will be stuck forever in the Java model where you are one side or the other of a common denominator, either you emulate the common platform on your specific hardware, or you are writing apps and compiling to the common platform.

What's the best description for "embedded hardware system"?

When I hear that, I always think about an mobile device. But why is the hardware "embedded" there? Isn't the whole device the hardware? Why is a personal computer no embedded hardware system?
In today's world embedded simply refers to a system with one or more of the following traits:
Single purpose (ie, not a general purpose computer, like your desktop)
Firmware rather than software - still software, but not as easily updated (flash, etc)
Hardware and software are designed together as a unit
Different, perhaps more rigorous testing as software updates are not desired
Real time computing
Memory integrated on the CPU
Microcontroller rather than microprocessor
Expected high reliability (you shouldn't have to reboot your dishwasher or microwave)
If it runs a program, but doesn't look like a computer, it's an embedded system.
That's my standard answer for friends and family. There's too many different types of embedded systems to get more specific.
I worked in the "embedded" area for a while and we considered anything that we had to write custom code for the hardware to be embedded.
If you have to work around the memory structure, write custom device drivers and anything that sits "directly on the metal" is generally "embedded".
If you're debugging it via a serial port - it's embedded.
It is called "embedded" because the computer is embedded as part of a larger device.
There is a very wide range of embedded systems.
At the low end are 8-pin PICs, for example there is a 12F629 in these diode lights. These costs cents and have very little memory.
The NXT by LEGO contains two controllers, a relatively big AT91SAM7S256 with a 32-bit ARM core, 256KB of flash ROM and 64KB of RAM, and a smaller 8-bit ATmega48 with 4KB of flash.
Currently I'm working on embedded systems for trains, these typically have a PowerPC with some hundreds of MHz clock, on the order of a hundred MB of RAM, run VxWorks or Linux and are connected by Ethernet.
I think there are still more powerful embedded systems for telecommunications, but I haven't worked on these.
As per Wikipedia:
An embedded system is a
special-purpose computer system designed to perform one or a few
dedicated functions, often with
real-time computing constraints. It is
usually embedded as part of a complete
device including hardware and
mechanical parts. In contrast, a
general-purpose computer, such as a
personal computer, can do many
different tasks depending on
programming.
Embedded systems are designed to do some specific task, rather than be a
general-purpose computer for multiple
tasks. Some also have real-time
performance constraints that must be
met, for reasons such as safety and
usability; others may have low or no
performance requirements, allowing the
system hardware to be simplified to
reduce costs.
Embedded systems are not always standalone devices. Many embedded
systems consist of small, computerized
parts within a larger device that
serves a more general purpose. For
example, the Gibson Robot Guitar
features an embedded system for tuning
the strings, but the overall purpose
of the Robot Guitar is, of course, to
play music.[2] Similarly, an embedded
system in an automobile provides a
specific function as a subsystem of
the car itself.
The program instructions written for embedded systems are referred to as
firmware, and are stored in read-only
memory or Flash memory chips. They run
with limited computer hardware
resources: little memory, small or
non-existent keyboard and/or screen.
From personal experience, if it's "headless" (i.e. doesn't have an output device like a VDU and relies on something like LED's), if there is a serial port used mainly for debugging and logging and if you often use a logic analyser for debugging, it's embedded.
"Embedded" has become a very diverse term.
I've seen and worked on designs that:
Simply toggled discrete I/O (including LEDs) at fixed intervals
Drivers for hardware solutions (e.g. webcams, wireless com)
Acted as communications translators for board-level I/O (SPI<->I2C<->Rs232<->USB)
[ insert multitude of appliances here ]
Human-controlled electronics (calculator-esque, phone-esque)
System level devices to coordinate actions of other devices.
I also like Dour-High-Arch's comment above:
"Another important difference is that embedded apps may run for years without intervention..."
"Embedded system" is a very broad term and I don't think that it is easy to have a single definition. The word "embedded" actually refers to an industry and not to a "hardware system". The description of embedded systems has changed over the years and it is definitely going to change in the future too.
In early days one would say the embedded systems were only programmed in assembly, but now C is common place and perhaps in the future other languages are used as well. CPUs are getting bigger and bigger, external memories are used all the time and they are many devices considered to be embedded that are not dedicated to a single task, applications can be added to them and the software is easily updated. Watches, gadgets, house appliances, automotive devices, PLCs, motor controllers, weather stations, system monitoring devices are all considered embedded. It is difficult to singe define them all.

CPU Cards for Parallel Computation?

I remember reading some time ago that there were cpu cards for systems to add additional processing power to do mass parallelization. Anyone have any experience on this and any resources to get looking into the hardware and software aspects of the project? Is this technology inferior to a traditional cluster? Is it more power conscious?
There are two cool options. one is the use of GPU's as Mitch mentions. The other is to get a PS/3, which has a multicore Cell processor.
You can also set up multiple inexpensive motherboard PCs and run Linux and Beowulf.
GPGPU is probably the most practical option for an enthusiast. However, DSPs are another option, such as those made by Texas Instruments, Freescale, Analog Devices, and NXP Semiconductors. Granted, most of those are probably targeted more towards industrial users, but you might look into the Storm-1 line of DSPs, some of which are supposed to go for as low as $60 a piece.
Another option for data parallelism are Physics Processing Units like the Nvidia (formerly Ageia) PhysX. The most obvious use of these coprocessors are for games, but they're also used for scientific modeling, cryptography, and other vector processing applications.
ClearSpeed Attached Processors are another possibility. These are basically SIMD co-processors designed for HPC applications, so they might be out of your price range, but I'm just guessing here.
All of these suggestions are based around data parallelism since I think that's the area with the most untapped potential. A lot of currently CPU-intensive applications could be performed much faster at much lower clock rates (and using less power) by simply taking advantage of vector processing and more specialized SIMD instruction sets.
Really, most computer users don't need more than an Intel Atom processor for the majority of their casual computing needs: e-mail, browsing the web, and playing music/video. And for the other 10% of computing tasks that actually do require lots of processing power, a general-purpose scalar processor typically isn't the best tool for the job anyway.
Even most people who do have serious processing needs only need it for a narrow range of applications; a physicist doesn't need a PC capable of playing the latest FPS; a sound engineer doesn't need to do scientific modeling or perform statistical analysis; and a graphic designer doesn't need to do digital signal processing. Domain-specific vector processors with highly specialized instruction sets (like modern GPUs for gaming) would be able to handle these tasks much more efficiently than a high power general-purpose CPU.
Cluster computing is no doubt very useful for a lot of high end industrial applications like nuclear research, but I think vector processing has much more practical uses for the average person.
Have you looked at the various GPU Computing options. Nvidia (and probably others) are offering personal supercomputers based around utilising the power of graphics cards.
OpenCL - is an industry wide standard for doing HPC computing across different vendors and processor types, single-core, multi-core, graphics cards, cell, etc... see http://en.wikipedia.org/wiki/OpenCL.
The idea is that using a simple code base you can use all spare processing capacity on the machine regardless of type of processor.
Apple has implemented this standard in its next version Mac OS X. There will also be offerings from nVIDIA, ATI, Intel etc.
Mercury Computing offers a Cell Accelerator Board, it's a PCIe card that has a Cell processor, and runs Yellow Dog Linux, or Mercury's flavor of YDL. Fixstars offers a more powerful Cell PCIe board called the GigaAccel. I called up Mercury, they said their board is about $5000 USD, without software. I'd guess the GigaAccel is up to twice as expensive.
I found one of the Mercury boards used, but it didn't come with a power cable, so I haven't been able to use it yet, sadly.