Which processors support "Fast Short REP CMPSB and SCASB" - optimization

Intel's "Optimization Reference Manual" mentions a new cpu feature "Fast Short REP CMPSB and SCASB" that could speed up string operations:
REP CMPSB and SCASB performance is enhanced. The enhancement applies to string lengths between 1
and 128 bytes long. When the Fast Short REP CMPSB and SCASB feature is enabled, REP CMPSB and REP
SCASB performance is flat 15 cycles per operation, for all strings 1-128 byte long whose two source operands reside in the processor first level cache.
Support for fast short REP CMPSB and SCASB is enumerated by the CPUID feature flag:
CPUID.07H.01H:EAX.FAST_SHORT_REP_CMPSB_SCASB[bit 12] = 1.
Fast Short REP MOVSB explicitly mentions support
Beginning with processors based on Ice Lake Client microarchitecture, REP MOVSB performance of short operations is enhanced
But I could not find any information about which cpu generation started supporting "Fast Short REP CMPSB".

CPUID dump for Core i5-12500 (which only has performance cores, no efficiency cores) Shows support for this feature.
Dumps for 1350P and 1365U also show support.
Interestingly I did not see it in any of the other 13x00 cores.
InstLatX64 on twitter also pointed me to the "Intel® 64 and IA-32 Architectures Software Developer’s Manual" saying the following:
Fast Short REP CMPSB, fast short REP SCASB
4th generation Intel® Xeon® Scalable Processor Family based on Sapphire Rapids microarchitecture

Related

What is the difference between the gem5 CPU models and which one is more accurate for my simulation?

When running a simulation in gem5, I can select a CPU with fs.py --cpu-type.
This option can also show a list of all CPU types if I use an invalid CPU type such as fs.py --cpu-type.
What is the difference between those CPU types and which one should I choose for my experiment?
Question inspired by: https://www.mail-archive.com/gem5-users#gem5.org/msg16976.html
An overview of the CPU types can be found at: https://cirosantilli.com/linux-kernel-module-cheat/#gem5-cpu-types
In summary:
simplistic CPUs (derived from BaseSimpleCPU): for example AtomicSimpleCPU (the default one). They have no CPU pipeline, and therefor are completely unrealistic. However, they also run much faster. Therefore,they are mostly useful to boot Linux fast and then checkpoint and switch to a more detailed CPU.
Within the simple CPUs we can notably distinguish:
AtomicSimpleCPU: memory requests finish immediately
TimingSimpleCPU: memory requests actually take time to go through to the memory system and return. Since there is no CPU pipeline however, the simulated CPU stalls on every memory request waiting for a response.
An alternative to those is to use KVM CPUs to speed up boot if host and guest ISA are the same, although as of 2019, KVM is less stable as it is harder to implement and debug.
in-order CPUs: derived from the generic MinorCPU by parametrization, Minor stands for In Order:
for ARM: HPI is made by ARM and models a "(2017) modern in-order Armv8-A implementation". This is your best in-order ARM bet.
out-of-order CPUs, derived from the generic DerivO3CPU by parametrization, O3 stands for Out Of Order:
for ARM: there are no models specifically published by ARM as of 2019. The only specific O3 model available is ex5_big for an A15, but you would have to verify its authors claims on how well it models the real core A15 core.
If none of those are accurate enough for your purposes, you could try to create your own in-order/out-of-order models by parametrizing MinorCPU / DerivO3CPU like HPI and ex5_big do, although this could be hard to get right, as there isn't generally enough public information on non-free CPUs to do this without experiments or reverse engineering.
The other thing you will want to think about is the memory system model. There are basically two choices: classical vs Ruby, and within Ruby, several options are available, see also: https://cirosantilli.com/linux-kernel-module-cheat/#gem5-ruby-build

Designing a Computer for Spatially-Explicit Modeling in NetLogo

I have done various searches and have yet to find a forum or article that discusses how to approach building a modeling computer for use with NetLogo. I was hoping to start such a discussion, and since the memory usage of NetLogo is proportional to the size of the world and number of simulations run in parallel with BehaviorSpace, it seems reasonable that a formula exists relating sufficient hardware to NetLogo demands.
As an example, I am planning to run a metapopulation model in a landscape approximately 12km x 12km, corresponding to a NetLogo world of 12,000x12,000 at a pixel size of 1, for a 1-meter resolution (relevant for the animal's movement behavior). An earlier post described a large world (How to model a very large world in NetLogo?), and provided a discussion for potential ways to reduce needing large worlds (http://netlogo-users.18673.x6.nabble.com/Re-Rumors-of-Relogo-td4869241.html#a4869247). Another post described a world of 3147x5141 and was using a Linux computer with 64GB of RAM (http://netlogo-users.18673.x6.nabble.com/Java-OutofMemory-errors-for-large-NetLogo-worlds-suggestions-requested-td5002799.html). Clearly, the capability of computers to run large NetLogo worlds is becoming increasingly important.
Presumably, the "best" solution for researchers at universities with access to Windows-based machines would be to run 16GB to 64GB of RAM with a six- or eight-core processor such as the Intel Xeon capable of hyperthreading for running multiple simulations in parallel with BehaviorSpace. As an example, I used SELES (Fall & Fall 2001) on a machine with a 6-core Xeon processor with hyperthreading enabled and 8GB of RAM to run 12,000 replicates of a model with a 1-meter resolution raster map of 1580x1580. This used the computer to its full capacity and it took about a month to run the simulations.
So - if I were to run 12,000 replicates of a 12,000x12,000 world in NetLogo, what would be the "best" option for a computer? Without reaching for the latest and greatest processing power out there, I would presume the most fiscally-reasonable option to be a server board with dual Xeon processors (likely 8-core Ivy bridge) with 64GB of RAM. Would this be a sufficient design, or are there alternatives that are cheaper (or not) for modeling at this scale? And additionally, do there exist "guidelines" of processor/RAM combinations to cope with the increasing demand of NetLogo on memory as the size of worlds and the number of parallel simulations increase?

Best-case instruction throughput on ARM NEON

What is the best-case instruction throughput for a compute-bound algorithm coded in ARM-NEON?
For example, if I have a simple algorithm based on a large number of 8-bit->8-bit operations, what is the fastest possible execution speed (measured in 8-bit operations per cycle) that could be sustained if we assume full latency hiding of any memory I/O.
I am initially interested in Cortex-A8, but if you also have data for different processors, please note the differences.
As nobar mentioned, this will vary depending on micro-architecture (Samsung/Apple/Qualcomm) etc. But basically (stock A8 implementation) NEON is a 64 bit architecture with two (or one) 64 bit operands giving a 64 bit result. So without any pipeline (data dependency) stalls or I/O stalls, an integer pipeline can do 8, 8-bit operations per cycle in SIMD fashion. So the best case on stock arm processors that are single issue for ALU/Mult operations is probably "8."
You can look at the ARM architecture reference for an idea of how long various instructions take on stock ARM A8 processors. If you aren't familiar with the nomenclature, "D" registers are 64 bit, "Q" are double wide 128 bit registers, and instructions can treat the data in the registers as 8,16 or 32 bit formats.
A nice overview of a stock A8 architecture is via TI's A8 NEON Architecture page.
Specifically about the differences between processors, a lot of ARM implementers don't make their architecture details known except for extremely powerful customers, so noting the differences is fairly difficult but as Stephen Canon notes below, the newer higher end A15-ish ones will probably double the performance for some types of instructions, and lower power ones will probably halve it for some types of instructions.
Most integer operations on Cortex-A8's NEON unit are executed 128-bits at a time, not 64-bits. You can see the throughput in the TRM, found here: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344k/index.html Some notable exceptions include multiplications, shift by register value, and bit selects. But if you think about it, if there weren't 128-bit integer operations there'd be a lot less reason to use these instructions, since Cortex-A8 can already execute two 32-bit scalar integer operations in parallel.
Sadly, Cortex-A8 and A9 were the last ARM cores to include public documentation of execution performance. I haven't done extensive testing, but I think A15 can execute a 128-bit and 64-bit NEON operation in parallel (not sure what restrictions there are). And from what I've heard in passing - this is totally untested - both Cortex-A5 and A7 have 64-bit NEON execution. A5 is further limited by only having 32-bit NEON load/store throughput (while A8 actually has 128-bit, and A9 and A7 have 64-bit)

CPU Cards for Parallel Computation?

I remember reading some time ago that there were cpu cards for systems to add additional processing power to do mass parallelization. Anyone have any experience on this and any resources to get looking into the hardware and software aspects of the project? Is this technology inferior to a traditional cluster? Is it more power conscious?
There are two cool options. one is the use of GPU's as Mitch mentions. The other is to get a PS/3, which has a multicore Cell processor.
You can also set up multiple inexpensive motherboard PCs and run Linux and Beowulf.
GPGPU is probably the most practical option for an enthusiast. However, DSPs are another option, such as those made by Texas Instruments, Freescale, Analog Devices, and NXP Semiconductors. Granted, most of those are probably targeted more towards industrial users, but you might look into the Storm-1 line of DSPs, some of which are supposed to go for as low as $60 a piece.
Another option for data parallelism are Physics Processing Units like the Nvidia (formerly Ageia) PhysX. The most obvious use of these coprocessors are for games, but they're also used for scientific modeling, cryptography, and other vector processing applications.
ClearSpeed Attached Processors are another possibility. These are basically SIMD co-processors designed for HPC applications, so they might be out of your price range, but I'm just guessing here.
All of these suggestions are based around data parallelism since I think that's the area with the most untapped potential. A lot of currently CPU-intensive applications could be performed much faster at much lower clock rates (and using less power) by simply taking advantage of vector processing and more specialized SIMD instruction sets.
Really, most computer users don't need more than an Intel Atom processor for the majority of their casual computing needs: e-mail, browsing the web, and playing music/video. And for the other 10% of computing tasks that actually do require lots of processing power, a general-purpose scalar processor typically isn't the best tool for the job anyway.
Even most people who do have serious processing needs only need it for a narrow range of applications; a physicist doesn't need a PC capable of playing the latest FPS; a sound engineer doesn't need to do scientific modeling or perform statistical analysis; and a graphic designer doesn't need to do digital signal processing. Domain-specific vector processors with highly specialized instruction sets (like modern GPUs for gaming) would be able to handle these tasks much more efficiently than a high power general-purpose CPU.
Cluster computing is no doubt very useful for a lot of high end industrial applications like nuclear research, but I think vector processing has much more practical uses for the average person.
Have you looked at the various GPU Computing options. Nvidia (and probably others) are offering personal supercomputers based around utilising the power of graphics cards.
OpenCL - is an industry wide standard for doing HPC computing across different vendors and processor types, single-core, multi-core, graphics cards, cell, etc... see http://en.wikipedia.org/wiki/OpenCL.
The idea is that using a simple code base you can use all spare processing capacity on the machine regardless of type of processor.
Apple has implemented this standard in its next version Mac OS X. There will also be offerings from nVIDIA, ATI, Intel etc.
Mercury Computing offers a Cell Accelerator Board, it's a PCIe card that has a Cell processor, and runs Yellow Dog Linux, or Mercury's flavor of YDL. Fixstars offers a more powerful Cell PCIe board called the GigaAccel. I called up Mercury, they said their board is about $5000 USD, without software. I'd guess the GigaAccel is up to twice as expensive.
I found one of the Mercury boards used, but it didn't come with a power cable, so I haven't been able to use it yet, sadly.

What makes a modern commodity cluster?

Would would be the most cost effective way of implementing a terabyte distributed memory cache using commodity hardware these days? What would class as a piece of commodity hardware?
Commodity hardware is considered hardware that
Is off the shelf (nothing custom)
Is available in substantially similar version from many manufacturers.
There are many motherboards that can hold 8 or 16 GB of RAM. Fewer server motherboards can hold 32 and even 64GB.
But they fit the definition of commodity, therefore can be made into very large clusters for a very large sum of money.
Note, however, that in many access patterns a striped RAID HD array doesn't go much slower than a gigabit ethernet link - so a RAM cluster might not have significant improvement (except in latency) depending on how you're actually using it.
-Adam