Why does higher core count lead to higher CPI? - hardware

I'm looking at a chart that shows that, in reality, increasing the core count on a CPI usually results in higher CPI for most instructions, as well as that it usually increases the total amount of instructions the program executes. Why is this happening?
From my understanding CPI should increase only when increasing the clock frequency, so the CPI increase doesn't make much sense to me.

What chart? What factors are they holding constant while increasing core count? Perhaps total transistor budget, so each core has to be simpler to have more cores?
Making a single core larger has diminishing returns, but building more cores has linear returns for embarrassingly parallel problems; hence Xeon Phi having lots of simple cores, and GPUs being very simple pipelines.
But CPUs that also care about single-thread performance / latency (instead of just throughput) will push into those diminishing returns and build wider cores. Many problems that we run on CPUs are not trivial to parallelize, so lots of weak cores is worse than fewer faster cores. For a given problem size, the more threads you have the more of its total time the thread will be communicating with other threads (and maybe waiting for data from them).
If you do keep each core identical when adding more cores, their CPI generally stays the same when running the same code. e.g. SPECint_rate scales nearly linearly with number of cores for current Intel/AMD CPUs (which do scale up by adding more of the same cores).
So that must not be what your chart is talking about. You'll need to clarify the question if you want a more specific answer.
You don't get perfectly linear scaling because cores do compete with each other for memory bandwidth, and space in the shared last-level cache. (Although most current designs scale up the size of last-level cache with the number of cores. e.g. AMD Zen has clusters of 4 cores sharing 8MiB of L3 that's private to those cores. Intel uses a large shared L3 that has a slice of L3 with each core, so the L3 per core is about the same.)
But more cores also means a more complex interconnect to wire them all together and to the memory controllers. Intel many-core Xeon CPUs notably have worse single-thread bandwidth than quad-core "client" chips of the same microarchitecture, even though the cores are the same in both. Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?

Related

Does low GPU utilization indicate bad fit for GPU acceleration?

I'm running some GPU-accelerated PyTorch code and training it against a custom dataset, but while monitoring the state of my workstation during the process, I see GPU usage along the following lines:
I have never written my own GPU primitives, but I have a long history of doing low-level optimizations for CPU-intensive workloads and my experience there makes me concerned that while pytorch/torchvision are offloading the work to the GPU, it may not be an ideal workload for GPU acceleration.
When optimizing CPU-bound code, the goal is to try and get the CPU to perform as much (meaningful) work as possible in a unit of time: a supposedly CPU-bound task that shows only 20% CPU utilization (of a single core or of all cores, depending on whether the task is parallelizable or not) is a task that is not being performed efficiently because the CPU is sitting idle when ideally it would be working towards your goal. Low CPU usage means that something other than number crunching is taking up your wall clock time, whether it's inefficient locking, heavy context switching, pipeline flushes, locking IO in the main loop, etc. which prevents the workload from properly saturating the CPU.
When looking at the GPU utilization in the chart above, and again speaking as a complete novice when it comes to GPU utilization, it strikes me that the GPU usage is extremely low and appears to be limited by the rate at which data is being copied into the GPU memory. Is this assumption correct? I would expect to see a spike in copy (to GPU) followed by an extended period of calculations/transforms, followed by a brief copy (back from the GPU), repeated ad infinitum.
I notice that despite the low (albeit constant) copy utilization, the GPU memory is constantly peaking at the 8GB limit. Can I assume the workload is being limited by the low GPU memory available (i.e. not maxing out the copy bandwidth because there's only so much that can be copied)?
Does that mean this is a workload better suited for the CPU (in this particular case with this RTX 2080 and in general with any card)?

GPU execution "flow" vs. CPU

On a general purpose CPU parallel processing is performed splitting calculation / problem into sub-problems, distributing them and running them in parallel on a number of cores on one or several sockets / servers.
What is the execution "flow" on a GPU from loading data to sending back results to the CPU ? What are key differences between execution on a GPU and execution on a CPU ?
Should we see a GPU as a "kind of CPU with a higher (huge) number smaller cores" or are there additionnal differences in nature ?
The fundamental difference in parallel processing between a CPU and a GPU is that CPUs are MIMD (Multi-Instruction-Multi-Data), while GPUs are SIMD (Single-Instruction-Multi-Data). In a multicore CPU, each core fetches its instructions and data independently of the others, whereas in a GPU there is only one instruction stream for a group of cores (typically 32 or 64). While there is only one instruction stream for the 32/64 cores, each of them is working on different data elements (typically located together in memory; more below). Such SIMD execution means that the GPU cores operate in a lock-stepped fashion.
For the above mentioned reason, a GPU can't be viewed as a "kind of CPU with a higher (huge) number smaller cores".
In order to support SIMD execution (also sometimes called wide-execution), we need wide fetch of input data. For a 32-wide execution, we fetch a contiguous 4B x 32 block = 128B that is consumed (typically) entirely by a 32-wide pipe. Contrast this to a MIMD multicore, where each of 32 CPU cores would fetch a separate instruction and then load from 32 different cachelines. This SIMD nature of (wide-) instruction/data fetch results in huge power savings compared to MIMD. As a result, for the same power budget, we can put more cores on a GPU (=> more HW parallelism) than a multicore-CPU.
The SIMD nature of GPUs is driven by applications that do exactly the same operation over very many input elements (e.g.; Image processing where we apply a filter on every pixel of say a 1024x768 image) so that wide instruction/data fetch works well. At the same time, applications where each core's computation is different (e.g., take if() when input data is zero, or else() if input data is 1) or each core needs to fetch data from a different page fail to take advantage of the SIMD nature of GPUs.
A partially related fact is that GPUs support applications (e.g., images/videos) that are streaming (almost zero data reuse) and have massive data-parallelism. Streaming means that we don't need huge caches like CPUs, and massive data-parallelism almost entirely cuts the need for HW coherence mechanisms.

How to allocate more CPU and RAM to SUMO (Simulation of Urban Mobility)

I have downloaded and unzipped sumo-win64-0.32.0 and running sumo.exe this on a powerful machine (64GB ram, Xeon CPU E5-1650 v4 3.6GHz) for about 140k trips, 108k edges, and 25k vehicles types which are departed in the first 30 min of simulation. I have noticed that my CPU is utilized only 30% and Memory only 38%, Is there any way to increase the speed by forcing sumo to use more CPU and ram, or possibly run in parallel? From "Can SUMO be run in parallel (on multiple cores or computers)?
The simulation itself always runs on a single core."
it appears that parallel processing is not possible t, but what about dedicating more CPU and ram?
Windows usually shows the CPU utilization such that 100% means all cores are used, so 30% is probably already more than one core and there is no way of increasing that with a single threaded application as sumo. Also if your scenario fits within RAM completely there is no point of increasing that. You might want to try one of the several parallelization approaches SUMO has but none of them got further than some toy examples (and none is in the official distribution) and the speed improvements are sometimes only marginal. Probably the best you can do is to do some profiling and find the performance bottlenecks and/or send your results to the developers.

Distributed Z3 and best hardware for each node

I'm thinking on starting a cluster of servers which will be running exclusively Z3 to solve SMT formulas.
Is there any way to clusterize several servers to join computational power and solve SMT formulas in a distributed fashion?
What are the recommendation characteristics of an system that will be running Z3 to be as fast as possible (regarding to hardware)?
Thank you!!
SAT/SMT solvers are usually very heavy on memory due to low cache hits. Therefore you can't run many processes on a CPU, otherwise they soon start degrading the performance of each other (i.e., running one process per core is not a good idea if you want to benchmark).
I can't give any specific recomendation, but I would choose CPUs that have fewer cores (say 4) and high memory bandwidth. These days CPUs have a fixed TDP and the fewer the CPUs the more powerful each one is -- and less contention for the memory.
Also you want to stick with little-endian architectures. At the moment, Z3 doesn't play well with big-endian archs (such as many ARMs, MIPS, SPARCs, etc). Moreover, for what I've seen, 64 bits usually helps.

Coprocessor accelerators compared to GPUs

Are coprocessors like Intel Xeon-Phi supposed to be utilized much like the GPUs, so that one should offload a large amount of blocks executing a single kernel, so that only the overall throughput the coprocessor handles results in a speed up, OR offloading independent threads (tasks) will increase the efficiency as well?
The Xeon Phi requires a large degree of both functional parallelism (different threads) and vector parallelism (SIMD). Since the cores are essentially enhanced Pentium processors, serial code runs slowly. This will change somewhat with the next generation as it'll use faster and more modern cores. The current Xeon Phi also suffers from the I/O bottleneck as does any coprocessor, having to communicate over a PCIe bus.
So though you could offload a kernel to every processor and exploit the 512-bit vectorization (similar to a GPGPU), you can also separate your code into many different functional blocks (i.e. different codes/kernels) and run them on different sets of Intel Xeon Phi cores. Again, the different blocks of code must also exploit the 512-bit SIMD vectors.
The Xeon Phi also operates as a native processor, so you can access other resources by mounting NFS directory trees, communication between cards and other processors in the cluster using TCP/IP, using MPI, etc. Note that this is not 'offload' but native execution. But the PCIe bus is still a significant bottle neck limiting I/O.
To summarize,
You can us an offload model similar to that used by GPGPUs,
The Xeon Phi itself also can support functional parallelism (more than one kernel) but each kernel must also exploit the 512-bit SIMD.
You can also write native code and use MPI, treating the Xeon Phi as a conventional (non-offload) node (always remembering the PCIe I/O bottleneck)