X86 DerivO3CPU CPI Issue - gem5

I used two_levels.py (The script for tutorial) to run a simulation with DerivO3CPU model. The simulation result shows the CPI value (cycles per instruction) is lower than 1. I don't think that makes sense to a single-core system.
The only two things I modified in that script are the CPU type (TimingSimpleCPU -> DerivO3CPU) and the program running (HelloWorld -> A benchmark program).
Did I make the wrong setup for DerivO3CPU or I misunderstand the CPI?
Thank you very much for your help!

Related

Programe Execution Optimization

I am writing a Program for Parabolic Time Price Systems based on the book written by J.Welles Wilder Jr.
I am have way through the program, running with an execution time of 122 microsecs. This is way above the benchmark limit. What I was looking for is a few views and tips if I
write a kernel space program to achieve the same. Implementing it through drivers
[Really keen on this method] Is it possible, if yes then how and where I should start looking, passing instructions to a graphic driver to perform the steps and calculation (Read this in a blog somewhere).
Thanks in Advance.
--->Programming on c
What makes GPU very fast is the fact that it can run around 2000~ (depending on the card) threads asynchronously.
If you code can be divided into threads then it might improve your performance to do the calculations on the gpgpu since average CPU speed is 50-100 GFlops and average GPU speed is 1500~ when used correctly.
Also you might want to consider the difficulties of maintaining gpgpu code. I suggest you that if you have an NVidia GPU you should check out 'Managed CUDA' since it contains a debugger and a GPU profiler which makes it possible to work with.
TL;DR: use gpgpu only for async code and preferably use 'managed CUDA' if possible

How does machine code communicate with processor?

Let's take Python as an example. If I am not mistaken, when you program in it, the computer first "translates" the code to C. Then again, from C to assembly. Assembly is written in machine code. (This is just a vague idea that I have about this so correct me if I am wrong) But what's machine code written in, or, more exactly, how does the processor process its instructions, how does it "find out" what to do?
If I am not mistaken, when you program in it, the computer first "translates" the code to C.
No it doesn't. C is nothing special except that it's the most widespread programming language used for system programming.
The Python interpreter translates the Python code into so called P-Code that's executed by a virtual machine. This virtual machine is the actual interpreter which reads P-Code and every blip of P-Code makes the interpreter execute a predefined codepath. This is not very unlike how native binary machine code controls a CPU. A more modern approach is to translate the P-Code into native machine code.
The CPython interpreter itself is written in C and has been compiled into a native binary. Basically a native binary is just a long series of numbers (opcodes) where each number designates a certain operation. Some opcodes tell the machine that a defined count of numbers following it are not opcodes but parameters.
The CPU itself contains a so called instruction decoder, which reads the native binary number by number and for each opcode it reads it gives power to the circuit of the CPU that implement this particular opcode. there are opcodes, that address memory, opcodes that load data from memory into registers and so on.
how does the processor process its instructions, how does it "find out" what to do?
For every opcode, which is just a binary pattern, there is its own circuit on the CPU. If the pattern of the opcode matches the "switch" that enables this opcode, its circuit processes it.
Here's a WikiBook about it:
http://en.wikibooks.org/wiki/Microprocessor_Design
A few years ago some guy built a whole, working computer from simple function logic and memory ICs, i.e. no microcontroller or similar involved. The whole project called "Big Mess o' Wires" was more or less a CPU built from scratch. The only thing nerdier would have been building that thing from single transistors (which actually wasn't that much more difficult). He also provides a simulator which allows you to see how the CPU works internally, decoding each instruction and executing it: Big Mess o' Wires Simulator
EDIT: Ever since I originally wrote that answer, building a fully fledged CPU from modern, discrete components has been done: For your considereation a MOS6502 (the CPU that powered the Apple II, Commodore C64, NES, BBC Micro and many more) built from discetes: https://monster6502.com/
Machine-code does not "communicate with the processor".
Rather, the processor "knows how to evaluate" machine-code. In the [widespread] Von Neumann architecture this machine-code (program) can be thought of as an index-able array of where each cell contains a machine-code instruction (or data, but let's ignore that for now).
The CPU "looks" at the current instruction (often identified by the PC or Program Counter) and decides what to do (this can either be done directly with transistors/"bare-metal", or it be translated to even lower-level code): this is known as the fetch-decode-execute cycle.
When the instructions are executed side-effects occur such as setting a control flag, putting a value in a register, or jumping to a different index (changing the PC) in the program, etc. See this simple overview of a CPU which covers the above a little bit better.
It is the evaluation of each instruction -- as it is encountered -- and the interaction of side-effects that results in the operation of a traditional processor.
(Of course, modern CPUs are very complex and do lots of neat tricky things!)
That's called microcode. It's the code in the CPU that reads machine code instructions and translate that into low level data flow.
When the CPU for example encounters the add instruction, the microcode describes how it should get the two values, feed them to the ALU to do the calculation, and where to put the result.
Electricity. Circuits, memory, and logic gates.
Also, I believe Python is usually interpreted, not compiled through C → assembly → machine code.

Slow Parallel programming - MPI, VB.NET and FORTRAN

I'm working on parallelizing a software which simulates transport and flow process in the unsaturated soil zone. The software consists of a VB.NET user interface, and a FORTRAN DLL kernel to do the calculations.
I parallelized the software by using the package MPI.NET in the VB.NET part. When the program is started with a number of processes, all of them but the master process go into a wait function, while the master process takes care of the interaction of the software with the user. When all the data required for the simulation is entered, the master process enters the FORTRAN DLL, and calls the other processes. These jump to the starting point of the function in the DLL, and together all the processes solve a linear system of equations for about 10-20 times (the original partial differential equation is nonlinear, therefore these iterations in order to gain accuracy in the solution). When the solution is computed, all the processes go back to VB.NET, This is done for all the timesteps of the simulation. When all steps are computed, the master process continues with the user interaction, while the other processes go back
into the wait function, until they are called again by the master process.
The thing is that this program runs much slower than the original, sequential version of it. Now there might be a number of reasons for this. I used the PETSc library in the FORTRAN DLL to solve the system of equations, and I think I have configured it quite well. My question is if at some point in the architecture I described there could be a point or two which could cause a significant slowdown if not handled correctly. I'm not sure f.e. if the subsequent calls of DLL function can cost a lot of time.
My system is a Intel Xeon 3470 processor with 8GB RAM. The systems I tried to solve had up to 120.000 unknowns, which I know is at the very lower bound of what should be calculated in parallel, but at least with the 120.000 matrix I would have expected a better performance than I did measure.
Thanks in advance for your thoughts,
Martin
I would say that 120,000 degrees of freedom and 10-20 iterations is not that large a problem. Million degree of freedom problems were done when I did finite element analysis for a living, and that was 16 years ago.
Is it possible to solve it using an in-memory solver, without parallelization, with 8GB of RAM? That would certainly be your benchmark. Is that what you're comparing your parallel results to?
Are the parallel processes running on different processors or different machines? Parallelization doesn't buy you anything if everything is done on a single processor. You have to context switch and time slice processes, and there's overhead associated with MPI to communicate between processes. I would expect a parallel solution on a single processor to run more slowly than a single thread, in-memory solution.
If you have multiple processes, then I'd say it's a matter of tuning. I'd plot performance versus number of parallel processes. If there's a speedup, you should find that it improves with more processes until you reach a saturation point, beyond which the overhead is greater than the benefit.
If you have multiple cores, when you run your program sequentially can you see that only one or a few processor are utilized?
If the load in the sequential case is high and evenly distributed over all cores then IMHO there is no need to parallelize your program.
My system has a Xeon 3470, which is a quadcore processor. So the computations are all done on these 4 on 1 machine. I don't run the program with more than 4 processes of course.The old solver that the software had was sequential of course, and that still runs faster than the parallel version. When I plot number of processes against runtime, I see that runtime even increases a little bit with smaller models - but that is to be expected because of the communication overhead.
In both the sequential and the parallel case all 4 processors are utilized, and the load balance between them is acceptable.
Like I said, I know that the models I've tested so far are not ideal to talk about parallel performance. I was just wondering if besides the communication overhead due to MPI there could still be another point that could lead to the slowdown of the program.

Creating a heater application

This might seem weird, but I'm interesting in creating an electric heater out of my computer, that is program an application, that heats up my PC, and I need some help.
I currently made an application, that runs infinite loops on the GPU (using a little shader), and on the CPU cores, however I'm interesting in getting the ram going too, as well as the several output ports, so.. About the ram heating, just allocate, and start randomly accessing and writing using all 8 cores?
And what about triggering CD-ROM, floppy etc, how do I do this?
How about heater with a purpose? Just run World Community Grid, create tons of heat while making your computer do valuable computations for science. It runs the processors wide open, is stable, and isn't just wasting cycles.
Have a look at How to stress test a computer If your interested in making your own try searching for open source stress test software that you could modify to your liking.
Use Furmark together with LinX/Prime95. Max out your settings. Make sure you have a strong enough PSU.
There`s a torture test option for CPU & RAM in Prime95 that looks like what you want. As for the GPU, there is Furmark which achieves the same kind of stress.
The heat from the other components will likely be not relevant (unless you have something really specific like a physx card) if you stress enough your cpu and gpu imho.

Optimizing for ARM: Why different CPUs affects different algorithms differently (and drastically)

I was doing some benchmarks for the performance of code on Windows mobile devices, and noticed that some algorithms were doing significantly better on some hosts, and significantly worse on others. Of course, taking into account the difference in clock speeds.
The statistics for reference (all results are generated from the same binary, compiled by Visual Studio 2005 targeting ARMv4):
Intel XScale PXA270
Algorithm A: 22642 ms
Algorithm B: 29271 ms
ARM1136EJ-S core (embedded in a MSM7201A chip)
Algorithm A: 24874 ms
Algorithm B: 29504 ms
ARM926EJ-S core (embedded in an OMAP 850 chip)
Algorithm A: 70215 ms
Algorithm B: 31652 ms (!)
I checked out floating point as a possible cause, and while algorithm B does use floating point code, it does not use it from the inner loop, and none of the cores seem to have a FPU.
So my question is, what mechanic may be causing this difference, preferrably with suggestions on how to fix/avoid the bottleneck in question.
Thanks in advance.
One possible cause is that the 926 has a shorter pipeline (5 cycles vs. 8 cycles for the 1136, iirc), so branch mispredictions are less costly on the 926.
That said, there are a lot of architectural differences between those processors, too many to say for sure why you see this effect without knowing something about the instructions that you're actually executing.
Clock speed is only one factor. Bus width and latency are big if not bigger factors. Cache is a factor. Speed of the media the program is run from if run from media and not memory.
Is this test using any shared libraries at all at any point in the test or is it all internal code? Fetching shared libraries on media that will vary from platform to platform (even if it is say the same sd card).
Is this the same algorithm compiled separately for each platform or the same binary? You can and will see some compiler induced variation as well. 50% faster and slower can easily come from the same compiler on the same platform by varying compiler settings. If possible you want to execute the same binary, and insure that no shared libraries are used in the loop under test. If not the same binary disassemble the loop under test for each platform and insure that there are no variations other than register selection.
From the data you have presented, its difficult to point the exact problem, but we can share some of the prior experience
Cache setting (check if all the
processors has the same CACHE
setting)
You need to check both D-Cache and I-Cache
For analysis,
Break down your code further, not just as algorithm but at a block level, and try to understand the block that causes the bottle-neck. After you find the block that causes the bottle-neck, try to disassemble the block's source code, and check the assembly. It may help.
Looks like the problem is in cache settings or something memory-related (maybe I-Cache "overflow").
Pipeline stalls, branch miss-predictions usually give less significant differences.
You can try to count some basic operations, executed in each algorithm, for example:
number of "easy" arithmetical/bitwise ops (+-|^&) and shifts by constant
number of shifts by variable
number of multiplications
number of "hard" arithmetics operations (divides, floating point ops)
number of aligned memory reads (32bit)
number of byte memory reads (8bit) (it's slower than 32bit)
number of aligned memory writes (32bit)
number of byte memory writes (8bit)
number of branches
something else, don't remember more :)
And you'll get info, that things get 926 much slower. After this you can check suspicious blocks, making using of them more or less intensive. And you'll get the answer.
Furthermore, it's much better to enable assembly listing generation in VS and use it (but not your high-level source code) as base for research.
p.s.: maybe the problem is in OS/software/firmware? Did you testing on clean system? OS is the same on all devices?