CPU pipeline stalls visualization - optimization

I saw a few graphs and tables showing how well are the CPU instructions interleaved. For example:
time → total: 7
1 B = a + b ● ● ●
2 C = c + d ● ● ●
3 A = B * C ○ ○ ● ● ●
which I got from Playing with the CPU pipeline.
My question is twofold: how to find out the stalls in the first place, and how to visualize them in a readable way? I mean, what software is used to look at and optimize code at such level?

Short: In most cases, no software is used to "look at the stalls". Stalls are predictable, and can be found without even touching a computer. You know when they will happen, and you can draw them however you like.
Full story:
First of all you have to understand pipelining.
Each "action" that has to be taken to process an instruction, is executed on separate hardware (seperate parts of your CPU). If there are 7 steps, you could be handling 7 instructions at the same time.
(source: http://www.phatcode.net/res/260/files/html/CPUArchitecturea3.html)
This image visualizes this pipelining. You can see multiple instructions shifting through the CPU.As soon as instruction 1's opcode is retrieved, it doesn't need the opcode hardware anymore. Instruction 2's opcode can now be retrieved. This goes the same for all other blocks.
The important thing to notice here, is to see that values for instruction 2 are loaded before instruction 1 finished. This is possible, if the values of instruction 2 do not depend on instruction 1. If they do depend on instruction 1, instruction 2 needs to be stalled. It will wait at its place. Instead of at T5, values will be retrieved at T6. At this point instruction 1 has stored its result, so instruction 2 can proceed.
This is what you see with 1 and 2. They're independent, allowing to execute the next instruction without any stalls. However, 3 depends on 1 and 2. This means it has to wait until both results are stored.
To answer your question now:
How did we know? We saw it, without using any tool. How did we visualise it? The same way we would visualise any other data, meaning you can choose which way, as long as it's clear to understand.
Please note that this is a simplified answer, in order to make it understandable. Pipelining and processor optimizations are way more advanced in modern computers. For example, there are (conditional) jumps, which can cause instruction 2,3,4 to be skipped, and all of a sudden another instruction has to be loaded in the pipeline due to the jump. You can find a lot about this (both simplified and advanced), when searching for pipelining.
More detailed information on this topic:
http://www.phatcode.net/res/260/files/html/CPUArchitecturea3.html section 4.8.2. (This is what I found while googling to refresh my memory, but it looks like pretty good information)

Related

How is pipelining implemented? Can we read the firmware of a modern microprocessor?

My question has two related parts.
First, most of modern microprocessors implement pipelining and other means to execute code faster. How do they implement it? I mean, is it the firmware or something else?
Second, if it is the firmware, is it possible for me to read the firmware and look at the code?
Apologies if it is stupid as I have little idea of microprocessor.
Pipelining in processor design is a hardware concept; the idea that a stream of instructions can execute faster if it exploits a bit of parallelism in the flow for process and instruction and by breaking up critical paths in logic. In hardware, for a given design (technically it's implementation), you can only "run" it so fast; ie it takes some time for signals to propagate through all the logic. The longest time it could take in the worst case is the critical path and defines a maximum time (or frequency) the design can run (this is were maximum clock speed comes from).
Now, processing an instruction in the simplest processor can be broken into three big parts: fetching the instruction from memory (ie, fetch), decoding the instruction into it's parts (decode), and actually executing the instruction (execute). For every instruction, it is fetched, decoded, executed; then the next instruction, then the next instruction.
The hardware for each of these stages has a critical path, ie a maximum time it can take in the worst case (Tmax_fetch for fetch stage, Tmax_decode for decode, Tmax_exec for execute). So, for a processor without pipelining (or, single cycle), the critical path for the full processor is all these stages would be the sum of these critical paths (this isn't necessarily true in real designs, but we will use this as a simplified example), Tmax_inst = Tmax_fetch + Tmax_decode + Tmax_exec. So, to run through four instructions, it would take 4 * Tmax_inst = 4 * Tmax_fetch + 4 * Tmax_decode + 4 * Tmax_exec.
Pipelining allows us to break up these critical paths using hardware registers (not unlike the programmers registers, r2 in ARM is an example), but these registers are invisible to the firmware. Now, instead of Tmax_inst being the sum of the stages, it's now just three times the largest of the stages, Tmax_inst = 3 * Tmax_stage = 3 * max(Tmax_fetch, Tmax_decode, Tmax_exec) since the processor has to "wait" for the slowest stage to finish in the worst case. The processor is now slower for a single instruction, but due to the pipeline, we can do each of these stages independently as long as there isn't a dependency between the instructions being processed in each stsge (like a branch instruction, where the fetch stage can't run until the branch is executed). So, for four independent instructions, the processor will only take Tmax_stage * (3 + 4 - 1) as the pipeline allows the first instruction to be fetched, then decoded at the same time the second instruction is fetched, etc.
This should hopefully help better exolain pipelining, but to answer your questions directly:
It's a hardware design concept, so implemented in hardware, not firmware
As it's a hardware concept, there is no firmware code to read.

PITest: JavaLaunchHelper is implemented in both

Recently I started using PITest for Mutation Testing. Post building my project using maven when I run the command mvn org.pitest:pitest-maven:mutationCoverage I get this error bunch of times:
-stderr : objc[2787]: Class JavaLaunchHelper is implemented in both /Library/Java/JavaVirtualMachines/jdk1.8.0_74.jdk/Contents/Home/jre/bin/java and /Library/Java/JavaVirtualMachines/jdk1.8.0_74.jdk/Contents/Home/jre/lib/libinstrument.dylib. One of the two will be ustderr : sed. Which one is undefined.
Sometimes the error is followed by
PIT >> WARNING : Slave exited abnormally due to MEMORY_ERROR
or PIT >> WARNING : Slave exited abnormally due to TIMED_OUT
I use OsX version 10.10.4 and Java 8 (jdk1.8.0_74).
Any fix/ work-around for this?
Don't worry about this;
-stderr : objc[2787]: Class JavaLaunchHelper is implemented in both /Library/Java/JavaVirtualMachines/jdk1.8.0_74.jdk/Contents/Home/jre/bin/java and /Library/Java/JavaVirtualMachines/jdk1.8.0_74.jdk/Contents/Home/jre/lib/libinstrument.dylib. One of the two will be ustderr : sed. Which one is undefined.
This is just for information that there are two implementations of JavaLauncherHelper and the message tells you that one of the two will use std-err output stream but it is undetermined which one of the two. It is a known isse, see also this question
The other two are a result of what PIT is doing: it's modifying the byte code and it may happen that this not just affects the output of an operation (detected by a test) but actually affects the runtime behavior. For example if boundaries of a loop get changed that way, that the loop runs endlessly. Pit is capable of detecting this and prints out an error. Mutations detected by either a memory error or a timeout error can be considered as "killed". But you should check each of those individually as they could be false positives, too.
PIT >> WARNING : Slave exited abnormally due to MEMORY_ERROR
means the modified code produces more or larger objects so the forked jvm runs out of memory. Imagine a loop like this
while(a < b){
list.add(new Object());
a++;
}
And the a++ gets altered to a--. The loop may eventually end, but it's more likely you run out of memory before that.
From the documentation
A memory error might occur as a result of a mutation that increases the amount of memory used by the system, or may be the result of the additional memory overhead required to repeatedly run your tests in the presence of mutations. If you see a large number of memory errors consider configuring more heap and permgen space for the tests.
The timeout issue is similar to this, the reason coud be either, that you run an infinite loop or that system thinks you run an infinite loop, i.e. when the system is too slow to compute the altered code. If you experience a lot of timeouts you should consider increasing the timeout value. But be carefull, as this may impact the overall execution time.
From the FAQ
Timeouts when running mutation tests are caused by one of two things
1 A mutation that causes an infinite loop
2 PIT thinking an infinite loop has occured but being wrong
In order to detect infinite loops PIT measures the normal execution time of each test without any mutations present. When the test is run in the presence of a mutation PIT checks that the test doesn’t run for any longer than
normal time * x + y
Unfortunately the real world is more complex than this.
Test times can vary due to the order in which the tests are run. The first test in a class may have a execution time much higher than the others as the JVM will need to load the classes required for that test. This can be particularly pronounced in code that uses XML binding frameworks such as jaxb where classloading may take several seconds.
When PIT runs the tests against a mutation the order of the tests will be different. Tests that previously took miliseconds may now take seconds as they now carry the overhead of classloading. PIT may therefore incorrectly flag the mutation as causing an infinite loop.
An fix for this issue may be developed in a future version of PIT. In the meantime if you encounter a large number of timeouts, try increasing y in the equations above to a large value with –timeoutConst (timeoutConstant in maven).

Manage multiple instances of a process automatically

I have a program that takes about 1 second to run and takes a file as input and produces another file as output. Problem is I have to be able to process about 30 files a second. The files to process will be available as a queue (implemented over memcached) and don't have to be processed exactly in order, so basically an instance of the program checks out a file to process and does so. I could use a process manager that automatically launches instances of the program when system resources are available.
At the simple end, "system resources" will simply mean "up to two processes at a time," but if I move to a different machine make this could be 2 or 10 or 100 or whatever. I could use a utility to handle this, at least. And at the complex end, I would like to bring up another process whenever CPU is available since these machines will be dedicated. CPU time seems to be the constraining resource - the program isn't memory intensive.
What tool can accomplish this sort of process management?
Storm - Without knowing more details, I would suggest Backtype Storm. But it would probably mean a total rewrite of your current code. :-)
More details at Tutorial, but it basically takes tuples of work and distributed them through a topology of worker nodes. A "spout" emits work into the topology and a "'bolt" is a step/task in the graph where some bit of work takes place. When a bolt finish it's work, it emits same/new tuple back into the topology. Bolts can do work in parallel or series.

What exactly is a dual-issue processor?

I came across several references to the concept of a dual issue processor (I hope this even makes sense in a sentence). I can't find any explanation of what exactly dual issue is. Google gives me links to micro-controller specification, but the concept isn't explained anywhere. Here's an example of such reference. Am I looking in the wrong place? A brief paragraph on what it is would be very helpful.
Dual issue means that each clock cycle the processor can move two instructions from one stage of the pipeline to the next stage. Where this happens depends on the processor and the company's terminology: it can mean that two instructions are moved from a decode queue to a reordering queue (Intel calls this issue) or it could mean moving instructions (or micro-operations or something) from a reordering queue to an execution port (afaik IBM calls this issue, while Intel calls it dispatch)
But really broadly speaking it should usually mean you can sustain executing two instructions per cycle.
Since you tagged this ARM, I think they're using Intel's terminology. Cortex-A8 and Cortex-A9 can, each cycle, fetch two instructions (more in Thumb-2), decode two instructions, and "issue" two instructions. On Cortex-A8 there's no out of order execution, although I can't remember if there's still a decode queue that you issue to - if not you'd go straight from decoding instructions to inserting them into two execution pipelines. On Cortex-A9 there's an issue queue, so the decoded instructions are issued there - then the instructions are dispatched at up to 4 per cycle to the execution pipelines.

How to check the Heap and Stack RAM consistency on an embedded system

I'm working on a project using a LEON2 Processor (Sparc V8).
The processor uses 8Mbytes of RAM that need to be consistency checked during the Self-Test of my Boot.
My issue is that my Boot obviously uses a small part of the RAM for its Heap/BSS/Stack which I cannot modify without crashing my application.
My RAM test is very simple, write a certain value to all the RAM address then read them back to be sure the RAM chip can be addressed.
This method can be used for most of the RAM available but how could I safely check for the consistency of the remaining RAM?
Generally a RAM test that needs to test every single byte will be done as one of the first things that happens when the processor starts. Often the only other thing that's done before it is the hardware initialization that needs to happen for the RAM test to be able to access RAM.
It'll usually be done in assembly language with interrupts disabled, for one reason because that's about the only way you can ensure that no RAM is used.
If you want to perform a RAM test after that point, you still need to do it pretty early in the system start-up. You could maybe do it in two passes - where any variables/stack/whatever the test needs for its own purposes are in low RAM, and that test tests high RAM. Then have the test run again with it's data in high RAM while it tests low RAM.
Another note: verifying that you read back a certain value written is a simple test that maybe better than nothing, but it can miss certain types of common failures (common particularly with external RAM: missing or cross soldered address lines.
You can find more detailed information about basic RAM tests here:
Jack Ganssle, "Testing RAM in Embedded Systems"
Michael Barr, "Fast Accurate Memory Test Suite"
as I am programming a safety-relevant device, I have to do a full RAM test during operation time.
I split the test in two tests:
Addressing test
you write unique values to the addresses reached by each addressing line and after all values are written, the values are read back and compared to the expected values. This test detects short-circuits (or stuck#low/high) of addressing lines (meaning you want to write 0x55 on address 0xFF40, but due to a short-circuit the value is stored at 0xFF80, you cannot detect this by test 2:
Pattern Test:
You save e.g. the first 4 bytes of RAM in CPU's registers and afterwards you first clear the cells, write 0x55, verify, write 0xAA, verify and restore saved content (you can use other patterns of course) and so on. The reason you have to use the registers is that by using a variable, this variable would be destroyed by that test.
You can even test your stack with this test.
In our project, we test 4 cells at a time and we have to run this test until whole RAM is tested.
I hope that helped a bit.
If you do your testing before the C run time environment is up you can trash the Heap and BSS areas without any problems.
Generally the stack does not get used much during run time setup so you may be able trash it with no ill effect. Just check your system.
If you need to use the stack during testing or need to preserve it just move it to an already tested area adjust the stack pointer. After wards just restore the old stack and continue.
There is no easy ways of doing this once you entered your runtime environment.