What is the best way to measure time in gem5 simulation environment - gem5

I am running a small matrix multiplication program in gem5 simulation environment and want to measure execution time of the program. The program is in Fortran and I use cpu_time before and after the matrix multiplication routine to get the time. But is there any other better way to measure time in the gem5 environment?

The standard way of measuring stats for a given binary using gem5 in Full System mode is through providing an rcS script using the --script parameter:
./build/ARM/gem5.fast ... your_options... --script=./script.rcS
Your script should contain m5ops to reset and dump stats as required. An example script.rcS:
m5 resetstats
/bin/yourbinary
m5 dumpstats
Then from the stats.txt you can take execution time (sim_seconds) or whatever stat that you require. If you're using the Syscall Emulation mode you can directly check the stats.txt without the need for an rcS script.

You can also add resetstats / dumpstats magic assembly instructions directly inside your benchmarks as shown at: How to count the number of CPU clock cycles between the start and end of a benchmark in gem5? E.g. in aarch64:
/* resetstats */
__asm__ __volatile__ ("mov x0, #0; mov x1, #0; .inst 0XFF000110 | (0x40 << 16);" : : : "x0", "x1")
/* dumpstats */
__asm__ __volatile__ ("mov x0, #0; mov x1, #0; .inst 0xFF000110 | (0x41 << 16);" : : : "x0", "x1")
You then likely want to look at the system.cpu.numCycles which shows how many CPU ticks passed.

You can of course look into different stat files depending on your build but I think the easiest way is to flag time before your simulation command:
time ./build/ARM/gem5.fast ... your_options... --script=./script.rcS ...

Related

Is TensorRT "floating-point 16" precision mode non-deterministic on Jetson TX2?

I'm using TensorRT FP16 precision mode to optimize my deep learning model. And I use this optimised model on Jetson TX2. While testing the model, I have observed that TensorRT inference engine is not deterministic. In other words, my optimized model gives different FPS values between 40 and 120 FPS for same input images.
I started to think that the source of the non-determinism is floating point operations when I see this comment about CUDA:
"If your code uses floating-point atomics, results may differ from run
to run because floating-point operations are generally not
associative, and the order in which data enters a computation (e.g. a
sum) is non-deterministic when atomics are used."
Is type of precision such as FP16, FP32 and INT8 affects determinism of TensorRT? Or anything?
Do you have any thoughs?
Best regards.
I solved the problem by changing the function clock() that I used for measuring latencies. The clock() function was measuring the CPU time latency, but what I want to do is to measure real time latency. Now I am using std::chrono to measure the latencies. Now inference results are latency-deterministic.
That was wrong one, (clock())
int main ()
{
clock_t t;
int f;
t = clock();
inferenceEngine(); // Tahmin yapılıyor
t = clock() - t;
printf ("It took me %d clicks (%f seconds).\n",t,((float)t)/CLOCKS_PER_SEC);
return 0;
}
Use Cuda Events like this, (CudaEvent)
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start);
inferenceEngine(); // Do the inference
cudaEventRecord(stop);
cudaEventSynchronize(stop);
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);
Use chrono like this: (std::chrono)
#include <iostream>
#include <chrono>
#include <ctime>
int main()
{
auto start = std::chrono::system_clock::now();
inferenceEngine(); // Do the inference
auto end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end-start;
std::time_t end_time = std::chrono::system_clock::to_time_t(end);
std::cout << "finished computation at " << std::ctime(&end_time)
<< "elapsed time: " << elapsed_seconds.count() << "s\n";
}

How to count the number of CPU clock cycles between the start and end of a benchmark in gem5?

How to count the number of CPU clock cycles between the start and end of a benchmark in gem5?
I'm interested in all of the following cases:
full system userland benchmark. Maybe the m5 guest tool has a way to do it?
bare metal benchmark. When gem5 exits it dumps the stats automatically, so the main question is how to skip the cycles for bootloader and go straight to the benchmark itself.
Is there a way besides modifying the benchmark source with instrumentation instructions? How to write those instrumentation instructions in detail?
syscall emulation benchmark. I think gem5 just outputs the stats.txt at the end of the run, and then you ca just grep system.cpu.numCycles, but I have to confirm it, currently blocked on: How to solve "FATAL: kernel too old" when running gem5 in syscall emulation SE mode?
I want to use this to learn:
learn how CPUs work
how to optimize assembly code or compiler settings to run optimally on a given CPU
m5 tool
A good approximation is to run, ideally from a shell script that is the /init program:
m5 resetstats
run-benchmark
m5 dumpstats
Then on host:
grep -E '^system.cpu.numCycles ' m5out/stats.txt
Gives something like:
system.cpu.numCycles 33942872680 # number of cpu cycles simulated
Note that if you replay from a m5 checkpoint with a different CPU, e.g.:
--restore-with-cpu=HPI --caches
then you need to grep for a different identifier:
grep -E '^system.switch_cpus.numCycles ' m5out/stats.txt
resetstats zeroes out the cumulative stats, and dumpstats dumps what has been collected during the benchmark.
This is not perfect since there is some time between the exec syscall for m5 dumpstats finishing and the benchmark starting, but if the benchmark enough, this shouldn't matter.
http://arm.ecs.soton.ac.uk/wp-content/uploads/2016/10/gem5_tutorial.pdf also proposes a few more heuristics:
#!/bin/sh
# Wait for system to calm down
sleep 10
# Take a checkpoint in 100000 ns
m5 checkpoint 100000
# Reset the stats
m5 resetstats
run-benchmark
# Exit the simulation
m5 exit
m5 exit also works since GEM5 dumps stats when it finishes.
Instrumentation instructions
Sometimes those seem to be just inevitable that you have to modify the input source code a bit with those instructions in order to:
skip initialization and go directly to steady state
evaluate individual main loop runs
You can of course deduce those instructions from the gem5 m5 tool code code, but here are some very easy to re-use one line copy pastes for arm and aarch64, e.g. for aarch64:
/* resetstats */
__asm__ __volatile__ ("mov x0, #0; mov x1, #0; .inst 0XFF000110 | (0x40 << 16);" : : : "x0", "x1")
/* dumpstats */
__asm__ __volatile__ ("mov x0, #0; mov x1, #0; .inst 0xFF000110 | (0x41 << 16);" : : : "x0", "x1")
The m5 tool uses the same mechanism under the hood, but by adding the instructions directly into the source, we avoid the syscall, and therefore more precise and representative (at the cost of more manual work).
To ensure that the assembly is not reordered around your ROI by the compiler however, you might want to use the techniques mentioned at: Enforcing statement order in C++
Address monitoring
Another technique that can be used is to monitory addresses of interest instead of adding magic instructions to the source.
E.g., if you know that a benchmark starts with PIC == 0x400, it should be possible to do something when that addresses is hit.
To find the addresses of interest, you would have for example to use readelf or gdb or tracing, and the if running full system on top of Linux, ensure that ASLR is turned off.
This technique would be the least intrusive one, but the setup is harder, and to be honest I haven't done it yet. One day, one day.

cuda uncorrectable ECC error encountered

My environment is
Windows 7 x64
Matlab 2012a x64
Cuda SDK 4.2
Tesla C2050 GPU
I am having trouble figuring out why my GPU is crashing with the "uncorrectable ECC error encountered". This error only occurs when i use 512 threads or more. I can't post the kernel, but i will try to describe what it does.
In general, the kernel takes a number of parameters and produces 2 complex matricies defined by the thread size, M and another number, N. So the returned matrices will be of size MxN. A typical configuration is 512x512, but each number is independent and can vary up or down. The kernel works when the numbers are 256x256.
Each thread (kernel) extracts a 999 size vector out of a 2D array based on the thread id, ie size 999xM, then cycles through the row (0 .. N-1) of the output matrices for calculation. A number of intermediate parameters are calculated, only using pow, sin and cos among the + - * / operators. To calculate one of the output matrices an additional loop needs to be executed to sum up the contribution of the 999 vector that was extracted earlier. This loop does some intermediate calculations to determine a range of values that will allow contribution. The contribution is then scaled by a factor determined by the cos and sine values of a calculated fractional value. This is where it crashes. If i stick in a constant value or 1.0 or any other for that matter, the kernel executes without trouble. however, when only one of the calls (cos or sine) is included, the kernel crashes.
Some psuedocode follows:
kernel()
{
/* Extract 999 vector from 2D array 999xM - one 999 vector for each thread. */
for (int i = 0; i < 999; i++)
{
.....
}
/* Cycle through the 2nd dimension of the output matricies */
for (int j = 0; j < N; j++)
{
/* Calculate some intermediate variables */
/* Calculate the real and imaginary components of the first output matrix */
/* real = cos(value), imaginary = sin(value) */
/* Construct the first output matrix from some intermediate variables and the real and imaginary components */
/* Calculate some more intermediate variables */
/* cycle through the extracted vector (0 .. 998) */
for (int k = 0; k < 999; k++)
{
/* Calculate some more intermediate variables */
/* Determine the range of allowed values to contribute to the second output matrix. */
/* Calculate the real and imaginary components of the second output matrix */
/* real = cos(value), imaginary = sin(value) */
/* This is were it crashes, unless real and imaginary are constant values (1.0) */
/* Sum up the contributions of the extracted vector to the second output matrix */
}
/* Construct the Second output matrix from some intermediate variables and the real and imaginary components */
}
}
I thought this could be due to a register limit, but the occupancy calculator indicates that this is not the case, I'm using less than the 32,768 registers with 512 threads. Can anyone give any suggestions as to what the cause of this could be?
Here is the ptasx info:
ptxas info : Compiling entry function '_Z40KerneliidddddPKdS0_S0_S0_iiiiiiiiiPdS1_S1_S1_S1_S1_S1_S1_S1_S1_' for 'sm_20'
ptxas info : Function properties for _Z40KerneliidddddPKdS0_S0_S0_iiiiiiiiiPdS1_S1_S1_S1_S1_S1_S1_S1_S1_
8056 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Function properties for __internal_trig_reduction_slowpathd
40 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 53 registers, 232 bytes cmem[0], 144 bytes cmem[2], 28 bytes cmem[16]
tmpxft_00001d70_00000000-3_MexFunciton.cudafe1.cpp
"Uncorrectable ECC error" usually refers to a hardware failure. ECC is Error Correcting Code, a means to detect and correct errors in bits stored in RAM. A stray cosmic ray can disrupt one bit stored in RAM every once in a great while, but "uncorrectable ECC error" indicates that several bits are coming out of RAM storage "wrong" - too many for the ECC to recover the original bit values.
This could mean that you have a bad or marginal RAM cell in your GPU device memory.
Marginal circuits of any kind may not fail 100%, but are more likely to fail under the stress of heavy use - and associated rise in temperature.
There are diagnostic utilities floating around to stress-test all the RAM banks of your PC to confirm or pinpoint which chip is failing, but I don't know of an analog for testing the device RAM banks of the GPU.
If you have access to another machine with a GPU of similar capability, try running your app on that machine to see how it behaves. If you don't get the ECC error on the second machine, this confirms that the problem is almost certainly in the hardware of the first machine. If you get the same ECC error on the second machine, then ignore everything I've written here and continue looking for your software bug. Unless your code is actually causing hardware damage, the chances of two machines having the same hardware failure are extremely small.

Are hard-coded math operations performed at compile time or run time in Objective-C in Xcode?

If I write a line of code for a math operation, such as:
x = 109.0f*768.0f/320.0f;
Is the result (261.6f) computed at compile time or run time? In other words, does Xcode's optimization recognize that the result of a hard-coded math operation will always be the same, and thus can be pre-computed while compiling?
It is computed at compile time, at least using Xcode targetting iOS. This function:
float test() {
float x = 109.0f*768.0f/320.0f;
return x;
}
compiles to these three instructions:
movw r0, #52429
movt r0, #17282
bx lr
Computing the value at compile-time isn't required by the C standard. In fact, if you set the FENV_ACCESS pragma, there are cases where it's forbidden from computing it at compile-time. Turning on FENV_ACCESS didn't affect the generated instructions in this test case.

add vs mul (IA32-Assembly)

I know that add is faster as compared to mul function.
I want to know how to go about using add instead of mul in the following code in order to make it more efficient.
Sample code:
mov eax, [ebp + 8] #eax = x1
mov ecx, [ebp + 12] #ecx = x2
mov edx, [ebp + 16] #edx = y1
mov ebx, [ebp + 20] #ebx = y2
sub eax,ecx #eax = x1-x2
sub edx,ebx #edx = y1-y2
mul edx #eax = (x1-x2)*(y1-y2)
add is faster than mul, but if you want to multiply two general values, mul is far faster than any loop iterating add operations.
You can't seriously use add to make that code go faster than it will with mul. If you needed to multiply by some small constant value (such as 2), then maybe you could use add to speed things up. But for the general case - no.
If you are multiplying two values that you don't know in advance, it is effectively impossible to beat the multiply instruction in x86 assembler.
If you know the value of one of the operands in advance, you may be able beat the multiply instruction by using a small number of adds. This works particularly well when the known operand is small, and only has a few bits in its binary representation. To multiply an unknown value x by a known value consisting 2^p+2^q+...2^r you simply add x*2^p+x*2^q+..x*2*r if bits p,q, ... and r are set. This is easily accomplished in assembler by left shifting and adding:
; x in EDX
; product to EAX
xor eax,eax
shl edx,r ; x*2^r
add eax,edx
shl edx,q-r ; x*2^q
add eax,edx
shl edx,p-q ; x*2^p
add eax,edx
The key problem with this is that it takes at least 4 clocks to do this, assuming
a superscalar CPU constrained by register dependencies. Multiply typically takes
10 or fewer clocks on modern CPUs, and if this sequence gets longer than that in time
you might as well do a multiply.
To multiply by 9:
mov eax,edx ; same effect as xor eax,eax/shl edx 1/add eax,edx
shl edx,3 ; x*2^3
add eax,edx
This beats multiply; should only take 2 clocks.
What is less well known is the use of the LEA (load effective address) instruction,
to accomplish fast multiply-by-small-constant.
LEA which takes only a single clock worst case its execution time can often
by overlapped with other instructions by superscalar CPUs.
LEA is essentially "add two values with small constant multipliers".
It computes t=2^k*x+y for k=1,2,3 (see the Intel reference manual) for t, x and y
being any register. If x==y, you can get 1,2,3,4,5,8,9 times x,
but using x and y as seperate registers allows for intermediate results to be combined
and moved to other registers (e.g., to t), and this turns out to be remarkably handy.
Using it, you can accomplish a multiply by 9 using a single instruction:
lea eax,[edx*8+edx] ; takes 1 clock
Using LEA carefully, you can multiply by a variety of peculiar constants in a small number of cycles:
lea eax,[edx*4+edx] ; 5 * edx
lea eax,[eax*2+edx] ; 11 * edx
lea eax,[eax*4] ; 44 * edx
To do this, you have to decompose your constant multiplier into various factors/sums involving
1,2,3,4,5,8 and 9. It is remarkable how many small constants you can do this for, and still
only use 3-4 instructions.
If you allow the use other typically single-clock instructions (e.g, SHL/SUB/NEG/MOV)
you can multiply by some constant values that pure LEA can't
do as efficiently by itself. To multiply by 31:
lea eax,[4*edx]
lea eax,[8*eax] ; 32*edx
sub eax,edx; 31*edx ; 3 clocks
The corresponding LEA sequence is longer:
lea eax,[edx*4+edx]
lea eax,[edx*2+eax] ; eax*7
lea eax,[eax*2+edx] ; eax*15
lea eax,[eax*2+edx] ; eax*31 ; 4 clocks
Figuring out these sequences is a bit tricky, but you can set up an organized attack.
Since LEA, SHL, SUB, NEG, MOV are all single-clock instructions worst
case, and zero clocks if they have no dependences on other instructions, you can compute the exeuction cost of any such sequence. This means you can implement a dynamic programmming algorithm to generate the best possible sequence of such instructions.
This is only useful if the clock count is smaller than the integer multiply for your particular CPU
(I use 5 clocks as rule of thumb), and it doesn't use up all the registers, or
at least it doesn't use up registers that are already busy (avoiding any spills).
I've actually built this into our PARLANSE compiler, and it is very effective for computing offsets into arrays of structures A[i], where the size of the structure element in A is the known constant. A clever person would possibly cache the answer so it doesn't
have to be recomputed each time multiplying the same constant occurs; I didn't actually do that because
the time to generate such sequences is less than you'd expect.
Its is mildly interesting to print out the sequences of instructions needed to multiply by all constants
from 1 to 10000. Most of them can be done in 5-6 instructions worst case.
As a consequence, the PARLANSE compiler hardly ever uses an actual multiply when indexing even the nastiest
arrays of nested structures.
Unless your multiplications are fairly simplistic, the add most likely won't outperform a mul. Having said that, you can use add to do multiplications:
Multiply by 2:
add eax,eax ; x2
Multiply by 4:
add eax,eax ; x2
add eax,eax ; x4
Multiply by 8:
add eax,eax ; x2
add eax,eax ; x4
add eax,eax ; x8
They work nicely for powers of two. I'm not saying they're faster. They were certainly necessary in the days before fancy multiplication instructions. That's from someone whose soul was forged in the hell-fires that were the Mostek 6502, Zilog z80 and RCA1802 :-)
You can even multiply by non-powers by simply storing interim results:
Multiply by 9:
push ebx ; preserve
push eax ; save for later
add eax,eax ; x2
add eax,eax ; x4
add eax,eax ; x8
pop ebx ; get original eax into ebx
add eax,ebx ; x9
pop ebx ; recover original ebx
I generally suggest that you write your code primarily for readability and only worry about performance when you need it. However, if you're working in assembler, you may well already at that point. But I'm not sure my "solution" is really applicable to your situation since you have an arbitrary multiplicand.
You should, however, always profile your code in the target environment to ensure that what you're doing is actually faster. Assembler doesn't change that aspect of optimisation at all.
If you really want to see some more general purpose assembler for using add to do multiplication, here's a routine that will take two unsigned values in ax and bx and return the product in ax. It will not handle overflow elegantly.
START: MOV AX, 0007 ; Load up registers
MOV BX, 0005
CALL MULT ; Call multiply function.
HLT ; Stop.
MULT: PUSH BX ; Preserve BX, CX, DX.
PUSH CX
PUSH DX
XOR CX,CX ; CX is the accumulator.
CMP BX, 0 ; If multiplying by zero, just stop.
JZ FIN
MORE: PUSH BX ; Xfer BX to DX for bit check.
POP DX
AND DX, 0001 ; Is lowest bit 1?
JZ NOADD ; No, do not add.
ADD CX,AX
NOADD: SHL AX,1 ; Shift AX left (double).
SHR BX,1 ; Shift BX right (integer halve, next bit).
JNZ MORE ; Keep going until no more bits in BX.
FIN: PUSH CX ; Xfer product from CX to AX.
POP AX
POP DX ; Restore registers and return.
POP CX
POP BX
RET
It relies on the fact that 123 multiplied by 456 is identical to:
123 x 6
+ 1230 x 5
+ 12300 x 4
which is the same way you were taught multiplication back in grade/primary school. It's easier with binary since you're only ever multiplying by zero or one (in other words, either adding or not adding).
It's pretty old-school x86 (8086, from a DEBUG session - I can't believe they still actually include that thing in XP) since that was about the last time I coded directly in assembler. There's something to be said for high level languages :-)
When it comes to assembly instruction,speed of executing any instruction is measured using the clock cycle. Mul instruction always take more clock cycle's then add operation,but if you execute the same add instruction in a loop then the overall clock cycle to do multiplication using add instruction will be way more then the single mul instruction. You can have a look on the following URL which talks about the clock cycle of single add/mul instruction.So that way you can do your math,which one will be faster.
http://home.comcast.net/~fbui/intel_a.html#add
http://home.comcast.net/~fbui/intel_m.html#mul
My recommendation is to use mul instruction rather then putting add in loop,the later one is very inefficient solution.
I'd have to echo the responses you have already - for a general multiply you're best off using MUL - after all it's what it's there for!
In some specific cases, where you know you'll be wanting to multiply by a specific fixed value each time (for example, in working out a pixel index in a bitmap) then you can consider breaking the multiply down into a (small) handful of SHLs and ADDs - e.g.:
1280 x 1024 display - each line on the
display is 1280 pixels.
1280 = 1024 + 256 = 2^10 + 2^8
y * 1280 = y * (2 ^ 10) + y * (2 ^ 8)
= ADD (SHL y, 10), (SHL y, 8)
...given that graphics processing is likely to need to be speedy, such an approach may save you precious clock cycles.