I have seen many posts about using the clock() function to determine the amount of elapsed time in a program with code looking something like:
start_time = clock();
//code to be timed
.
.
.
end_time = clock();
elapsed_time = (end_time - start_time) / CLOCKS_PER_SEC;
The value of CLOCKS_PER_SEC is almost surely not the actual number of clock ticks per second so I am a bit wary of the result. Without worrying about threading and I/O, is the output of the clock() function being scaled in some way so that this divison produces the correct wall clock time?
The answer to your question is yes.
clock() in this case refers to a wallclock rather than a CPU clock so it could be misleading at first glance. For all the machines and compilers I've seen, it returns the time in milliseconds since I've never seen a case where CLOCKS_PER_SEC isn't 1000. So the precision of clock() is limited to milliseconds and the accuracy is usually slightly less.
If you're interested in the actual cycles, this can be hard to obtain.
The rdtsc instruction will let you access the number "pseudo"-cycles from when the CPU was booted. On older systems (like Intel Core 2), this number is usually the same as the actual CPU frequency. But on newer systems, it isn't.
To get a more accurate timer than clock(), you will need to use the hardware performance counters - which is specific to the OS. These are internally implemented using the 'rdtsc' instruction from the last paragraph.
Related
Does x86(64 too) processor optimise away the multiplication if one of the operands of multiplication happens to be 1.0?
PS:I do not mean compiler optimising a constant multiplication of 1.0.
That's not something I've seen mentioned in docs about instruction latencies or microarchitectures of Intel or AMD CPUs.
I suspect it doesn't happen, because variable latency would interfere with pipelined execution units. (multiple results coming out of the same execution unit in the same clock cycle = extra complexity). Also, there are probably other bits of logic (uop scheduling / queueing, result forwarding networks) that are designed around every uop having known latency. (except for special cases like division / sqrt).
IIRC, one analyst, maybe Agner Fog or David Kanter, suggested that some uops might have been possible to implement with 2 cycle latency, but instead take 3 cycles to match the other uops that their execution port can handle. So constant latency for operations appears to be a big deal for Intel CPU designs, to the extent that it was worth making an operation slower.
Note that we're only talking about latency here. If your multiply isn't part of a loop-carried dependency chain, or you have enough independent multiplies, you can keep the multiplier(s) going with one operation per clock.
Haswell CPUs can sustain a throughput of 2 FP vector multiplies per clock. (256b vectors of 4 doubles or 8 floats). Latency = 5 clock cycles for the result to be ready, regardless of input. Or 1 vector integer multiply per clock. (The vector multiply ALU is on port 0. The vector FP multipliers are on port 0 and port 1).
Avoid multiplying when you can, it leads to long dependency chains. (Usually this comes up for integer multiplies to calculate loop indices. Compilers do a lot better when you write your loop to increment the counter by 16, instead of multiplying i++ by 16 as an array index.)
Can anybody spot what is wrong with the code below. It is supposed to average the frame interval (dt) for the previous TIME_STEPS number of frames.
I'm using Box2d and cocos2d, although I don't think the cocos2d bit is very relevent.
-(void) update: (ccTime) dt
{
float32 timeStep;
const int32 velocityIterations = 8;
const int32 positionIterations = 3;
// Average the previous TIME_STEPS time steps
for (int i = 0; i < TIME_STEPS; i++)
{
timeStep += previous_time_steps[i];
}
timeStep = timeStep/TIME_STEPS;
// step the world
[GB2Engine sharedInstance].world->Step(timeStep, velocityIterations, positionIterations);
for (int i = 0; i < TIME_STEPS - 1; i++)
{
previous_time_steps[i] = previous_time_steps[i+1];
}
previous_time_steps[TIME_STEPS - 1] = dt;
}
The previous_time_steps array is initially filled with whatever the animation interval is set too.
This doesn't do what I would expect it too. On devices with a low frame rate it speeds up the simulation and on devices with a high frame rate it slows it down. I'm sure it's something stupid I'm over looking.
I know box2D likes to work with fixed times steps but I really don't have a choice. My game runs at a very variable frame rate on the various devices so a fixed time stop just won't work. The game runs at an average of 40 fps, but on some of the crappier devices like the first gen iPad it runs at barely 30 frames per second. The third gen ipad runs it at 50/60 frames per second.
I'm open to suggestion on other ways of dealing with this problem too. Any advice would be appreciated.
Something else unusual I should note that somebody might have some insight into is the fact that running any debug optimisations on the build has a huge effect on the above. The frame rate isn't changed much when debug optimisations are set to -Os vs -O0. But when the debut optimisations are set to -Os the physics simulation runs much faster than -O0 when the above code is active. If I just use dt as the interval instead of the above code then the debug optimisations make no difference.
I'm totally confused by that.
On devices with a low frame rate it speeds up the simulation and on
devices with a high frame rate it slows it down.
That's what using a variable time step is all about. If you only get 10 fps the physics engine will iterate the world faster because the delta time is larger.
PS: If you do any kind of performance tests like these, run them with the release build. That also ensures that (most) logging is disabled and code optimizations are on. It's possible that you simply experience much greater impact on performance from debugging code on older devices.
Also, what value is TIME_STEPS? It shouldn't be more than 10, maybe 20 at most. The alternative to averaging is to use delta time directly, but if delta time is greater than a certain threshold (30 fps) switch to using a fixed delta time (cap it). Because variable time step below 30 fps can get really ugly, it's probably better in such cases to allow the physics engine to slow down with the framerate or else the game will become harder if not unplayable at lower fps.
Shaders allow if(x>a && x<b) but my understanding is logical tests are slow. Is this worth optimizing along the lines:
mid = (a+b)*.5;
half = (b-a)*.5;
if(abs(x - mid) < half){...}
It seems a lot more code but from the days of CPU ASM, such tricks were sometimes worth doing.
If half and mid can be calculated once on the CPU and passed in as shader parameters, then we replace: if(x>a && x<b) with: if(abs(x - mid) < half) - is this worth bothering with?
In my tests I see a little improvement pulling half and mid into pre-calculated shader params, but I've only one GPU to test on so I'm after a "in general" answer, and if there are cases it will be dramatically better or worse.
The precalculating part of the shader params could be irrelevant (especially on directx), as the compiler is often smart enough to figure out the expression is constant.
Question is how abs is implemented. What's your target GPU? Mobile,mid,highend?
I'm trying to evaluate the maximum physical rate (Nyquist performance limit) of the A/Ds integrated on board various PIC microcontrollers.
However, to do the calculation requires parameters that I'm not finding explicitly stated in the datasheets, specifically Tacq, Fosc, TAD, and divisor parameters.
I've proceeded by making some assumptions but would be helpful to have a sanity check -- am I doing the maximum physical rate calculations correctly?
For illustration purposes only, I've taken the simplest possible PIC10F220 that has an ADC. This is to focus specifically on the interpretation of Tacq, Fosc, TAD, and divisor parameters, and not to suggest that any practical functionality could be implemented on this very basic chip. (This is to Clifford's points in the comments below.)
Calculation:
Nyquist Performance Analysis of PIC10F220
- Runs at clock speed of 8MHz.
- Has an instruction cycle of 0.5us [4 clock steps per instruction]
So:
- Get Tacq = 6.06 us [acquisition time for ADC, assuming chip temp. = 50*C]
[from datasheet p34]
- Set Fosc := 8MHz [? should this be internal clock speed ?]
- Set divisor := 4 [? assuming this is 4 from 4 clock steps per CPU instruction ?]
- This gives TAD = 0.5us [TAD = 1/(Fosc/divisor) ]
- Get conversion time is 13*TAD [from datasheet p31]
- This gives conversion time 6.5 us
- So ADC duration is 12.56 us [? Tacq + 13*TAD]
Assuming 10 instructions for a simple load/store/threshold done in real-time before the next sample (this is just a stub -- the point is the rest of the calculation):
- This adds another 5 us [0.5 us per instruction]
- To give total ADC and handling time of 17.56 us [ 12.56us + 1us + 4us ]
- before the sampling loop repeats [? Again Tacq ? + 13*TAD + handling ]
- If this is correct, then the max sampling rate is 56.9 ksps [ 1/ total time ]
- So the Nyquist frequency for this sampling rate is 28 kHz. [1/2 sampling rate]
Which means the (theoretical) performance of this system --- chip's A/D with the hypothetical real-time handling code --- is for signals that are bandlimited to 28 kHz.
Is this a correct assignment / interpretation of the data sheet in obtaining Tacq, Fosc, TAD, and divisor parameters and using them to obtain the maximum physical rate, or Nyquist performance limit, of this chip?
Thanks,
You're not going to be able to do much processing in 8 instructions, but assuming you're just doing something simple like storing the incoming samples to a buffer, or detecting a threshold, then your analysis looks good.
The actual chips I'm considering for the design are the dsPIC33FJ128MC804 (with 16b A/D) or dsPIC30F3014 (with 12b A/D).
That is an important distinction; the dsPIC ADC supports ping-pong DMA transfers of multiple channels simultaneously, so can minimise the effective software overhead per sample. That makes the calculation a somewhat different one. You need to determine from the sample rate and the DMA buffer size the time between sample buffer interrupts; that is how much processing time you have to deal with each buffer. If you are using Microchip's DSP library, it gives precise cycle time formulae for each algorithm, and block processing is considerably more efficient that sample-by-sample processing.
My last project was on a dsPIC33 with two channels sampled at 48KHz and 32word sample buffers (giving 667us to process each pair of buffers). The software processing was therefore entirely independent of the sampling since by using DMA they take place simultaneously.
How can you keep track of time in a simple embedded system, given that you need a fixed-point representation of the time in seconds, and that your time between ticks is not precisely expressable in that fixed-point format? How do you avoid cumulative errors in those circumstances.
This question is a reaction to this article on slashdot.
0.1 seconds cannot be neatly expressed as a binary fixed-point number, just as 1/3 cannot be neatly expressed as a decimal fixed-point number. Any binary fixed-point representation has a small error. For example, if there are 8 binary bits after the point (ie using an integer value scaled by 256), 0.1 times 256 is 25.6, which will be rounded to either 25 or 26, resulting in an error in the order of -2.3% or +1.6% respectively. Adding more binary bits after the point reduces the scale of this error, but cannot eliminate it.
With repeated addition, the error gradually accumulates.
How can this be avoided?
One approach is not to try to compute the time by repeated addition of this 0.1 seconds constant, but to keep a simple integer clock-tick count. This tick count can be converted to a fixed-point time in seconds as needed, usually using a multiplication followed by a division. Given sufficient bits in the intermediate representations, this approach allows for any rational scaling, and doesn't accumulate errors.
For example, if the current tick count is 1024, we can get the current time (in fixed point with 8 bits after the point) by multiplying that by 256, then dividing by 10 - or equivalently, by multiplying by 128 then dividing by 5. Either way, there is an error (the remainder in the division), but the error is bounded since the remainder is always less than 5. There is no cumulative error.
Another approach might be useful in contexts where integer multiplication and division is considered too costly (which should be getting pretty rare these days). It borrows an idea from Bresenhams line drawing algorithm. You keep the current time in fixed point (rather than a tick count), but you also keep an error term. When the error term grows too large, you apply a correction to the time value, thus preventing the error from accumulating.
In the 8-bits-after-the-point example, the representation of 0.1 seconds is 25 (256/10) with an error term (remainder) of 6. At each step, we add 6 to our error accumulator. Based on this so far, the first two steps are...
Clock Seconds Error
----- ------- -----
25 0.0977 6
50 0.1953 12
At the second step, the error value has overflowed - exceeded 10. Therefore, we increment the clock and subtract 10 from the error. This happens every time the error value reaches 10 or higher.
Therefore, the actual sequence is...
Clock Seconds Error Overflowed?
----- ------- ----- -----------
25 0.0977 6
51 0.1992 2 Yes
76 0.2969 8
102 0.3984 4 Yes
There is almost always an error (the clock is precisely correct only when the error value is zero), but the error is bounded by a small constant. There is no cumulative error in the clock value.
A hardware-only solution is to arrange for the hardware clock ticks to run very slightly fast - precisely fast enough to compensate for cumulative losses caused by the rounding-down of the repeatedly added tick-duration value. That is, adjust the hardware clock tick speed so that the fixed-point tick-duration value is precisely correct.
This only works if there is only one fixed-point format used for the clock.
Why not have 0.1 sec counter and every ten times increment your seconds counter, and wrap the 0.1 counter back to 0?
In this particular instance, I would have simply kept the time count in tenths of a seconds (or milliseconds, or whatever time scale is appropriate for the application). I do this all the time in small systems or control systems.
So a time value of 100 hours would be stored as 3_600_000 ticks - zero error (other than error that might be introduced by hardware).
The problems that are introduced by this simple technique are:
you need to account for the larger numbers. For example, you may have to use a 64-bit counter rather than a 32-bit counter
all your calculations need to be aware of the units used - this is the area that is most likely going to cause problems. I try to help with this problem by using time counters with a uniform unit. For example, this particular counter needs only 10 ticks per second, but another counter might need millisecond precision. In that case, I'd consider making both counters millisecond precision so they use the same units even though one doesn't really need that precision.
I've also had to play some other tricks this with timers that aren't 'regular'. For example, I worked on a device that required a data acquisition to occur 300 times a second. The hardware timer fired once a millisecond. There's no way to scale the millisecond timer to get exactly 1/300th of a second units. So We had to have logic that would perform the data acquisition on every 3, 3, and 4 ticks to keep the acquisition from drifting.
If you need to deal with hardware time error, then you need more than one time source and use them together to keep the overall time in sync. Depending on your needs this can be simple or pretty complex.
Something I've seen implemented in the past: the increment value can't be expressed precisely in the fixed-point format, but it can be expressed as a fraction. (This is similar to the "keep track of an error value" solution.)
Actually in this case the problem was slightly different, but conceptually similar—the problem wasn't a fixed-point representation as such, but deriving a timer from a clock source that wasn't a perfect multiple. We had a hardware clock that ticks at 32,768 Hz (common for a watch crystal based low-power timer). We wanted a millisecond timer from it.
The millisecond timer should increment every 32.768 hardware ticks. The first approximation is to increment every 33 hardware ticks, for a nominal 0.7% error. But, noting that 0.768 is 768/1000, or 96/125, you can do this:
Keep a variable for "fractional" value. Start it on 0.
wait for the hardware timer to count 32.
While true:
increment the millisecond timer.
Add 96 to the "fractional" value.
If the "fractional" value is >= 125, subtract 125 from it and wait for the hardware timer to count 33.
Otherwise (the "fractional" value is < 125), wait for the hardware timer to count 32.
There will be some short term "jitter" on the millisecond counter (32 vs 33 hardware ticks) but the long-term average will be 32.768 hardware ticks.