Is CLOCK_THREAD_CPUTIME_ID the same as pthread_getcpuclockid(pthread_self(),..)? - cpu-time

In terms of obtaining per-thread CPU time using Posix clock_gettime(): is there any difference between using the clock ID obtained from pthread_getcpuclockid(pthread_self(),..) or using CLOCK_THREAD_CPUTIME_ID?

Per the man page:
NOTES
When thread refers to the calling thread, this function returns an
identifier that refers to the same clock manipulated by
clock_gettime(2) and clock_settime(2) when given the clock ID
CLOCK_THREAD_CPUTIME_ID.

Related

What is the Difference between Aging and Healing in AUTOSAR DEM?

Event Aging
The process of aging resets status bit 3 – ConfirmedDTC when a sufficient amount of time
has elapsed so that the cause for the error entry is assumedly not relevant anymore. This
is often used as a trigger to also clear stored snapshots or extended data from the event
memory.
But I don't get the healing process. I couldn't find anything about it.
Aged counter
Aging Counter The Dem module provides the ability to remove a specific event from the event memory, if its fault conditions are not fulfilled for a certain period of time (operation cycles). This process is called as "aging" or "unlearning". The usage of this feature requires the maintaining of an additional NVRAM block
Healing counter
Available both in positive direction, counting up from 0 (healing not started), latching at 255;
and in reverse counting down from the healing threshold (healing not started) to 0. The
counter is incremented resp. decremented as soon as the healing conditions are fulfilled (at
the end of a ‘passed’ tested operation cycle without failed result), irrespective of the status
of the ‘ConfirmedDTC ‘ or ‘WarningIndicatorRequested’ status bit.
The up-counting data element corresponds to ‘Cycles Tested Since Last Failed’.
Both data elements are also calculated for events without indicator.
I found the following Diagram in AUTOSAR documentation, now It's clear
According to AUTOSAR DEM SWS Document :
Healing of diagnostic events
The Dem module provides the ability to activate and deactivate indicators per event
stored in the event memory. The process of deactivation is defined as healing of a
diagnostic event.
Aging of diagnostic events
The Dem module provides the ability to remove a specific event from the event memory, if its fault conditions are not fulfilled for a certain period of time (operation cycles).This process is called as "aging" or "unlearning".
Few points to notice ( According to my point of view ) :
1 - Each one of them has a separate counter and a separate threshold, When the counting value meets the provided threshold, corresponding action is being taken.
2 - Normally, healing comes before aging.
3 - Aging resets the confirmedDTC bit in the status byte of the DTC. Healing just means we have an operation cycle in which the event status byte never had the testFailedThisOperationCycle bit set before.

A general-purpose warp-level std::copy-like function - what should it account for?

A C++ standard library implements std::copy with the following code (ignoring all sorts of wrappers and concept checks etc) with the simple loop:
for (; __first != __last; ++__result, ++__first)
*__result = *__first;
Now, suppose I want a general-purpose std::copy-like function for warps (not blocks; not grids) to use for collaboratively copying data from one place to another. Let's even assume for simplicity that the function takes pointers rather than an arbitrary iterator.
Of course, writing general-purpose code in CUDA is often a useless pursuit - since we might be sacrificing a lot of the benefit of using a GPU in the first place in favor of generality - so I'll allow myself some boolean/enum template parameters to possibly select between frequently-occurring cases, avoiding runtime checks. So the signature might be, say:
template <typename T, bool SomeOption, my_enum_t AnotherOption>
T* copy(
T* __restrict__ destination,
const T* __restrict__ source,
size_t length
);
but for each of these cases I'm aiming for optimal performance (or optimal expected performance given that we don't know what other warps are doing).
Which factors should I take into consideration when writing such a function? Or in other words: Which cases should I distinguish between in implementing this function?
Notes:
This should target Compute Capabilities 3.0 or better (i.e. Kepler or newer micro-architectures)
I don't want to make a Runtime API memcpy() call. At least, I don't think I do.
Factors I believe should be taken into consideration:
Coalescing memory writes - ensuring that consecutive lanes in a warp write to consecutive memory locations (no gaps).
Type size vs Memory transaction size I - if sizeof(T) is sizeof(T) is 1 or 2, and we have have each lane write a single element, the entire warp would write less than 128B, wasting some of the memory transaction. Instead, we should have each thread place 2 or 4 input elements in a register, and write that
Type size vs Memory transaction size II - For type sizes such that lcm(4, sizeof(T)) > 4, it's not quite clear what to do. How well does the compiler/the GPU handle writes when each lane writes more than 4 bytes? I wonder.
Slack due to the reading of multiple elements at a time - If each thread wishes to read 2 or 4 elements for each write, and write 4-byte integers - we might have 1 or 2 elements at the beginning and the end of the input which must be handled separately.
Slack due to input address mis-alignment - The input is read in 32B transactions (under reasonable assumptions); we thus have to handle the first elements up to the multiple of 32B, and the last elements (after the last such multiple,) differently.
Slack due to output address mis-alignment - The output is written in transactions of upto 128B (or is it just 32B?); we thus have to handle the first elements up to the multiple of this number, and the last elements (after the last such multiple,) differently.
Whether or not T is trivially-copy-constructible. But let's assume that it is.
But it could be that I'm missing some considerations, or that some of the above are redundant.
Factors I've been wondering about:
The block size (i.e. how many other warps are there)
The compute capability (given that it's at least 3)
Whether the source/target is in shared memory / constant memory
Choice of caching mode

measuring time between two rising edges in beaglebone

I am reading sensor output as square wave(0-5 volt) via oscilloscope. Now I want to measure frequency of one period with Beaglebone. So I should measure the time between two rising edges. However, I don't have any experience with working Beaglebone. Can you give some advices or sample codes about measuring time between rising edges?
How deterministic do you need this to be? If you can tolerate some inaccuracy, you can probably do it on the main Linux OS; if you want to be fancy pants, this seems like a potential use case for the BBB's PRU's (which I unfortunately haven't used so take this with substantial amounts of salt). I would expect you'd be able to write PRU code that just sits with an infinite outerloop and then inside that loop, start looping until it sees the pin shows 0, then starts looping until the pin shows 1 (this is the first rising edge), then starts counting until either the pin shows 0 again (this would then be the falling edge) or another loop to the next rising edge... either way, you could take the counter value and you should be able to directly convert that into time (the PRU is states as having fixed frequency for each instruction, and is a 200Mhz (50ns/instruction). Assuming your loop is something like
#starting with pin low
inner loop 1:
registerX = loadPin
increment counter
jump if zero registerX to inner loop 1
# pin is now high
inner loop 2:
registerX = loadPin
increment counter
jump if one registerX to inner loop 2
# pin is now low again
That should take 3 instructions per counter increment, so you can get the time as 3 * counter * 50 ns.
As suggested by Foon in his answer, the PRUs are a good fit for this task (although depending on your requirements it may be fine to use the ARM processor and standard GPIO). Please note that (as far as I know) both the regular GPIOs and the PRU inputs are based on 3.3V logic, and connecting a 5V signal might fry your board! You will need an additional component or circuit to convert from 5V to 3.3V.
I've written a basic example that measures timing between rising edges on the header pin P8.15 for my own purpose of measuring an engine's rpm. If you decide to use it, you should check the timing results against a known reference. It's about right but I haven't checked it carefully at all. It is implemented using PRU assembly and uses the pypruss python module to simplify interfacing.

How can I (reasonably) precisely perform an action every N milliseconds?

I have a machine which uses an NTP client to sync up to internet time so it's system clock should be fairly accurate.
I've got an application which I'm developing which logs data in real time, processes it and then passes it on. What I'd like to do now is output that data every N milliseconds aligned with the system clock. So for example if I wanted to do 20ms intervals, my oututs ought to be something like this:
13:15:05:000
13:15:05:020
13:15:05:040
13:15:05:060
I've seen suggestions for using the stopwatch class, but that only measures time spans as opposed to looking for specific time stamps. The code to do this is running in it's own thread, so should be a problem if I need to do some relatively blocking calls.
Any suggestions on how to achieve this to a reasonable (close to or better than 1ms precision would be nice) would be very gratefully received.
Don't know how well it plays with C++/CLR but you probably want to look at multimedia timers,
Windows isn't really real-time but this is as close as it gets
You can get a pretty accurate time stamp out of timeGetTime() when you reduce the time period. You'll just need some work to get its return value converted to a clock time. This sample C# code shows the approach:
using System;
using System.Runtime.InteropServices;
class Program {
static void Main(string[] args) {
timeBeginPeriod(1);
uint tick0 = timeGetTime();
var startDate = DateTime.Now;
uint tick1 = tick0;
for (int ix = 0; ix < 20; ++ix) {
uint tick2 = 0;
do { // Burn 20 msec
tick2 = timeGetTime();
} while (tick2 - tick1 < 20);
var currDate = startDate.Add(new TimeSpan((tick2 - tick0) * 10000));
Console.WriteLine(currDate.ToString("HH:mm:ss:ffff"));
tick1 = tick2;
}
timeEndPeriod(1);
Console.ReadLine();
}
[DllImport("winmm.dll")]
private static extern int timeBeginPeriod(int period);
[DllImport("winmm.dll")]
private static extern int timeEndPeriod(int period);
[DllImport("winmm.dll")]
private static extern uint timeGetTime();
}
On second thought, this is just measurement. To get an action performed periodically, you'll have to use timeSetEvent(). As long as you use timeBeginPeriod(), you can get the callback period pretty close to 1 msec. One nicety is that it will automatically compensate when the previous callback was late for any reason.
Your best bet is using inline assembly and writing this chunk of code as a device driver.
That way:
You have control over instruction count
Your application will have execution priority
Ultimately you can't guarantee what you want because the operating system has to honour requests from other processes to run, meaning that something else can always be busy at exactly the moment that you want your process to be running. But you can improve matters using timeBeginPeriod to make it more likely that your process can be switched to in a timely manner, and perhaps being cunning with how you wait between iterations - eg. sleeping for most but not all of the time and then using a busy-loop for the remainder.
Try doing this in two threads. In one thread, use something like this to query a high-precision timer in a loop. When you detect a timestamp that aligns to (or is reasonably close to) a 20ms boundary, send a signal to your log output thread along with the timestamp to use. Your log output thread would simply wait for a signal, then grab the passed-in timestamp and output whatever is needed. Keeping the two in separate threads will make sure that your log output thread doesn't interfere with the timer (this is essentially emulating a hardware timer interrupt, which would be the way I would do it on an embedded platform).
CreateWaitableTimer/SetWaitableTimer and a high-priority thread should be accurate to about 1ms. I don't know why the millisecond field in your example output has four digits, the max value is 999 (since 1000 ms = 1 second).
Since as you said, this doesn't have to be perfect, there are some thing that can be done.
As far as I know, there doesn't exist a timer that syncs with a specific time. So you will have to compute your next time and schedule the timer for that specific time. If your timer only has delta support, then that is easily computed but adds more error since the you could easily be kicked off the CPU between the time you compute your delta and the time the timer is entered into the kernel.
As already pointed out, Windows is not a real time OS. So you must assume that even if you schedule a timer to got off at ":0010", your code might not even execute until well after that time (for example, ":0540"). As long as you properly handle those issues, things will be "ok".
20ms is approximately the length of a time slice on Windows. There is no way to hit 1ms kind of timings in windows reliably without some sort of RT add on like Intime. In windows proper I think your options are WaitForSingleObject, SleepEx, and a busy loop.

If passing a negative number to taskDelay function in vxworks, what happens?

Noted that the parameter of taskDelay is of type int, which means the number could be negative. Just wondering how the function is going to react when passing a negative number.
Most functions would validate the input, and just return early/return 0/set the parameter in question to a default value.
I presume there's no critical need to do this in production, and you probably have some code lying around that you could test with.... why not give it a go?
The documentation doesn't address it, and the only error codes they do define doesn't cover this case. The most correct answer therefore is that the results are undefined.
See the VxWorks / Tornado II FAQ for this gem, however:
taskDelay(-1) shows another bug in
the vxWorks timer/tick code. It has
the (side) effect of setting vxTicks
to zero. This corrupts the localtime
(and probably other things). In fact
taskDelay(x) will have the same effect
if vxTicks + x >= 0x100000000. If the
system clock rate is 100Hz this
happens after about 500 days (because
vxTicks wraps). At faster clock rates
it will happen sooner. Anyone trying
for several years uptime?
Oh there is an undocumented upper
limit on the clock rate. At rates
above 4294 select() will fail to
convert its 'usec' time into the
correct number of ticks. (From: David
Laight, dsl#tadpole.co.uk)
Assuming this bug is old, I would hope that it would either return an error or do the same thing as taskDelay(0), which puts your task at the end of the ready queue.
The task delay tick will be VIRTUALLY 10,9,..,1,0 for taskDelay(10).
The task delay tick will be VIRTUALLY -10,-11,...,-2147483648,2147483647,...,1,0 for taskDelay(-10).