How to use perf_event_open under sampling mode to read the value of BRANCH STACK?

How to use perf_event_open under sampling mode to read the value of BRANCH STACK? - branch

I use perf_event_open() under sampling mode to sample the value of branch stack, but I don't know why!!
attr.sample_type=PERF_SAMPLE_IP|PERF_SAMPLE_BRANCH_STACK
if I don't set PERF_SAMPLE_BRANCH_STACK to attr.sample_type, everything is ok!! I don't know why!!!!!!!!!!
static int perf_event_open(struct perf_event_attr *attr,
pid_t pid,int cpu,int group_fd,unsigned long flags)
{
return syscall(__NR_perf_event_open,attr,pid,cpu,group_fd,flags);
}
int main(int argc, char** argv)
{
pid_t pid = 0;
// create a perf fd
struct perf_event_attr attr;
memset(&attr,0,sizeof(struct perf_event_attr));
attr.size=sizeof(struct perf_event_attr);
// disable at init time
attr.disabled=1;
// set what is the event
attr.type=PERF_TYPE_HARDWARE;
attr.config=PERF_COUNT_HW_BRANCH_INSTRUCTIONS;
// how many clocks to trigger sampling
attr.sample_period=1000000;
// what to sample is IP
attr.sample_type=PERF_SAMPLE_IP|PERF_SAMPLE_BRANCH_STACK;
// notify every 1 overflow
attr.wakeup_events=1;
attr.branch_sample_type = PERF_SAMPLE_BRANCH_ANY_RETURN;
// open perf fd
int perf_fd=perf_event_open(&attr,pid,-1,-1,0);
if(perf_fd<0)
{
perror("perf_event_open() failed!");
return errno;
}
failed! error : Operation not supported!

I can think of three reasons why would that error occur in your case:
You're running the code on an IBM POWER processor. On these processors PERF_SAMPLE_BRANCH_STACK is supported and some of the branch filters are supported in the hardware, but PERF_SAMPLE_BRANCH_ANY_RETURN is not supported on any of the current POWER processors. You said that the code works fine by removing PERF_SAMPLE_BRANCH_STACK, but that doesn't tell us whether the problem is from PERF_SAMPLE_BRANCH_STACK or PERF_SAMPLE_BRANCH_ANY_RETURN.
You're running the code on a hypervisor (e.g., KVM). Most hypervisors (if not all) don't virtualize branch sampling. Yet the host processor may actually support branch sampling and maybe even the ANY_RETURN filter.
The processor doesn't support the branch sampling feature. These include Intel processors that are older than the Pentium 4.
Not all Intel processors support the ANY_RETURN filter in hardware. This filter is supported starting with Core2. However, on Intel processors, for branch filters that are not supported in the hardware, Linux provides software filtering, so PERF_SAMPLE_BRANCH_ANY_RETURN should still work on these processors.
There could be other reasons that I have missed.

error : Operation not supported
The perf_event_open() manual page says about this error:
EOPNOTSUPP
Returned if an event requiring a specific hardware feature is
requested but there is no hardware support. This includes
requesting low-skid events if not supported, branch tracing if
it is not available, sampling if no PMU interrupt is
available, and branch stacks for software events.
And about PERF_SAMPLE_BRANCH_STACK it says:
PERF_SAMPLE_BRANCH_STACK (since Linux 3.4)
This provides a record of recent branches, as provided
by CPU branch sampling hardware (such as Intel Last
Branch Record). Not all hardware supports this fea‐
ture.
So it looks like your hardware doesn't support this.

Related

Is there a way to synchronize custom interrupt signals with AXI master transactions in Vitis HLS?

I have been unable to find an answer, possibly due to me being unable to put specific enough nomenclature on the involved processes.
I use Vitis HLS to synthesize designs where one call of the main function is one clock cycle long, being pipelined of course. This works fine for almost all of our cases. Where this is not possible (i.e. for components where we need to guarantee certain latencies / pipelining depths) I use verilog.
The goal is to transfer data via DMA to a Zynq-7000's memory and THEN issue an interrupt to let the PS know that the DMA transfer is finished.
Suppose I have a Vitis HLS project, where the PS can initiate a DMA transfer of uint32s using (a rising edge on a signal in) an s_axilite interface to my component, like in the code below:
#include <cstdint>
void Example
(
uint32_t *dmaRegion,
bool &intrSig,
volatile bool writeNow
)
{
#pragma HLS PIPELINE II=1
#pragma HLS INLINE RECURSIVE
#pragma HLS INTERFACE s_axilite port=return bundle=registers
#pragma HLS INTERFACE ap_ctrl_none port=return
#pragma HLS INTERFACE m_axi port=dmaRegion offset=slave bundle=x
#pragma HLS INTERFACE s_axilite port=dmaRegion bundle=registers
#pragma HLS INTERFACE ap_none port=dmaRegion
#pragma HLS INTERFACE s_axilite port=writeNow bundle=registers
#pragma HLS INTERFACE ap_none port=writeNow
#pragma HLS INTERFACE ap_none port=intrSig
static bool lastWriteNow { false };
static uint32_t Ctr { 0 };
bool intr = false;
if (!lastWriteNow && writeNow)
{
Ctr++;
dmaRegion[10] = Ctr;
intr = true;
}
intrSig = intr;
lastWriteNow = writeNow;
}
Now, this seems to work fine and cause a 1-clock-cycle-pulse interrupt as long as WREADY is driven high by the Zynq (and through a SmartConnect to my component) and I have found some examples where this is done this way. Also, the PS grabs the correct data from the DDR memory (L2 Data Cache has been disabled for this memory region) directly after the interrupt.
However, what will happen if for example more AXI masters are trying to drive the Smart Connect and cause congestion, effectively causing WREADY to go low for this component? In my tests, where I drove the WREADY signal of the AXI Smart Connect Master Interface to a constant zero to simulate (permanent) congestion, the interrupt signal (and WVALID) was driven to a permanent high, which would mean.... what? That the HLS design blocked inside the if clause? I do not quite get it as it seems to me that this would contradict the II=1 constraint (which is reported by Vitis HLS as being satisfied).
In a way it makes sense of course, since WVALID must go high when data is available and it must stay high until WREADY is high as well. But why the interrupt line goes (and stays) high no matter what even though the transaction is not yet finished evades me.
Is this at all possible with any guarantees about the m_axi interface, or will I have to find other solutions?
Any hint and information (especially background information about that behaviour) is very much appreciated.
Edit:
For example, this works fine:
but this causes the interrupt to stay high forever:
Of course, the transaction cannot finish. But it seems I have no way of unblocking the design so long as the AXI bus is congested.

Vitis Scheduler view
When I compile your code and look at the schedule view this is the result:
What I understand is that there is phi node (term borrowed from LLVM) which means that the value of intrSig can't be set before finishing the AXI4 write response. Since this is then converted into RTL the signal must have a value, and if it goes high, then there is congestion on the AXI4, it will stay high until the AXI transaction has finished.
HLS craziness
I tried to look into the HDL, with not much luck. I only got an intuition though which I try to share:
The red wires are the ones that eventually drive the intrSig signal. The flip flop is driven to 1 through the SET port, and to 0 by the RST port.
Long way to intrSig from this FF, but it eventually gets there:
The SET signal is driven by combinatorial logic using writeNow:
And lastly the wready goes a long way but it interferes to the pipeline chain of registers that eventually drives the intrSig.
Is this proof of what is happening? Unfortunately no, but there are some hints that the outcome of the m_axi transaction stops the interrupt pipeline to advance.
Some debugging hints
I don't know if clearing the wready signal actually simulates congestion, the axi protocol starts with a awready and I expect a congested interconnect to not accept transactions from the beginning.
Also, I would instantiate your IP alone, then attach some AXI VIP (axi verification IPs) which are provided in Vivado by Xilinx and programmed in SystemVerilog to give you the output you want, while recording all your data. You will also be able to look all the waveforms and detect where your issues are.
You can have your IP write into one of these AXI4VIP configured in slave mode, or you can write to a BRAM.
I'll leave here some documentation.

How to best convert legacy polling embedded firmware architecture into an event driven one?

I've got a family of embedded products running a typical main-loop based firmware with 150+k lines of code. A load of complex timing critical features is realized by a combination of hardware interrupt handlers, timer polling and protothreads (think co-routines). Actually protothreads are polling well and "only" are syntactic sugar to mimic pseudo-parallel scheduling of multiple threads (endless loops). I add bugfixes and extensions to the firmware all the time. There are about 30k devices of about 7 slightly different hardware types and versions out in the field.
For a new product-family member I need to integrate an external FreeRTOS-based project on the new product only while all older products need to get further features and improvements.
In order to not have to port the whole complex legacy firmware to FreeRTOS with all the risk of breaking perfectly fine products I plan to let the old firmware run inside a FreeRTOS task. On the older products the firmware shall still not run without FreeRTOS. Inside the FreeRTOS task the legacy firmware would consume all available processor time because its underlying implementation scheme is polling based. Due to the fact that its using prothreads (timer and hardware polling based behind the scenes) and polls on a free-running processor counter register I hope that I can convert the polling into a event driven behavior.
Here are two examples:
// first example: do something every 100 ms
if (GET_TICK_COUNT() - start > MS(100))
{
start = GET_TICK_COUNT();
// do something every 100 ms
}
// second example: wait for hardware event
setup_hardware();
PT_WAIT_UNTIL(hardware_ready(), pt);
// hardware is ready, do something else
So I got the impression that I can convert these 2 programming patterns (e.g. through macro magic and FreeRTOS functionality) into an underlying event based scheme.
So now my question: has anybody done such a thing already? Are there patterns and or best practices to follow?
[Update]
Thanks for the detailed responses. Let me comment some details: My need is to combine a "multithreading-simulation based legacy firmware (using coo-routines implementation protothreads)" and a FreeRTOS based project being composed of a couple of interacting FreeRTOS tasks. The idea is to let the complete old firmware run in its own RTOS task besides the other new tasks. I'm aware of RTOS principles and patterns (pre-emption, ressource sharing, blocking operations, signals, semaphores, mutexes, mailboxes, task priorities and so forth). I've planed to base the interaction of old and new parts exactly on these mechanisms. What I'm asking for is 1) Ideas how to convert the legacy firmware (150k+ LOC) in a semi-automated way so that the busy-waiting / polling schemes I presented above are using either the new mechanisms when run inside the RTOS tasks or just work the old way when build and run as the current main-loop-kind of firmware. A complete rewrite / full port of the legacy code is no option. 2) More ideas how to teach the old firmware implementation that is used to have full CPU resources on its hands to behave nicely inside its new prison of a RTOS task and not just consume all CPU cycles available (when given the highest priority) or producing new large Real Time Latencies when run not at the highest RTOS priority.
I guess here is nobody who already has done such a very special tas. So I just have to do the hard work and solve all the described issues one after another.

In an RTOS you create and run tasks. If you are not running more than one task, then there is little advantage in using an RTOS.
I don't use FreeRTOS (but have done), but the following applies to any RTOS, and is pseudo-code rather then FreeRTOS API specific - many details such as task priorities and stack allocation are deliberately missing.
First in most simple RTOS, including FreeRTOS, main() is used for hardware initialisation, task creation, and scheduler start:
int main( void )
{
// Necessary h/w & kernel initialisation
initHardware() ;
initKernel() ;
// Create tasks
createTask( task1 ) ;
createTask( task2 ) ;
// Start scheduling
schedulerStart() ;
// schedulerStart should not normally return
return 0 ;
}
Now let us assume that your first example is implemented in task1. A typical RTOS will have both timer and delay functions. The simplest to use is a delay, and this is suitable when the periodic processing is guaranteed to take less than one OS tick period:
void task1()
{
// do something every 100 ms
for(;;)
{
delay( 100 ) ; // assuming 1ms tick period
// something
...
}
}
If the something takes more than 1ms in this case, it will not be executed every 100ms, but 100ms plus the something execution time, which may itself be variable or non-deterministic leading to undesirable timing jitter. In that case you should use a timer:
void task1()
{
// do something every 100 ms
TIMER timer = createTimer( 100 ) ; // assuming 1ms tick period
for(;;)
{
timerWait() ;
// something
...
}
}
That way something can take up to 100ms and will still be executed accurately and deterministically every 100ms.
Now to your second example; that is a little more complex. If nothing useful can happen until the hardware is initialised, then you may as well use your existing pattern in main() before starting the scheduler. However as a generalisation, waiting for something in a different context (task or interrupt) to occur is done using a synchronisation primitive such as a semaphore or task event flag (not all RTOS have task event flags). So in a simple case in main() you might create a semaphore:
createSemaphore( hardware_ready ) ;
Then in the context performing the process that must complete:
// Init hardware
...
// Tell waiting task hardware ready
semaphoreGive( hardware_ready ) ;
Then in some task that will wait for the hardware to be ready:
void task2()
{
// wait for hardware ready
semaphoreTake( hardware_ready ) ;
// do something else
for(;;)
{
// This loop must block is any lower-priority task
// will run. Equal priority tasks may run is round-robin
// scheduling is implemented.
...
}
}

You are facing two big gotchas...
Since the old code is implemented using protothreads (coroutines), there is never any asynchronous resource contention between them. If you split these into FreeRTOS tasks, there there will be preemptive scheduling task switches; these switches may occur at places where the protothreads were not expecting, leaving data or other resources in an inconsistent state.
If you convert any of your protothreads' PT_WAIT calls into real waits in FreeRTOS, the call will really block. But the protothreads assume that other protothreads continue while they're blocked.
So, #1 implies you cannot just convert protothreads to tasks, and #2 implies you must convert protothreads to tasks (if you use FreeRTOS blocking primitives, like xEventGroupWaitBits().
The most straightforward approach will be to put all your protothreads in one task, and continue polling within that task.

Why doesn't MPC5554 board requires a RTOS? Does it come with a built-in OS?

I was looking at reference manual for MPC5554 board, there was no mention of any Operating System(Kernel) used. Applications can be run without the use of any external OS on this board.
I understand RTOS does memory management, task scheduling function,so are these function done by MPC5554's built in firmware?
There are vendor of RTOS for these board, so I wonder in what application would they be needed?
Is RTOS meant to be just another abstraction above the board level implementation?
And if we are putting a RTOS above , wouldn't that conflict with built in OS?

There is no built-in OS - why would you assume that?
Many embedded applications run bare-metal with no OS (RTOS or otherwise), but in any event the choice of RTOS is a developer decision not a board manufacturer's decision.
An RTOS fundamentally provides scheduling, synchronisation, inter-process communication and timing services. Memory management may be provided for devices with an MMU, but that is not a given. A bare-metal application can establish a C run-time environment and boot to main() with no scheduling or IPC etc. In most simple RTOS the system boots to main() where the RTOS is initialised and started, rather than the OS starting `main() as would happen in a GPOS.
A board manufacturer may provide a board support package for one or more a specific RTOS, but equally the BSP (or HAL or driver library) may comprise of bare-metal or RTOS independent device drivers only. Typically it is for the developer to integrate an RTOS, device drivers and middleware (such as filesystems and networking) etc., and these may come from a single or multiple vendors. You have to understand that many (or perhaps most) developers will be designing their own boards around sich a microcontroller rather then using COTS hardware, so there can be no one-size-fits-all solution, and instead embedded development tends to be a more kit-of-parts approach.

Clifford already hit the nail on the head. Depending on the complexity of the task you wish to accomplish, you may want to reconsider what exactly it is you need. If your only interest in an RTOS is based on the "real-time" part, then it may well be easier (and cheaper) to create your own little interrupt-driven bare metal application. Generally speaking, I find that if performance constraints are your main concern, then less abstraction is better.
Based on the way you phrased your question, I'm going to assume that you are looking at an eval board or dev kit with the MPC5554 as its micro, in which case you might already have some basic startup code that does things like setup the memory controller and a few peripherals (most dev kits or IDEs come with some amount of sample code you can reuse).
A simple application might do the following things:
initialize the runtime environment (MMU, INTC, FMPLL)
initialize the peripheral devices you wish to use, ex. ADC, GPIO, SPI etc. (this can get pretty complex if you have any external peripherals that require the EBI)
initialize the eMIOS to generate a timed interrupt with high priority (ie. your main processing task loop)
The way I've seen this typically work is that once all of the above initialization is complete, your main application thread runs into an infinite loop that you can use as a background task to do garbage collection or some non-time-critical fault detection. Then, in the ISR for the timed interrupt you created, you implement a basic scheduler to do your driver-level processing (ex. trigger/read ADC, read/write GPIOs, initiate any IPC transactions etc.). An extremely basic concept might look like this:
void fastTaskISR()
{
static uint8 frameCount = 1;
...
switch(frameCount)
{
case 1:
//Task 1
break;
case 2:
//Task 2
break;
case 3:
//Task 3
break;
case 4:
//Task 4
break;
case 5:
//Task 5
break;
default:
//Default case for robustness. Error.
break;
}
frameCount++;
if(frameCount > 5)
{
frameCount = 1;
}
...
}
Optionally, you can use one of the scheduled tasks to generate a software interrupt which can then be used to run some slower tasks or more complex control logic. Where I work this is a tried and true formula: an eMIOS driven "fast task" (usually between 100us and 1ms period) + a software interrupt driven "normal task" which runs the higher-level control logic, often generated from a Simulink model. Needless to say, we do a lot of reuse of the BSP and driver level code.

How to code ARM interrupt functions in C

I am using arm-none-eabi-gcc toolchain, v 4.8.2, on LinuxMint 17.2 64b.
I am, at hobbyist level, trying to play with a TM4C123G board and its usual features (coding various blinkies, uart things...) but always trying to remain as close to the metal as possible without using other libraries (eg CMSIS...) whenever possible. Also no IDE (CCS, Keil...), just Linux terminal windows, the board and I... All that mostly for education purpose.
The issue : I am stuck trying to implement the usual interrupt functions like :
EnableInt (clearing bit 0, bit I, of special registry PRIMASK) :
CPSIE I
WaitForInt :
WFI
DisableInt :
CPSID I
Eg, I added this function to my .c file for EnableInt :
void EnableInt(void)
{ __asm(" cpsie i\n");
}
... this compiles but the execution does not seem to work properly (in the simplest blinky.c version, I cannot get any LED action once I have called EnableInt() in the C code). The blinky.c code can be found here.
What would be the proper way to write these interrupt routines in a .c file (ideally without using other libraries, but just setting/clearing bits of the appropriate registers...)?
EDIT : removed the bx lr instructions - but EnableInt() does not seem to work any better - still looking for a solution.
EDIT2 : Actually the function EnableInt(), defined as above, is now working. My SysTick_Handler was mapped incorrectly to the Interrupt Vector table in the startup file (while my original problem was the bx lr instructions which I removed in Edit1).

The ARM Cortex-M4 CPU which your Tivia MCU incorporates does basically not require the software environment to take special action for entry/exit the interrupt handler. The only requirement is to use the AAPCS calling standard, which should be the default with gcc if compiling for this CPU.
The CPU is supported by some tightly coupled "core" peripherals provided by ARM. These are standard for most (if not all) Cortex-M3/4 MCUs. MCU vendors can configure some features, but the basic operation is always the same.
To simplify software development, ARM has introduced the CMSIS software standard. This at least consists of some header-files which unify access to the core-peripherals and use of special CPU instructions. Among those are intrinsics to manipulate the special CPU registers like PRIMASK, BASEMASK, OPTION, etc. Another header provides definitions of the core peripherals and functions to manipulate some of them where a simple access is not sufficient.
So, one of these peripherals supports the CPU for interrupt handling: The NVIC (nested vector-interrupt controller). This prioritises interrupts aagains each other and provides the interrupt vector to the CPU which uses this vector to fetch the address of the interrupt handler.
The NVIC also includes enable-bits for all interrupt sources. So, to have an interrupt processed by the CPU, for a typical MCU you have to enable the interrupt in two or three locations:
PRIMASK/BASEMASK in the CPU: last line of defense. These are the global interrupt gates. `PRIMASK is similar to the interrupt-enable bit in the status-register of the smaller CPUs, BASEMASK is part of interrupt-priority resolution (just ignore it for the beginning).
NVIC interrupt-enable bit for each peripheral interrupt source. E.g Timer, UART, SPI, etc. Many peripherals have multiple internal sources tied to this NVIC-line. (e.g UART rx and tx interrupt).
The interrupt-enable bits in the peripheral itself. E.g. UART rx-interrupt, tx interrupt, rxerror interrupt, etc.
Some peripherals might not have internal bits, so the last one might be missing.
To get things working, you should read the Reference Manaul (Family Guide, or similar), then there is often some "porgramming the Cortex-M4" howto (e.g ST has one for the STM32 series). You should also get the documents from ARM (they are available for free download).
Finally you need the CMSIS headers from your MCU vendor (TI here). These should be tailored for your MCU. You might have to provide some `#define's.
And, yes, this is quite some stuff to read. But imo it is worth the effort. Alternatively you might start with a book. There are some out which might be helpful to get the whole picture first (it is really hard to get from the single documents - yet possible).

OpenCL assembly optimization for "testing carry flag after adding"

In my OpenCL kernel, I find this:
error += y;
++y;
error += y;
// The following test may be implemented in assembly language in
// most machines by testing the carry flag after adding 'y' to
// the value of 'error' in the previous step, since 'error'
// nominally has a negative value.
if (error >= 0)
{
error -= x;
--x;
error -= x;
}
Obviously, those operations can easily be optimized using some nifty assembly instructions. How can I optimized this code in OpenCL?

You don't. The OpenCL compiler decides what to do with the code, depending on the target hardware and the optimization settings, which can be set as pragmas or as parameters when building the kernel. If it is smart enough, it'll use the nifty assembly instructions for the platform on which the kernel is to be run. If not, well, it won't.
You have to keep in mind that OpenCL is a general framework applicable to many devices, not just your standard consumer-grade processor, so going "under the hood" is not really possible due to differences in assembly instructions (i.e. OpenCL is meant to be portable, if you start writing x86 opcodes in your kernel, how is it going to run on a graphics card for instance?)
If you need absolute maximum performance on a specific device, you shouldn't be using OpenCL, IMHO.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas