How to check where my code is slow using STM32 CubeIDE?

How to check where my code is slow using STM32 CubeIDE? - embedded

Is it possible to check where my code is slow, or to check how much time
is being taken in individual functions in STM32 CubeIDE?

Related

Arduino/ESP32 loop execution time "stuck" at 80ms

I'm trying to build a quadcopter flight controller from scratch using an ESP32, BNO055 IMU, and Turnigy TGY-i6 Transmitter/Receiver. So far I've been able to get everything to work together and give me reasonable data using PID control loops, filters, and various libraries. Previously I was using an Arduino Nano in place of the ESP32 because it was what I had at the time and I thought it worked well with it.
After doing some (very rough) flight testing I ended up coming to the conclusion that the code was not executing fast enough with the Nano to sustainably keep the quadcopter in the air. I decided to measure the loop time by setting a t1 and t2 at the beginning and end of the loop, and took the difference to find the time it took to execute one loop of the code. With the Nano, it took roughly 80000 microseconds exactly with little deviation.
Seeing this, I bought the ESP32 knowing that it had a much faster CPU clock speed and overall performance compared against the Nano. After getting the code to work on the ESP32 with a few changes, I ran a speed test again, and got the same 80ms. I was a bit confused at first but I decided to try and isolate the problem by taking out chunks of the code to see if the loop time would change - and it did not. After reading about some specific inefficencies of Arduino IDE [such as digitalWrite()] I kept trying to take away specific parts of the code, and no difference arose as I kept measuring the clock speed. It kept at roughly 80000 microseconds no matter the changes I made.
This leads me to believe that there is something quite important that I'm unaware about in my code that is causing it to run so slow. Ideally, I would like to be able to lower the loop time to at most 10ms so that the quadcopter can autolevel with little to no oscillation.
I've used Arduino for a few years now, but by no means am an expert - so any help on optimizing the code and/or solving this odd problem would be very much appreciated, Thank you.
Note: I have a suspicion that it may be something related to the IMU (BNO055) because that is one part of the hardware and software I know very little about.
These are the libraries I'm using:
#include "ESC.h" //ESC
#include <ESP32Servo.h> //ESC
#include <Wire.h>
#include <Adafruit_Sensor.h> //IMU
#include <Adafruit_BNO055.h> //IMU
#include <utility/imumaths.h> //IMU
#include <PID_v1.h> //PID
Here is the rest of the code.

After reviewing the code and segmenting it further, I discovered the problem was coming from the pulseIn() functions I use to read the PPM signal from the receiver.
The function works by:
Reads a pulse (either HIGH or LOW) on a pin. For example, if value is HIGH, pulseIn() waits for the pin to go from LOW to HIGH, starts timing, then waits for the pin to go LOW and stops timing. Returns the length of the pulse in microseconds or gives up and returns 0 if no complete pulse was received within the timeout. Link
Looking back, this clearly causes a delay in the code.
Thank you to dimich for putting me on the right path to solve the problem.
Edit: How you can solve this:
Rather than use pulseIn(), you can use interrupts to solve the problem. I did this by attaching an interrupt to the pin that is sending the PPM signal and listening for a CHANGE to happen on it. You then check whether the pin is HIGH or LOW and either begin or stop your timer. This avoids using delays in the program and allows for a clean signal to be read.
Note: Some receivers use a PPMsum on one pin rather than individual pins. On my receiver, I use 4 different channels and 4 different GPIO pins for each axis + throttle.

STM32F4 Cycle Fluctuations

I have STM32F4 running with FreeRTOS and LwIP for networking library. I want to know how much cycle that STM32F4 needs, so I use DWT to measure that. When I ping the STM32F4 and it shows me around 3000 cycles, but after 3-5 times pings it shows me around 6000 cycles after that it shows 3000 cycles. It happen repeatedly. Why this condition happens? I just curious with this.
Regards

I don't fully understand what you are asking because you are using 'cycles' without defining what you mean, or even what you are measuring, and provide no context information at all. However, if you are using FreeRTOS then you can use FreeRTOS+Trace to visualise the execution pattern of your application.

Does SPI really need waiting loop?

I am using msp430f5418, with IAR Embedded workbench 5.10.
A Graphical LCD (ST7565R) is connected through SPI into the MSP..
MSP master uses 8-bit, MSB first mode with SMCLK.
Normally we have to check the busy bit before transferring a byte using SPI, right?
But for my case, even if I send data continuously without checking the busy bit, it works fine and I can view the display data correctly.
Can anybody explain why is it working??
Is there any need to check for the ready bit or is it safe??
Thank you,

Your software is probably slow enough that the spi transaction completes every time. If you can verify that that is the case and always will be the case then you can argue not to add even more code to do the check. Removing the code that does the check might speed up your routine just enough to be too fast for the spi interface and cause collisions.
In general you should make sure one thing finishes before another starts. And in general how you make sure can be to use hardware features or through analysis or experiments. If the hardware has the feature and you somehow determine you dont need the check it is still a good idea to do a performance test with and without the check. If the performance is not critical or there isnt much difference it is still probably safer to leave the check in, somewhere down the road, even if your code is heavily commented with warnings, a compiler or code change might be just enough to have it not work without the check.

Testing Real Time Operating System for Hardness

I have an embedded device (Technologic TS-7800) that advertises real-time capabilities, but says nothing about 'hard' or 'soft'. While I wait for a response from the manufacturer, I figured it wouldn't hurt to test the system myself.
What are some established procedures to determine the 'hardness' of a particular device with respect to real time/deterministic behavior (latency and jitter)?
Being at college, I have access to some pretty neat hardware (good oscilloscopes and signal generators), so I don't think I'll run into any issues in terms of testing equipment, just expertise.

With that kind of equipment, it ought to be fairly easy to sync the o-scope to a steady clock, produce a spike each time the real-time system produces an output, an see how much that spike varies from center. The less the variation, the greater the hardness.

To clarify Bob's answer maybe:
Use the signal generator to generate a pulse at some varying frequency.
Random distribution across some range would be best.
use the signal generator (trigger signal) to start the scope.
the RTOS has to respond, do it thing and send an output pulse.
feed the RTOS output into input 2 of the scope.
get the scope to persist/collect mode.
get the scope to start on A , stop on B. if you can.
in an ideal workd, get it to measure the distribution for you. A LeCroy would.
Start with a much slower trace than you would expect. You need to be able to see slow outliers.
You'll be able to see the distribution.
Assuming a normal distribution the SD of the response time variation is the SOFTNESS.
(This won't really happen in practice, but if you don't get outliers it is reasonably useful. )
If there are outliers of large latency, then the RTOS is NOT very hard. Does not meet deadlines well. Unsuitable then it is for hard real time work.
Many RTOS-like things have a good left edge to the curve, sloping down like a 1/f curve.
Thats indicitive of combined jitters. The thing to look out for is spikes of slow response on the right end of the scope. Keep repeating the experiment with faster traces if there are no outliers to get a good image of the slope. Should be good for some speculative conclusion in your paper.
If for your application, say a delta of 1uS is okay, and you measure 0.5us, it's all cool.
Anyway, you can publish the results ( and probably in the publish sense, but certainly on the web.)
Link from this Question to the paper when you've written it.

Hard real-time has more to do with how your software works than the hardware on its own. When asking if something is hard real-time it must be applied to the complete system (Hardware, RTOS and application). This means hard or soft real-time is system design issues.
Under loading exceeding the specification even a hard real-time system will fail (hopefully with proper failure indication) while a soft real-time system with low loading would give hard real-time results. How much processing must happen in time and how much pre/post processing can be performed is the real key to hard/soft real-time.
In some real-time applications some data loss is not a failure it should just be below a certain level, again a system criteria.
You can generate inputs to the board and have a small application count them and check at what level data is going to be lost. But that gives you a rating specific to that system running that application. As soon as you start doing more processing your computational load increases and you now have a different hard real-time limit.
This board will running a bare bones scheduler will give great predictable hard real-time performance for most tasks.
Running a full RTOS with heavy computational load you probably only get soft real-time.
Edit after comment
The most efficient and easiest way I have used to measure my software's performance (assuming you use a schedular) is by using a free running hardware timer on the board and to time stamp my start and end of my cycle. Or if you run a full RTOS time stamp you acquisition and transition. Save your Max time and run a average on the values over a second. If your average is around 50% and you max is within 20% of your average you are OK. If not it is time to refactor your application. As your application grows the cycle time will grow. You can monitor the effect of all your software changes on your cycle time.
Another way is to use a hardware timer generate a cyclical interrupt. If you are in time reset the interrupt. If you miss the deadline you have interrupt handler signal a failure. This however will only give you a warning once your application is taking to long but it rely on hardware and interrupts so you can't miss.
These solutions also eliminate the requirement to hook up a scope to monitor the output since the time information can be displayed in any kind of terminal by a background task. If it is easy to monitor you will monitor it regularly avoiding solving the timing problems at the end but as soon as they are introduced.
Hope this helps

I have the same board here at work. It's a slightly-modified 2.6 Kernel, I believe... not the real-time version.
I don't know that I've read anything in the docs yet that indicates that it is meant for strict RTOS work.

I think that this is not a hard real-time device, since it runs no RTOS.

I understand being geek, but using oscilloscope to test a computer with ethernet/usb/other digital ports and HUGE internal state (RAM) is both ineffective and unreliable.
Instead of watching wave forms, you can connect any PC to the output port and run proper statistical analysis.
The established procedure (if the input signal is analog by nature) is to test system against several characteristic inputs - traditionally spikes, step functions and sine waves of different frequencies - and measure phase shift and variance for each input type. Worst case is then used in specifications of the system.
Again, if you are using standard ports, you can easily generate those on PC. If the input is truly analog, a separate DAC or simply a good sound card would be needed.
Now, that won't say anything about OS being real-time - it could be running vanilla Linux or even Win CE and still produce good and stable results in those tests if hardware is fast enough.
So, you need to simulate heavy and varying loads on processor, memory and all ports, let it heat and eat memory for a few hours, and then repeat tests. If latency stays constant, it's hard real-time. If it doesn't, under any load and input signal type, increase above acceptable limit, it's soft. Otherwise, it's advertisement.
P.S.: Implication is that even for critical systems you don't actually need hard real-time if you have hardware.

FPGA based RTL evaluation

Currently I am testing some RTL, I am using ncverilog, and it is very ... very slow. I have heard that, if we use some kind of FPGA boards, then things will be faster. Is it for real?

You're talking about two different things.
NCVerilog is a simulation tool while an FPGA board is real hardware. So, there will be differences. Real hardware will be generally faster but with a simulator, you can have all sorts of debugging fun. Trying to probe a specific signal is just a matter of adding a line to the testbench. Also, you can easily make changes to the simulated model instead of having to redesign the FPGA board.
If you run simulation on a sufficiently powerful machine, you can sometimes approximate real-world performance (assuming that the FPGA is a slow one).
All in all, you should do both. Use a simulator to do your basic development and evaluation. Move onto your FPGA hardware once your design is sufficiently well defined.

We've had the same issues with simulation speed too. However, we stick with simulations for the majority of our verification. Each sim checks a specific function and are much quicker than system-level sims. We've also made them self-checking and are useful for regressions tests (unit-tests).
For long system tests on real-world signals that take too much time to simulate, we move these to the FPGA if we can. We need to manually re-check all these testcases again after code changes, so it can be slow in its own way.
Sometimes though, FPGAing a design is just not feasible. Sometimes full designs are too large to fit into an FPGA, or the clock rate is too high. But remember that you don't necessarily have to FPGA your entire design, it may be enough to get the important block you're interested in and check this out fully.

You can trace activity on signals in a running FPGA design using "embedded logic analyzer" software tools like Altera SignalTap or Xilinx ChipScope. Before synthesizing/mapping your RTL to the device, you would use these tools to attach soft probes to the signals you want to watch. You can set triggers so that a signal's values only get logged under certain conditions. Then you generate the bitfile and program the device with JTAG. The logic analyzer communicates with your PC over JTAG and logs activity on your probes, which you can then analyze.
It's a bit complicated to set up, as these tools are not especially easy to use, but you will get results much faster than with RTL simulation.

What kind of RTL are you testing ? If you use FPGA boards, then you can compile
your code provided you have the right tool for the right FPGA. Since FPGA are reprograammable, then of course you can test your code on the board, and have the target (FPGA) execute your code (RTL)
But it is no more a simulation, it is a test, with a given hardware, at a given clock speed.
And you don't get nice result on the screen, you need to use physical probe and scope. Plus you don't get to see how the internal of your code is working.
verilog or VHDL simulation is sort of like running code using a debugger. FPGA testing is more like debugging with printf. The big difference is that when simulating, your CPU has to simulate the behaviour of all those logic gate that results of your code. On the FPGA, there is no simulation, you just 'run' the code, so it is much faster, but you have less information.
You should use simulation for very small components, and then test your whole program on a FPGA.

You're probably asking about hardware simulation accelerators.
Here is one of them : GateRocket

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas