I have an application for Zynq MPSoC (Vitis 2020.2) written in C++ using FreeRTOS V10.3.0. This application runs very well if stops at a breakpoint once. If I disable all breakpoints program runs buggy? What might be problem?
How many ways could this happen?! It is a real-time operating system, presumably then also a real-time application. If you stop the coprocessor you affect the timing. Without knowing the hardware, the software, where you are placing the breakpoint and the bugs that arise when free-running, it is not possible to answer your specific question. I.e. you need to debug it - there is no generic explanation as to why the intrusive action of stopping the processor "fixes" your system.
You clearly have buggy code that is affected by timing. Stopping the code does not necessarily stop peripherals and it certainly does not stop the outside world that your system interacts with. For example when you stop on a breakpoint, the world continues, interrupts become pending (possibly several), so that when you resume execution, the execution path and thread scheduling order is likely to differ considerably from that when it is free-run as all those pending interrupts are handled and in turn issue events that cause different tasks to become pending ready-to-run and then run in a different order to that which would otherwise occur.
Ultimately you are asking the wrong question; the breakpoint is not magically "fixing" your code, rather it is significantly changing the way that it runs such that some existing bug (or bugs) is hidden or avoided. The bug is still there, so the question would better focus on finding the bug than "magic thinking".
The bugs could be at any level, but most likely are design issues with inappropriate task partitioning, priority assignment, IPC, task synchronisation or resource protection. Generally probably rather too broad to deal with in a single SO question.
Related
Sorry I cannot post proprietary codes here. Basically, it's a Mac GUI application. The codes were not properly designed to make use of the asynchronousity concept. Everything is processed on the main thread, and it's impossible to change the design overnight. Therefore, I'd like not to go with the dispatch_async(…) solution.
The context of the problem is: I have a time-consuming task that runs on the main thread. While the task is being processed, I try to update/redraw a progress bar (NSProgressIndicator) based on the task's completion percentage (from 0% to 100%). However, because the task runs on the main thread, the main thread is blocked, and any update/redraw event in the event queue has to wait until the main thread has a chance to look at it, so the progress bar is not updated/redrawn at all during the task execution.
The solution I'm thinking about is to create another app (with an .exe file) that handles the progress bar drawing. From the main app, I'll create another process and have that process execute the other app. The task's completion percentage can be sent from the main app to the other app by using the Boost inter-process message queue.
I'm hoping to hear about both advantages and disadvantages of this solution, so any thoughts will be much appreciated!
You can do that from a thread in the same process as well. Interprocess message queues still work, though any threadsafe solution would suffice.
In general, it can be worth running some non-trivial tasks out-of-process. The kernel-level process-isolation has benefits that threads can never have:
memory space separation (security)
privilege separation (the other process can potentially run in a different security context)
Therefore when dealing with untrusted inputs or unreliable third-party library code you can gain stability guarantees for the main process.
However for your purposes it sounds like severe overkill.
It might be a stupid question where the answer is "no" but is that theoretically possible ? I wonder why not ?
And I don't know what for...
There's several different types of "interrupt handlers".
The first, the hardware IRQ handlers, are modified when the OS loads drivers and such.
The second, the software interrupt handlers, are used to call OS level services in modern OSes.
Those are the ones that have hardware support (either across the entire computer, or within the processor).
A third kind, without hardware support, are "signal handlers" (in UNIX), which are basically OS-level and relate to OS events.
The common concept between them is that the responses are programmable. The idea is that you know how you want your software/OS to respond to them, so you add the code necessary to service them. In that sense, they are "modifiable at runtime".
But there are rules as to what to do in these things. Primarily, you don't want to waste too much time handling them, because whatever you do with them prevents other interrupts (of the same or lower priority) from occurring while you're processing them. (For example, you don't want to be in the middle of handling one interrupt and get another interrupt for the same thing before you finish handling the first one, because an interrupt handler can do things that would otherwise require a lock (loading and incrementing the current or last pointers on a ring queue, for example) and would clobber the state if it re-entered.)
So, typically interrupt handlers do the least of what they need to do, and set a flag for the software to recognize that processing on that needs to be done once it gets back out of interrupt mode.
Historically, DOS and other non-protected OSes allowed software to modify the interrupt tables willy-nilly. This worked out okay when people who understood how interrupts were supposed to work were programming them, but it was also easy to completely screw over the state of the system with them. This is why modern, protected OSes don't typically allow user software to modify the interrupt tables. (If you're running in kernel mode as a driver, you can do it, but it's still really not a good idea.)
But, UNIX allows for user software to change its process's signal handlers. This is typically done to allow (for example) SIGHUP to tell Apache to reload its configuration files.
Modifying the interrupt table that the OS uses modifies that table for all software running on the system. This is generally not something that a user running a secure OS would particularly want, if they wanted to retain security of their system.
From the Cortex-R reference manual, probably not Cortex-R specific
Asynchronous abort masking
The nature of asynchronous aborts means that they can occur while the processor is handling a different abort. If an asynchronous abort generates a new exception in such a situation, the r14_abt and SPSR_abt values are overwritten. If this occurs before the data is pushed to the stack in memory, the state information about the first abort is lost. To prevent this from happening, the CPSR contains a mask bit, the A-bit, to indicate that an asynchronous abort cannot be accepted. When the A-bit is set, any asynchronous abort that occurs is held pending by the processor until the A-bit is cleared, when the exception is actually taken. The A-bit is automatically set when abort, IRQ or FIQ exceptions are taken, and on reset. You must only clear the A-bit in an abort handler after the state information has either been stacked to memory, or is no longer required.
My question is, if I have the A bit masked since reset how can I know if an asynchronous abort is pending? Can pending external aborts be cleared without unmasking the A bit and taking the exception? Or more generally, is there advice on clearing the A bit after a reset?
Apparently something in my current boot chain has a pending external abort (but only after a hard power on). I would like to enable the external aborts, but it seems rather cumbersome to special case the first external abort in the exception code.
On a system that implements the security extensions, the Interrupt Status Register, ISR, can tell you if there's an external abort pending. Sadly this doesn't help much if you're on R4 which doesn't implement them.
Otherwise, there's nothing that I can see in the architecture to identify or deal with an abort short of taking the exception as you say. This doesn't really surprise me - in general an external about that can be safely ignored very much is a special case.
If the bug in the system can't be fixed (is the bootloader probing devices in the wrong order, or similar?) then a workaround, however cumbersome, is the order of the day - if there's some reasonably straightforward way to tell a cold boot from a warm reset I can imagine a pretty trivial self-contained shim to handle it so the main code never needs to know.
I have an iOS app with a really nasty bug: an operation in my NSOperationQueue will for some reason hang and not finish executing so other additional operations are being queued up but still not executing. This in turn leads to the app not begin able to perform critical functions. I have not yet been able to identify any pattern other than that it occurs on one of my co-workers devices every week or so. Running the app from Xcode at that point does not help as killing and relaunching the app resolves the issue for the time being. I've tried attaching the debugger to a running process and I seem to be able to see log data but any break points I add are not registering. I've added a bread crumb trail of NSLogs to try to pinpoint where it's hanging but this has not yet led to a resolution.
I originally described the bug in another question which is yet to have a clear answer I'm guessing because of the lack of info I'm able to provide around this issue.
A friend once told me that it's possible to save the entire memory stack of an app at a given moment in some form and reload that exact state of memory onto a process on a different device. Does anyone know how I can achieve that? If that's possible the next time someone encounters that bug I can save that exact state of memory and replicate to test all my theories of possible solutions. Or is there a different approach to tackling this? As an interim measure, do you think it would make sense to forcefully make the app crash when the app enters this state so actual users would be less confused? I'm have mixed feelings about this but the user will have to kill the app from the multitask dock anyway in order to use the app again. I can check the operation queue count or create some kind of timeout code for this until I actually nail this bug.
This sounds as a deadlock on a very rare race-condition. You also mentioned using a maxConcurrentOperationCount of 2. This means that either:
some operation is blocking the operation queue and waitiong for main to release some lock and main is waiting for the operation to finish
two operations are waiting on each other to release some lock
1 seems very unlikely as the queue should allow 2 concurrent operations to be completely blocked, unless you are using some system functions that have concurency issues and block you queue instead of just one thread.
I this case my first attempt to debug would be to connect the debugger and pause execution. After that you can look at the stack traces for all threads. You should be able to find the 2 threads that are made by your operation queue after which I would review the responsible functions to find code thet might possibly wait on some lock. Make sure to take into consideration sytem functions.
Well it's quite hard to solve bugs that don't crash the App but just hang a thread. If you can't find the bug by looking at your code step by step checking if there are any possible deadlock- or raceconditions I would suggest to implement some logging.
Write your log to disk everytime you add a logentry. That's not the most memory efficient way, but if you give a build with logging enabled to your co-worker you can pull the log from his iPhone when things go wrong. Even while the App is still running.
Make sure you log every step you take including the values of important variables around the code that you suspect of breaking the App. This way you can see what the App is doing and what the state of the App is.
Hope this helps a bit. I don't now about restoring the state of memory of an App so can't help you with that.
Note; If the App is crashing on the bug you could use some other tools, but if I get it right thats not the case here is it?
I read the question describing the bug and I would try to log to disk what the currently running operations are doing. It seems the operations will hang once in a while and there is a bug in there. If you can log what methods are called while running the operation this will show you what function call will hang the App and you can start looking in there.
You didn't say this but I presume the bug occurs while a human operator is working with the app? Maybe you should add an automated mode to this app, where the app simulates the same operations that users normally do, using randomized times for starting different actions. Then you can leave the app running unattended on all your devices and increase the chances of seeing the problem.
Also, since the problem appears related to the NSOperationQueue, maybe you should subclass it so that you can add logging to the more interesting methods. For example, each time an operation is added you should log the state of the queue, since you suspect that sometimes it is getting suspended.
Also, I suggested this on your other question as well, you may want to setup an observer to get notified if the queue ever goes into a suspended state.
Good luck.
Checking assumptions here, since that never hurts: do you actually have evidence that your background threads are hanging? From what you report, the observed behavior is that the tasks you're putting in your background thread are not achieving the outcome that you expected. That doesn't necessarily indicate that the thread has hung—it might just indicate that the particular conditions meant that the thread closed due to all tasks being completed, without the tasks achieving what you wanted them to.
Addition: Given your answer in the comments, it seems to me the next step then is to use logging when an item begins to be executed in the queue so that you can identify which items it is that lead to the queue becoming blocked. Best guess is that it is a certain class of items or certain characteristics of the items if they are all of a certain class. Log enough as the first step of executing each item that you'll have a reasonable characterization of the item, and then once you get a real device that has entered this state, check the logs and see just what conditions are leading to this problem. That should enable you to reliably reproduce the problem on a device during debugging or in the simulator, to then nail it.
In other words—I would focus your attention on identifying the problematic operations first, rather than trying to identify the particular line of code where things are stalling.
In my case
start
instead of
main
had to be overridden.
When in doubt consult https://developer.apple.com/documentation/foundation/nsoperation#1661262?language=objc for discrepancies with your implementation
When things go badly awry in embedded systems I tend to write an error to a special log file in flash and then reboot (there's not much option if, say, you run out of memory).
I realize even that can go wrong, so I try to minimize it (by not allocating any memory during the final write, and boosting the write processes priority).
But that relies on someone retrieving the log file. Now I was considering sending a message over the intertubes to report the error before rebooting.
On second thoughts, of course, it would be better to send that message after reboot, but it did get me to thinking...
What sort of things ought I be doing if I discover an irrecoverable error, and how can I do them as safely as possible in a system which is in an unstable state?
One strategy is to use a section of RAM that is not initialised by during power-on/reboot. That can be used to store data that survives a reboot, and then when your app restarts, early on in the code it can check that memory and see if it contains any useful data. If it does, then write it to a log, or send it over a comms channel.
How to reserve a section of RAM that is non-initialised is platform-dependent, and depends if you're running a full-blown OS (Linux) that manages RAM initialisation or not. If you're on a small system where RAM initialisation is done by the C start-up code, then your compiler probably has a way to put data (a file-scope variable) in a different section (besides the usual e.g. .bss) which is not initialised by the C start-up code.
If the data is not initialised, then it will probably contain random data at power-up. To determine whether it contains random data or valid data, use a hash, e.g. CRC-32, to determine its validity. If your processor has a way to tell you if you're in a reboot vs a power-up reset, then you should also use that to decide that the data is invalid after a power-up.
There is no single answer to this. I would start with a Watchdog timer. This reboots the system if things go terribly awry.
Something else to consider - what is not in a log file is also important. If you have routine updates from various tasks/actions logged then you can learn from what is missing.
Finally, in the case that things go bad and you are still running: enter a critical section, turn off as much of the OS a possible, shut down peripherals, log as much state info as possible, then reboot!
The one thing you want to make sure you do is to not corrupt data that might legitimately be in flash, so if you try to write information in a crash situation you need to do so carefully and with the knowledge that the system might be an a very bad state so anything you do needs to be done in a way that doesn't make things worse.
Generally, when I detect a crash state I try to spit information out a serial port. A UART driver that's accessible from a crashed state is usually pretty simple - it just needs to be a simple polling driver that writes characters to the transmit data register when the busy bit is clear - a crash handler generally doesn't need to play nice with multitasking, so polling is fine. And it generally doesn't need to worry about incoming data; or at least not needing to worry about incoming data in a fashion that can't be handled by polling. In fact, a crash handler generally cannot expect that multitasking and interrupt handling will be working since the system is screwed up.
I try to have it write the register file, a portion of the stack and any important OS data structures (the current task control block or something) that might be available and interesting. A watchdog timer usually is responsible for resetting the system in this state, so the crash handler might not have the opportunity to write everything, so dump the most important stuff first (do not have the crash handler kick the watchdog - you don't want to have some bug mistakenly prevent the watchdog from resetting the system).
Of course this is most useful in a development setup, since when the device is released it might not have anything attached to the serial port. If you want to be able to capture these kinds of crash dumps after release, then they need to get written somewhere appropriate (like maybe a reserved section of flash - just make sure it's not part of the normal data/file system area unless you're sure it can't corrupt that data). Of course you'd need to have something examine that area at boot so it can be detected and sent somewhere useful or there's no point, unless you might get units back post-mortem and can hook them up to a debugging setup that can look at the data.
I think the most well known example of proper exception handling is a missile self-destruction. The exception was caused by arithmetic overflow in software. There obviously was a lot of tracing/recording media involved because the root cause is known. It was discovered debugged.
So, every embedded design must include 2 features: recording media like your log file and graceful halt, like disabling all timers/interrupts, shutting all ports and sitting in infinite loop or in case of a missile - self-destruction.
Writing messages to flash before reboot in embedded systems is often a bad idea. As you point out, no one is going to read the message, and if the problem is not transient you wear out the flash.
When the system is in an inconsistent state, there is almost nothing you can do reliably and the best thing to do is to restart the system as quickly as possible so that you can recover from transient failures (timing, special external events, etc.). In some systems I have written a trap handler that uses some reserved memory so that it can, set up the serial port and then emit a stack dump and register contents without requiring extra stack space or clobbering registers.
A simple restart with a dump like that is reasonable because if the problem is transient the restart will resolve the problem and you want to keep it simple and let the device continue. If the problem is not transient you are not going to make forward progress anyway and someone can come along and connect a diagnostic device.
Very interesting paper on failures and recovery: WHY DO COMPUTERS STOP AND WHAT CAN BE DONE ABOUT IT?
For a very simple system, do you have a pin you can wiggle? For example, when you start up configure it to have high output, if things go way south (i.e. watchdog reset pending) then set it to low.
Have you ever considered using a garbage collector ?
And I'm not joking.
If you do dynamic allocation at runtime in embedded systems,
why not reserve a mark buffer and mark and sweep when the excrement hits the rotating air blower.
You've probably got the malloc (or whatever) implementation's source, right ?
If you don't have library sources for your embedded system forget I ever suggested it, but tell the rest of us what equipment it is in so we can avoid ever using it. Yikes (how do you debug without library sources?).
If you're system is already dead.... who cares how long it takes. It obviously isn't critical that it be running this instant;
if it was you couldn't risk "dieing" like this anyway ?