Get info about what resource is blocking task in FreeRTOS - embedded

I have a FreeRTOS (v9.0 in case it matters) based embedded product. This product has multiple tasks, which interact with each other using multiple semaphores, mutexes, queues, and other task blocking resources. Unfortunately, I have a seldom occurring bug which causes one tasks to permanently block on some resource (perhaps a deadly embrace?).
My efforts to trap it so far have been fruitless. However, I can attach a debugger to the running target after the problem happens, and pause the processor. Since I have each tasks's handle as a global variable, I was hoping to extract some useful information about which resource the task is blocked by. However, a handle is nothing but a pointer, and I can't figure out how to get useful information from that.
Does anyone have any ideas on how I can find out which task blocking resource is holding off the task?
UPDATE: It seems to me that, since I know the task which is getting stuck, I should be able to look at its stack and extract some useful information about that. Unfortunately, I'm not sure how to get access to the current stack pointer, nor how deep into the stack I'd have to go to make sense of what's on there.

Related

Can I run a DLL in a separate thread?

I have a program I'm writing in vb.net that has ballooned into the most complicated thing I've ever written. Because of some complex math and image rendering that's happening constantly I've been delving into multithreading for the first time to improve overall performance. Things have honestly been running really smoothly, but we've just added more functionality that's causing me some trouble.
The new functionality comes from a pair of DLLs that are each processing a video stream from a USB camera and looking for moving objects. When I start my program I initiate the DLLs and they start viewing the cameras and processing the videos. I then periodically ping them to see if they have detected anything. This is how I start and stop them:
Declare Function StartLeftCameraDetection Lib "DetectorLibLeft.dll" Alias "StartCameraDetection" () As Integer
Declare Function StopLeftCameraDetection Lib "DetectorLibLeft.dll" Alias "StopCameraDetection" () As Integer
When I need to check if they've found any objects I use several functions like this:
Declare Function LeftDetectedObjectLeft Lib "DetectorLibLeft.dll" Alias "DetectedObjectLeft" () As Integer
All of that works really well. The problem is, I've started to notice some significant lag in my UI and I'm thinking it may be coming from the DLLs. Forgive my ignorance on this, but as I said I'm new to using multiple threads (and incorporating DLLs too if I'm honest). It seems to me that when I start a DLL it running it's background tasks on my main thread and just waiting for me to ping it for information. Is that the case? If so, is it possible to have the DLL running on a sperate thread so it doesn't affect my UI?
I've tried a few different things but I can't seem to address the lag. I moved the code that pings the DLL and processes whatever information it gets into a sperate thread, but that hasn't made any difference. I also tried calling StartLeftCameraDetection from a separate thread but that didn't seem to help either. Again, I'm guessing that's because the real culprit is the DLL itself running these constant background tasks on my main thread no what thread I actually call it's functions from.
Thanks in advance for any help you might be able to offer!
There's a lot to grok when it comes to threading, but I'll try to write a concise summary that hits the high points with enough details to cover what you need to know.
Multi-threaded synchronization is hard, so you should try to avoid it as much as possible. That doesn't mean avoiding multi-threading at all, it just means avoiding doing much more than sending a self-contained task off to a thread to run to completion and getting the results back when it's done.
Recognizing that multi-threaded synchronization is hard, it's even worse when it involves UI elements. So in .NET, the design is that any access to UI elements will only occur through one thread, typically referred to as the UI thread. If you are not explicitly writing multi-threaded code, then all of your code runs on the UI thread. And, while your code is running, the UI is blocked.
This also extends to external routines that you run through Declare Function. It's not really accurate to say that they are doing anything with "background tasks on the main thread", if they are doing anything with "background tasks" they are almost certainly implementing their own threading. More likely, they aren't doing any task breakdown at all, and all of their work is being done on whichever thread you use to call them---the UI thread if you're not doing anything else.
If the work being done in these routines is CPU-bound, then it would definitely make sense to push it off onto a worker thread. Based on your comments on what you already tried:
I moved the code that pings the DLL and processes whatever information it gets into a sperate thread, but that hasn't made any difference. I also tried calling StartLeftCameraDetection from a separate thread but that didn't seem to help either.
I think the most likely problem is that you're blocking in the UI thread waiting for a result from the background thread.
The best way to avoid this depends on exactly what the routines are doing and how they produce results. If they do some sort of extended process and return everything in function results, then I would suggest that using Await would work well. This will basically return control to the UI until the operation finishes, then resume whatever the rest of the calling routine was going to do.
Note that if you do this, the user will have full interaction with the UI, and you should react accordingly. You might need to disable some (or all) operations until it's done.
There are a lot of resources on Async and Await. I'd particularly recommend reading Stephen Cleary's blog articles to get a better understanding of how they work and potential pitfalls that you might encounter.

Snakemake: Job preemption can interrupt running jobs on clusters, how to make sure that the task is not considered as failed?

I'm using Snakemake on a cluster, and I don't know how best to handle the fact that some jobs can be preempted.
For more power on the cluster I use, it is possible to have access to the resources of other teams, but with the risk of being preempted, which consists in stopping the job in progress, and rescheduling it. It will be launched again as soon as a resource is available. This is especially advantageous when you have a lot of quick jobs to run. Unfortunately, I don't have the impression that Snakemake supports this properly.
In the example given in the help on the cluster-status feature for Slurm, there is no PREEMPTED in the running_status list (running_status=["PENDING", "CONFIGURING", "COMPLETING", "RUNNING", "SUSPENDED"]), which may lead to consider a preempted job has failed. Not a big deal, I’ve added PREEMPTED to this list, but I am led to believe that Snakemake did not consider this scenario.
More annoyingly, even when running Snakemake with the --rerun-incomplete option, when the job is interrupted by the preemption, then restarted, I get the following error:
IncompleteFilesException:
The files below seem to be incomplete. If you are sure that certain files are not incomplete, mark them as complete with
snakemake --cleanup-metadata <filenames>
To re-generate the files rerun your command with the --rerun-incomplete flag.
I would expect the interrupted job to restart from scratch.
For now, the only solution I have found is to stop using other teams' resources to avoid having my jobs preempted, but I am losing computing power.
How do you use Snakemake in a context where your jobs can be preempted? Anyone see a solution so I don't get the IncompleteFilesException anymore?
Thanks in advance
Snakemake has a restart feature, which can be used to let jobs be resubmitted automatically. However, there is no special handling for prememption currently, indeed. You are also right, I was not even aware that something like that exists on slurm. A PR in that direction would be welcome of course. Basically, one would need to extend the status script handling to recognize this and in that case restart the job.
Thanks for reporting these, I see two separate issues here:
Handling of the PREEMPTED status returned by slurm.
The IncompleteFilesException suggesting you use --rerun-incomplete when that is exactly what you are doing.
1. PREEMPTED status handling
I have no experience in using slurm, so I cannot comment if the script example in the docs that you are linking to will work for slurm. Especially the expression in output = str(subprocess.check_output(expression)) might have to be adjusted to slurm in some way. Maybe there's someone around here who also uses slurm and has found a working solution in the past?
But otherwise, adding PREEMPTED to the running_status list should be exactly what you want to do (assuming that that is exactly the tag returned by expression).
If this has to be adapted to slurm and you manage to generate a working status.py script, it might be worth adding this to the docs via a pull request onto this file, so that other slurm users don't have to reinvent the solution.
2. IncompleteFilesException with --rerun-incomplete flag
From the general description, this sounds a bit like a bug. But without any details, I cannot be sure. But maybe it's worth describing this in some more detail while filing an issue in the snakemake repo. Either simply by providing more details, or by even providing a minimal example to reproduce this behavior.

Debugging an intermittently stuck NSOperationQueue

I have an iOS app with a really nasty bug: an operation in my NSOperationQueue will for some reason hang and not finish executing so other additional operations are being queued up but still not executing. This in turn leads to the app not begin able to perform critical functions. I have not yet been able to identify any pattern other than that it occurs on one of my co-workers devices every week or so. Running the app from Xcode at that point does not help as killing and relaunching the app resolves the issue for the time being. I've tried attaching the debugger to a running process and I seem to be able to see log data but any break points I add are not registering. I've added a bread crumb trail of NSLogs to try to pinpoint where it's hanging but this has not yet led to a resolution.
I originally described the bug in another question which is yet to have a clear answer I'm guessing because of the lack of info I'm able to provide around this issue.
A friend once told me that it's possible to save the entire memory stack of an app at a given moment in some form and reload that exact state of memory onto a process on a different device. Does anyone know how I can achieve that? If that's possible the next time someone encounters that bug I can save that exact state of memory and replicate to test all my theories of possible solutions. Or is there a different approach to tackling this? As an interim measure, do you think it would make sense to forcefully make the app crash when the app enters this state so actual users would be less confused? I'm have mixed feelings about this but the user will have to kill the app from the multitask dock anyway in order to use the app again. I can check the operation queue count or create some kind of timeout code for this until I actually nail this bug.
This sounds as a deadlock on a very rare race-condition. You also mentioned using a maxConcurrentOperationCount of 2. This means that either:
some operation is blocking the operation queue and waitiong for main to release some lock and main is waiting for the operation to finish
two operations are waiting on each other to release some lock
1 seems very unlikely as the queue should allow 2 concurrent operations to be completely blocked, unless you are using some system functions that have concurency issues and block you queue instead of just one thread.
I this case my first attempt to debug would be to connect the debugger and pause execution. After that you can look at the stack traces for all threads. You should be able to find the 2 threads that are made by your operation queue after which I would review the responsible functions to find code thet might possibly wait on some lock. Make sure to take into consideration sytem functions.
Well it's quite hard to solve bugs that don't crash the App but just hang a thread. If you can't find the bug by looking at your code step by step checking if there are any possible deadlock- or raceconditions I would suggest to implement some logging.
Write your log to disk everytime you add a logentry. That's not the most memory efficient way, but if you give a build with logging enabled to your co-worker you can pull the log from his iPhone when things go wrong. Even while the App is still running.
Make sure you log every step you take including the values of important variables around the code that you suspect of breaking the App. This way you can see what the App is doing and what the state of the App is.
Hope this helps a bit. I don't now about restoring the state of memory of an App so can't help you with that.
Note; If the App is crashing on the bug you could use some other tools, but if I get it right thats not the case here is it?
I read the question describing the bug and I would try to log to disk what the currently running operations are doing. It seems the operations will hang once in a while and there is a bug in there. If you can log what methods are called while running the operation this will show you what function call will hang the App and you can start looking in there.
You didn't say this but I presume the bug occurs while a human operator is working with the app? Maybe you should add an automated mode to this app, where the app simulates the same operations that users normally do, using randomized times for starting different actions. Then you can leave the app running unattended on all your devices and increase the chances of seeing the problem.
Also, since the problem appears related to the NSOperationQueue, maybe you should subclass it so that you can add logging to the more interesting methods. For example, each time an operation is added you should log the state of the queue, since you suspect that sometimes it is getting suspended.
Also, I suggested this on your other question as well, you may want to setup an observer to get notified if the queue ever goes into a suspended state.
Good luck.
Checking assumptions here, since that never hurts: do you actually have evidence that your background threads are hanging? From what you report, the observed behavior is that the tasks you're putting in your background thread are not achieving the outcome that you expected. That doesn't necessarily indicate that the thread has hung—it might just indicate that the particular conditions meant that the thread closed due to all tasks being completed, without the tasks achieving what you wanted them to.
Addition: Given your answer in the comments, it seems to me the next step then is to use logging when an item begins to be executed in the queue so that you can identify which items it is that lead to the queue becoming blocked. Best guess is that it is a certain class of items or certain characteristics of the items if they are all of a certain class. Log enough as the first step of executing each item that you'll have a reasonable characterization of the item, and then once you get a real device that has entered this state, check the logs and see just what conditions are leading to this problem. That should enable you to reliably reproduce the problem on a device during debugging or in the simulator, to then nail it.
In other words—I would focus your attention on identifying the problematic operations first, rather than trying to identify the particular line of code where things are stalling.
In my case
start
instead of
main
had to be overridden.
When in doubt consult https://developer.apple.com/documentation/foundation/nsoperation#1661262?language=objc for discrepancies with your implementation

Testing fault tolerant code

I’m currently working on a server application were we have agreed to try and maintain a certain level of service. The level of service we want to guaranty is: if a request is accepted by the server and the server sends on an acknowledgement to the client we want to guaranty that the request will happen, even if the server crashes. As requests can be long running and the acknowledgement time needs be short we implement this by persisting the request, then sending an acknowledgement to the client, then carrying out the various actions to fulfill the request. As actions are carried out they too are persisted, so the server knows the state of a request on start up, and there’s also various reconciliation mechanisms with external systems to check the accuracy of our logs.
This all seems to work fairly well, but we have difficult saying this with any conviction as we find it very difficult to test our fault tolerant code. So far we’ve come up with two strategies but neither is entirely satisfactory:
Have an external process watch the server code and then try and kill it off at what the external process thinks is an appropriate point in the test
Add code the application that will cause it to crash a certain know critical points
My problem with the first strategy is the external process cannot know the exact state of the application, so we cannot be sure we’re hitting the most problematic points in the code. My problem with the second strategy, although it gives more control over were the fault takes, is I do not like have code to inject faults within my application, even with optional compilation etc. I fear it would be too easy to over look a fault injection point and have it slip into a production environment.
I think there are three ways to deal with this, if available I could suggest a comprehensive set of integration tests for these various pieces of code, using dependency injection or factory objects to produce broken actions during these integrations.
Secondly, running the application with random kill -9's, and disabling of network interfaces may be a good way to test these things.
I would also suggest testing file system failure. How you would do that depends on your OS, on Solaris or FreeBSD I would create a zfs file system in a file, and then rm the file while the application is running.
If you are using database code, then I would suggest testing failure of the database as well.
Another alternative to dependency injection, and probably the solution I would use, are interceptors, you can enable crash test interceptors in your code, these would know the state of the application and introduce the above listed failures at the correct time, or any others you may want to create. It would not require changes to your existing code, just some additional code to wrap it.
A possible answer to the first point is to multiply experiments with your external process so that probability to impact problematic parts of code is increased. Then you can analyze core dump file to determine where the code has actually crashed.
Another way is to increase observability and/or commandability by stubbing library or kernel calls, i.e., without modifying your application code.
You can find some resources on Fault Injection page of Wikipedia, in particular in Software Implemented Fault Injection section.
Your concern about fault injection is not a fundamental concern. You merely need a foolproof way to prevent such code ending up in deployment. One way to do so is by designing your fault injector as a debugger. I.e. the faults are injected by a process external to your process. This already provides a level of isolation. Furthermore, most OS'es provide some kind of access control which prevents debugging unless specifially enabled. In the most primitive form, it's by limiting it to root, on other operating systems it requires a specific "debug privilege". Naturally, on production nobody will have that, and thus your fault injector cannot even run on production.
Practially, the fault injector can set breakpoints at specific addresses, i.e. function or even line of code. You can then react to that, e.g. by terminating the process after a certain breakpoint is hit three times.
I was just about to write the same as Justin :)
The component I would suggest to replace during testing could be the logging component (if you have one, if not, I'd strongly suggest to implement one...). It's relatively easy to replace it with code that generates error and the logger usually gets enough information to know the current application state.
Also it seems to be feasible to make sure that the testing code doesn't go into production. I would discourage conditional compilation though but rather go with some configuration file to select the logging component.
Using "random" kills might help to detect errors but is not well suited for systematic testing because of its non-determinism. Therefore I wouldn't use it for automatic tests.

Embedded systems : last gasp before reboot

When things go badly awry in embedded systems I tend to write an error to a special log file in flash and then reboot (there's not much option if, say, you run out of memory).
I realize even that can go wrong, so I try to minimize it (by not allocating any memory during the final write, and boosting the write processes priority).
But that relies on someone retrieving the log file. Now I was considering sending a message over the intertubes to report the error before rebooting.
On second thoughts, of course, it would be better to send that message after reboot, but it did get me to thinking...
What sort of things ought I be doing if I discover an irrecoverable error, and how can I do them as safely as possible in a system which is in an unstable state?
One strategy is to use a section of RAM that is not initialised by during power-on/reboot. That can be used to store data that survives a reboot, and then when your app restarts, early on in the code it can check that memory and see if it contains any useful data. If it does, then write it to a log, or send it over a comms channel.
How to reserve a section of RAM that is non-initialised is platform-dependent, and depends if you're running a full-blown OS (Linux) that manages RAM initialisation or not. If you're on a small system where RAM initialisation is done by the C start-up code, then your compiler probably has a way to put data (a file-scope variable) in a different section (besides the usual e.g. .bss) which is not initialised by the C start-up code.
If the data is not initialised, then it will probably contain random data at power-up. To determine whether it contains random data or valid data, use a hash, e.g. CRC-32, to determine its validity. If your processor has a way to tell you if you're in a reboot vs a power-up reset, then you should also use that to decide that the data is invalid after a power-up.
There is no single answer to this. I would start with a Watchdog timer. This reboots the system if things go terribly awry.
Something else to consider - what is not in a log file is also important. If you have routine updates from various tasks/actions logged then you can learn from what is missing.
Finally, in the case that things go bad and you are still running: enter a critical section, turn off as much of the OS a possible, shut down peripherals, log as much state info as possible, then reboot!
The one thing you want to make sure you do is to not corrupt data that might legitimately be in flash, so if you try to write information in a crash situation you need to do so carefully and with the knowledge that the system might be an a very bad state so anything you do needs to be done in a way that doesn't make things worse.
Generally, when I detect a crash state I try to spit information out a serial port. A UART driver that's accessible from a crashed state is usually pretty simple - it just needs to be a simple polling driver that writes characters to the transmit data register when the busy bit is clear - a crash handler generally doesn't need to play nice with multitasking, so polling is fine. And it generally doesn't need to worry about incoming data; or at least not needing to worry about incoming data in a fashion that can't be handled by polling. In fact, a crash handler generally cannot expect that multitasking and interrupt handling will be working since the system is screwed up.
I try to have it write the register file, a portion of the stack and any important OS data structures (the current task control block or something) that might be available and interesting. A watchdog timer usually is responsible for resetting the system in this state, so the crash handler might not have the opportunity to write everything, so dump the most important stuff first (do not have the crash handler kick the watchdog - you don't want to have some bug mistakenly prevent the watchdog from resetting the system).
Of course this is most useful in a development setup, since when the device is released it might not have anything attached to the serial port. If you want to be able to capture these kinds of crash dumps after release, then they need to get written somewhere appropriate (like maybe a reserved section of flash - just make sure it's not part of the normal data/file system area unless you're sure it can't corrupt that data). Of course you'd need to have something examine that area at boot so it can be detected and sent somewhere useful or there's no point, unless you might get units back post-mortem and can hook them up to a debugging setup that can look at the data.
I think the most well known example of proper exception handling is a missile self-destruction. The exception was caused by arithmetic overflow in software. There obviously was a lot of tracing/recording media involved because the root cause is known. It was discovered debugged.
So, every embedded design must include 2 features: recording media like your log file and graceful halt, like disabling all timers/interrupts, shutting all ports and sitting in infinite loop or in case of a missile - self-destruction.
Writing messages to flash before reboot in embedded systems is often a bad idea. As you point out, no one is going to read the message, and if the problem is not transient you wear out the flash.
When the system is in an inconsistent state, there is almost nothing you can do reliably and the best thing to do is to restart the system as quickly as possible so that you can recover from transient failures (timing, special external events, etc.). In some systems I have written a trap handler that uses some reserved memory so that it can, set up the serial port and then emit a stack dump and register contents without requiring extra stack space or clobbering registers.
A simple restart with a dump like that is reasonable because if the problem is transient the restart will resolve the problem and you want to keep it simple and let the device continue. If the problem is not transient you are not going to make forward progress anyway and someone can come along and connect a diagnostic device.
Very interesting paper on failures and recovery: WHY DO COMPUTERS STOP AND WHAT CAN BE DONE ABOUT IT?
For a very simple system, do you have a pin you can wiggle? For example, when you start up configure it to have high output, if things go way south (i.e. watchdog reset pending) then set it to low.
Have you ever considered using a garbage collector ?
And I'm not joking.
If you do dynamic allocation at runtime in embedded systems,
why not reserve a mark buffer and mark and sweep when the excrement hits the rotating air blower.
You've probably got the malloc (or whatever) implementation's source, right ?
If you don't have library sources for your embedded system forget I ever suggested it, but tell the rest of us what equipment it is in so we can avoid ever using it. Yikes (how do you debug without library sources?).
If you're system is already dead.... who cares how long it takes. It obviously isn't critical that it be running this instant;
if it was you couldn't risk "dieing" like this anyway ?