Debugging an intermittently stuck NSOperationQueue - objective-c

I have an iOS app with a really nasty bug: an operation in my NSOperationQueue will for some reason hang and not finish executing so other additional operations are being queued up but still not executing. This in turn leads to the app not begin able to perform critical functions. I have not yet been able to identify any pattern other than that it occurs on one of my co-workers devices every week or so. Running the app from Xcode at that point does not help as killing and relaunching the app resolves the issue for the time being. I've tried attaching the debugger to a running process and I seem to be able to see log data but any break points I add are not registering. I've added a bread crumb trail of NSLogs to try to pinpoint where it's hanging but this has not yet led to a resolution.
I originally described the bug in another question which is yet to have a clear answer I'm guessing because of the lack of info I'm able to provide around this issue.
A friend once told me that it's possible to save the entire memory stack of an app at a given moment in some form and reload that exact state of memory onto a process on a different device. Does anyone know how I can achieve that? If that's possible the next time someone encounters that bug I can save that exact state of memory and replicate to test all my theories of possible solutions. Or is there a different approach to tackling this? As an interim measure, do you think it would make sense to forcefully make the app crash when the app enters this state so actual users would be less confused? I'm have mixed feelings about this but the user will have to kill the app from the multitask dock anyway in order to use the app again. I can check the operation queue count or create some kind of timeout code for this until I actually nail this bug.

This sounds as a deadlock on a very rare race-condition. You also mentioned using a maxConcurrentOperationCount of 2. This means that either:
some operation is blocking the operation queue and waitiong for main to release some lock and main is waiting for the operation to finish
two operations are waiting on each other to release some lock
1 seems very unlikely as the queue should allow 2 concurrent operations to be completely blocked, unless you are using some system functions that have concurency issues and block you queue instead of just one thread.
I this case my first attempt to debug would be to connect the debugger and pause execution. After that you can look at the stack traces for all threads. You should be able to find the 2 threads that are made by your operation queue after which I would review the responsible functions to find code thet might possibly wait on some lock. Make sure to take into consideration sytem functions.

Well it's quite hard to solve bugs that don't crash the App but just hang a thread. If you can't find the bug by looking at your code step by step checking if there are any possible deadlock- or raceconditions I would suggest to implement some logging.
Write your log to disk everytime you add a logentry. That's not the most memory efficient way, but if you give a build with logging enabled to your co-worker you can pull the log from his iPhone when things go wrong. Even while the App is still running.
Make sure you log every step you take including the values of important variables around the code that you suspect of breaking the App. This way you can see what the App is doing and what the state of the App is.
Hope this helps a bit. I don't now about restoring the state of memory of an App so can't help you with that.
Note; If the App is crashing on the bug you could use some other tools, but if I get it right thats not the case here is it?
I read the question describing the bug and I would try to log to disk what the currently running operations are doing. It seems the operations will hang once in a while and there is a bug in there. If you can log what methods are called while running the operation this will show you what function call will hang the App and you can start looking in there.

You didn't say this but I presume the bug occurs while a human operator is working with the app? Maybe you should add an automated mode to this app, where the app simulates the same operations that users normally do, using randomized times for starting different actions. Then you can leave the app running unattended on all your devices and increase the chances of seeing the problem.
Also, since the problem appears related to the NSOperationQueue, maybe you should subclass it so that you can add logging to the more interesting methods. For example, each time an operation is added you should log the state of the queue, since you suspect that sometimes it is getting suspended.
Also, I suggested this on your other question as well, you may want to setup an observer to get notified if the queue ever goes into a suspended state.
Good luck.

Checking assumptions here, since that never hurts: do you actually have evidence that your background threads are hanging? From what you report, the observed behavior is that the tasks you're putting in your background thread are not achieving the outcome that you expected. That doesn't necessarily indicate that the thread has hung—it might just indicate that the particular conditions meant that the thread closed due to all tasks being completed, without the tasks achieving what you wanted them to.
Addition: Given your answer in the comments, it seems to me the next step then is to use logging when an item begins to be executed in the queue so that you can identify which items it is that lead to the queue becoming blocked. Best guess is that it is a certain class of items or certain characteristics of the items if they are all of a certain class. Log enough as the first step of executing each item that you'll have a reasonable characterization of the item, and then once you get a real device that has entered this state, check the logs and see just what conditions are leading to this problem. That should enable you to reliably reproduce the problem on a device during debugging or in the simulator, to then nail it.
In other words—I would focus your attention on identifying the problematic operations first, rather than trying to identify the particular line of code where things are stalling.

In my case
start
instead of
main
had to be overridden.
When in doubt consult https://developer.apple.com/documentation/foundation/nsoperation#1661262?language=objc for discrepancies with your implementation

Related

Can I run a DLL in a separate thread?

I have a program I'm writing in vb.net that has ballooned into the most complicated thing I've ever written. Because of some complex math and image rendering that's happening constantly I've been delving into multithreading for the first time to improve overall performance. Things have honestly been running really smoothly, but we've just added more functionality that's causing me some trouble.
The new functionality comes from a pair of DLLs that are each processing a video stream from a USB camera and looking for moving objects. When I start my program I initiate the DLLs and they start viewing the cameras and processing the videos. I then periodically ping them to see if they have detected anything. This is how I start and stop them:
Declare Function StartLeftCameraDetection Lib "DetectorLibLeft.dll" Alias "StartCameraDetection" () As Integer
Declare Function StopLeftCameraDetection Lib "DetectorLibLeft.dll" Alias "StopCameraDetection" () As Integer
When I need to check if they've found any objects I use several functions like this:
Declare Function LeftDetectedObjectLeft Lib "DetectorLibLeft.dll" Alias "DetectedObjectLeft" () As Integer
All of that works really well. The problem is, I've started to notice some significant lag in my UI and I'm thinking it may be coming from the DLLs. Forgive my ignorance on this, but as I said I'm new to using multiple threads (and incorporating DLLs too if I'm honest). It seems to me that when I start a DLL it running it's background tasks on my main thread and just waiting for me to ping it for information. Is that the case? If so, is it possible to have the DLL running on a sperate thread so it doesn't affect my UI?
I've tried a few different things but I can't seem to address the lag. I moved the code that pings the DLL and processes whatever information it gets into a sperate thread, but that hasn't made any difference. I also tried calling StartLeftCameraDetection from a separate thread but that didn't seem to help either. Again, I'm guessing that's because the real culprit is the DLL itself running these constant background tasks on my main thread no what thread I actually call it's functions from.
Thanks in advance for any help you might be able to offer!
There's a lot to grok when it comes to threading, but I'll try to write a concise summary that hits the high points with enough details to cover what you need to know.
Multi-threaded synchronization is hard, so you should try to avoid it as much as possible. That doesn't mean avoiding multi-threading at all, it just means avoiding doing much more than sending a self-contained task off to a thread to run to completion and getting the results back when it's done.
Recognizing that multi-threaded synchronization is hard, it's even worse when it involves UI elements. So in .NET, the design is that any access to UI elements will only occur through one thread, typically referred to as the UI thread. If you are not explicitly writing multi-threaded code, then all of your code runs on the UI thread. And, while your code is running, the UI is blocked.
This also extends to external routines that you run through Declare Function. It's not really accurate to say that they are doing anything with "background tasks on the main thread", if they are doing anything with "background tasks" they are almost certainly implementing their own threading. More likely, they aren't doing any task breakdown at all, and all of their work is being done on whichever thread you use to call them---the UI thread if you're not doing anything else.
If the work being done in these routines is CPU-bound, then it would definitely make sense to push it off onto a worker thread. Based on your comments on what you already tried:
I moved the code that pings the DLL and processes whatever information it gets into a sperate thread, but that hasn't made any difference. I also tried calling StartLeftCameraDetection from a separate thread but that didn't seem to help either.
I think the most likely problem is that you're blocking in the UI thread waiting for a result from the background thread.
The best way to avoid this depends on exactly what the routines are doing and how they produce results. If they do some sort of extended process and return everything in function results, then I would suggest that using Await would work well. This will basically return control to the UI until the operation finishes, then resume whatever the rest of the calling routine was going to do.
Note that if you do this, the user will have full interaction with the UI, and you should react accordingly. You might need to disable some (or all) operations until it's done.
There are a lot of resources on Async and Await. I'd particularly recommend reading Stephen Cleary's blog articles to get a better understanding of how they work and potential pitfalls that you might encounter.

Why FreeRTOS requires stop once in a breakpoint to run well?

I have an application for Zynq MPSoC (Vitis 2020.2) written in C++ using FreeRTOS V10.3.0. This application runs very well if stops at a breakpoint once. If I disable all breakpoints program runs buggy? What might be problem?
How many ways could this happen?! It is a real-time operating system, presumably then also a real-time application. If you stop the coprocessor you affect the timing. Without knowing the hardware, the software, where you are placing the breakpoint and the bugs that arise when free-running, it is not possible to answer your specific question. I.e. you need to debug it - there is no generic explanation as to why the intrusive action of stopping the processor "fixes" your system.
You clearly have buggy code that is affected by timing. Stopping the code does not necessarily stop peripherals and it certainly does not stop the outside world that your system interacts with. For example when you stop on a breakpoint, the world continues, interrupts become pending (possibly several), so that when you resume execution, the execution path and thread scheduling order is likely to differ considerably from that when it is free-run as all those pending interrupts are handled and in turn issue events that cause different tasks to become pending ready-to-run and then run in a different order to that which would otherwise occur.
Ultimately you are asking the wrong question; the breakpoint is not magically "fixing" your code, rather it is significantly changing the way that it runs such that some existing bug (or bugs) is hidden or avoided. The bug is still there, so the question would better focus on finding the bug than "magic thinking".
The bugs could be at any level, but most likely are design issues with inappropriate task partitioning, priority assignment, IPC, task synchronisation or resource protection. Generally probably rather too broad to deal with in a single SO question.

Get info about what resource is blocking task in FreeRTOS

I have a FreeRTOS (v9.0 in case it matters) based embedded product. This product has multiple tasks, which interact with each other using multiple semaphores, mutexes, queues, and other task blocking resources. Unfortunately, I have a seldom occurring bug which causes one tasks to permanently block on some resource (perhaps a deadly embrace?).
My efforts to trap it so far have been fruitless. However, I can attach a debugger to the running target after the problem happens, and pause the processor. Since I have each tasks's handle as a global variable, I was hoping to extract some useful information about which resource the task is blocked by. However, a handle is nothing but a pointer, and I can't figure out how to get useful information from that.
Does anyone have any ideas on how I can find out which task blocking resource is holding off the task?
UPDATE: It seems to me that, since I know the task which is getting stuck, I should be able to look at its stack and extract some useful information about that. Unfortunately, I'm not sure how to get access to the current stack pointer, nor how deep into the stack I'd have to go to make sense of what's on there.

Ensembles: when to use MagicalRecord's saveWithCompletion vs saveAndWait

I have an existing app (uses MagicalRecord) that I am trying to incorporate Ensembles. I have come across several places in my app where I save using MR_saveToPersistentStoreWithCompletion. I noticed in the Ensembles MagicalRecord example that it uses MR_saveToPersistentStoreAndWait.
I know what the difference is between the two; the question is: with Ensembles, should I always use MR_saveToPersistentStoreAndWait? If not, what are the circumstances that I should use MR_saveToPersistentStoreWithCompletion?
The main thing to be aware is that using the completion block involves an asynchronous save in the background, and once that save completes, Ensembles has to capture the changes from the notification that is fired.
In general, this is not a problem, but when terminating or going to the background, it is important to give Ensembles a chance to finish saving what it observes in the notification. You should thus use the save-and-wait variation in that case, to make sure the store is fully saved BEFORE you use the processPendingChanges... method on your ensemble. If you instead use the non-blocking method, you can't be sure the save is finished when you ask Ensembles to process pending changes, so there is a risk that it will not complete before the app is terminated.
There is a more exotic complication with saving in the background that involves creating objects with the same global identifier on different devices, but it will only affect a small number of apps. You can read more about that case in the Ensembles book.

Embedded systems : last gasp before reboot

When things go badly awry in embedded systems I tend to write an error to a special log file in flash and then reboot (there's not much option if, say, you run out of memory).
I realize even that can go wrong, so I try to minimize it (by not allocating any memory during the final write, and boosting the write processes priority).
But that relies on someone retrieving the log file. Now I was considering sending a message over the intertubes to report the error before rebooting.
On second thoughts, of course, it would be better to send that message after reboot, but it did get me to thinking...
What sort of things ought I be doing if I discover an irrecoverable error, and how can I do them as safely as possible in a system which is in an unstable state?
One strategy is to use a section of RAM that is not initialised by during power-on/reboot. That can be used to store data that survives a reboot, and then when your app restarts, early on in the code it can check that memory and see if it contains any useful data. If it does, then write it to a log, or send it over a comms channel.
How to reserve a section of RAM that is non-initialised is platform-dependent, and depends if you're running a full-blown OS (Linux) that manages RAM initialisation or not. If you're on a small system where RAM initialisation is done by the C start-up code, then your compiler probably has a way to put data (a file-scope variable) in a different section (besides the usual e.g. .bss) which is not initialised by the C start-up code.
If the data is not initialised, then it will probably contain random data at power-up. To determine whether it contains random data or valid data, use a hash, e.g. CRC-32, to determine its validity. If your processor has a way to tell you if you're in a reboot vs a power-up reset, then you should also use that to decide that the data is invalid after a power-up.
There is no single answer to this. I would start with a Watchdog timer. This reboots the system if things go terribly awry.
Something else to consider - what is not in a log file is also important. If you have routine updates from various tasks/actions logged then you can learn from what is missing.
Finally, in the case that things go bad and you are still running: enter a critical section, turn off as much of the OS a possible, shut down peripherals, log as much state info as possible, then reboot!
The one thing you want to make sure you do is to not corrupt data that might legitimately be in flash, so if you try to write information in a crash situation you need to do so carefully and with the knowledge that the system might be an a very bad state so anything you do needs to be done in a way that doesn't make things worse.
Generally, when I detect a crash state I try to spit information out a serial port. A UART driver that's accessible from a crashed state is usually pretty simple - it just needs to be a simple polling driver that writes characters to the transmit data register when the busy bit is clear - a crash handler generally doesn't need to play nice with multitasking, so polling is fine. And it generally doesn't need to worry about incoming data; or at least not needing to worry about incoming data in a fashion that can't be handled by polling. In fact, a crash handler generally cannot expect that multitasking and interrupt handling will be working since the system is screwed up.
I try to have it write the register file, a portion of the stack and any important OS data structures (the current task control block or something) that might be available and interesting. A watchdog timer usually is responsible for resetting the system in this state, so the crash handler might not have the opportunity to write everything, so dump the most important stuff first (do not have the crash handler kick the watchdog - you don't want to have some bug mistakenly prevent the watchdog from resetting the system).
Of course this is most useful in a development setup, since when the device is released it might not have anything attached to the serial port. If you want to be able to capture these kinds of crash dumps after release, then they need to get written somewhere appropriate (like maybe a reserved section of flash - just make sure it's not part of the normal data/file system area unless you're sure it can't corrupt that data). Of course you'd need to have something examine that area at boot so it can be detected and sent somewhere useful or there's no point, unless you might get units back post-mortem and can hook them up to a debugging setup that can look at the data.
I think the most well known example of proper exception handling is a missile self-destruction. The exception was caused by arithmetic overflow in software. There obviously was a lot of tracing/recording media involved because the root cause is known. It was discovered debugged.
So, every embedded design must include 2 features: recording media like your log file and graceful halt, like disabling all timers/interrupts, shutting all ports and sitting in infinite loop or in case of a missile - self-destruction.
Writing messages to flash before reboot in embedded systems is often a bad idea. As you point out, no one is going to read the message, and if the problem is not transient you wear out the flash.
When the system is in an inconsistent state, there is almost nothing you can do reliably and the best thing to do is to restart the system as quickly as possible so that you can recover from transient failures (timing, special external events, etc.). In some systems I have written a trap handler that uses some reserved memory so that it can, set up the serial port and then emit a stack dump and register contents without requiring extra stack space or clobbering registers.
A simple restart with a dump like that is reasonable because if the problem is transient the restart will resolve the problem and you want to keep it simple and let the device continue. If the problem is not transient you are not going to make forward progress anyway and someone can come along and connect a diagnostic device.
Very interesting paper on failures and recovery: WHY DO COMPUTERS STOP AND WHAT CAN BE DONE ABOUT IT?
For a very simple system, do you have a pin you can wiggle? For example, when you start up configure it to have high output, if things go way south (i.e. watchdog reset pending) then set it to low.
Have you ever considered using a garbage collector ?
And I'm not joking.
If you do dynamic allocation at runtime in embedded systems,
why not reserve a mark buffer and mark and sweep when the excrement hits the rotating air blower.
You've probably got the malloc (or whatever) implementation's source, right ?
If you don't have library sources for your embedded system forget I ever suggested it, but tell the rest of us what equipment it is in so we can avoid ever using it. Yikes (how do you debug without library sources?).
If you're system is already dead.... who cares how long it takes. It obviously isn't critical that it be running this instant;
if it was you couldn't risk "dieing" like this anyway ?