Snakemake: Job preemption can interrupt running jobs on clusters, how to make sure that the task is not considered as failed? - jobs

I'm using Snakemake on a cluster, and I don't know how best to handle the fact that some jobs can be preempted.
For more power on the cluster I use, it is possible to have access to the resources of other teams, but with the risk of being preempted, which consists in stopping the job in progress, and rescheduling it. It will be launched again as soon as a resource is available. This is especially advantageous when you have a lot of quick jobs to run. Unfortunately, I don't have the impression that Snakemake supports this properly.
In the example given in the help on the cluster-status feature for Slurm, there is no PREEMPTED in the running_status list (running_status=["PENDING", "CONFIGURING", "COMPLETING", "RUNNING", "SUSPENDED"]), which may lead to consider a preempted job has failed. Not a big deal, I’ve added PREEMPTED to this list, but I am led to believe that Snakemake did not consider this scenario.
More annoyingly, even when running Snakemake with the --rerun-incomplete option, when the job is interrupted by the preemption, then restarted, I get the following error:
IncompleteFilesException:
The files below seem to be incomplete. If you are sure that certain files are not incomplete, mark them as complete with
snakemake --cleanup-metadata <filenames>
To re-generate the files rerun your command with the --rerun-incomplete flag.
I would expect the interrupted job to restart from scratch.
For now, the only solution I have found is to stop using other teams' resources to avoid having my jobs preempted, but I am losing computing power.
How do you use Snakemake in a context where your jobs can be preempted? Anyone see a solution so I don't get the IncompleteFilesException anymore?
Thanks in advance

Snakemake has a restart feature, which can be used to let jobs be resubmitted automatically. However, there is no special handling for prememption currently, indeed. You are also right, I was not even aware that something like that exists on slurm. A PR in that direction would be welcome of course. Basically, one would need to extend the status script handling to recognize this and in that case restart the job.

Thanks for reporting these, I see two separate issues here:
Handling of the PREEMPTED status returned by slurm.
The IncompleteFilesException suggesting you use --rerun-incomplete when that is exactly what you are doing.
1. PREEMPTED status handling
I have no experience in using slurm, so I cannot comment if the script example in the docs that you are linking to will work for slurm. Especially the expression in output = str(subprocess.check_output(expression)) might have to be adjusted to slurm in some way. Maybe there's someone around here who also uses slurm and has found a working solution in the past?
But otherwise, adding PREEMPTED to the running_status list should be exactly what you want to do (assuming that that is exactly the tag returned by expression).
If this has to be adapted to slurm and you manage to generate a working status.py script, it might be worth adding this to the docs via a pull request onto this file, so that other slurm users don't have to reinvent the solution.
2. IncompleteFilesException with --rerun-incomplete flag
From the general description, this sounds a bit like a bug. But without any details, I cannot be sure. But maybe it's worth describing this in some more detail while filing an issue in the snakemake repo. Either simply by providing more details, or by even providing a minimal example to reproduce this behavior.

Related

Get info about what resource is blocking task in FreeRTOS

I have a FreeRTOS (v9.0 in case it matters) based embedded product. This product has multiple tasks, which interact with each other using multiple semaphores, mutexes, queues, and other task blocking resources. Unfortunately, I have a seldom occurring bug which causes one tasks to permanently block on some resource (perhaps a deadly embrace?).
My efforts to trap it so far have been fruitless. However, I can attach a debugger to the running target after the problem happens, and pause the processor. Since I have each tasks's handle as a global variable, I was hoping to extract some useful information about which resource the task is blocked by. However, a handle is nothing but a pointer, and I can't figure out how to get useful information from that.
Does anyone have any ideas on how I can find out which task blocking resource is holding off the task?
UPDATE: It seems to me that, since I know the task which is getting stuck, I should be able to look at its stack and extract some useful information about that. Unfortunately, I'm not sure how to get access to the current stack pointer, nor how deep into the stack I'd have to go to make sense of what's on there.

RSpec errors when run en masse, but not individually

Unfortunately I don't have a specific question (or clues), but was hoping someone could point me in the right direction.
When I run all of my tests (rspec spec), I am getting two tests that fail specifically related to Delayed Job.
When I run this spec file in isolation (rspec ./spec/controllers/xxx_controller_spec.rb) all the tests pass...... Is this a common problem? What should I be looking for?
Thanks!
You are already mentioning it: isolation might be the solution. Usually I would guess that you have things in the database that are being changed and not cleaned up properly (or rather, are not not mocked properly).
In this case though I would suggest that, because the system is quite under a high workload, the delayed jobs are not being worked off fast enough. The challenge is with all asynchronous tasks that should be tested: you must not let the system run the delayed jobs, but mock the calls and just make sure that the delayed jobs have been received.
Sadly, with no examples, I can hardly point out the missing mocks. But make sure that all calls to delay_jobs and similar receive the correct data, but do not actually create and run those jobs - your specs will be faster, too. Make sure you isolate the function under test and not call external dependencies.

Debugging an intermittently stuck NSOperationQueue

I have an iOS app with a really nasty bug: an operation in my NSOperationQueue will for some reason hang and not finish executing so other additional operations are being queued up but still not executing. This in turn leads to the app not begin able to perform critical functions. I have not yet been able to identify any pattern other than that it occurs on one of my co-workers devices every week or so. Running the app from Xcode at that point does not help as killing and relaunching the app resolves the issue for the time being. I've tried attaching the debugger to a running process and I seem to be able to see log data but any break points I add are not registering. I've added a bread crumb trail of NSLogs to try to pinpoint where it's hanging but this has not yet led to a resolution.
I originally described the bug in another question which is yet to have a clear answer I'm guessing because of the lack of info I'm able to provide around this issue.
A friend once told me that it's possible to save the entire memory stack of an app at a given moment in some form and reload that exact state of memory onto a process on a different device. Does anyone know how I can achieve that? If that's possible the next time someone encounters that bug I can save that exact state of memory and replicate to test all my theories of possible solutions. Or is there a different approach to tackling this? As an interim measure, do you think it would make sense to forcefully make the app crash when the app enters this state so actual users would be less confused? I'm have mixed feelings about this but the user will have to kill the app from the multitask dock anyway in order to use the app again. I can check the operation queue count or create some kind of timeout code for this until I actually nail this bug.
This sounds as a deadlock on a very rare race-condition. You also mentioned using a maxConcurrentOperationCount of 2. This means that either:
some operation is blocking the operation queue and waitiong for main to release some lock and main is waiting for the operation to finish
two operations are waiting on each other to release some lock
1 seems very unlikely as the queue should allow 2 concurrent operations to be completely blocked, unless you are using some system functions that have concurency issues and block you queue instead of just one thread.
I this case my first attempt to debug would be to connect the debugger and pause execution. After that you can look at the stack traces for all threads. You should be able to find the 2 threads that are made by your operation queue after which I would review the responsible functions to find code thet might possibly wait on some lock. Make sure to take into consideration sytem functions.
Well it's quite hard to solve bugs that don't crash the App but just hang a thread. If you can't find the bug by looking at your code step by step checking if there are any possible deadlock- or raceconditions I would suggest to implement some logging.
Write your log to disk everytime you add a logentry. That's not the most memory efficient way, but if you give a build with logging enabled to your co-worker you can pull the log from his iPhone when things go wrong. Even while the App is still running.
Make sure you log every step you take including the values of important variables around the code that you suspect of breaking the App. This way you can see what the App is doing and what the state of the App is.
Hope this helps a bit. I don't now about restoring the state of memory of an App so can't help you with that.
Note; If the App is crashing on the bug you could use some other tools, but if I get it right thats not the case here is it?
I read the question describing the bug and I would try to log to disk what the currently running operations are doing. It seems the operations will hang once in a while and there is a bug in there. If you can log what methods are called while running the operation this will show you what function call will hang the App and you can start looking in there.
You didn't say this but I presume the bug occurs while a human operator is working with the app? Maybe you should add an automated mode to this app, where the app simulates the same operations that users normally do, using randomized times for starting different actions. Then you can leave the app running unattended on all your devices and increase the chances of seeing the problem.
Also, since the problem appears related to the NSOperationQueue, maybe you should subclass it so that you can add logging to the more interesting methods. For example, each time an operation is added you should log the state of the queue, since you suspect that sometimes it is getting suspended.
Also, I suggested this on your other question as well, you may want to setup an observer to get notified if the queue ever goes into a suspended state.
Good luck.
Checking assumptions here, since that never hurts: do you actually have evidence that your background threads are hanging? From what you report, the observed behavior is that the tasks you're putting in your background thread are not achieving the outcome that you expected. That doesn't necessarily indicate that the thread has hung—it might just indicate that the particular conditions meant that the thread closed due to all tasks being completed, without the tasks achieving what you wanted them to.
Addition: Given your answer in the comments, it seems to me the next step then is to use logging when an item begins to be executed in the queue so that you can identify which items it is that lead to the queue becoming blocked. Best guess is that it is a certain class of items or certain characteristics of the items if they are all of a certain class. Log enough as the first step of executing each item that you'll have a reasonable characterization of the item, and then once you get a real device that has entered this state, check the logs and see just what conditions are leading to this problem. That should enable you to reliably reproduce the problem on a device during debugging or in the simulator, to then nail it.
In other words—I would focus your attention on identifying the problematic operations first, rather than trying to identify the particular line of code where things are stalling.
In my case
start
instead of
main
had to be overridden.
When in doubt consult https://developer.apple.com/documentation/foundation/nsoperation#1661262?language=objc for discrepancies with your implementation

How stable is RabbitMQ in production (using DRBD and Pacemaker)?

Looking for experience with RabbitMQ, especially in HA configuration using Pacemaker and DRDB as recommended here: http://www.rabbitmq.com/pacemaker.html
The DRBD part in particular makes me nervous, so I'm hoping someone here has real-world experience to share.
Works most of the time. However you'll have to pay special attention to fencing (split brain), when dealing with DRBD. On a production system it's always a pain to have to fix this kind of issues manually.
We failed to run RabbitMQ in a master/slave (multi-state RA). We thought we'd enhance availability. We're back to a single instance now. If anyone else has experience with several RabbitMQ instances running concurrently and backing a master entity that would be great to share!
I find the lack of tools to debug Pacemaker when there are issues is a big hurdle to deploy to live systems... It's not always clear what Pacemaker is "thinking" or doing. hb_report is not sufficient unfortunately.
Hope this helps,
D.
We tried master/slave configuration as well, however it became difficult to maintain all instances up to date with no downtime. And trust me, you want to update RabbitMQ. There are always bugs popping up either in RabbitMQ itself or in Erlang.
We've are getting about 100 crashes per year without any meaningful explanation in the logs. The error log just has generic "error while starting" in it and that's pretty much it. Sometimes it won't start after the crash and most of those times, the only solution is to delete all the persistent messages from all instances, so that the queue state is synchronized across the cluster. Other times it would crash immediately after launching and only after multiple repeated attempts will it properly load. Meaning there is no added reliability what so ever when using master/slave. At least there was none in our case. (RabbitMQ 3.5.3, Erlang 18.0)
It works for production, but only if you keep a copy of the message somewhere in the logs or in the database, from where it can be quickly recovered after a major crash.

How do you solve a problem that is unreproducible, random and changes are not immediately testable?

Thought I would throw this one out there and see what other people's experiences have been like.
I'm experiencing an issue with a system at work where it stops processing jobs in a queue and 'jams' so to speak. Once the services are restarted the software processes the queue and everything returns to normal.
In my experience so far, I cannot for the life of me figure out what is causing these stoppages. That, and I cannot reproduce the stoppage myself. The queue fails at all different intervals, sometimes running for a month straight, other times failing as close together as twice in 1 day. I have since involved two different vendors and various colleagues within the department and everyone is stumped, and has been for several months.
Since I started, we've isolated the processing to a single server and cranked up the logging which we've sent to the vendors. Neither have no idea what the problem is.
We've updated a few settings here and there, upgraded client and server pieces, but we have no idea if the things we are doing is contributing to an overall solution.
So I have a problem that appears to be unreproducible, random and untestable.
Has anyone been involved with any similar situations?
What are some of the ways to solve a situation like this?
Any shared input or experiences would be great.
Cheers,
EDIT:: Cranked up the logging, updated all of the components to the latest version, and made sure proper anti-virus exclusions were done and so far it has not jammed in over a month!
Use a logging framework that can be turned on in production. You might have to have too much logging initially but it should help narrow down the problem and as you get closer you can narrow the scope of the logging and at the same time increase the verbosity (is that a word) of the remaining log statements.
In addition to the logging as pointed out by Kelly there is the possibilty of a deadlock taking place since things seem to stop. One option if this is a Java application is to use jconsole and connect to the JVM instance. jconsole has a detect deadlock option which can provide very valuable information when the hangup occurs.
If this is not a Java application and perhaps a .NET application you could make use of this technique.