Snakemake: how to best discern between the various types of Node failures in cluster-mode? - snakemake

In Snakemake, as far as I know, we can only adapt job resources dynamically based on the number of attempts a job has made. When trying to re-adjust resources after a failure, it would be useful to discern from the following type of cluster failures:
Program error
Transient node failure
Out of memory
Timeout
The last 3 cases, in particular, are exposed to the SLURM user via different job completion status codes. The snakemake interface to the status script merges all types of failures into a single "failed" status.
Is there any way to do so? Or is this a planned feature? Keeping a list of previous failure reasons, instead of just the attempts count would be most useful.
e.g. goal:
rule foo:
resources:
mem_gb=lambda wildcards, attempts: 100 + (20*attempts.OOM)
time_s=lambda wildcards, attempts: 3600 + (3600*attemps.TIMEOUT)
...
The cluster I have access to has heterogeneous machines where each node is configured with various walltime and memory limits, and it would minimize scheduling times if I didn't have to conservatively bump all resources at once.
Possible workaround: I thought of keeping track of that extra info between the job status script, and the cluster submission script (e.g. keeping a history of status codes for each jobid). Is the attempt# available to the cluster submission and cluster status commands?

I think it would be best to handle such cluster specific functionality in a profile, e.g. in the slurm profile. When an error is detected, the status script could simply silently resubmit with updated resources based on what slurm reports. I don't think the Snakefile has to be cluttered with such platform details.

Related

Prevent one failed subtask failing all tasks in Flyte

I have a dynamic_task which kicks off a number of python_tasks. However, as soon as one of the python_tasks fails, the other ones that are still running would fail as well. Is this by design? Is there a way to change this behavior so that other tasks can still complete without failing?
This is by design, as a means to save resources, but it is configurable. Presumably, dynamic tasks are related to each other, and downstream tasks will need the output of all of them. So if one fails, the default behavior is to fail the rest.
If you'd like to change this, create your dynamic task with a float as this argument in the decorator: https://github.com/lyft/flytekit/blob/d4cfedc4c580f08bf904e6e474a0b948a4608737/flytekit/common/tasks/sdk_dynamic.py#L84
The idea is that partial failures are not tolerated within a data passing DAG. If some node fails, then by definition the data is partial.
But for dynamic array tasks, Flyte allows a special provision (actually the Array tasks plugin), which allows the users to provide a ratio of acceptable successful tasks.

Is it possible to request more time to a running job in SLURM?

I know it's possible on a queued job to change directives via scontrol, for example
scontrol update jobid=111111 TimeLimit=08:00:00
This only works in some cases, depending on the administrative configuration of the slurm instance (I'm not an admin). Thus this post does not answer my question.
What I'm looking for is a way to ask SLURM to add more time to a running job, if resources are available, and even if it's already running. Sort of like a nested job request.
Particularly a running job that was initiated with srun on-the-fly.
In https://slurm.schedmd.com/scontrol.html, it is clearly written under TimeLimit:
Only the Slurm administrator or root can increase job's TimeLimit.
So I fear what you want is not possible.
An it makes sense, since the scheduler looks at job time to decide which jobs to launch and some short jobs can benefit from back-filling to start before longer jobs, it would be really a mess if users where allowed to change the job length while running. Indeed, how to define "when resource are available"? Some node can be IDLE for some time because slurm knows that it will need it soon for a large job

Is it possible to limit number of oozie workflows running at the same time?

This is not clear to me from the docs. Here's our scenario and why we need this as succinctly as I can:
We have 60 coordinators running, launching workflows usually hourly, some of which have sub-workflows (some multiple in parallel). This works out to around 40 workflows running at any given time. However when cluster is under load or some underlying service is slow (e.g. impala or hbase), workflows will run longer than usual and back up so we can end up with 80+ workflows (including sub-workflows) running.
This sometimes results in ALL workflows hanging indefinitely, because we have only enough memory and cores allocated to this pool that oozie can start the launcher jobs (i.e. oozie:launcher:T=sqoop:W=JobABC:A=sqoop-d596:ID=XYZ), but not their corresponding actions (i.e. oozie:action:T=sqoop:W=JobABC:A=sqoop-d596:ID=XYZ).
We could simply allocate enough resources to the pool to accommodate for these spikes, but that would be a massive waste (hundreds of cores and GBs that other pools/tenants could never use).
So I'm trying to enforce some limit on number of workflows running, even if that means some will be running behind sometimes. BTW all our coordinators are configured with execution=LAST_ONLY, and any delayed workflow will simply catch up fully on the next run. We are on CDH 5.13 with Oozie 4.1; pools are setup with DRF scheduler.
Thanks in advance for your ideas.
AFAIK there is not a configuration parameter that let you control the number of workflows running at a given time.
If your coordinators are scheduled to run approximately in the same time-window, you could think to collapse them in just one coordinator/workflow and use the fork/join control nodes to control the degree of parallelism. Thus you can distribute your actions in a K number of queues in your workflow and this will ensure that you will not have more than K actions running at the same time, limiting the load on the cluster.
We use a script to generate automatically the fork queues inside the workflow and distribute the actions (of course this is only for actions that can run in parallel, i.e. there no data dependencies etc).
Hope this helps

Ensuring one job per node for a specific rule

Hello and thank you for reviewing this question!
I'm working on an SGE cluster with 16 available worker nodes. Each has 32 cores.
I have a rule which defines a process which must be run only one instance per worker node. This means I could in theory run 16 jobs at a time. It's fine if there are other things happening on each worker node - there just can't be two jobs from this specific rule running at the same time. Is there a way to ensure this?
I have tried setting memory resources. But setting for example
resources:
mem_mb=10000
and running
snakemake --resources mem_mb=10000
will only allow one job to run at a time, not one job per cluster. Is there a way to set each individual cluster's memory limit? Or some other way to achieve one job per node for only a specific rule?
Thank you,
Eric

What is the practice for scheduling multiple inter-dependent SQL Server Agent jobs?

The way my team currently schedules jobs is through the SQL Server Job Agent. Many of these jobs have dependencies on other internal servers which in turn have their own SQL Server Jobs that need to be run to keep their data up to date.
This has created dependencies in the start time and length of each of our SQL Server Jobs. Job A might depend on Job B finishing, so we schedule Job B a certain estimated time in advance to Job A. All of this process is very subjective and not scalable, as we add more jobs and servers which create more dependencies.
I would love to get out of the business of subjectively scheduling these jobs and hoping that the dominos fall in the right order. I am wondering what the accepted practices for scheduling SQL Server jobs are. Do people use SSIS to chain jobs together? Is there tooling already built into the SQL Server Job Agent to handle this?
What is the accepted way to handle the scheduling of multiple SQL Server jobs with dependencies on each other?
I have used Control-M before to schedule multiple inter-dependent jobs in different environment. Control-M generally works by using batch files (from what I remember) to execute SSIS packages.
We had a complicated environment hosting 2 data warehouses side by side (1 International and 1 US Local). There were jobs that were dependent on other jobs and those jobs on others and so on, but by using Control-M we could easily decide on the dependency (It has a really nice and intuitive GUI). Other tool that comes to my mind is Tidal Scheduler.
There is no set standard for job scheduling, but I think its safe to say that job schedules depend entirely on what an organization needs. For example Finance jobs might be dependent on Sales and Sales on Inventory and so on. But the point is, if you need to have job inter dependency, using a third party software such as Control-M is a safe bet. It can control jobs on different environments and give you real sense of the company wide job control.
We too had the requirement to manage dependencies between multiple agent jobs - after looking at various 3rd party tools and discounting them for various reasons (mainly down to the internal constraints relating to the use of 3rd party software) we decided to create our own solution.
The solution centres around a configuration database that holds details about processes (jobs) that need to run and how they are grouped (batches), along with the dependencies between processes.
Summary of configuration tables used:
Batch - highlevel definition of a group of related processes, includes metadata such as max concurrent processes, and current batch instance etc.
Process - meta data relating to a process (job) such as name, max wait time, earliest run time, status (enabled / disabled), batch (what batch the process belongs to), process job name etc.
Batch Instance - the active instance of a given batch
Process Instance - active instances of processes for a given batch
Process Dependency - dependency matrix
Batch Instance Status - lookup for batch instance status
Process Instance Status - loolup for process instance status
Each batch has 2 control jobs - START BATCH and UPDATE BATCH. The 1st deals with starting all processes that belong to it and the 2nd is the last to run in any given batch and deals with updating the outcome statuses.
Each process has an agent job associated with it that gets executed by the START BATCH job - processes have a capped concurrency (defined in the batch configuration) so processes are started up to a max of x at a time and then START BATCH waits until a free slot becomes available before starting the next process.
The process agent job steps call a templated SSIS package that deals with the actual ETL work and with the decision making around whether the process needs to run and has to wait for dependencies etc.
We are currently looking to move to a Service Broker solution for greater flexibility and control.
Anyway, probably too much detail and not enough example here so VS2010 project available on request.
I'm not sure how much this will help, but we ended up creating an email solution for scheduling.
We built an email reader that accesses an exchange mailbox. As jobs finish, they send an email to the mail reader to start another job. The other nice part, is that most applications have email notifications built in, so there really isn't much in the way of custom programming.
We really only built it in the first place to handle data files coming in from lots of other partners. It was much easier to give them an email address rather than setting them up with an ftp site, etc.
The mail reader app now has grown to include basic filtering, time of day scheduling, use of semaphores to prevent concurrent jobs, etc. It really works great.