Is it possible to request more time to a running job in SLURM? - jobs

I know it's possible on a queued job to change directives via scontrol, for example
scontrol update jobid=111111 TimeLimit=08:00:00
This only works in some cases, depending on the administrative configuration of the slurm instance (I'm not an admin). Thus this post does not answer my question.
What I'm looking for is a way to ask SLURM to add more time to a running job, if resources are available, and even if it's already running. Sort of like a nested job request.
Particularly a running job that was initiated with srun on-the-fly.

In https://slurm.schedmd.com/scontrol.html, it is clearly written under TimeLimit:
Only the Slurm administrator or root can increase job's TimeLimit.
So I fear what you want is not possible.
An it makes sense, since the scheduler looks at job time to decide which jobs to launch and some short jobs can benefit from back-filling to start before longer jobs, it would be really a mess if users where allowed to change the job length while running. Indeed, how to define "when resource are available"? Some node can be IDLE for some time because slurm knows that it will need it soon for a large job

Related

Spring Batch restart crashed jobs

Hi spring batch users,
regarding the documentation http://docs.spring.io/spring-batch/reference/htmlsingle/#d5e1320
"If the process died ("kill -9" or server failure) the job is, of course, not running, but the JobRepository has no way of knowing because no-one told it before the process died."
I try to find and restart the stale job executions by using
Set<JobExecution> jobExecutions = jobExplorer.findRunningJobExecutions(jobName);
...
jobExecution.setStatus(FAILED);
jobExecution.setEndTime(new Date());
jobRepository.update(jobExecution);
jobOperator.restart(jobExecution.getId());
But this seems to be very inconvenient.
1) I have to do this before other (new) jobs could be started.
2) I have to handle multiple instances of running servers so findRunningJobExecutions will not do the trick.
You can find other questions regarding this topic:
https://jira.spring.io/browse/BATCH-2433?jql=project%20%3D%20BATCH%20AND%20status%20%3D%20Open%20ORDER%20BY%20priority%20DESC
Spring Batch after JVM crash
I would love to see a solution to register a "start up clean jobs listener". This will still not fix the problems originated by the multi server environment because spring batch does not know if the JobExecution marked by STARTED is not running on an other instance.
Thanks for any advice
Alex
Your job cannot and should not recover "automatically" from a kill -9 scenario. A kill -9 is treated very differently than you application throwing a caught Exception. The reason for this is that you've effectively pulled the carpet out from under the application without giving it a chance to reach a synchronization point with the database to commit any necessary information to the ExecutionContext or update the job/step status(es). Therefore, the last status touchpoint with the database will remain and the job will still look STARTED.
"OK, fine" you say, "but if I start another execution, I want it to find that STARTED execution, and pick up where it left off." The problem here is that there is no clean way for the application to distinguish a job that is ACTUALLY RUNNING from one that has failed but couldn't up the database. The framework here correctly errs on the side of caution and prevents you from starting a job that already appears running, and this is a GOOD thing.
Why? Because let's assume your job was actually still running and you restarted by accident. As coded, the framework will start to spin up, see your running execution and fail with the following message A job execution for this job is already running. I can't tell you how many times we've been saved by this because someone accidentally launched a job twice!
If you were to implement the listener you suggest, the 2nd execution would instead be allowed to start and you'd have 2 different JVMs repeating the same work, possibly writing to the same files/tables and causing a huge data mess that could be impossible to clean up.
Trust me, in the event the Linux terminal kills your job or your job dies because the connection to the database has been severed, you WANT human eyes on those execution states before you attempt a restart.
Finally, on the off chance you actually wanted to kill you job, you can leverage several other standard patterns for stopping jobs:
Stop via throw Exception
Stop via JobOperator.stop()

Autosys JIL ignoring success conditions

I hope someone can point me in the right direction or shed some light on the issue I'm having. We have Autosys 11.3.5 running in Windows environment.
I have several jobs setup to launch on a remote NAS server.
I needed JOB_1 in particular to only run if another completed successfully.
Seems straight forward enough. In UI there's a section to specify Condition such as: s(job_name) as I have done and I'm assuming that ONLY if the job with name job_name succeeds that my initial job should run.
No matter what I do, when I make the second job fail on purpose (whether manually setting its status to FAILURE) or changing some of its parameters so that its natural run time causes it to fail. The other job that I run afterwards seems to ignore the condition altogether and complete successfully each time.
I've triple checked the job names (in fact I copy and pasted it from the JIL definition of the job so there are no typos), but it's still being ignored.
Any help in figuring out how to make one job only run if another did not fail (and NOT to run if it DID fail) would be appreciated.
If both the jobs are scheduled and become active together, then this should not happen.
The way i think is, you must be force starting the other job while the first is failed. If thats the case, then conditions will not work.
You need to let both the jobs start as per schedule, or at least the other job start as per schedule while the first one is failed. In that case the other job will stay in AC state until the first one is SU.
Let me know if this is not the case, i will have to rephrase another solution then.

Hangfire 1.3.4 - deleted jobs stuck in queue

We are running hangfire single threaded using BackgroundJobServerOptions.WorkerCount = 1 (because we have a requiement for ordered processing).
Most of the time this is no problem, but occasionally a job gets stuck for entirely expected reasons (eg, the actual code it is running goes into an infintite loop), but because we are running single threaded this prevents other jobs in the queue from starting.
In order to try and work around this, we delete the job, but then it stays on the queue, blocking any other job from starting:
The only way I have found to resolve this is to drop and recreate the hangfire DB which is obviously not great.
Why does deleting a running job in hangfire not also remove it from the queue? Is this weird delete behavior a bug which to be fixed in a later version, or is this behavior by design because we're running single threaded?
If this is by design then how do you cancel a processing job in a way which removes it from the queue?
Well it seems that this behavior is by design.
If the IIS app pool worker is recycled, Hangfire will start processing the next task immediately. However, without this restart Hangfire will "hang" indefinitely.
An issue was raised on github about this, however it has not been solved yet:
https://github.com/HangfireIO/Hangfire/issues/80
With no way to cancel or manually "fail" a job, this makes hangfire a lot less useful in a single threaded scenario.
Update: this has been partially or fully addressed in some later version of Hangfire.

What is the practice for scheduling multiple inter-dependent SQL Server Agent jobs?

The way my team currently schedules jobs is through the SQL Server Job Agent. Many of these jobs have dependencies on other internal servers which in turn have their own SQL Server Jobs that need to be run to keep their data up to date.
This has created dependencies in the start time and length of each of our SQL Server Jobs. Job A might depend on Job B finishing, so we schedule Job B a certain estimated time in advance to Job A. All of this process is very subjective and not scalable, as we add more jobs and servers which create more dependencies.
I would love to get out of the business of subjectively scheduling these jobs and hoping that the dominos fall in the right order. I am wondering what the accepted practices for scheduling SQL Server jobs are. Do people use SSIS to chain jobs together? Is there tooling already built into the SQL Server Job Agent to handle this?
What is the accepted way to handle the scheduling of multiple SQL Server jobs with dependencies on each other?
I have used Control-M before to schedule multiple inter-dependent jobs in different environment. Control-M generally works by using batch files (from what I remember) to execute SSIS packages.
We had a complicated environment hosting 2 data warehouses side by side (1 International and 1 US Local). There were jobs that were dependent on other jobs and those jobs on others and so on, but by using Control-M we could easily decide on the dependency (It has a really nice and intuitive GUI). Other tool that comes to my mind is Tidal Scheduler.
There is no set standard for job scheduling, but I think its safe to say that job schedules depend entirely on what an organization needs. For example Finance jobs might be dependent on Sales and Sales on Inventory and so on. But the point is, if you need to have job inter dependency, using a third party software such as Control-M is a safe bet. It can control jobs on different environments and give you real sense of the company wide job control.
We too had the requirement to manage dependencies between multiple agent jobs - after looking at various 3rd party tools and discounting them for various reasons (mainly down to the internal constraints relating to the use of 3rd party software) we decided to create our own solution.
The solution centres around a configuration database that holds details about processes (jobs) that need to run and how they are grouped (batches), along with the dependencies between processes.
Summary of configuration tables used:
Batch - highlevel definition of a group of related processes, includes metadata such as max concurrent processes, and current batch instance etc.
Process - meta data relating to a process (job) such as name, max wait time, earliest run time, status (enabled / disabled), batch (what batch the process belongs to), process job name etc.
Batch Instance - the active instance of a given batch
Process Instance - active instances of processes for a given batch
Process Dependency - dependency matrix
Batch Instance Status - lookup for batch instance status
Process Instance Status - loolup for process instance status
Each batch has 2 control jobs - START BATCH and UPDATE BATCH. The 1st deals with starting all processes that belong to it and the 2nd is the last to run in any given batch and deals with updating the outcome statuses.
Each process has an agent job associated with it that gets executed by the START BATCH job - processes have a capped concurrency (defined in the batch configuration) so processes are started up to a max of x at a time and then START BATCH waits until a free slot becomes available before starting the next process.
The process agent job steps call a templated SSIS package that deals with the actual ETL work and with the decision making around whether the process needs to run and has to wait for dependencies etc.
We are currently looking to move to a Service Broker solution for greater flexibility and control.
Anyway, probably too much detail and not enough example here so VS2010 project available on request.
I'm not sure how much this will help, but we ended up creating an email solution for scheduling.
We built an email reader that accesses an exchange mailbox. As jobs finish, they send an email to the mail reader to start another job. The other nice part, is that most applications have email notifications built in, so there really isn't much in the way of custom programming.
We really only built it in the first place to handle data files coming in from lots of other partners. It was much easier to give them an email address rather than setting them up with an ftp site, etc.
The mail reader app now has grown to include basic filtering, time of day scheduling, use of semaphores to prevent concurrent jobs, etc. It really works great.

Select to recycle worker processes after a specific period of inactivity

Can anyone confirm that this statement "Select to recycle worker processes after a specific period of inactivity" in this Microsoft help file is wrong and should in fact not have the "of inactivity" at the end of it?
Yes, that seems wrong. As far as I'm aware, this option just recycles the processes regardless of whether they are idle or not.
This article seems to confirm that too.
The statement you quoted is correct. It's a way to allow you to free up resources that aren't being used.
When it is necessary to conserve
system resources by terminating unused
worker processes, you can configure a
worker process to gracefully close
after a specified period of time. You
can use this feature to better manage
the resources when the processing load
is heavy, when identified applications
consistently fall into an idle state,
or when new processing space is not
available. You can also start
additional worker processes to replace
a worker process that is finished.
http://www.microsoft.com/technet/prodtechnol/WindowsServer2003/Library/IIS/83b35271-c93c-49f4-b923-7fdca6fae1cf.mspx?mfr=true