Autosys JIL ignoring success conditions

Autosys JIL ignoring success conditions - conditional-statements

I hope someone can point me in the right direction or shed some light on the issue I'm having. We have Autosys 11.3.5 running in Windows environment.
I have several jobs setup to launch on a remote NAS server.
I needed JOB_1 in particular to only run if another completed successfully.
Seems straight forward enough. In UI there's a section to specify Condition such as: s(job_name) as I have done and I'm assuming that ONLY if the job with name job_name succeeds that my initial job should run.
No matter what I do, when I make the second job fail on purpose (whether manually setting its status to FAILURE) or changing some of its parameters so that its natural run time causes it to fail. The other job that I run afterwards seems to ignore the condition altogether and complete successfully each time.
I've triple checked the job names (in fact I copy and pasted it from the JIL definition of the job so there are no typos), but it's still being ignored.
Any help in figuring out how to make one job only run if another did not fail (and NOT to run if it DID fail) would be appreciated.

If both the jobs are scheduled and become active together, then this should not happen.
The way i think is, you must be force starting the other job while the first is failed. If thats the case, then conditions will not work.
You need to let both the jobs start as per schedule, or at least the other job start as per schedule while the first one is failed. In that case the other job will stay in AC state until the first one is SU.
Let me know if this is not the case, i will have to rephrase another solution then.

Related

SQL Agent job failure universal handling

I'm in a situation where I have a server running sql 2012 with roughly two hundred scheduled jobs (all are SSIS package executions). I'm facing a directive from management where I need to run some custom software to create a bug report ticket whenever a job fails. Right now I'm relying on half the jobs jobs notifying an operator on failure, while the other half do like a "go to step X- send failure email" for each step on failure, where "step X" is some sql that queries the DB and sends out an email saying which job failed at which step.
So what I'm looking for is some universal solution where I can have every job do the same thing when it fails (in this case, run some program that creates a bug tracking ticket). I am trying to avoid the situation where I manually go into every single job and add a new step at the end, with all previous steps changing to "go to step Y on failure" where step Y is this thing that creates the bug report.
My first thought was to create a new job that queries the execution history tables and looks for unhandled failures and then does the bug report creation itself. However, I already made the mistake of presenting this idea to the manager and was told it's not a viable solution because it's "reactive and not proactive" and also not creating tickets in real-time. I should know better than to brainstorm with non-programming management but it's too late, so that option is off the table and I haven't been able to uncover any other methods.
Any suggestions?

I'm proposing this as an answer, though it's not a technical solution. Present the possible solutions and let the manager decide:
Update all the Agent Jobs - This will take a lot of time and every job will need to be tested, which will also take a lot of time. I'd guess 2-8 weeks depending on how it's done.
Create an error handler job that monitors the logs and creates tickets based on those errors. This has two drawbacks - it is not "real-time" (as desired by the manager) and something will need to be put into place to insure errors are only reported once. This has the upside of being one change to manage. Also it can be made near real time if it were run on the minute.
A third option, which would be more a preliminary step, is to create an error report based off of the logs. This will help to understand the quantity and types of failures. This may help to shape the ultimate solution - do we want all these tickets, can they be broken up into different categories, do we want tickets for errors that are self-healing (i.e. connection errors which have built-in retries)?

Is it possible to request more time to a running job in SLURM?

I know it's possible on a queued job to change directives via scontrol, for example
scontrol update jobid=111111 TimeLimit=08:00:00
This only works in some cases, depending on the administrative configuration of the slurm instance (I'm not an admin). Thus this post does not answer my question.
What I'm looking for is a way to ask SLURM to add more time to a running job, if resources are available, and even if it's already running. Sort of like a nested job request.
Particularly a running job that was initiated with srun on-the-fly.

In https://slurm.schedmd.com/scontrol.html, it is clearly written under TimeLimit:
Only the Slurm administrator or root can increase job's TimeLimit.
So I fear what you want is not possible.
An it makes sense, since the scheduler looks at job time to decide which jobs to launch and some short jobs can benefit from back-filling to start before longer jobs, it would be really a mess if users where allowed to change the job length while running. Indeed, how to define "when resource are available"? Some node can be IDLE for some time because slurm knows that it will need it soon for a large job

Spring Batch restart crashed jobs

Hi spring batch users,
regarding the documentation http://docs.spring.io/spring-batch/reference/htmlsingle/#d5e1320
"If the process died ("kill -9" or server failure) the job is, of course, not running, but the JobRepository has no way of knowing because no-one told it before the process died."
I try to find and restart the stale job executions by using
Set<JobExecution> jobExecutions = jobExplorer.findRunningJobExecutions(jobName);
...
jobExecution.setStatus(FAILED);
jobExecution.setEndTime(new Date());
jobRepository.update(jobExecution);
jobOperator.restart(jobExecution.getId());
But this seems to be very inconvenient.
1) I have to do this before other (new) jobs could be started.
2) I have to handle multiple instances of running servers so findRunningJobExecutions will not do the trick.
You can find other questions regarding this topic:
https://jira.spring.io/browse/BATCH-2433?jql=project%20%3D%20BATCH%20AND%20status%20%3D%20Open%20ORDER%20BY%20priority%20DESC
Spring Batch after JVM crash
I would love to see a solution to register a "start up clean jobs listener". This will still not fix the problems originated by the multi server environment because spring batch does not know if the JobExecution marked by STARTED is not running on an other instance.
Thanks for any advice
Alex

Your job cannot and should not recover "automatically" from a kill -9 scenario. A kill -9 is treated very differently than you application throwing a caught Exception. The reason for this is that you've effectively pulled the carpet out from under the application without giving it a chance to reach a synchronization point with the database to commit any necessary information to the ExecutionContext or update the job/step status(es). Therefore, the last status touchpoint with the database will remain and the job will still look STARTED.
"OK, fine" you say, "but if I start another execution, I want it to find that STARTED execution, and pick up where it left off." The problem here is that there is no clean way for the application to distinguish a job that is ACTUALLY RUNNING from one that has failed but couldn't up the database. The framework here correctly errs on the side of caution and prevents you from starting a job that already appears running, and this is a GOOD thing.
Why? Because let's assume your job was actually still running and you restarted by accident. As coded, the framework will start to spin up, see your running execution and fail with the following message A job execution for this job is already running. I can't tell you how many times we've been saved by this because someone accidentally launched a job twice!
If you were to implement the listener you suggest, the 2nd execution would instead be allowed to start and you'd have 2 different JVMs repeating the same work, possibly writing to the same files/tables and causing a huge data mess that could be impossible to clean up.
Trust me, in the event the Linux terminal kills your job or your job dies because the connection to the database has been severed, you WANT human eyes on those execution states before you attempt a restart.
Finally, on the off chance you actually wanted to kill you job, you can leverage several other standard patterns for stopping jobs:
Stop via throw Exception
Stop via JobOperator.stop()

queue job all day and execute it at a specified time

Is there a plugin or can I somehow configure it, that a job (that is triggered by 3 other jobs) queues until a specified time and only then executes the whole queue?
Our case is this:
we have tests run for 3 branches
each of the 3 build jobs for those branches triggers the same smoke-test-job that runs immediately
each of the 3 build jobs for those branches triggers the same complete-test-job
points 1. and 2. work perfectly fine.
The complete-test-job should queue the tests all day long and just execute them in the evening or at night (starting from a defined time like 6 pm), so that the tests are run at night and during the day the job is silent.
It's no option to trigger the complete-test-job on a specified time with the newest version. we absolutely need the trigger of the upstream build-job (because of promotion plugin and we do not want to run already run versions again).

That seems a rather strange request. Why queue a build if you don't want it now... And if you want a build later, then you shouldn't be triggering it now.
You can use Jenkins Exclusion plugin. Have your test jobs use a certain resource. Make another job whose task is to "hold" the resource during the day. While the resource is in use, the test jobs won't run.
Problem with this: you are going to kill your executors by having queued non-executing jobs, and there won't be free executors for other jobs.

Haven't tried it myself, but this sounds like a solution to your problem.

how to run (or not) a cron job based on the results of another cron job

I need to execute a cron job based on whether or not the cron job that ran before it was at least partially successful, I am trying to set up conditions for the run... sometimes these condition would be on the local or remote machine...
is there a way to do this?

In that case you can share a common-file in which the first job will write its status and second cron job will read it an decide on that basis weather it should proceed or not.
In case of remote machine you may share that file on network.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas