I work in Runtime 4.1.5 and batch job undertakes the work of synchronizing data
If the batch job is completed normally, the log should look like this:
Created instance 'dc97a040-009e-11ec-a7bf-00155d801499' for batch job 'sendFlow_Job'
splitAndLoad: Starting loading phase for instance 'dc97a040-009e-11ec-a7bf-00155d801499' of job 'sendFlow_Job'
Finished loading phase for instance dc97a040-009e-11ec-a7bf-00155d801499 of job sendFlow_Job. 1 records were loaded
Started execution of instance 'dc97a040-009e-11ec-a7bf-00155d801499' of job 'sendFlow_Job'
batch step customer log ....
Finished execution for instance 'dc97a040-009e-11ec-a7bf-00155d801499' of job 'sendFlow_Job'. Total Records processed: 1. Successful records: 1. Failed Records: 0
=================end=======================
The log in question is as follows:
Created instance 'dc97a040-009e-11ec-a7bf-00155d801499' for batch job 'sendFlow_Job'
splitAndLoad: Starting loading phase for instance 'dc97a040-009e-11ec-a7bf-00155d801499' of job 'sendFlow_Job'
Finished loading phase for instance dc97a040-009e-11ec-a7bf-00155d801499 of job sendFlow_Job. 1 records were loaded
Started execution of instance 'dc97a040-009e-11ec-a7bf-00155d801499' of job 'sendFlow_Job'
=================end===================
It can be clearly seen that the log shows that the batch job has only completed the first stage of work. After that, the batch job is as if it has never existed, there is no log output, and no errors are thrown.And from the target database, the data is indeed not synchronized
I tested it in the local environment and the problem was reproduced. Use kill -9 to kill the process while the batch step is executing, then the process will restart, and then all batch jobs will have problems
I found the queue file used by batch job in the .mule folder. It is similar to BSQ-batch-job-flow-name-dc97a040-009e-11ec-a7bf-00155d801499-XXX
Under normal circumstances, each batch job will create three BSQ file and delete it at the mplete.
In my question, the BSQ file will be created but not deleted
I looked up some posts and they suggested deleting the .mule folder and restarting. In the actual environment, I don’t know when there will be a problem and deleting the .mule folder does not completely solve the problem of batch job not being executed.
Is anyone proficient in mule batch job? Can you give me some suggestions, thanks
You should not delete the .mule directory. There is other information in there unrelated to batch that would be lost, like clustering configurations, persistent object stores, other applications batches and queues. It may be ok to delete it inside the Studio embedded runtime because that just your development environment and you probably are not losing production data, but in any case is not a solution just to delete information.
There are too many possible causes to identify the right one, and you should provide a lot more information. My first recommendation is to ensure your Mule 4.1.5 has the latest cumulative patch to ensure all known issues are resolved. Note that Mule 4.1.5 has been released almost 3 years ago. If possible at all migrate to the latest Mule 4.3.0 with the latest cumulative patching. It should be more stable and performant than 4.1.5.
Related
I know it's possible on a queued job to change directives via scontrol, for example
scontrol update jobid=111111 TimeLimit=08:00:00
This only works in some cases, depending on the administrative configuration of the slurm instance (I'm not an admin). Thus this post does not answer my question.
What I'm looking for is a way to ask SLURM to add more time to a running job, if resources are available, and even if it's already running. Sort of like a nested job request.
Particularly a running job that was initiated with srun on-the-fly.
In https://slurm.schedmd.com/scontrol.html, it is clearly written under TimeLimit:
Only the Slurm administrator or root can increase job's TimeLimit.
So I fear what you want is not possible.
An it makes sense, since the scheduler looks at job time to decide which jobs to launch and some short jobs can benefit from back-filling to start before longer jobs, it would be really a mess if users where allowed to change the job length while running. Indeed, how to define "when resource are available"? Some node can be IDLE for some time because slurm knows that it will need it soon for a large job
Intially the package was in the package deployment model (SSIS 2008), which export the data to a local CSV file in parallel from a local database.
I've converted it to Project deployment Model and now the same parallelism exists but by calling a child package (utilizing 26 threads) through Execute Package Task (earlier it was through Execute Process Task) using the Execute-Out-of-Process in-order to utilize the resources
The child package picks a random customer out of 15K customers and exports it's related data from a view to the CSV file.
<>
The customer are placed in a table and all the threads read the table and a mutex is applied over it using the TABLOCKX, whichever thread gets the write access first will pick-up the customer and modifies the load status to 'Progress'. The other threads waiting for the write access will follow the same process.
The process in each thread is repeated for all the customers using the "Forloop" container
For the 576 executions it exports good and quickly but surprisingly it hangs up for several minutes at the 576th execution of a random customer. I've tried to repro it for several times and it hangs up at the same point.
Your help on this is very much appreciated!!
PS: The issue is not there in the earlier version of my package
There is a bug in SSIS 2012 due to which my Migrated package Hangs.
SSIS package with multiple child packages when executed all at once creates a deadlock in the internal Catalog tables. So running a child package with multiple parallel thread should be avoid. If needed, run them with few milli seconds of delay (> 100 ms).
Adding a delay resolved the problem. Hope this bug will be resolved by Microsoft in the later versions of SSIS
Hi spring batch users,
regarding the documentation http://docs.spring.io/spring-batch/reference/htmlsingle/#d5e1320
"If the process died ("kill -9" or server failure) the job is, of course, not running, but the JobRepository has no way of knowing because no-one told it before the process died."
I try to find and restart the stale job executions by using
Set<JobExecution> jobExecutions = jobExplorer.findRunningJobExecutions(jobName);
...
jobExecution.setStatus(FAILED);
jobExecution.setEndTime(new Date());
jobRepository.update(jobExecution);
jobOperator.restart(jobExecution.getId());
But this seems to be very inconvenient.
1) I have to do this before other (new) jobs could be started.
2) I have to handle multiple instances of running servers so findRunningJobExecutions will not do the trick.
You can find other questions regarding this topic:
https://jira.spring.io/browse/BATCH-2433?jql=project%20%3D%20BATCH%20AND%20status%20%3D%20Open%20ORDER%20BY%20priority%20DESC
Spring Batch after JVM crash
I would love to see a solution to register a "start up clean jobs listener". This will still not fix the problems originated by the multi server environment because spring batch does not know if the JobExecution marked by STARTED is not running on an other instance.
Thanks for any advice
Alex
Your job cannot and should not recover "automatically" from a kill -9 scenario. A kill -9 is treated very differently than you application throwing a caught Exception. The reason for this is that you've effectively pulled the carpet out from under the application without giving it a chance to reach a synchronization point with the database to commit any necessary information to the ExecutionContext or update the job/step status(es). Therefore, the last status touchpoint with the database will remain and the job will still look STARTED.
"OK, fine" you say, "but if I start another execution, I want it to find that STARTED execution, and pick up where it left off." The problem here is that there is no clean way for the application to distinguish a job that is ACTUALLY RUNNING from one that has failed but couldn't up the database. The framework here correctly errs on the side of caution and prevents you from starting a job that already appears running, and this is a GOOD thing.
Why? Because let's assume your job was actually still running and you restarted by accident. As coded, the framework will start to spin up, see your running execution and fail with the following message A job execution for this job is already running. I can't tell you how many times we've been saved by this because someone accidentally launched a job twice!
If you were to implement the listener you suggest, the 2nd execution would instead be allowed to start and you'd have 2 different JVMs repeating the same work, possibly writing to the same files/tables and causing a huge data mess that could be impossible to clean up.
Trust me, in the event the Linux terminal kills your job or your job dies because the connection to the database has been severed, you WANT human eyes on those execution states before you attempt a restart.
Finally, on the off chance you actually wanted to kill you job, you can leverage several other standard patterns for stopping jobs:
Stop via throw Exception
Stop via JobOperator.stop()
Is there a plugin or can I somehow configure it, that a job (that is triggered by 3 other jobs) queues until a specified time and only then executes the whole queue?
Our case is this:
we have tests run for 3 branches
each of the 3 build jobs for those branches triggers the same smoke-test-job that runs immediately
each of the 3 build jobs for those branches triggers the same complete-test-job
points 1. and 2. work perfectly fine.
The complete-test-job should queue the tests all day long and just execute them in the evening or at night (starting from a defined time like 6 pm), so that the tests are run at night and during the day the job is silent.
It's no option to trigger the complete-test-job on a specified time with the newest version. we absolutely need the trigger of the upstream build-job (because of promotion plugin and we do not want to run already run versions again).
That seems a rather strange request. Why queue a build if you don't want it now... And if you want a build later, then you shouldn't be triggering it now.
You can use Jenkins Exclusion plugin. Have your test jobs use a certain resource. Make another job whose task is to "hold" the resource during the day. While the resource is in use, the test jobs won't run.
Problem with this: you are going to kill your executors by having queued non-executing jobs, and there won't be free executors for other jobs.
Haven't tried it myself, but this sounds like a solution to your problem.
The way my team currently schedules jobs is through the SQL Server Job Agent. Many of these jobs have dependencies on other internal servers which in turn have their own SQL Server Jobs that need to be run to keep their data up to date.
This has created dependencies in the start time and length of each of our SQL Server Jobs. Job A might depend on Job B finishing, so we schedule Job B a certain estimated time in advance to Job A. All of this process is very subjective and not scalable, as we add more jobs and servers which create more dependencies.
I would love to get out of the business of subjectively scheduling these jobs and hoping that the dominos fall in the right order. I am wondering what the accepted practices for scheduling SQL Server jobs are. Do people use SSIS to chain jobs together? Is there tooling already built into the SQL Server Job Agent to handle this?
What is the accepted way to handle the scheduling of multiple SQL Server jobs with dependencies on each other?
I have used Control-M before to schedule multiple inter-dependent jobs in different environment. Control-M generally works by using batch files (from what I remember) to execute SSIS packages.
We had a complicated environment hosting 2 data warehouses side by side (1 International and 1 US Local). There were jobs that were dependent on other jobs and those jobs on others and so on, but by using Control-M we could easily decide on the dependency (It has a really nice and intuitive GUI). Other tool that comes to my mind is Tidal Scheduler.
There is no set standard for job scheduling, but I think its safe to say that job schedules depend entirely on what an organization needs. For example Finance jobs might be dependent on Sales and Sales on Inventory and so on. But the point is, if you need to have job inter dependency, using a third party software such as Control-M is a safe bet. It can control jobs on different environments and give you real sense of the company wide job control.
We too had the requirement to manage dependencies between multiple agent jobs - after looking at various 3rd party tools and discounting them for various reasons (mainly down to the internal constraints relating to the use of 3rd party software) we decided to create our own solution.
The solution centres around a configuration database that holds details about processes (jobs) that need to run and how they are grouped (batches), along with the dependencies between processes.
Summary of configuration tables used:
Batch - highlevel definition of a group of related processes, includes metadata such as max concurrent processes, and current batch instance etc.
Process - meta data relating to a process (job) such as name, max wait time, earliest run time, status (enabled / disabled), batch (what batch the process belongs to), process job name etc.
Batch Instance - the active instance of a given batch
Process Instance - active instances of processes for a given batch
Process Dependency - dependency matrix
Batch Instance Status - lookup for batch instance status
Process Instance Status - loolup for process instance status
Each batch has 2 control jobs - START BATCH and UPDATE BATCH. The 1st deals with starting all processes that belong to it and the 2nd is the last to run in any given batch and deals with updating the outcome statuses.
Each process has an agent job associated with it that gets executed by the START BATCH job - processes have a capped concurrency (defined in the batch configuration) so processes are started up to a max of x at a time and then START BATCH waits until a free slot becomes available before starting the next process.
The process agent job steps call a templated SSIS package that deals with the actual ETL work and with the decision making around whether the process needs to run and has to wait for dependencies etc.
We are currently looking to move to a Service Broker solution for greater flexibility and control.
Anyway, probably too much detail and not enough example here so VS2010 project available on request.
I'm not sure how much this will help, but we ended up creating an email solution for scheduling.
We built an email reader that accesses an exchange mailbox. As jobs finish, they send an email to the mail reader to start another job. The other nice part, is that most applications have email notifications built in, so there really isn't much in the way of custom programming.
We really only built it in the first place to handle data files coming in from lots of other partners. It was much easier to give them an email address rather than setting them up with an ftp site, etc.
The mail reader app now has grown to include basic filtering, time of day scheduling, use of semaphores to prevent concurrent jobs, etc. It really works great.