spring batch| Graceful job termination within the job - jobs

After launching a job, in the before job - there are certain occasions where we want to gracefully terminate the job (i.e. dont run the job at all but neither complain i.e .no exception). The current way of doing this looks like invoking jobExecution.stop - However, this results in JobInteruptedException which further results in logger.error invocation.
Is there any other better programmatic alternative (without manual intervention)?

You may read :
Section 5.3.3 Configuring for Stop and
section 5.3.4. Programmatic Flow Decisions.
Just introduce an end element for your first step based on condition:
The 'end' element instructs a Job to stop with a BatchStatus of
COMPLETED.

I solved the problem adding a flag boolean executeTheJob in my "before job" listener that I set to false when I don't want to execute the job.
Then I handle that in my firstStep with this configuration:
<step id="firstStep" >
<tasklet ref="myFirstTasklet"/>
<stop on="STOPPED" restart="firstStep" />
<next on="COMPLETED" to="nextStep"/>
</step>
And at the beginning of my first tasklet I have this:
if (executeTheJob == false) {
contribution.setExitStatus(ExitStatus.STOPPED);
}

stop() instruction will be active only if transaction commit successfully.
If all you chunks rollback your job doesn't stop.
I have make this workaround:
Create a ChunkListener and in the method afterChunkError(ChunkContext chunkCtx) put:
StepExecution stepExecution = chunkCtx.getStepContext().getStepExecution();
JobExecution jobExecution = jobExplorer.getJobExecution(stepExecution.getJobExecutionId());
if (jobExecution.getStatus().equals(BatchStatus.STOPPING)) {
stepExecution.setTerminateOnly();
}
This will force a "controlled" stop.

Instead of invoking stop() on the job execution, try signalling it via the JobOperator as shown in Stopping a Job

Related

Quartz.NET does not execute nor raise error for a job

Using Quartz.NET 3.0.6, a "malformed" job detail definition was passed to be scheduled, so the job was not executed and no error was raised.
Job Detail passed one param as bool (ignoreHeaderRow) instead of string (ignoreHeaderRow.ToString()), changing the param to string fixed the issue and the job got executed.
IJobDetail job = JobBuilder.Create<ImportJob>()
.WithIdentity("Immediate" + DateTime.UtcNow.ToFileTime(), GROUP_NAME)
.UsingJobData("InfolinxSession", JsonConvert.SerializeObject(session))
.UsingJobData("unprintable", unprintable.ToString())
.UsingJobData("ignoreHeaderRow", ignoreHeaderRow.ToString())
.Build();
QuartzScheduler.ScheduleJob(job);
Is there a way to catch this scenario?
Quartz.NET does log all execution errors when job throws an exception. You can enable logging (liblog abstraction hooks to NLog, log4net, Serilog) and watch for logs and have alerts with modern log aggregation system.
Other option is to have a scheduler listener attached to the scheduler listening for scheduler errors and then perfom some action on errors like Slack notification or whatever suits your needs.

Sidekiq stop one single, running job

So I need to stop a running Job in Sidekiq (3.1.2) programmatically, not a scheduled one. I did read the API documentation but didn't really find anything about cancelling running jobs. Is this possible with sidekiq?
When this is not directly possible, my idea was to circumvent this, by raising an exception in the job when I call the signal, then deleting the job from the retryset. This is clearly not optimal though.
Thanks in advance
Correct, the only way to stop a job is for the job to stop itself. Your application must implement that logic.
https://github.com/mperham/sidekiq/wiki/FAQ#how-do-i-cancel-a-sidekiq-job
If you know the long running job's Thread ID, its possible to terminate it from another task:
class ThreadLightly
include Sidekiq::Worker
def perform(tid)
puts "I'm %s, and I'll be terminating TID: %s..." % [self.class, tid]
Thread.list.each {|t|
if t.object_id.to_s == tid
puts "Goodbye %s!" % t
t.exit
end
}
end
end
You can trigger it from the sidekiq_pusher:
bundle exec ./pusher.rb ThreadLightly $YOURJOBSTHREADID
You'll need to log the Thread.current.object_id from each job since the UI dosn't show it. Also, if you run distributed sidekiqs, you'll need to run this task until it runs on the same instance.

How do I debug a Delayed::Worker.work_off that doesn't return success or failure

I am testing my Delayed::Job using Rspec.
In my rspec_controller:
it "queues up delayed job and fires" do
setup
expect {
post :create, {:job => valid_attributes}
}.to change(Delayed::Job, :count).by(2)
Delayed::Worker.new.work_off.should == [2,0]
end
Delayed::Job.count passes as expected, but Delayed::Worker.new.work_off returns as [0,0], indicating there are 0 successes and 0 failures when there are 2 jobs.
How should I debug to find out why work_off doesn't fire the jobs.
Edit: The 2 jobs that are supposed to run, have their run_at set into the future. Does work_off fire off jobs that are not meant to be immediate?
Although this could be an older question, there's one parameter that's not much documented, try using
Delayed::Worker.new(quiet: false).work_off
to debug the result of your background jobs, this could help you to find out if the fact that they're supposed to run in the future is messing with the assert itself.
EDIT: Don't forget to take off the "quiet:false" when you're done, otherwise your tests will always output the results of the background jobs.
The construct
Delayed::Worker.new.work_off
immediately processes everything that is in the DJ queue, and in the same thread as the caller (it doesn't spawn a separate worker thread). But this doesn't explain why you're not getting [2, 0] for a result.
To answer your original question 'How should I debug to find out why work_off doesn't fire the jobs?', I suggest you use the callback hooks to trace the lifecycle of the jobs. Add a comment if you need to be shown how to do that... :)

In celery, how to ensure tasks are retried when worker crashes

First of all please don't consider this question as a duplicate of this question
I have a setup an environment which uses celery and redis as broker and result_backend. My question is how can I make sure that when the celery workers crash, all the scheduled tasks are re-tried, when the celery worker is back up.
I have seen advice on using CELERY_ACKS_LATE = True , so that the broker will re-drive the tasks until it get an ACK, but in my case its not working. Whenever I schedule a task its immediately goes to the worker which persists it until the scheduled time of execution. Let me give some example:
I am scheduling a task like this: res=test_task.apply_async(countdown=600) , but immediately in celery worker logs i can see something like : Got task from broker: test_task[a137c44e-b08e-4569-8677-f84070873fc0] eta:[2013-01-...] . Now when I kill the celery worker, these scheduled tasks are lost. My settings:
BROKER_URL = "redis://localhost:6379/0"
CELERY_ALWAYS_EAGER = False
CELERY_RESULT_BACKEND = "redis://localhost:6379/0"
CELERY_ACKS_LATE = True
Apparently this is how celery behaves.
When worker is abruptly killed (but dispatching process isn't), the message will be considered as 'failed' even though you have acks_late=True
Motivation (to my understanding) is that if consumer was killed by OS due to out-of-mem, there is no point in redelivering the same task.
You may see the exact issue here: https://github.com/celery/celery/issues/1628
I actually disagree with this behaviour. IMO it would make more sense not to acknowledge.
I've had the issue, where I was using some open-source C libraries that went totaly amok and crashed my worker ungraceful without throwing an exception. For any reason whatsoever, one can simply wrap the content of a task in a child process and check its status in the parent.
n = os.fork()
if n > 0: //inside the parent process
status = os.wait() //wait until child terminates
print("Signal number that killed the child process:", status[1])
if status[1] > 0: // if the signal was something other then graceful
// here one can do whatever they want, like restart or throw an Exception.
self.retry(exc=SomeException(), countdown=2 ** self.request.retries)
else: // here comes the actual task content with its respected return
return myResult // Make sure there are not returns in child and parent at the same time.

Erlang finish or kill process

I have erlang application. In this application i run process with spawn(?MODULE, my_foo, [my_param1, my_param2, my_param3]).
And my_foo:
my_foo(my_param1, my_param2, my_param3) ->
...
some code here
...
ok.
When i open etop i see that this my_foo/3 function status: proc_lib:sync_wait/2
Than i try to put exit(self(), normal) in the end of my function, but i see same behavior: proc_lib:sync_wait/2 in etop.
How can i kill or exit process correctly?
Thank you.
Note that exit(Pid, Reason) and exit(Reason) do NOT do the same thing if Pid is the process itself. exit/1 tells the current process to exit - from the inside if you like - while exit/2 sends an exit signal to the process, even if the process is itself. So when you do exit(self(), normal) you are actually sending the normal exit signal to yourself, which is ignored.
In this case putting the exit call at the end of the function should not make any difference as the process automatically dies (with reason normal) when the function with which it was started ends. It seems like the process is suspended somewhere before that.
proc_lib:sync_wait/2 is called inside proc_lib:start/start_link and sits and waits for the spawned process to do proc_lib:init_ack/1/2 to return the return value for start. It would appear that your process does not call init_ack.
Based on the limited information that you give in the question I would suspect that your process hasn't finished running yet.
Normally you don't need to add exit/2 to your process. It will exit automatically when the function has finished running.
You probably have a long running call in some code here that has not finished running. I recommend that you add logging information and see where you are stuck.