When running sequence job in datastage, get the timeout error ,code =-14 [Timed out while waiting for an event] - sequence

When running sequence job in Datastage, there would be occasions to get error like this:
Seq_CHECK_JV_OMS_DATA..JobControl (#Pjob_CHECK_JV_OMS_DATA): Controller problem: Error calling DSRunJob(Pjob_CHECK_JV_OMS_DATA.Seq_CHECK_JV_OMS_DATA), code=-14
[Timed out while waiting for an event]
Why would this happen? Cause I'm not running these job instances concurrently, but execute one by one (It is serializable)?
And I get other problems too. The job instances under sequence job controller often get stuck like this:
And this status will last for ever unless I clear the status file
I'm getting crazy with that! Could anyone help? Thanks very much!

The problem is that your server is overloaded, and after 60 seconds it stops and abort with this error, when you try to execute your job Pjob_CHECK_JV_OMS_DATA.Seq_CHECK_JV_OMS_DATA. Do you have many instances of the job Pjob_CHECK_JV_OMS_DATA running at the same time?
Try to wait a few seconds to start this job.

Related

JitterBit Run Only One Instance of an Operation at a Time

I ran into an issue where I had long running JitterBit operations that were scheduled. I had them scheduled close together, since I needed to keep data flowing. But, when they would take longer than expected I would wind up with multiple instances of the operation set running at the same time. This was killing my performance.
I'll put the fix in the answer below.
To resolve this issue I added an additional Script Operation at the beginning of my operation set (with the schedule running on this operation). This script simply checks to see if one of the operations in this set is already running. If not, it starts the next operation. If there is anything running, it exists and waits till the next scheduled instance.
This is a sample of my script. This one assumes that there were originally two operations in this operation set.
<trans>
$isInQueue=GetOperationQueue("<TAG>Operations/OperationToCheck01</TAG>");
$isInQueue2=GetOperationQueue("<TAG>Operations/OperationToCheck02</TAG>");
$isRunning=$isInQueue[0][1];
$isRunning2=$isInQueue2[0][1];
if(($isRunning==1 && $isRunning!=Null()) || ($isRunning2==1 && $isRunning2!=Null()),
WriteToOperationLog("Skip for now: "+$isRunning+" / "+$isRunning2);,
WriteToOperationLog("Nothign is Running - Starting Operation Chain.");
RunOperation("<TAG>Operations/OperationToCheck01</TAG>");
);
</trans>

Sidekiq stop one single, running job

So I need to stop a running Job in Sidekiq (3.1.2) programmatically, not a scheduled one. I did read the API documentation but didn't really find anything about cancelling running jobs. Is this possible with sidekiq?
When this is not directly possible, my idea was to circumvent this, by raising an exception in the job when I call the signal, then deleting the job from the retryset. This is clearly not optimal though.
Thanks in advance
Correct, the only way to stop a job is for the job to stop itself. Your application must implement that logic.
https://github.com/mperham/sidekiq/wiki/FAQ#how-do-i-cancel-a-sidekiq-job
If you know the long running job's Thread ID, its possible to terminate it from another task:
class ThreadLightly
include Sidekiq::Worker
def perform(tid)
puts "I'm %s, and I'll be terminating TID: %s..." % [self.class, tid]
Thread.list.each {|t|
if t.object_id.to_s == tid
puts "Goodbye %s!" % t
t.exit
end
}
end
end
You can trigger it from the sidekiq_pusher:
bundle exec ./pusher.rb ThreadLightly $YOURJOBSTHREADID
You'll need to log the Thread.current.object_id from each job since the UI dosn't show it. Also, if you run distributed sidekiqs, you'll need to run this task until it runs on the same instance.

Determine actual errors from a load job

Using the Java SDK I am creating a load job for just a single record with a fairly complicated schema. When monitoring the status of the load job, it takes a surprisingly long time (but perhaps this is due to working out the schema), but then says:
11:21:06.975 [main] INFO xxx.GoogleBigQuery - Job status (21694ms) create_scans_1384744805079_172221126: DONE
11:24:50.618 [main] ERROR xxx.GoogleBigQuery - Job create_scans_1384744805079_172221126 caused error (invalid) with message
Too many errors encountered. Limit is: 0.
11:24:50.810 [main] ERROR xxx.GoogleBigQuery - {
"message" : "Too many errors encountered. Limit is: 0.",
"reason" : "invalid"
?}
BTW - how do I tell the job that it can have more than zero errors using Java?
This load job does not appear in the list of recent jobs in the console, and as far as I can see, none of the Java objects contains any more details about the actual errors encountered. So how can I pro-grammatically find out what is going wrong? All I can find is:
if (err != null) {
log.error("Job {} caused error ({}) with message\n{}", jobID, err.getReason(), err.getMessage());
try {
log.error(err.toPrettyString());
}
...
In general I am having a difficult time finding good documentation for some of these things and am working it out by trial and error and short snippets of code found on here and older groups. If there is a better source of information than the getting started guides, then I would appreciate any pointers to that information. The Javadoc does not really help and I cannot find any complete examples of loading, querying, testing for errors, cataloging errors and so on.
This job is submitted via a NEWLINE_DELIMITIED_JSON record, supplied to the job via:
InputStream dummy = getClass().getResourceAsStream("/googlebigquery/xxx.record");
final InputStreamContent jsonIn = new InputStreamContent("application/octet-stream", dummy);
createTableJob = bigQuery.jobs().insert(projectId, loadJob, jsonIn).execute();
My authentication and so on seems to work correctly as separate Java code to list the projects, and the datasets in the project all works correctly. So I just need help in working what the actual error is - does it not like the schema (I have records nested within records for instance), or does it think that there is an error in the data I am submitting.
Thanks in advance for any help. The job number cited above is an actual failed load job if that helps any Google staffers who might read this.
It sounds like you have a couple of questions, so I'll try to address them all.
First, the way to get the status of the job that failed is to call jobs().get(jobId), which returns a job object that has an errorResult object that has the error that caused the job to fail (e.g. "too many errors"). The errorStream list is a lost of all of the errors on the job, which should tell you which lines hit errors.
Note if you have the job id, it may be easier to use bq to lookup the job -- you can run bq show <job_id> to get the job error information. If you add the --format=prettyjson it will print out all of the information in the job.
A hint you also might want to consider is to supply your own job id when you create the job -- then even if there is an error starting the job (i.e. the insert() call fails, perhaps due to a network error) you can look up the job to see what actually happened.
To tell BigQuery that some errors are allowed during import, you can use the maxBadResults setting in the load job. See https://developers.google.com/resources/api-libraries/documentation/bigquery/v2/java/latest/com/google/api/services/bigquery/model/JobConfigurationLoad.html#getMaxBadRecords().

How do I debug a Delayed::Worker.work_off that doesn't return success or failure

I am testing my Delayed::Job using Rspec.
In my rspec_controller:
it "queues up delayed job and fires" do
setup
expect {
post :create, {:job => valid_attributes}
}.to change(Delayed::Job, :count).by(2)
Delayed::Worker.new.work_off.should == [2,0]
end
Delayed::Job.count passes as expected, but Delayed::Worker.new.work_off returns as [0,0], indicating there are 0 successes and 0 failures when there are 2 jobs.
How should I debug to find out why work_off doesn't fire the jobs.
Edit: The 2 jobs that are supposed to run, have their run_at set into the future. Does work_off fire off jobs that are not meant to be immediate?
Although this could be an older question, there's one parameter that's not much documented, try using
Delayed::Worker.new(quiet: false).work_off
to debug the result of your background jobs, this could help you to find out if the fact that they're supposed to run in the future is messing with the assert itself.
EDIT: Don't forget to take off the "quiet:false" when you're done, otherwise your tests will always output the results of the background jobs.
The construct
Delayed::Worker.new.work_off
immediately processes everything that is in the DJ queue, and in the same thread as the caller (it doesn't spawn a separate worker thread). But this doesn't explain why you're not getting [2, 0] for a result.
To answer your original question 'How should I debug to find out why work_off doesn't fire the jobs?', I suggest you use the callback hooks to trace the lifecycle of the jobs. Add a comment if you need to be shown how to do that... :)

In celery, how to ensure tasks are retried when worker crashes

First of all please don't consider this question as a duplicate of this question
I have a setup an environment which uses celery and redis as broker and result_backend. My question is how can I make sure that when the celery workers crash, all the scheduled tasks are re-tried, when the celery worker is back up.
I have seen advice on using CELERY_ACKS_LATE = True , so that the broker will re-drive the tasks until it get an ACK, but in my case its not working. Whenever I schedule a task its immediately goes to the worker which persists it until the scheduled time of execution. Let me give some example:
I am scheduling a task like this: res=test_task.apply_async(countdown=600) , but immediately in celery worker logs i can see something like : Got task from broker: test_task[a137c44e-b08e-4569-8677-f84070873fc0] eta:[2013-01-...] . Now when I kill the celery worker, these scheduled tasks are lost. My settings:
BROKER_URL = "redis://localhost:6379/0"
CELERY_ALWAYS_EAGER = False
CELERY_RESULT_BACKEND = "redis://localhost:6379/0"
CELERY_ACKS_LATE = True
Apparently this is how celery behaves.
When worker is abruptly killed (but dispatching process isn't), the message will be considered as 'failed' even though you have acks_late=True
Motivation (to my understanding) is that if consumer was killed by OS due to out-of-mem, there is no point in redelivering the same task.
You may see the exact issue here: https://github.com/celery/celery/issues/1628
I actually disagree with this behaviour. IMO it would make more sense not to acknowledge.
I've had the issue, where I was using some open-source C libraries that went totaly amok and crashed my worker ungraceful without throwing an exception. For any reason whatsoever, one can simply wrap the content of a task in a child process and check its status in the parent.
n = os.fork()
if n > 0: //inside the parent process
status = os.wait() //wait until child terminates
print("Signal number that killed the child process:", status[1])
if status[1] > 0: // if the signal was something other then graceful
// here one can do whatever they want, like restart or throw an Exception.
self.retry(exc=SomeException(), countdown=2 ** self.request.retries)
else: // here comes the actual task content with its respected return
return myResult // Make sure there are not returns in child and parent at the same time.