job_CeZvrPO2l8l4m6kGBmBwBpoVqTU
Around 15 hours now - I had quite a few of these jobs complete, but three are still "stuck".
There was a server deadlock. The problem has been found and fixed, and the server has been restarted. Your job should start progressing again soon.
Related
I'm using Celery with Django over Redis.
Some of my tasks are quite long, taking about 1 hour. I'm aware that this is suboptimal, and preferably I should use shorter tasks, but this is what I got...
Sometimes the task/worker crash. This can happen for various unimportant reasons. Maybe this worker crashed, network problem, spot-instance when preempted, killed by OOM, or any other unexpected reason that I can't "catch" and handle.
I want to make sure the task will be tried again as fast as possible.
I can use ack_late, but the problem is that this task has a very long timeout (about 90 minutes), which means that if the task started and the worker crashed after 2 minutes, I will now wait for another 88 minutes until the task will get back to the queue and will start executing again on another worker.
I'm wondering if there exists another solution, that will see the worker "disappeared" and will put the task back in the queue?
You could give task_reject_on_worker_lost a try... It is a tricky one, but have a look...
I'm running into a weird issue with Pentaho 7.1. I run a job that I created and it runs perfectly and quickly the first time I run it.
The job is an ETL Job consisting of a Start widget, 7 Transformations running in a sequence, and a Success widget.
I'm confused as to why the job runs once, and when I try to run it again it says "Spoon - Starting job..." and then the job just hangs.
If I delete the job and I create a brand new one, I am then able to run the job once and I am stuck again with the job no longer able to run after that. I don't understand why the job keeps hanging after it gets executed once, and it is then 100% broken after a Successful run...
I turned up the logging in Pentaho 7.1 Spoon, and it shows this continuously...
2018/08/14 17:30:24 - Job1 - Triggering heartbeat signal for
ported_jobs at every 10 seconds 2018/08/14 17:30:34 - Job1 -
Triggering heartbeat signal for ported_jobs at every 10 seconds
2018/08/14 17:30:44 - Job1 - Triggering heartbeat signal for
ported_jobs at every 10 seconds 2018/08/14 17:30:54 - Job1 -
Triggering heartbeat signal for ported_jobs at every 10 seconds
I can't seem to put my finger on why this happening.
Any help is appreciated
Probable answer: Check that your transformations are not opening the same database for input and output. A quick check may to run the first transformation directly (without the job) and see if it locks.
Happened because the server db you want to update are slow to respond. Probably high CPU and RAM. I tried to increase the RAM and CPU for the db server, now my job runs okay.
I have a SQL Agent Job that shows as idle in the Activity Monitor but the time duration keeps increasing.
The job seems to have stopped as I've tried stopping it manually and SQL advises the job isn't running.
SysJobActivity doesn't have a stop_execution_date for the job
The job has 5 steps and the last step didn't complete - the server rebooted during the execution of this step.
Is the job ok to leave in it's current state? The duration will forever keep increasing.
Thanks
Job History
Activity Monitor
Job History - Updated
After the comments and viewing the screenshots I think I know what is going on:
The job duration is reported by what is in the msdb..sysjobhistory
The reboot during the job caused a problem (perhaps the power was killed to the box so it couldn't log properly?) so the job never really failed or finished and not properly recorded into the sysjobhistory.
It's not showing up in sp_who which means it is NOT running
I suspect it's probably OK to leave the job just 'running' forever. But I would suggest clearing that up so some other poor DBA isn't scratching his head. You could:
Manually edit the msdb..sysjobhistory which is scary and I wouldn't.
I bet start and stop the job and now it will report OK
Delete the job and history via the GUI and remake it (script it out first!)
I use hirefire-gem with Delayed-Job 3 on heroku cedar-stack and it is working pretty good in terms of hiring/firing but performance of the job execution is terrible. firing up the background job and seeing the results in the UI takes about 5-8 seconds locally and about 25-30 seconds (!) on heroku.
Processing time of the jobs is about the same locally/deployed but hiring workers (scaling, up, starting, ...) seems to take a lot of time(?).
is that a common issue? is there a solution (rake tasks, etc.)?
Thanks a lot.
Best, Phil
It's down to the fact that your worker isn't running all the time but spinning up for each individual job. The lag is the code start-up time.
If you have a full time dyno the jobs should process almost instantaneously.
I have a job that is supposed to run every 11 AM and 8 PM. About two weeks ago, it started to not respect the schedule. The "fix" that I found was to start the job manually and then the job would restart respecting the schedule for a while but eventually the issue reappears.
The big problem is that there are no error message what so ever. If the job fails, I am supposed to get a notification Email which I do not. In the sql server agent logs and the Job history, there are no errors. In the job history, I can see clearly that the job skipped the schedule since there are no entries. It looks like it did not even start as if the running time had not arrived.
The schedule is set to run everyday and there are no limits on how long it is supposed to run. The sql Agent is set to restart automatically if it stops unexpectedly.
Did anyone get this problem before?
Check the user which is used to run the job. Maybe the user password is expired or the user itself is no longer active.