I noticed that when i kill a resque worker and it's processing something, it won't leave a failed job. It will be simply gone.
Thus, the job will never be finished and jobs place an important role in my application.
This only happens when i kill a worker. If my job raises an exception, i can retry it later.
Is it possible to avoid this behavior?
Thanks.
Related
I start flink(bin/start-cluster.sh)on a single machine, and submit a job by flink web UI.
If there are something wrong with the job, such as sink mysql table does not exist or wrong keyby field, not only this job failure, I have to cancel failed task, but after cancelling ,the taskmanager seems like be "killed", it disappears in flink web ui.
Are there solutions for fault tolerance(taskmanager be killed by failure job) ?
The only way is to run flink on yarn?
A task failure should never cause a TaskManager to be killed. Please check the TaskManager logs for any exceptions.
I'm running a Spark Streaming application on YARN in cluster mode and I'm trying to implement a gracefully shutdown so that when the application is killed it will finish the execution of the current micro batch before stopping.
Following some tutorials I have configured spark.streaming.stopGracefullyOnShutdown to true and I've added the following code to my application:
sys.ShutdownHookThread {
log.info("Gracefully stopping Spark Streaming Application")
ssc.stop(true, true)
log.info("Application stopped")
}
However when I kill the application with
yarn application -kill application_1454432703118_3558
the micro batch executed at that moment is not completed.
In the driver I see the first line of log printed ("Gracefully stopping Spark Streaming Application") but not the last one ("Application stopped").
ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM
INFO streaming.MySparkJob: Gracefully stopping Spark Streaming Application
INFO scheduler.JobGenerator: Stopping JobGenerator gracefully
INFO scheduler.JobGenerator: Waiting for all received blocks to be consumed for job generation
INFO scheduler.JobGenerator: Waited for all received blocks to be consumed for job generation
INFO streaming.StreamingContext: Invoking stop(stopGracefully=true) from shutdown hook
In the executors log I see the following error:
ERROR executor.CoarseGrainedExecutorBackend: Driver 192.168.6.21:49767 disassociated! Shutting down.
INFO storage.DiskBlockManager: Shutdown hook called
WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver#192.168.6.21:49767] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
INFO util.ShutdownHookManager: Shutdown hook called
I think the problem is related to how YARN send the kill signal the application. Any idea on how can I make the application stop gracefully?
you should go to the executors page to see where your driver is running ( on which node). ssh to that node and do the following:
ps -ef | grep 'app_name'
(replace app_name with your classname/appname). it will list couple of processes. Look at the process, some will be child of the other. Pick the id of the parent-most process and send a SIGTERM
kill pid
after some time you'll see that your application has terminated gracefully.
Also now you don't need to add those hooks for shutdown.
use spark.streaming.stopGracefullyOnShutdown config to help shutdown gracefully
You can stop spark streaming application by invoking ssc.stop when a customized condition is triggered instead of using awaitTermination. As the following pseudocode shows:
ssc.start()
while True:
time.sleep(10s)
if some_file_exist:
ssc.stop(True, True)
We have an application configured for high availability.
Of the 2 nodes one of them is made active (say NN1) and other one's (say NN2) NameNode process is killed. So now NN1 is active.
Now we submit a mapreduce job , and the logs keep saying
"Application submission is not finished, submitted application application_someid is still in NEW_SAVING".
This happens for about 17 minutes and then the job gets executed successfully.
So which means the fail-over has happened and NN1 is active. But why does it take so long?
The yarn nodemanager logs says :
INFO org.apache.hadoop.ipc.Client: Retrying connect to server: . Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
Can somebody please explain as to why this is happening?
Thanks in advance
I don't know the cause of this problem,
But restarting the yarn service help me solve this problem.
I ran a mono-service with
mono-service2 -l:lockfile process.exe
It started the service and it was all fine but I had to change something in source. So I recompiled and deployed it. I killed the service by running
kill -9 <pid>
Now I tried to run the service again. But it doesn't start at all. What is the problem here ?
When mono starts a service, it creates a lock in /tmp based on the program name or given parameter. You should stop the service by sending the SIGTERM not SIGKILL signal - if you did so, the lock would be deleted. Now you should manually delete the lock. Read details here.
I have a worker doing some processing 24/7. However, sometimes the code crashes and it needs to be restarted (even if I catch the exception, I have to restart the worker in order for it to work).
What do you do when this happens or am I doing something wrong and this shouldn't happen at all? Does your dynos/workers crash or it is just me?
thanks
Heroku is supposed to restart a worker every time it crashes. As far as I know, you don't have to select or configure anything. Whatever is in your jobs:work task will be executed as soon as it fails.
In the event that you are heavily dependent on background jobs in your web app. You could create a rake task that finds the last record to be updated and execute a background job to update it. Or perhaps automate the rake task to find the rest of the records that need updating, since the last crash.
Alternatively, you force worker restart manually as indicated in this article (using delayed_job):
heroku workers 0;
heroku workers 1;
Or perhaps you can restart a specific worker by doing (mentioned in this article):
heroku restart worker.1
By the way, try the 1.9 stack. Make sure your app is 1.9.2 compatible, before doing so. Hopefully crashes are less frequent there:
heroku stack:migrate bamboo-mri-1.9.2
In the event, that such issues still arise. Best to contact Heroku support. They are very responsive at what they do.
Latest command to restart a specific heroku web worker (2014):
heroku ps:restart web.1
(tested on Cedar stack)
At times, for instance in case of DB crashes, the worker may not restart automatically. you would need to do this.
heroku restart web.1
It worked for me.