How to stop gracefully a Spark Streaming application on YARN? - hadoop-yarn

I'm running a Spark Streaming application on YARN in cluster mode and I'm trying to implement a gracefully shutdown so that when the application is killed it will finish the execution of the current micro batch before stopping.
Following some tutorials I have configured spark.streaming.stopGracefullyOnShutdown to true and I've added the following code to my application:
sys.ShutdownHookThread {
log.info("Gracefully stopping Spark Streaming Application")
ssc.stop(true, true)
log.info("Application stopped")
}
However when I kill the application with
yarn application -kill application_1454432703118_3558
the micro batch executed at that moment is not completed.
In the driver I see the first line of log printed ("Gracefully stopping Spark Streaming Application") but not the last one ("Application stopped").
ERROR yarn.ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM
INFO streaming.MySparkJob: Gracefully stopping Spark Streaming Application
INFO scheduler.JobGenerator: Stopping JobGenerator gracefully
INFO scheduler.JobGenerator: Waiting for all received blocks to be consumed for job generation
INFO scheduler.JobGenerator: Waited for all received blocks to be consumed for job generation
INFO streaming.StreamingContext: Invoking stop(stopGracefully=true) from shutdown hook
In the executors log I see the following error:
ERROR executor.CoarseGrainedExecutorBackend: Driver 192.168.6.21:49767 disassociated! Shutting down.
INFO storage.DiskBlockManager: Shutdown hook called
WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkDriver#192.168.6.21:49767] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
INFO util.ShutdownHookManager: Shutdown hook called
I think the problem is related to how YARN send the kill signal the application. Any idea on how can I make the application stop gracefully?

you should go to the executors page to see where your driver is running ( on which node). ssh to that node and do the following:
ps -ef | grep 'app_name'
(replace app_name with your classname/appname). it will list couple of processes. Look at the process, some will be child of the other. Pick the id of the parent-most process and send a SIGTERM
kill pid
after some time you'll see that your application has terminated gracefully.
Also now you don't need to add those hooks for shutdown.
use spark.streaming.stopGracefullyOnShutdown config to help shutdown gracefully

You can stop spark streaming application by invoking ssc.stop when a customized condition is triggered instead of using awaitTermination. As the following pseudocode shows:
ssc.start()
while True:
time.sleep(10s)
if some_file_exist:
ssc.stop(True, True)

Related

Botium jobs stucked at ready

All my Test Project are stuck on "Ready".
I tried restarting containers and even "I'm Botium" is stuck on Ready without passed/failed results.
Job Log is here: https://pastebin.com/851LPCBS
In the Docker logs is:
2019-10-04T08:25:30.046Z botium-box-worker sending heartbeat ...
2019-10-04T08:25:30.049Z botium-box-server-agents agent.heartbeat: {"title":"heartbeat from agent b0c5b43c0f82 for group Default Group","name":"b0c5b43c0f82","group":"Default Group"}
2019-10-04T08:25:35.559Z botium-box-server-index WARNING: a socket timeout ocurred. You should increase the BOTIUMBOX_API_TIMEOUT environment variable.
2019-10-04T08:25:35.598Z botium-box-server-index WARNING: a socket timeout ocurred. You should increase the BOTIUMBOX_API_TIMEOUT environment variable.

taskmanager be killed when canceling failed task

I start flink(bin/start-cluster.sh)on a single machine, and submit a job by flink web UI.
If there are something wrong with the job, such as sink mysql table does not exist or wrong keyby field, not only this job failure, I have to cancel failed task, but after cancelling ,the taskmanager seems like be "killed", it disappears in flink web ui.
Are there solutions for fault tolerance(taskmanager be killed by failure job) ?
The only way is to run flink on yarn?
A task failure should never cause a TaskManager to be killed. Please check the TaskManager logs for any exceptions.

Monit cannot start/stop service

Monit cannot start/stop service,
If i stop the service, just stop monitoring the service in Monit.
Attached the log and config for reference.
#Monitor vsftpd#
check process vsftpd
matching vsftpd
start program = "/usr/sbin/vsftpd start"
stop program = "/usr/sbin/vsftpd stop"
if failed port 21 protocol ftp then restart
The log states: "stop on user request". The process is stopped and monitoring is disabled, since monitoring a stopped (= non existing) process makes no sense.
If you Restart service (by cli or web) it should print info: 'test' restart on user request to the log and call the stop program and continue with the start program (if no dedicated restart program is provided).
In fact one problem can arise: if the stop scripts fails to create the expected state (=NOT(check process matching vsftpd)), the start program is not called. So if there is a task running that matches vsftpd, monit will not call the start program. So it's always better to use a PID file for monitoring where possible.
Finally - and since not knowing what system/versions you are on, an assumption: The vsftpd binary on my system is really only the daemon. It is not supporting any options. All arguments are configuration files as stated in the man page. So supplying start and stop only tries to create new daemons loading start and stop file. -- If this is true, the one problem described above applies, since your vsftpd is never stopped.

Redis - can't [BG]SAVE after one failure

I have a message in the Redis log saying
BeginForkOperation: system error caught. error code=0x00000000, message=Forked process did not respond in a timely mannager.: unknown error
Can't save in background: fork: Invalid argument
This message appeared in the log more than a week ago, and since then, each time I'm trying to run "BGSAVE" or "SAVE", it throws the "err background save already in progress" error..
I can't see another redis-server process except the main process in the task manager, nor in "client list" command.
I'm using the Redis on windows project (constraints:)
Any ideas how to tell the Redis there's no background save process and force it to save the data into the disk, before it crashes?

Why doesn't a mono service restart after a kill -9?

I ran a mono-service with
mono-service2 -l:lockfile process.exe
It started the service and it was all fine but I had to change something in source. So I recompiled and deployed it. I killed the service by running
kill -9 <pid>
Now I tried to run the service again. But it doesn't start at all. What is the problem here ?
When mono starts a service, it creates a lock in /tmp based on the program name or given parameter. You should stop the service by sending the SIGTERM not SIGKILL signal - if you did so, the lock would be deleted. Now you should manually delete the lock. Read details here.