What does the status "CG" mean in SLURM? - jobs

On a SLURM cluster one can use squeue to get information about jobs on the system.
I know that "R" means running; and "PD" meaning pending, but what is "CG"?
I understand it to be "canceling" or "failing" from experience, but does "CG" apply when a job successfully closes? What is the G?

"CG" stands for "completing" and it happens to a job that cannot be terminated, probably because of an I/O operation.
More detailed info in the Slurm Troubleshooting Guide

I found this in the 'squeue' section of the Slurm Troubleshooting Guide:
state
Job state, extended form: PENDING, RUNNING, STOPPED, SUSPENDED,
CANCELLED, COMPLETING, COMPLETED, CONFIGURING, FAILED, TIMEOUT,
PREEMPTED, NODE_FAIL, REVOKED and SPECIAL_EXIT. See the JOB STATE
CODES section below for more information. (Valid for jobs only)
statecompact
Job state, compact form: PD (pending), R (running), CA (cancelled),
CF(configuring), CG (completing), CD (completed), F (failed), TO
(timeout), NF (node failure), RV (revoked) and SE (special exit
state). See the JOB STATE CODES section below for more information.
(Valid for jobs only)

Related

Spark execution occasionally gets stuck at mapPartitions at Exchange.scala:44

I am running a Spark job on a two node standalone cluster (v 1.0.1).
Spark execution often gets stuck at the task mapPartitions at Exchange.scala:44.
This happens at the final stage of my job in a call to saveAsTextFile (as I expect from Spark's lazy execution).
It is hard to diagnose the problem because I never experience it in local mode with local IO paths, and occasionally the job on the cluster does complete as expected with the correct output (same output as with local mode).
This seems possibly related to reading from s3 (of a ~170MB file) immediately prior, as I see the following logging in the console:
DEBUG NativeS3FileSystem - getFileStatus returning 'file' for key '[PATH_REMOVED].avro'
INFO FileInputFormat - Total input paths to process : 1
DEBUG FileInputFormat - Total # of splits: 3
...
INFO DAGScheduler - Submitting 3 missing tasks from Stage 32 (MapPartitionsRDD[96] at mapPartitions at Exchange.scala:44)
DEBUG DAGScheduler - New pending tasks: Set(ShuffleMapTask(32, 0), ShuffleMapTask(32, 1), ShuffleMapTask(32, 2))
The last logging I see before the task apparently hangs/gets stuck is:
INFO NativeS3FileSystem: INFO NativeS3FileSystem: Opening key '[PATH_REMOVED].avro' for reading at position '67108864'
Has anyone else experience non-deterministic problems related to reading from s3 in Spark?

How to get process scheduler history in Solaris?

I would like to know if there is way to get the process sheduler history in Solaris operating system. The output may have the following details.
user : user name who invoked the process
name : name of the process / command used to invoke the process
loc : location or path of the binary
pid : process id
event: event happened to the process (init, suspend or end)
time : time the event happend
date : date the event happeed
I'm interested to hear if any such thing available for other OS as well.
You might implement that with a dtrace script leveraging the proc provider (proc:::exec-success, proc:::exit and proc:::signal_handle).
Your event list looks dubious, should probably be at least "start,suspend,resume and exit"
You want the audit feature of Solaris. man audit and associated utilities, auditconfig etc

Autosys: Concept of Kick Start Attribute and how to use

i have a daily( 09:00am) box containing 10 jobs inside it. All child jobs are sequentially scheduled to run.
On Monday, jobs 1,2 &3 completed and job4 failed. And coz of this, the downstream is stalled and the box is running infinetly( until some actions taken manually)
But the requirement is to run this box again on Tue 09:00am. I heard of Kickstart attribute to kick off the box on next scheduled time irrespective of last run status.
Can someone tell about this kick_start attribute? Also suggest me any other way to schedule this box daily.
TIA
Never heard of the kick_start attribute and could not find it in the R11.3.5 reference guide.
I would look at the box_terminator: y that will fail the box if a job in it fails and the job_terminator: y that will terminate and fail a job if the box it is in fails.
box_criteria is another attribute that may help as you can define what success or failure looks like. For example if you don't care if job4 fails, define box_criteria: s(job3).
Course that only sets your box to FA where it will run the next time it's starting conditions are met. It does nothing to run the downstream for the current run.
Have fun and test, test, test.

In celery, how to ensure tasks are retried when worker crashes

First of all please don't consider this question as a duplicate of this question
I have a setup an environment which uses celery and redis as broker and result_backend. My question is how can I make sure that when the celery workers crash, all the scheduled tasks are re-tried, when the celery worker is back up.
I have seen advice on using CELERY_ACKS_LATE = True , so that the broker will re-drive the tasks until it get an ACK, but in my case its not working. Whenever I schedule a task its immediately goes to the worker which persists it until the scheduled time of execution. Let me give some example:
I am scheduling a task like this: res=test_task.apply_async(countdown=600) , but immediately in celery worker logs i can see something like : Got task from broker: test_task[a137c44e-b08e-4569-8677-f84070873fc0] eta:[2013-01-...] . Now when I kill the celery worker, these scheduled tasks are lost. My settings:
BROKER_URL = "redis://localhost:6379/0"
CELERY_ALWAYS_EAGER = False
CELERY_RESULT_BACKEND = "redis://localhost:6379/0"
CELERY_ACKS_LATE = True
Apparently this is how celery behaves.
When worker is abruptly killed (but dispatching process isn't), the message will be considered as 'failed' even though you have acks_late=True
Motivation (to my understanding) is that if consumer was killed by OS due to out-of-mem, there is no point in redelivering the same task.
You may see the exact issue here: https://github.com/celery/celery/issues/1628
I actually disagree with this behaviour. IMO it would make more sense not to acknowledge.
I've had the issue, where I was using some open-source C libraries that went totaly amok and crashed my worker ungraceful without throwing an exception. For any reason whatsoever, one can simply wrap the content of a task in a child process and check its status in the parent.
n = os.fork()
if n > 0: //inside the parent process
status = os.wait() //wait until child terminates
print("Signal number that killed the child process:", status[1])
if status[1] > 0: // if the signal was something other then graceful
// here one can do whatever they want, like restart or throw an Exception.
self.retry(exc=SomeException(), countdown=2 ** self.request.retries)
else: // here comes the actual task content with its respected return
return myResult // Make sure there are not returns in child and parent at the same time.

Redis queue with claim expire

I have a queue interface I want to implement in redis. The trick is that each worker can claim an item for N seconds after that it's presumed the worker has crashed and the item needs to be claimable again. It's the worker's responsibility to remove the item when finished. How would you do this in redis? I am using phpredis but that's kind of irrelevant.
To realize a simple queue in redis that can be used to resubmit crashed jobs I'd try something like this:
1 list "up_for_grabs"
1 list "being_worked_on"
auto expiring locks
a worker trying to grab a job would do something like this:
timeout = 3600
#wrap this in a transaction so our cleanup wont kill the task
#Move the job away from the queue so nobody else tries to claim it
job = RPOPLPUSH(up_for_grabs, being_worked_on)
#Set a lock and expire it, the value tells us when that job will time out. This can be arbitrary though
SETEX('lock:' + job, Time.now + timeout, timeout)
#our application logic
do_work(job)
#Remove the finished item from the queue.
LREM being_worked_on -1 job
#Delete the item's lock. If it crashes here, the expire will take care of it
DEL('lock:' + job)
And every now and then, we could just grab our list and check that all jobs that are in there actually have a lock.
If we find any jobs that DON'T have a lock, this means it expired and our worker probably crashed.
In this case we would resubmit.
This would be the pseudo code for that:
loop do
items = LRANGE(being_worked_on, 0, -1)
items.each do |job|
if !(EXISTS("lock:" + job))
puts "We found a job that didn't have a lock, resubmitting"
LREM being_worked_on -1 job
LPUSH(up_for_grabs, job)
end
end
sleep 60
end
You can set up standard synchronized locking scheme in Redis with [SETNX][1]. Basically, you use SETNX to create a lock that everyone tries to acquire. To release the lock, you can DEL it and you can also set up an EXPIRE to make the lock releasable. There are other considerations here, but nothing out of the ordinary in setting up locking and critical sections in a distributed application.