Slurm: automatically shutdown job with inactive GPUs - gpu

Is it possible to configure Slurm in such a way that it automatically releases/shuts down a job with low-usage / inactive GPU(s)? So far the only option I find is to have a global job time limit but nothing smarter than that.

Related

Prevent Celery From Grabbing One More Task Than Concur

In Celery using RabbitMQ, I have distributed workers running very long tasks on individual ec2 instances.
What is happening is that my concurrency is set to 2, -Ofair is enabled, and task_acks_late = True worker_prefetch_multiplier = 1 are set, but the Celery worker runs the 2 tasks in parallel, but then grabs a third task and doesn't run it. This leaves other workers with no tasks to run.
What i would like to happen is for the workers to only grab jobs when they can perform work on them. Allowing other workers that are free to grab the tasks and perform them.
Does anyone know how to achieve the result that I'm looking for? Attached below is an example of my concurrency being 2, and there being three jobs on the worker, where one is not yet acknowledged. I would like for there to be only two tasks there, and the other remain on the server until another worker can start them.

How to make Dataproc detect Python-Hive connection as a Yarn Job?

I launch a Dataproc cluster and serve Hive on it. Remotely from any machine I use Pyhive or PyODBC to connect to Hive and do things. It's not just one query. It can be a long session with intermittent queries. (The query itself has issues; will ask separately.)
Even during one single, active query, the operation does not show as a "Job" (I guess it's Yarn) on the dashboard. In contrast, when I "submit" tasks via Pyspark, they show up as "Jobs".
Besides the lack of task visibility, I also suspect that, w/o a Job, the cluster may not reliably detect a Python client is "connected" to it, hence the cluster's auto-delete might kick in prematurely.
Is there a way to "register" a Job to companion my Python session, and cancel/delete the job at times of my choosing? For my case, it is a "dummy", "nominal" job that does nothing.
Or maybe there's a more proper way to let Yarn detect my Python client's connection and create a job for it?
Thanks.
This is not supported right now, you need to submit jobs via Dataproc Jobs API to make them visible on jobs UI page and to be taken into account by cluster TTL feature.
If you can not use Dataproc Jobs API to execute your actual jobs, then you can submit a dummy Pig job that sleeps for desired time (5 hours in the example below) to prevent cluster deletion by max idle time feature:
gcloud dataproc jobs submit pig --cluster="${CLUSTER_NAME}" \
--execute="sh sleep $((5 * 60 * 60))"

Is it possible to request more time to a running job in SLURM?

I know it's possible on a queued job to change directives via scontrol, for example
scontrol update jobid=111111 TimeLimit=08:00:00
This only works in some cases, depending on the administrative configuration of the slurm instance (I'm not an admin). Thus this post does not answer my question.
What I'm looking for is a way to ask SLURM to add more time to a running job, if resources are available, and even if it's already running. Sort of like a nested job request.
Particularly a running job that was initiated with srun on-the-fly.
In https://slurm.schedmd.com/scontrol.html, it is clearly written under TimeLimit:
Only the Slurm administrator or root can increase job's TimeLimit.
So I fear what you want is not possible.
An it makes sense, since the scheduler looks at job time to decide which jobs to launch and some short jobs can benefit from back-filling to start before longer jobs, it would be really a mess if users where allowed to change the job length while running. Indeed, how to define "when resource are available"? Some node can be IDLE for some time because slurm knows that it will need it soon for a large job

How to prevent ironworker from enqueuing tasks of workers that are still running?

I have this worker whose runtime greatly varies from 10 seconds to up to an hour. I want to run this worker every five minutes. This is fine as long as the job finishes within five minutes. However, If the job takes longer Iron.io keeps enqueuing the same task over and over and a bunch of tasks of the same type accumulate while the worker is running.
Furthermore, it is crucial that the task may not run concurrently, so max concurrency for this worker is set to one.
So my question is: Is there a way to prevent Iron.io from enqueuing tasks of workers that are still running?
Answering my own question.
According to Iron.io support it is not possible to prevent IronWorker from enqueuing tasks of workers that are still running. For cases like mine it is better to have master workers that do the scheduling, i.e. creating/enqueuing tasks from script via one of the client libraries.
The best option would be to enqueue new task from the worker's code. For example, your task is running for 10 sec - 1 hour and enqueues itself at the end (last line of code). This will prevent the tasks from accumulating while the worker is running.

How to detect APScheduler's running jobs?

I have some recurring jobs run frequently or last for a while.
It seems that Scheduler().get_jobs() will only return the list of scheduled jobs that are not currently running, so I cannot determine if a job with certain id do not exists or is actually running.
How may I test if a job is running or not in this situation?
(I set up those jobs not the usual way, (because I need them to run in a random interval, not fixed interval), they are jobs that execute only once, but will add a job with the same id by the end of their execution, and they will stop doing so when reaching a certain threshold.)
APScheduler does not filter the list of jobs for get_jobs() in any way. If you need random scheduling, why not implement that in a custom trigger instead of constantly readding the job?