Hello and thank you for reviewing this question!
I'm working on an SGE cluster with 16 available worker nodes. Each has 32 cores.
I have a rule which defines a process which must be run only one instance per worker node. This means I could in theory run 16 jobs at a time. It's fine if there are other things happening on each worker node - there just can't be two jobs from this specific rule running at the same time. Is there a way to ensure this?
I have tried setting memory resources. But setting for example
resources:
mem_mb=10000
and running
snakemake --resources mem_mb=10000
will only allow one job to run at a time, not one job per cluster. Is there a way to set each individual cluster's memory limit? Or some other way to achieve one job per node for only a specific rule?
Thank you,
Eric
Related
I know it's possible on a queued job to change directives via scontrol, for example
scontrol update jobid=111111 TimeLimit=08:00:00
This only works in some cases, depending on the administrative configuration of the slurm instance (I'm not an admin). Thus this post does not answer my question.
What I'm looking for is a way to ask SLURM to add more time to a running job, if resources are available, and even if it's already running. Sort of like a nested job request.
Particularly a running job that was initiated with srun on-the-fly.
In https://slurm.schedmd.com/scontrol.html, it is clearly written under TimeLimit:
Only the Slurm administrator or root can increase job's TimeLimit.
So I fear what you want is not possible.
An it makes sense, since the scheduler looks at job time to decide which jobs to launch and some short jobs can benefit from back-filling to start before longer jobs, it would be really a mess if users where allowed to change the job length while running. Indeed, how to define "when resource are available"? Some node can be IDLE for some time because slurm knows that it will need it soon for a large job
In Snakemake, as far as I know, we can only adapt job resources dynamically based on the number of attempts a job has made. When trying to re-adjust resources after a failure, it would be useful to discern from the following type of cluster failures:
Program error
Transient node failure
Out of memory
Timeout
The last 3 cases, in particular, are exposed to the SLURM user via different job completion status codes. The snakemake interface to the status script merges all types of failures into a single "failed" status.
Is there any way to do so? Or is this a planned feature? Keeping a list of previous failure reasons, instead of just the attempts count would be most useful.
e.g. goal:
rule foo:
resources:
mem_gb=lambda wildcards, attempts: 100 + (20*attempts.OOM)
time_s=lambda wildcards, attempts: 3600 + (3600*attemps.TIMEOUT)
...
The cluster I have access to has heterogeneous machines where each node is configured with various walltime and memory limits, and it would minimize scheduling times if I didn't have to conservatively bump all resources at once.
Possible workaround: I thought of keeping track of that extra info between the job status script, and the cluster submission script (e.g. keeping a history of status codes for each jobid). Is the attempt# available to the cluster submission and cluster status commands?
I think it would be best to handle such cluster specific functionality in a profile, e.g. in the slurm profile. When an error is detected, the status script could simply silently resubmit with updated resources based on what slurm reports. I don't think the Snakefile has to be cluttered with such platform details.
This is not clear to me from the docs. Here's our scenario and why we need this as succinctly as I can:
We have 60 coordinators running, launching workflows usually hourly, some of which have sub-workflows (some multiple in parallel). This works out to around 40 workflows running at any given time. However when cluster is under load or some underlying service is slow (e.g. impala or hbase), workflows will run longer than usual and back up so we can end up with 80+ workflows (including sub-workflows) running.
This sometimes results in ALL workflows hanging indefinitely, because we have only enough memory and cores allocated to this pool that oozie can start the launcher jobs (i.e. oozie:launcher:T=sqoop:W=JobABC:A=sqoop-d596:ID=XYZ), but not their corresponding actions (i.e. oozie:action:T=sqoop:W=JobABC:A=sqoop-d596:ID=XYZ).
We could simply allocate enough resources to the pool to accommodate for these spikes, but that would be a massive waste (hundreds of cores and GBs that other pools/tenants could never use).
So I'm trying to enforce some limit on number of workflows running, even if that means some will be running behind sometimes. BTW all our coordinators are configured with execution=LAST_ONLY, and any delayed workflow will simply catch up fully on the next run. We are on CDH 5.13 with Oozie 4.1; pools are setup with DRF scheduler.
Thanks in advance for your ideas.
AFAIK there is not a configuration parameter that let you control the number of workflows running at a given time.
If your coordinators are scheduled to run approximately in the same time-window, you could think to collapse them in just one coordinator/workflow and use the fork/join control nodes to control the degree of parallelism. Thus you can distribute your actions in a K number of queues in your workflow and this will ensure that you will not have more than K actions running at the same time, limiting the load on the cluster.
We use a script to generate automatically the fork queues inside the workflow and distribute the actions (of course this is only for actions that can run in parallel, i.e. there no data dependencies etc).
Hope this helps
So, I've been given a code base which uses daemons, daemons-rails and delayed jobs to trigger a number of *_ctl files in /lib/daemons/ but here's the problem:
If two people do an action which starts the daemons doing some heavy lifting then whichever one clicks second will have to wait for the first to complete. Not good. We need to start multiple daemons on each queue.
Ideally what I want to do is read a config file like this:
default:
queues: default ordering
num_workers: 10
hours:
queues: slow_admin_tasks
num_workers: 2
minutes:
queues: minute
num_workers: 2
This would mean that 10 daemon processes are started to listen to the default and ordering queues, 2 for slow_admin tasks etc.
How would I go about defining multiple daemons like this, it looks like it might be in one of these places:
/lib/daemons/*_ctl
/lib/daemons/*.rb
/lib/daemons/daemons
I thought it might be a change to the daemons-rails rake tasks, but they just hit the daemons file.
Has anyone looked in to scaling daemons-rails in this way? Where can I get more information?
I suggest you to try Foreman.
Take a look at this discution.
Foreman can help manage multiple processes that your Rails app depends upon when running in development. You can find a tutorial regarding Foreman on RailsCasts. There's a video tutorial + some source code examples.
I also suggest you to take a look at Sidekiq.
Sidekiq allows you to move jobs into the background for asynchronous processing. It uses threads instead of forks so it is much more efficient with memory compared to Resque. You can find a tutorial here.
I also suggest you to take a look at Resque
Resque is a Redis-backed Ruby library for creating background jobs, placing them on multiple queues, and processing them later. You can find a tutorial here.
Im running some tasks via django-celery (with rabbitmq as backend), the tasks are time consuming and cpu intensive.
I got 2 worker Ec2 instances (One is small and other is high cpu medium).
Ive set the small instance to run 1 concurrent task, and the medium to do 4. This works well for me. But occasionally, in the celery monitor, I see that the small instance is working on a task and 2 or 3 more tasks are in "RECEIVED" state(assigned to the small instance), while the medium instance is not doing anything. Ideally id like the medium instance to have preference over the small, but in this case if small is at its concurrency the task should goto the medium. It seems the small instance is biting more than it can chew.. as in allocating tasks to itself which it cant start at the moment.
Is there a way to make workers accept only the tasks it can start at that moment?
Screenshot : http://dl.dropbox.com/u/361747/task-state.png . The worker starting with domU is the small, the one starting with ip is medium.
You can use CELERYD_PREFETCH_MULTIPLIER option to control how many tasks to prefetch. In your case CELERYD_PREFETCH_MULTIPLIER=1 will help evenly distribute tasks.
http://ask.github.com/celery/configuration.html#celeryd-prefetch-multiplier