set slurm to distribute jobs across nodes in nextflow - jobs

I am running a nextflow pipeline on a 3-node cluster.
When I run the pipeline through slurm, it creates a high number of jobs, that I limit by using the executor.queueSize = X directive.
However, what slurm does is to saturate node 1, then saturate node 2, then starts sending jobs to node 3.
I'd like it to distribute the job list more evenly.
I've tried a number of slurm commands, including
--spread-job
--ntasks-per-core=5
--distribution=cyclic
-m cyclic=1
--distribution=plane=5
But none does what I want, which is just to assign 1 job to N1, then 1 to N2, then 1 to N3, then 1 to N1 again etc.
Any ideas please?
Thanks in advance for your help.

As a user, you do not decide how your independent jobs are allocated with respect to one another. The --spread-job and --distribution=cyclic options decide how the allocation for a single job is built, and how tasks are mapped onto that allocation.
To obtain the behaviour you want, the cluster must be configured with SelectTypeParameters=CR_LLN
This option leads to fragmented resource and makes it more difficult to schedule large jobs, so it often is not the default choice for clusters.

Related

Snakemake: Combine cluster profile with resources: attempt

Let's say I have a rule that for 90% of my data needs 1h, but occasionally needs 3h. In a busy cluster environment, I however do not want to submit all jobs with a time limit of 3h to be save, as this would slow down the scheduling of my jobs.
Hence, I played around with the attempt variable:
resources:
# Increase time limit in factors of 1h, if the job fails due to time limit.
time = lambda wildcards, input, threads, attempt: int(60 * int(attempt))
(one could be even smarter and use powers of 2 to amortize better...).
But this approach forces me to put the base time (1h) direclty into the rule. How can I combine this approach with cluster profiles, where the base time is in some cluster_config.yaml file?
Thanks and so long
Lucas

Snakemake 200000 job submission

I have 200000 fasta sequences. I am doing GATK to call variants and created a wildcard for every sequence. Now I would like to submit 200000 jobs using snakemake. Will this cause a problem to cluster? Is there a way to submit jobs in a set of 10-20?
First off, it might take some time to calculate the DAG, but I have been told the DAG calculation recently has been greatly improved. Anyways, it might be wise to split up in batches.
Most clusters won't allow you to submit more than X jobs at the same time, usually in the range of 100-1000. I believe the documentation is not fully correct, but when using --cluster cluster I believe the --jobs argument controls the number of active/submitted jobs at the same time, so by using snakemake --jobs 20 --cluster "myclustercommand" you should be able to control this. Know that this control the number of submitted jobs, not active jobs. It might be that all your jobs are in the queue, so probably best to check in with your cluster administrator and ask what the maximum number of submitted jobs is, and get as close to that number.

Ideal value for Kafka Connect Distributed tasks.max configuration setting?

I am looking to productionize and deploy my Kafka Connect application. However, there are two questions I have about the tasks.max setting which is required and of high importance but details are vague for what to actually set this value to.
If I have a topic with n partitions that I wish to consume data from and write to some sink (in my case, I am writing to S3), what should I set tasks.max to? Should I set it to n? Should I set it to 2n? Intuitively it seems that I'd want to set the value to n and that's what I've been doing.
What if I change my Kafka topic and increase partitions on the topic? I will have to pause my Kafka Connector and increase the tasks.max if I set it to n? If I have set a value of 2n, then my connector should automatically increase the parallelism it operates?
In a Kafka Connect sink, the tasks are essentially consumer threads and receive partitions to read from. If you have 10 partitions and have tasks.max set to 5, each task with receive 2 partitions to read from and track the offsets. If you have configured tasks.max to a number above the partition count Connect will launch a number of tasks equal to the partitions of the topics it's reading.
If you change the partition count of the topic you'll have to relaunch your connect task, if tasks.max is still greater than the partition count, Connect will start that many tasks.
edit, just discovered ConnectorContext: https://kafka.apache.org/0100/javadoc/org/apache/kafka/connect/connector/ConnectorContext.html
The connector will have to be written to include this but it looks like Connect has the ability to reconfigure a connector if there's a topic change (partitions added/removed).
We had a problem with the distribution of the workload between the Kafka-Connect(5.1.2) instances, caused by the high number of tasks.max than the number of partitions.
In our case, there were 10 Kafka Connect tasks and 3 partitions of the topic which is to be consumed. 3 of those 10 workers are assigned to the 3 partitions of the topic and the other 7 are not assigned to any partitions(which is expected) but the Kafka Connect were distributing the tasks evenly, without considering their workload. So we were ending up with a task distribution to our instances where some instances are staying idle( because they are not assigned to any unempty worker ) or some instances are working more than the others.
To come up with the issue, we set tasks.max equal to number of partitions of our topics.
It is really unexpected for us to see that Kafka Connect does not consider tasks' assignments while rebalancing. Also, I couldn't find any documentation for the tasks.max setting.

How to make monit start processes in order?

In the monit config file, we have a list of processes we expect monit to check for. Each one looks like:
check process process_name_here
with pidfile /path/to/file.pid
start program = "/bin/bash ..."
stop program = "/bin/bash ..."
if totalmem is greater than X MB for Y cycles then alert
if N restarts within X cycles then alert
group group_name
Since we have about 30-40 processes in this list that we monitor, I have two questions:
1) If we restart the services (kill them all), can we have monit start all processes at the same time instead of the way it's done now (sequentially, one by one).
2) Can we specify the order in which we would like the processes to start? How is the order determined? Is it the order that they appear in the conf file? Is it by process name? Anything else? This is especially important if #1 above is not possible...
You can use the depends on syntax. I use this for custom Varnish builds.
For example, process a, process b, and process c. Process a needs to start first, then followed by b and c.
Your first process won't depend on anything. In your check for process b, you'll want:
depends on process a
Then in your process c check, you'll want:
depends on process b
This should make sure that the processes are started in the correct order. Let me know if this works for you.
Going only by documentation, there is nothing related to point one other than the fact that monit runs single-threaded.
As for point two, under "SERVICE POLL TIME":
Checks are performed in the same order as they are written in the .monitrc file, except if dependencies are setup between services, in which case the services hierarchy may alternate the order of the checks.
Note that if you have an include string that matches multiple files they are included in no specific order.
If you require a specific order you should use DEPENDS where possible

How do I do the Delayed::Job equivalent of Process#waitall?

I have a large task that proceeds in several major steps: Step A must complete before Step B can be started, etc. But each major step can be divided up across multiple processes, in my case, using Delayed::Job.
The question: Is there a simple technique for starting Step B only after all the processes have completed working on Step A?
Note 1: I don't know a priori how many external workers have been spun up, so keeping a reference count of completed workers won't help.
Note 2: I'd prefer not to create a worker whose sole job is to busy wait for the other jobs to complete. Heroku workers cost money!
Note 3: I've considered having each worker examine the Delayed::Job queue in the after callback to decide if it's the last one working on Step A, in which case it could initiate Step B. This could work, but seems potentially fraught with gotchas. (In the absence of better answers, this is the approach I'm going with.)
I think it really depends on the specifics of what you are doing, but you could set priority levels such that any jobs from Step A run first. Depending on the specifics, that might be enough. From the github page:
By default all jobs are scheduled with priority = 0, which is top
priority. You can change this by setting
Delayed::Worker.default_priority to something else. Lower numbers have
higher priority.
So if you set Step A to run at priority = 0, and Step B to run at priority = 100, nothing in Step B will run until Step A is complete.
There's some cases where this will be problematic -- in particular, if you have a lot of jobs and are running a lot of workers, you will probably have some workers running Step B before the work in Step A is finished. Ideally in this setup, Step B has some sort of check to make check if it can be run or not.