Snakemake 200000 job submission - snakemake

I have 200000 fasta sequences. I am doing GATK to call variants and created a wildcard for every sequence. Now I would like to submit 200000 jobs using snakemake. Will this cause a problem to cluster? Is there a way to submit jobs in a set of 10-20?

First off, it might take some time to calculate the DAG, but I have been told the DAG calculation recently has been greatly improved. Anyways, it might be wise to split up in batches.
Most clusters won't allow you to submit more than X jobs at the same time, usually in the range of 100-1000. I believe the documentation is not fully correct, but when using --cluster cluster I believe the --jobs argument controls the number of active/submitted jobs at the same time, so by using snakemake --jobs 20 --cluster "myclustercommand" you should be able to control this. Know that this control the number of submitted jobs, not active jobs. It might be that all your jobs are in the queue, so probably best to check in with your cluster administrator and ask what the maximum number of submitted jobs is, and get as close to that number.

Related

How changing number of workers will effect the Glue job

I have 2200000 records to process in Glue job which is leading to timeout as by default it is set to 2 days and number of workers are 10 . Increasing the number of workers will help in running the glue job faster ??
Increasing the numbers of worker will help to run the job faster, if your job has transformations that can run in parallel since you are allocating more executor nodes.
2200000 records isn't that much though, and you should check if somethings wrong with the code if it takes > 2 days.

Snakemake: Combine cluster profile with resources: attempt

Let's say I have a rule that for 90% of my data needs 1h, but occasionally needs 3h. In a busy cluster environment, I however do not want to submit all jobs with a time limit of 3h to be save, as this would slow down the scheduling of my jobs.
Hence, I played around with the attempt variable:
resources:
# Increase time limit in factors of 1h, if the job fails due to time limit.
time = lambda wildcards, input, threads, attempt: int(60 * int(attempt))
(one could be even smarter and use powers of 2 to amortize better...).
But this approach forces me to put the base time (1h) direclty into the rule. How can I combine this approach with cluster profiles, where the base time is in some cluster_config.yaml file?
Thanks and so long
Lucas

Check the current position in a Redis list of some list element

I have a simple job queue on Redis where new jobs are pushed with RPUSH and consumed with BLPOP. The jobs are stringified JSON objects that have an id field among other things (the json string is parsed by the workers).
Each job takes some time to do, so there can be a meaningful wait time. I'd like to be able to find a job's current position in the queue, so that I can give an update to whatever is waiting on that job. That is, be able to do something like "your current position is 300... 250... 200... 100... 10... your job is now being processed".
It can be assumed that the list may grow long but never too long, i.e. possibly 1000 entries but not 1 million.
After looking through the docs a bit, it seems like this is maybe easier said than done. A possible naive solution seems to be to just loop through the list until the element is found. Are there any performance issues with calling LINDEX a couple hundred times at a time like that?
Would appreciate any suggestions on other ways this can be done (or confirmation that LINDEX is the only way). The whole structure (even the usage of a list, or addition of some helper map/list) can be changed if needed, only requirement is that it run on Redis.
You can use a sorted set and a counter to more elegantly solve the problem.
Push a job
Call INCR counter to get a counter.
Use the counter as score of the job, and call ZADD jobs counter job-name
Pop a job
Call BZPOPMIN jobs to get the first unprocessed job.
Get job position
Call ZRANK jobs job-name to get the rank of the job, e.g. the current position of the job.

How to reduce time allotted for a batch of HITs?

today I created a small batch of 20 categorization HITs with the name Grammatical or Ungrammatical using the web UI. Can you tell me the easiest way to manage this batch so that I can reduce its time allotted to 15 minutes from 1 hour and remove also remove the categorization of masters. This is a very simple task that's set to auto-approve within 1 hour, and I am fine with that. I just need to make it more lucrative for people to attempt this at the penny rate.
You need to register a new HITType with the relevant properties (reduced time and no masters qualification) and then perform a ChangeHITTypeOfHIT operation on all of the HITs in the batch.
API documentation here: http://docs.aws.amazon.com/AWSMechTurk/latest/AWSMturkAPI/ApiReference_ChangeHITTypeOfHITOperation.html

How do I do the Delayed::Job equivalent of Process#waitall?

I have a large task that proceeds in several major steps: Step A must complete before Step B can be started, etc. But each major step can be divided up across multiple processes, in my case, using Delayed::Job.
The question: Is there a simple technique for starting Step B only after all the processes have completed working on Step A?
Note 1: I don't know a priori how many external workers have been spun up, so keeping a reference count of completed workers won't help.
Note 2: I'd prefer not to create a worker whose sole job is to busy wait for the other jobs to complete. Heroku workers cost money!
Note 3: I've considered having each worker examine the Delayed::Job queue in the after callback to decide if it's the last one working on Step A, in which case it could initiate Step B. This could work, but seems potentially fraught with gotchas. (In the absence of better answers, this is the approach I'm going with.)
I think it really depends on the specifics of what you are doing, but you could set priority levels such that any jobs from Step A run first. Depending on the specifics, that might be enough. From the github page:
By default all jobs are scheduled with priority = 0, which is top
priority. You can change this by setting
Delayed::Worker.default_priority to something else. Lower numbers have
higher priority.
So if you set Step A to run at priority = 0, and Step B to run at priority = 100, nothing in Step B will run until Step A is complete.
There's some cases where this will be problematic -- in particular, if you have a lot of jobs and are running a lot of workers, you will probably have some workers running Step B before the work in Step A is finished. Ideally in this setup, Step B has some sort of check to make check if it can be run or not.