Snakemake: Is it possible to only display "Job counts" using --dryrun? - snakemake

How can I make Snakemake display only the Job counts fields on a dry run? When performing a real run, that's the first information Snakemake outputs before starting the jobs.
Currently, the way I get job counts is to run Snakemake without the -n flag and immediately cancel it (^C), but that's far from ideal.
Letting the dry run complete will output the Job counts at the end, but that's not feasible for pipelines with hundreds or thousands of jobs.
Desired output:
$ snakemake -n --someflag
Job counts:
count jobs
504 BMO
1 all
504 fit_nbinoms
517 motifs_in_peaks
503 motifs_outside_peaks
2029
$

Flag -q does this.
--quiet, -q Do not output any progress or rule information.

Related

"scancel: error: Invalid job id Submitted batch job" with --cluster-cancel from snakemake

I am running snakemake using this command:
snakemake --profile slurm -j 1 --cores 1 --cluster-cancel "scancel"
which writes this to standard out:
Submitted job 224 with external jobid 'Submitted batch job 54174212'.
but after I cancel the run with ctrl + c, I get the following error:
scancel: error: Invalid job id Submitted batch job 54174212
What I would guess is that the jobid is 'Submitted batch job 54174212'
and snakemake tries to run scancel 'Submitted batch job 54174212' instead of the expected scancel 54174212. If this is the case, how do I change the jobid to something that works with scancel?
Your suspicion is probably correct, snakemake probably tries to cancel the wrong job (id Submitted batch job 54174212).
Check your slurm profile for snakemake which you invoke (standard location ~/.config/snakemake/slurm/config.yaml):
Does it contain the --parsable flag for sbatch?
Missing to include that flag is a mistake I made before. Adding the flag solved it for me.

Can Snakemake parallelize the same rule both within and across nodes?

I have a somewhat basic question about Snakemake parallelization when using cluster execution: can jobs from the same rule be parallelized both within a node and across multiple nodes at the same time?
Let's say for example that I have 100 bwa mem jobs and my cluster has nodes with 40 cores each. Could I run 4 bwa mem per node, each using 10 threads, and then have Snakemake submit 25 separate jobs? Essentially, I want to parallelize both within and across nodes for the same rule.
Here is my current snakefile:
SAMPLES, = glob_wildcards("fastqs/{id}.1.fq.gz")
print(SAMPLES)
rule all:
input:
expand("results/{sample}.bam", sample=SAMPLES)
rule bwa:
resources:
time="4:00:00",
partition="short-40core"
input:
ref="/path/to/reference/genome.fa",
fwd="fastqs/{sample}.1.fq.gz",
rev="fastqs/{sample}.2.fq.gz"
output:
bam="results/{sample}.bam"
log:
"results/logs/bwa/{sample}.log"
params:
threads=10
shell:
"bwa mem -t {params.threads} {input.ref} {input.fwd} {input.rev} 2> {log} | samtools view -bS - > {output.bam}"
I've run this with the following command:
snakemake --cluster "sbatch --partition={resources.partition}" -s bwa_slurm_snakefile --jobs 25
With this setup, I get 25 jobs submitted, each to a different node. However, only one bwa mem process (using 10 threads) is run per node.
Is there some straightforward way to modify this so that I could get 4 different bwa mem jobs (each using 10 threads) to run on each node?
Thanks!
Dave
Edit 07/28/22:
In addition to Troy's suggestion below, I found a straightforward way of accomplishing what I was trying to do by simply following the job grouping documentation.
Specifically, I did the following when executing my Snakemake pipeline:
snakemake --cluster "sbatch --partition={resources.partition}" -s bwa_slurm_snakefile --jobs 25 --groups bwa=group0 --group-components group0=4 --rerun-incomplete --cores 40
By specifying a group ("group0") for the bwa rule and setting "--group-components group0=4", I was able to group the jobs such that 4 bwa runs are occurring on each node.
You can try job grouping but note that resources are typically summed together when submitting group jobs like this. Usually that's not what is desired, but in your case it seems to be correct.
Instead you can make a group job with another rule that does the grouping for you in batches of 4.
rule bwa_mem:
group: 'bwa_batch'
output: '{sample}.bam'
...
def bwa_mem_batch(wildcards):
# for wildcard.i, pick 4 bwa_mem outputs to put in this group
return expand('{sample}.bam', sample=SAMPLES[i*4:i*4+4])
rule bwa_mem_batch:
input: bwa_mem_batch_input
output: touch('flag_{i}') # could be temp too
group 'bwa_batch'
The consuming rule must request flag_{i} for i in {0..len(SAMPLES)//4}. With cluster integration, each slurm job gets 1 bwa_mem_batch job and 4 bwa_mem jobs with resources for a single bwa_mem job. This is useful for batching together multiple jobs to increase the runtime.
As a final point, this may do what you want, but I don't think it will help you get around QOS or other job quotas. You are using the same amount of CPU hours either way. You may be waiting in the queue longer because the scheduler can't find 40 threads to give you at once, where it could have given you a few 10 thread jobs. Instead, consider refining your resource values to get better efficiency, which may get your jobs run earlier.

Is there a way to get a nice error report summary when running many jobs on DRMAA cluster?

I need to run a snakemake pipeline on a DRMAA cluster with a total number of >2000 jobs. When some of the jobs have failed, I would like to receive in the end an easy readable summary report, where only the failed jobs are listed instead of the whole job summary as given in the log.
Is there a way to achieve this without parsing the log file by myself?
These are the (incomplete) cluster options:
jobs: 200
latency-wait: 5
keep-going: True
rerun-incomplete: True
restart-times: 2
I am not sure if there is another way than parsing the log file yourself, but I've done it several times with grep and I am happy with the results:
cat .snakemake/log/[TIME].snakemake.log | grep -B 3 -A 3 error
Of course you should change the TIME placeholder for whichever run you want to check.

Big query make a quick backup with many tables

Currently I'm copying tables with something like this:
#!/bin/sh
export SOURCE_DATASET="BQPROJECTID:BQSOURCEDATASET"
export DEST_PREFIX="TARGETBQPROJECTID:TARGETBQDATASET._YOUR_PREFIX"
for f in `bq ls -n TOTAL_NUMBER_OF_TABLES $SOURCE_DATASET |grep TABLE | awk '{print $1}'`
do
export CLONE_CMD="bq --nosync cp $SOURCE_DATASET.$f $DEST_PREFIX$f"
echo $CLONE_CMD
echo `$CLONE_CMD`
done
(script from here), but it takes ~20min (because of ~600 tables). Maybe there is another way (preferably faster), to make a backup?
As a suggestion you may use Scheduling queries to schedule recurring queries in BigQuery, with this option you will be able to schedule your backups on a daily, weekly, monthly or custom periodicity, leaving the backups of your tables for nights or weekends. You can find more information about it in the following link.
But remember, the time that you backup takes will depend on your tables size.
Well, due to you mentioned that Scheduling queries is not an option for you, another option you can try is run your cp command in the background, this because you are working with a for loop and you are waiting to finish each process, instead of that you can run multiple process in background to get better performance. I made a simple script to test it and it works! First I made test without background process:
#!/bin/bash
start_global=$(date +'%s');
for ((i=0;i<100;i++))
do
start=$(date +'%s');
bq --location=US cp -a -f -n [SOURCE_PROJECT_ID]:[DATASET].[TABLE]
[TARGET_PROJECT_ID]:[DATASET].[TABLE]
echo "It took $(($(date +'%s') - $start)) seconds to iteration umber:
$i"
done
echo "It took $(($(date +'%s') - $start_global)) seconds to the entire
process"
It takes me around 5 seconds per table copied (160 Mb approx), so I spend more less 10 minutes in that process, so I modified the script to use background process:
#!/bin/bash
start_global=$(date +'%s');
for ((i=0;i<100;i++))
do
bq --location=US cp -a -f -n [SOURCE_PROJECT_ID]:[DATASET].[TABLE]
[TARGET_PROJECT_ID]:[DATASET].[TABLE] &
pid_1=$! # Get background process id
done
if wait $pid_1
then
echo -e "Processes termination successful"
else
echo -e "Error"
fi
echo "It took $(($(date +'%s') - $start_global)) seconds to the entire
process"
In this way I only spend 3 minutes to finish the execution.
You may adapt this idea to your implementation, just consider the quotas for Copy jobs, you can check it here.

Sun Grid Engine resubmit job stuck in 'Rq' state

I have what I hope is a pretty simple question, but I'm not super familiar with Sun Grid, so I've been having trouble finding the answer. I am currently submitting jobs to a grid using a bash submission script that generates a command and then executes it. I have read online that if a sun grid job exits with a code of 99, it gets re-submitted to the grid. I have successfully written my bash script to do this:
[code to generate command, stores in $command]
$command
STATUS=$?
if [[ $STATUS -ne 0 ]]; then
exit 99
fi
exit 0
When I submit this job to the grid with a command that I know has a non-zero exit status, the job does indeed appear to be resubmitted, however the scheduler never sends it to another host, instead it just remains stuck in the queue with the status "Rq":
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
2150015 0.55500 GridJob.sh my_user Rq 04/08/2013 17:49:00 1
I have a feeling that this is something simple in the config options for the queue, but I haven't been able to find anything googling. I've tried submitting this job with the qsub -r y option, but that doesn't seem to change anything.
Thanks!
Rescheduled jobs will only get run in queues that have their rerun attribute (FALSE by default) set to TRUE, so check your queue configuration (qconf -mq myqueue). Without this, your job remains in the rescheduled-pending state indefinitely because it has nowhere to go.
IIRC, submitting jobs with qsub -r yes only qualifies them for automatic rescheduling in the event of an exec node crash, and that exiting with status 99 should trigger a reschedule regardless.