Unexperienced, self-tought "coder" here, so please be understanding :]
I am trying to learn and use Snakemake to construct pipeline for my analysis. Unfortunatly, I am unable to run multiple instances of a single job/rule at the same time. My workstation is not a computing cluster, so I cannot use this option. I looked for an answer for hours, but either there is non, or I am not knowledgable enough to understand it.
So: is there a way to run multiple instances of a single job/rule simultaneously?
If You would like a concrete example:
Lets say I want to analyze a set of 4 .fastq files using fastqc tool. So I input a command:
time snakemake -j 32
and thus run my code, which is:
SAMPLES, = glob_wildcards("{x}.fastq.gz")
rule Raw_Fastqc:
input:
expand("{x}.fastq.gz", x=SAMPLES)
output:
expand("./{x}_fastqc.zip", x=SAMPLES),
expand("./{x}_fastqc.html", x=SAMPLES)
shell:
"fastqc {input}"
I would expect snakemake to run as many instances of fastqc as possible on 32 threads (so easily all of my 4 input files at once). In reality. this command takes about 12 minutes to finish. Meanwhile, utilizing GNU parallel from inside snakemake
shell:
"parallel fastqc ::: {input}"
I get results in 3 minutes. Clearly there is some untapped potential here.
Thanks!
If I am not wrong, fastqc works on each fastq file separately, and therefore your implementation doesn't take advantage of parallelization feature of snakemake. This can be done by defining the targets as shown below using rule all.
from pathlib import Path
SAMPLES = [Path(f).name.replace('.fastq.gz', '') for f in glob_wildcards("{x}.fastq.gz") ]
rule all:
input:
expand("./{sample_name}_fastqc.{ext}",
sample_name=SAMPLES, ext=['zip', 'html'])
rule Raw_Fastqc:
input:
"{x}.fastq.gz", x=SAMPLES
output:
"./{x}_fastqc.zip", x=SAMPLES,
"./{x}_fastqc.html", x=SAMPLES
shell:
"fastqc {input}"
To add to JeeYem's answer above, you can also define the number of resources to reserve for each job using the 'threads' property of each rule, as so:
rule Raw_Fastqc:
input:
"{x}.fastq.gz", x=SAMPLES
output:
"./{x}_fastqc.zip", x=SAMPLES,
"./{x}_fastqc.html", x=SAMPLES
threads: 4
shell:
"fastqc --threads {threads} {input}"
Because fastqc itself can use multiple threads per task, you might even get additional speedups over the parallel implementation.
Snakemake will then automatically allocate as many jobs as can fit within the total threads provided by the top-level call:
snakemake -j 32, for example, would execute up to 8 instances of the Raw_Fastqc rule.
Related
I have a somewhat basic question about Snakemake parallelization when using cluster execution: can jobs from the same rule be parallelized both within a node and across multiple nodes at the same time?
Let's say for example that I have 100 bwa mem jobs and my cluster has nodes with 40 cores each. Could I run 4 bwa mem per node, each using 10 threads, and then have Snakemake submit 25 separate jobs? Essentially, I want to parallelize both within and across nodes for the same rule.
Here is my current snakefile:
SAMPLES, = glob_wildcards("fastqs/{id}.1.fq.gz")
print(SAMPLES)
rule all:
input:
expand("results/{sample}.bam", sample=SAMPLES)
rule bwa:
resources:
time="4:00:00",
partition="short-40core"
input:
ref="/path/to/reference/genome.fa",
fwd="fastqs/{sample}.1.fq.gz",
rev="fastqs/{sample}.2.fq.gz"
output:
bam="results/{sample}.bam"
log:
"results/logs/bwa/{sample}.log"
params:
threads=10
shell:
"bwa mem -t {params.threads} {input.ref} {input.fwd} {input.rev} 2> {log} | samtools view -bS - > {output.bam}"
I've run this with the following command:
snakemake --cluster "sbatch --partition={resources.partition}" -s bwa_slurm_snakefile --jobs 25
With this setup, I get 25 jobs submitted, each to a different node. However, only one bwa mem process (using 10 threads) is run per node.
Is there some straightforward way to modify this so that I could get 4 different bwa mem jobs (each using 10 threads) to run on each node?
Thanks!
Dave
Edit 07/28/22:
In addition to Troy's suggestion below, I found a straightforward way of accomplishing what I was trying to do by simply following the job grouping documentation.
Specifically, I did the following when executing my Snakemake pipeline:
snakemake --cluster "sbatch --partition={resources.partition}" -s bwa_slurm_snakefile --jobs 25 --groups bwa=group0 --group-components group0=4 --rerun-incomplete --cores 40
By specifying a group ("group0") for the bwa rule and setting "--group-components group0=4", I was able to group the jobs such that 4 bwa runs are occurring on each node.
You can try job grouping but note that resources are typically summed together when submitting group jobs like this. Usually that's not what is desired, but in your case it seems to be correct.
Instead you can make a group job with another rule that does the grouping for you in batches of 4.
rule bwa_mem:
group: 'bwa_batch'
output: '{sample}.bam'
...
def bwa_mem_batch(wildcards):
# for wildcard.i, pick 4 bwa_mem outputs to put in this group
return expand('{sample}.bam', sample=SAMPLES[i*4:i*4+4])
rule bwa_mem_batch:
input: bwa_mem_batch_input
output: touch('flag_{i}') # could be temp too
group 'bwa_batch'
The consuming rule must request flag_{i} for i in {0..len(SAMPLES)//4}. With cluster integration, each slurm job gets 1 bwa_mem_batch job and 4 bwa_mem jobs with resources for a single bwa_mem job. This is useful for batching together multiple jobs to increase the runtime.
As a final point, this may do what you want, but I don't think it will help you get around QOS or other job quotas. You are using the same amount of CPU hours either way. You may be waiting in the queue longer because the scheduler can't find 40 threads to give you at once, where it could have given you a few 10 thread jobs. Instead, consider refining your resource values to get better efficiency, which may get your jobs run earlier.
I would like to learn how snakemake handles following situations, and what is the best practice avoid collisions/corruptions.
rule something:
input:
expand("/path/to/out-{asd}.txt", asd=LIST)
output:
"/path/to/merged.txt"
shell:
"cat {input} >> {output}"
With snakemake -j10 the command will try to append to the same file simultaneously, and I could not figure out if this could lead to possible corruptions or if this is already handled.
Also, how are more complicated cases handled e.g. where it is not only cat but a return value of another process based on input value being appended to the same file? Is the best practice first writing them to individual files then catting them together?
rule get_merged_total_distinct:
input:
expand("{dataset_id}/merge_libraries/{tomerge}_merged_rmd.bam",dataset_id=config["dataset_id"],tomerge=list(TOMERGE.keys())),
output:
"{dataset_id}/merge_libraries/merged_total_distinct.csv"
params:
with_dups="{dataset_id}/merge_libraries/{tomerge}_merged.bam"
shell:
"""
RCT=$(samtools view -#4 -c -F1 -F4 -q 30 {params.with_dups})
RCD=$(samtools view -#4 -c -F1 -F4 -q 30 {input})
printf "{wildcards.tomerge},${{RCT}},${{RCD}}\n" >> {output}
"""
or cases where an external script is being called to print the result to a single output file?
input:
expand("infile/{x}",...) # expanded as above
output:
"results/all.txt"
shell:
"""
bash script.sh {params.x} {input} {params.y} >> {output}
"""
With your example, the shell directive will expand to
cat /path/to/out-SAMPLE1.txt /path/to/out-SAMPLE2.txt [...] >> /path/to/merged.txt
where SAMPLE1, etc, comes from the LIST. In this case, there is no collision, corruption, or race conditions. One thread will run that command as if you typed it on your shell and all inputs will get cated to the output. Since snakemake is pull based, once the output exists that rule will only run again if the inputs change at which points the new inputs will be added to the old due to using >>. As such, I would recommend using > so the old contents are removed; rules should be deterministic where possible.
Now, if you had done something like
rule something:
input:
"/path/to/out-{asd}.txt"
output:
touch("/path/to/merged-{asd}.txt")
params:
output="/path/to/merged.txt"
shell:
"cat {input} >> {params.output}"
# then invoke
snakemake -j10 /path/to/merged-{a..z}.txt
Things are more messy. Snakemake will launch all 10 jobs and output to the single merged.txt. Note that file is now a parameter and we are targeting some dummy files. This will behave as if you had 10 different shells and executed the commands
cat /path/to/out-a.txt >> /path/to/merged.txt
# ...
cat /path/to/out-z.txt >> /path/to/merged.txt
all at once. The output will have a random order and lines may be interleaved or interrupted.
As some guidance
Try to make outputs deterministic. Given the same inputs you should always produce the same outputs. If possible, set random seeds and enforce input ordering. In the second example, you have no idea what the output will be.
Don't use the append operator. This follows from the first point. If the output already exists and needs to be updated, start from scratch.
If you need to append a bunch of outputs, say log files or to create a summary, do so in a separate rule. This again follows from the first point, but it's the only reason I can think of to use append.
Hope that helps. Otherwise you can comment or edit with a more realistic example of what you are worried about.
I'm a newbie in Snakemake and on StackOverflow. Don't hesitate to tell me if something is unclear or if you want any other detail.
I have written a workflow permitting to convert .BCL Illumina Base Calls files to demultiplexed .FASTQ files and to generate QC report (FastQC files). This workflow is composed of :
Subworkflow "convert_bcl_to_fastq" It creates FASTQ files in a directory named Fastq from BCL files. It must be executed before the main workflow, this is why I have chosen to use a subworkflow since my second rule depends on the generation of these FASTQ files which I don't know the names in advance. A fake file "convert_bcl_to_fastq.done" is created as an output in order to know when this subworkflow ran as espected.
Rule "generate_fastqc" It takes the FASTQ files generated thanks to the subworkflow and creates FASTQC files in a directory named FastQC.
Problem
When I try to run my workflow, I don't have any error but my workflow does not behave as expected. I only get the Subworkflow to be ran and then, the main workflow but only the Rule "all" is executed. My Rule "generate_fastqc" is not executed at all. I would like to know where I could possibly have been wrong ?
Here is what I get :
Building DAG of jobs...
Executing subworkflow convert_bcl_to_fastq.
Building DAG of jobs...
Job counts:
count jobs
1 convert_bcl_to_fastq
1
[...]
Processing completed with 0 errors and 1 warnings.
Touching output file convert_bcl_to_fastq.done.
Finished job 0.
1 of 1 steps (100%) done
Complete log: /path/to/my/working/directory/conversion/.snakemake/log/2020-03-12T171952.799414.snakemake.log
Executing main workflow.
Using shell: /usr/bin/bash
Provided cores: 40
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 all
1
localrule all:
input: /path/to/my/working/directory/conversion/convert_bcl_to_fastq.done
jobid: 0
Finished job 0.
1 of 1 steps (100%) done
And when all of my FASTQ files are generated, if I run again my workflow, this time it will execute the Rule "generate_fastqc".
Building DAG of jobs...
Executing subworkflow convert_bcl_to_fastq.
Building DAG of jobs...
Nothing to be done.
Complete log: /path/to/my/working/directory/conversion/.snakemake/log/2020-03-12T174337.605716.snakemake.log
Executing main workflow.
Using shell: /usr/bin/bash
Provided cores: 40
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 all
95 generate_fastqc
96
I wanted my workflow to execute itself entirely by running rule "generate_fastqc" just after the completion of the subworkflow execution but I am actually forced to execute my workflow 2 times. I thought that this workflow would work since all the files needed in the second part of the workflow will be generated thanks to the subworkflow... Do you have any idea of where I could have been wrong ?
My Code
Here is my Snakefile for the main workflow :
subworkflow convert_bcl_to_fastq:
workdir: WDIR + "conversion/"
snakefile: WDIR + "conversion/Snakefile"
SAMPLES, = glob_wildcards(FASTQ_DIR + "{sample}_R1_001.fastq.gz")
rule all:
input:
convert_bcl_to_fastq("convert_bcl_to_fastq.done"),
expand(FASTQC_DIR + "{sample}_R1_001_fastqc.html", sample=SAMPLES),
expand(FASTQC_DIR + "{sample}_R2_001_fastqc.html", sample=SAMPLES)
rule generate_fastqc:
output:
FASTQC_DIR + "{sample}_R1_001_fastqc.html",
FASTQC_DIR + "{sample}_R2_001_fastqc.html",
temp(FASTQC_DIR + "{sample}_R1_001_fastqc.zip"),
temp(FASTQC_DIR + "{sample}_R2_001_fastqc.zip")
shell:
"mkdir -p "+ FASTQC_DIR +" | " #Creates a FastQC directory if it is missing
"fastqc --outdir "+ FASTQC_DIR +" "+ FASTQ_DIR +"{wildcards.sample}_R1_001.fastq.gz "+ FASTQ_DIR + " {wildcards.sample}_R2_001.fastq.gz &" #Generates FASTQC files for each sample at a time
Here is my Snakefile for the subworkflow "convert_bcl_to_fastq" :
rule all:
input:
"convert_bcl_to_fastq.done"
rule convert_bcl_to_fastq:
output:
touch("convert_bcl_to_fastq.done")
shell:
"mkdir -p "+ FASTQ_DIR +" | " #Creates a Fastq directory if it is missing
"bcl2fastq --no-lane-splitting --runfolder-dir "+ INPUT_DIR +" --output-dir "+ FASTQ_DIR #Demultiplexes and Converts BCL files to FASTQ files
Thank you in advance for your help !
The documentation about subworkflows currently states:
When executing, snakemake first tries to create (or update, if necessary)
"test.txt" (and all other possibly mentioned dependencies) by executing the subworkflow.
Then the current workflow is executed.
In your case, the only dependency declared is "convert_bcl_to_fastq.done", which Snakemake happily produces the first time.
Snakemake usually does a one-pass parsing, and the main workflow has not been told to look for sample-files from the subworkflow. Since sample-files do not exist yet during the first execution, the main workflow gets no match in the expand() statements. No match, no work to be done :-)
When you run the main workflow the second time, it finds sample-matches in the expand() of rule all: and produces them.
Side note 1: Be happy to have noticed this. Using your code, if you actually had done changes that mandated re-run of the subworkflow, Snakemake would find an old "convert_bcl_to_fastq.done" and not re-execute the subworkflow.
Side note 2: If you want to make Snakemake be less 'one-pass' it has a rule-keyword checkpoint that can be used to re-evaluate what needs to be done as consequences of rule-execution. In your case, the checkpoint would have been rule convert_bcl_to_fastq . That would mandate the rules to be in the same logical snakefile (with include permitting multiple files though)
I have a rule that runs a tool on multiple samples (some fail), I use -k option to proceed with remaining samples. However, for my next step I need to check the output of the first rule and create a single text summary file. I can not get the next rule to execute after the first rule.
I have tried various things, including rule fastQC_post having the output with wildcards. But then if I use this as input for the next rule I can't have one output file. If I use expand in the input of rule checkQC with the {sample} as the list of all samples determined at initiation this breaks as not all samples successfully reached the fastQC stage.
I really just want to be able to create the post_fastqc_reports folder in the fastQC_post rule and use this as input for my checkQC rule. OR be able to force it to run checkQC after fastQC_post has finished, but again checkpoints doesn't work as some of the jobs for fastQC_post fail.
I would like something like below: (this does not work as the output directory does not use the wildcard)
Surely there is an easier way to force rule order?
rule fastQC_post:
"""
runs FastQC on the trimmed reads
"""
input:
projectDir+batch+"_trimmed_reads/{sample}_trimmed.fq.gz"
output: directory(projectDir+batch+"post_fastqc_reports/")
log:
projectDir+"logs/{sample}_trimmed_fastqc.log"
params:
p = fastqcParams,
shell:
"""
/home/apps/pipelines/FastQC/CURRENT {params.p} -o {output} {input}
"""
rule checkQC:
input: rules.parse_sampleFile.output[0],rules.parse_sampleFile.output[1], rules.parse_sampleFile.output[2], directory(projectDir+batch+"post_fastqc_reports/")
output:
projectDir+"summaries/"+batch+"_summary_tg.txt"
, projectDir+"summaries/"+batch+"_listForFastqc.txt"
, projectDir+"summaries/"+batch+"_trimmingResults.txt"
, projectDir+"summaries/"+batch+"_summary_fq.txt"
log:
projectDir+"logs/"+stamp+"_checkQC.log"
shell:
"""
python python_scripts/fastqc_checks.py --input_file {log} --output {output[1]} {batchCmd}
python python_scripts/trimGalore_checks.py --list_file {input[0]} --single {input[1]} --pair {input[2]} --log {log} --output {output[0]} --trimDir {trimDir} --sampleFile \
{input[3]} {batchCmd}
"""
With the above I get the error that not all output and log contain same wildcard as input for the fastqc_post rule.
I just want to be able to run my checkQC rule after my fastqc_post rule (regardless of failures of jobs in the fastqc_post rule)
I want to do alignment using star and I use proxy file for star the alignment.
Without a proxy file star-align run also without reference. So if I gave as input constrain of the alignment process the presence of database.done the alignment process can start.
How can manage this situation?
rule star_index:
input:
config['references']['transcriptome_fasta']
output:
genome=config['references']['starindex_dir'],
tp=touch("database.done")
shell:
'STAR --limitGenomeGenerateRAM 54760833024 --runMode genomeGenerate --genomeDir {output.genome} --genomeFastaFiles {input}'
rule star_map:
input:
dt="trim/{sample}/",
forward_paired="trim/{sample}/{sample}_forward_paired.fq.gz",
reverse_paired="trim/{sample}/{sample}_reverse_paired.fq.gz",
forward_unpaired="trim/{sample}/{sample}_forward_unpaired.fq.gz",
reverse_unpaired="trim/{sample}/{sample}_reverse_unpaired.fq.gz",
t1p="database.done",
output:
out1="ALIGN/{sample}/Aligned.sortedByCoord.out.bam",
out2="ALIGN/{sample}/",
# out2=touch("Star.align.done")
params:
genomedir = config['references']['basepath'],
sample="mitico",
platform_unit=config['platform'],
cente=config['center']
threads: 12
log: "ALIGN/log/{params.sample}_star.log"
shell:
'mkdir -p ALIGN/;STAR --runMode alignReads --genomeDir {params.genomedir} '
r' --outSAMattrRGline ID:{params.sample} SM:{params.sample} PL:{config[platform]} PU:{params.platform_unit} CN:{params.cente} '
'--readFilesIn {input.forward_paired} {input.reverse_paired} \
--readFilesCommand zcat
--outWigType wiggle \
--outWigStrand Stranded --runThreadN {threads} --outFileNamePrefix {output.out2} 2> {log} '
How can start a module only after all the previous function have finished.
I mean.Here i create the index then I trim ll my data and then I staart the alignment. I want after finishis all this sstep for all the sample start a new function like run fastqc. How can decode this in snakemake?
thanks so much for patience help
Without any mention of the genome as a required input for "star_map", I believe the rule is starting too early.
Try moving the genome reference from being a "Parameter" to being an "Input" requirement for star_map. Snakemake doesn't wait for parameters, only inputs. All reference genomes should be listed as inputs. In fact, all required files should be listed as input requirements. Param's are just for mostly convenience; ad-hoc strings and things on the fly.
I'm not entirely sure as to the connectivity across your files, some of these references are to a YAML file you have not provided, so I cannot guarantee the code will work.
rule star_map:
input:
dt="trim/{sample}/",
forward_paired="trim/{sample}/{sample}_forward_paired.fq.gz",
reverse_paired="trim/{sample}/{sample}_reverse_paired.fq.gz",
forward_unpaired="trim/{sample}/{sample}_forward_unpaired.fq.gz",
reverse_unpaired="trim/{sample}/{sample}_reverse_unpaired.fq.gz",
# Including the gnome as a required input, so Snakemake knows to wait for it too.
genomedir = config['references']['basepath'],
output:
out1="ALIGN/{sample}/Aligned.sortedByCoord.out.bam",
out2="ALIGN/{sample}/",
Snakemake doesn't check what files your shell commands are touching and modifying. Snakemake only knows to coordinate the files described in the "input" and "output" directives.