Erroneous MissingOutputException errors on Google Cloud/Kubernetes - snakemake

Executing snakemake with --kubernetes on GCP I am running into erroneous MissingeOutputException errors. Looking at the logs it seems that the jobs ended successfully and the output files were successfully uploaded to the bucket. The reported missing files appear to be intact and look as expected. Unfortunately I have not been able to reliably recreate this issue so its difficult to determine what the cause may be. I have tried increasing --latency-wait to 900 with no help
Would appreciate any insight on how Snakemake determines what files may be missing as that seems to be the best place to start. Digging through the source code myself I could not quite figure it out.
Edit 2/23/22, adding example rule:
rule dedup:
input:
get_bams_for_dedup
output:
dedupBam = config['output'] + "{Organism}/{refGenome}/" + config['bamDir'] + "{sample}" + config['bam_suffix'],
dedupBai = config['output'] + "{Organism}/{refGenome}/" + config['bamDir'] + "{sample}" + "_final.bam.bai",
conda:
"../envs/sambamba.yml"
resources:
threads = res_config['dedup']['threads'],
mem_mb = lambda wildcards, attempt: attempt * res_config['dedup']['mem']
log:
"logs/{Organism}/dedup/{refGenome}_{sample}.txt"
benchmark:
"benchmarks/{Organism}/dedup/{refGenome}_{sample}.txt"
shell:
"sambamba markdup -t {threads} {input} {output.dedupBam} 2> {log}"
This issue also leads to an IncompleteFilesException when trying to restart the workflow. Which doesn't make sense, as when Snakemake is run on kubernetes, the output file is uploaded to a bucket when the job finishes. And because the output files are in the bucket that means that the job must have completed successfully.
There seems to be something going on with how Snakemake is determining if an output file in the bucket is 'incomplete' I imagine it may have to do with the timestamps of the file vs the timestamps of when the Kubernetes job to create said file was submitted? I'm not sure though. Would appreciate feedback.

Related

Snakemake is unable to match wildcard although it's defined and even suggested

I am still very confused about the wildcards concept despite reading the full docs and a few examples, so maybe someone can shed light on this weird behaviour. It might be a bug but it's such a basic example that I am pretty sure I am doing or understanding something wrong.
Here is my Snakefile which should generate a bunch of files defined in a dictionary where the location of the files is stored (those can be served by all kinds of data providers like iRODS, XRootD etc., but it's not important now).
import os
some_files = {
"foo": "some_location/foo",
"bar": "another_location/bar",
"baz": "yet_another_loc/baz"
}
rule all:
input: ["raw/" + os.path.basename(f) for f in some_files.keys()]
rule generate_files:
output:
temp("raw/{fname}")
shell:
"echo grabbed file from {some_files[wildcards.fname]} > {output}"
As you can see, I need to use a similar "trick" which was proposed in my previous question (Array of values as input in Snakemake workflows) to force the recognition of the files by adding a rule and listing those (in rule all), which works nicely.
The rule generate_files should then generate (retrieve) those by using the corresponding URL and protocol defined in some_files. For the sake of simplicity, it's now just echoing the origin into the output file.
To achieve this, I thought I can simply use the wildcards.fname in the shell section but I when I run the workflow, I get:
░ tamasgal#silentbox-(2):PhD/snakemake  master ●●● snakemake took 16s
░ 08:47:35 > snakemake -c1
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job count min threads max threads
-------------- ------- ------------- -------------
all 1 1 1
generate_files 3 1 1
total 4 1 1
Select jobs to execute...
[Fri Feb 18 08:47:38 2022]
rule generate_files:
output: raw/bar
jobid: 2
wildcards: fname=bar
resources: tmpdir=/var/folders/84/mcvklq757tq1nfrkbxvvbq8m0000gn/T
RuleException in line 12 of /Users/tamasgal/Dev/PhD/snakemake/Snakefile:
NameError: The name 'wildcards.fname' is unknown in this context. Please make sure that you defined that variable. Also note that braces not used for variable access have to be escaped by repeating them, i.e. {{print $1}}
If I use fname (and not wildcards.fname), Snakemake proposes to use wildcards.fname, which again, does not work. Here is the output when running with fname in output:
[Fri Feb 18 08:47:48 2022]
rule generate_files:
output: raw/bar
jobid: 2
wildcards: fname=bar
resources: tmpdir=/var/folders/84/mcvklq757tq1nfrkbxvvbq8m0000gn/T
RuleException in line 12 of /Users/tamasgal/Dev/PhD/snakemake/Snakefile:
NameError: The name 'fname' is unknown in this context. Did you mean 'wildcards.fname'?
Why is this happening? The output of the workflow clearly shows that wildcards: fname=bar, so it exists and is defined. Is this a bug?
Hm, you may have to try and get at some_files[wildcards.fname] outside of the shell part? It looks to me like it can tell what the wildcard is supposed to be for the output to be raw/bar, but it can't handle using it to access the dict in the shell part. It seems like this could be handled with an input function to me.
Off the top of my head:
rule generate_files:
input:
some_file = lambda wildcards: some_files[wildcards.fname]
output:
temp("raw/{fname}")
shell:
"echo grabbed file from {input.some_file} > {output}"
EDIT: if it fails because the file isn't local so Snakemake can't find it, you may supply the path to it as a parameter instead:
rule generate_files:
params:
some_file = lambda wildcards: some_files[wildcards.fname]
output:
temp("raw/{fname}")
shell:
"echo grabbed file from {params.some_file} > {output}"

Snakemake running Subworkflow but not the Rest of my workflow (goes directly to rule All)

I'm a newbie in Snakemake and on StackOverflow. Don't hesitate to tell me if something is unclear or if you want any other detail.
I have written a workflow permitting to convert .BCL Illumina Base Calls files to demultiplexed .FASTQ files and to generate QC report (FastQC files). This workflow is composed of :
Subworkflow "convert_bcl_to_fastq" It creates FASTQ files in a directory named Fastq from BCL files. It must be executed before the main workflow, this is why I have chosen to use a subworkflow since my second rule depends on the generation of these FASTQ files which I don't know the names in advance. A fake file "convert_bcl_to_fastq.done" is created as an output in order to know when this subworkflow ran as espected.
Rule "generate_fastqc" It takes the FASTQ files generated thanks to the subworkflow and creates FASTQC files in a directory named FastQC.
Problem
When I try to run my workflow, I don't have any error but my workflow does not behave as expected. I only get the Subworkflow to be ran and then, the main workflow but only the Rule "all" is executed. My Rule "generate_fastqc" is not executed at all. I would like to know where I could possibly have been wrong ?
Here is what I get :
Building DAG of jobs...
Executing subworkflow convert_bcl_to_fastq.
Building DAG of jobs...
Job counts:
count jobs
1 convert_bcl_to_fastq
1
[...]
Processing completed with 0 errors and 1 warnings.
Touching output file convert_bcl_to_fastq.done.
Finished job 0.
1 of 1 steps (100%) done
Complete log: /path/to/my/working/directory/conversion/.snakemake/log/2020-03-12T171952.799414.snakemake.log
Executing main workflow.
Using shell: /usr/bin/bash
Provided cores: 40
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 all
1
localrule all:
input: /path/to/my/working/directory/conversion/convert_bcl_to_fastq.done
jobid: 0
Finished job 0.
1 of 1 steps (100%) done
And when all of my FASTQ files are generated, if I run again my workflow, this time it will execute the Rule "generate_fastqc".
Building DAG of jobs...
Executing subworkflow convert_bcl_to_fastq.
Building DAG of jobs...
Nothing to be done.
Complete log: /path/to/my/working/directory/conversion/.snakemake/log/2020-03-12T174337.605716.snakemake.log
Executing main workflow.
Using shell: /usr/bin/bash
Provided cores: 40
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 all
95 generate_fastqc
96
I wanted my workflow to execute itself entirely by running rule "generate_fastqc" just after the completion of the subworkflow execution but I am actually forced to execute my workflow 2 times. I thought that this workflow would work since all the files needed in the second part of the workflow will be generated thanks to the subworkflow... Do you have any idea of where I could have been wrong ?
My Code
Here is my Snakefile for the main workflow :
subworkflow convert_bcl_to_fastq:
workdir: WDIR + "conversion/"
snakefile: WDIR + "conversion/Snakefile"
SAMPLES, = glob_wildcards(FASTQ_DIR + "{sample}_R1_001.fastq.gz")
rule all:
input:
convert_bcl_to_fastq("convert_bcl_to_fastq.done"),
expand(FASTQC_DIR + "{sample}_R1_001_fastqc.html", sample=SAMPLES),
expand(FASTQC_DIR + "{sample}_R2_001_fastqc.html", sample=SAMPLES)
rule generate_fastqc:
output:
FASTQC_DIR + "{sample}_R1_001_fastqc.html",
FASTQC_DIR + "{sample}_R2_001_fastqc.html",
temp(FASTQC_DIR + "{sample}_R1_001_fastqc.zip"),
temp(FASTQC_DIR + "{sample}_R2_001_fastqc.zip")
shell:
"mkdir -p "+ FASTQC_DIR +" | " #Creates a FastQC directory if it is missing
"fastqc --outdir "+ FASTQC_DIR +" "+ FASTQ_DIR +"{wildcards.sample}_R1_001.fastq.gz "+ FASTQ_DIR + " {wildcards.sample}_R2_001.fastq.gz &" #Generates FASTQC files for each sample at a time
Here is my Snakefile for the subworkflow "convert_bcl_to_fastq" :
rule all:
input:
"convert_bcl_to_fastq.done"
rule convert_bcl_to_fastq:
output:
touch("convert_bcl_to_fastq.done")
shell:
"mkdir -p "+ FASTQ_DIR +" | " #Creates a Fastq directory if it is missing
"bcl2fastq --no-lane-splitting --runfolder-dir "+ INPUT_DIR +" --output-dir "+ FASTQ_DIR #Demultiplexes and Converts BCL files to FASTQ files
Thank you in advance for your help !
The documentation about subworkflows currently states:
When executing, snakemake first tries to create (or update, if necessary)
"test.txt" (and all other possibly mentioned dependencies) by executing the subworkflow.
Then the current workflow is executed.
In your case, the only dependency declared is "convert_bcl_to_fastq.done", which Snakemake happily produces the first time.
Snakemake usually does a one-pass parsing, and the main workflow has not been told to look for sample-files from the subworkflow. Since sample-files do not exist yet during the first execution, the main workflow gets no match in the expand() statements. No match, no work to be done :-)
When you run the main workflow the second time, it finds sample-matches in the expand() of rule all: and produces them.
Side note 1: Be happy to have noticed this. Using your code, if you actually had done changes that mandated re-run of the subworkflow, Snakemake would find an old "convert_bcl_to_fastq.done" and not re-execute the subworkflow.
Side note 2: If you want to make Snakemake be less 'one-pass' it has a rule-keyword checkpoint that can be used to re-evaluate what needs to be done as consequences of rule-execution. In your case, the checkpoint would have been rule convert_bcl_to_fastq . That would mandate the rules to be in the same logical snakefile (with include permitting multiple files though)

Running parallel instances of a single job/rule on Snakemake

Unexperienced, self-tought "coder" here, so please be understanding :]
I am trying to learn and use Snakemake to construct pipeline for my analysis. Unfortunatly, I am unable to run multiple instances of a single job/rule at the same time. My workstation is not a computing cluster, so I cannot use this option. I looked for an answer for hours, but either there is non, or I am not knowledgable enough to understand it.
So: is there a way to run multiple instances of a single job/rule simultaneously?
If You would like a concrete example:
Lets say I want to analyze a set of 4 .fastq files using fastqc tool. So I input a command:
time snakemake -j 32
and thus run my code, which is:
SAMPLES, = glob_wildcards("{x}.fastq.gz")
rule Raw_Fastqc:
input:
expand("{x}.fastq.gz", x=SAMPLES)
output:
expand("./{x}_fastqc.zip", x=SAMPLES),
expand("./{x}_fastqc.html", x=SAMPLES)
shell:
"fastqc {input}"
I would expect snakemake to run as many instances of fastqc as possible on 32 threads (so easily all of my 4 input files at once). In reality. this command takes about 12 minutes to finish. Meanwhile, utilizing GNU parallel from inside snakemake
shell:
"parallel fastqc ::: {input}"
I get results in 3 minutes. Clearly there is some untapped potential here.
Thanks!
If I am not wrong, fastqc works on each fastq file separately, and therefore your implementation doesn't take advantage of parallelization feature of snakemake. This can be done by defining the targets as shown below using rule all.
from pathlib import Path
SAMPLES = [Path(f).name.replace('.fastq.gz', '') for f in glob_wildcards("{x}.fastq.gz") ]
rule all:
input:
expand("./{sample_name}_fastqc.{ext}",
sample_name=SAMPLES, ext=['zip', 'html'])
rule Raw_Fastqc:
input:
"{x}.fastq.gz", x=SAMPLES
output:
"./{x}_fastqc.zip", x=SAMPLES,
"./{x}_fastqc.html", x=SAMPLES
shell:
"fastqc {input}"
To add to JeeYem's answer above, you can also define the number of resources to reserve for each job using the 'threads' property of each rule, as so:
rule Raw_Fastqc:
input:
"{x}.fastq.gz", x=SAMPLES
output:
"./{x}_fastqc.zip", x=SAMPLES,
"./{x}_fastqc.html", x=SAMPLES
threads: 4
shell:
"fastqc --threads {threads} {input}"
Because fastqc itself can use multiple threads per task, you might even get additional speedups over the parallel implementation.
Snakemake will then automatically allocate as many jobs as can fit within the total threads provided by the top-level call:
snakemake -j 32, for example, would execute up to 8 instances of the Raw_Fastqc rule.

Proxy file on snakemake code

I want to do alignment using star and I use proxy file for star the alignment.
Without a proxy file star-align run also without reference. So if I gave as input constrain of the alignment process the presence of database.done the alignment process can start.
How can manage this situation?
rule star_index:
input:
config['references']['transcriptome_fasta']
output:
genome=config['references']['starindex_dir'],
tp=touch("database.done")
shell:
'STAR --limitGenomeGenerateRAM 54760833024 --runMode genomeGenerate --genomeDir {output.genome} --genomeFastaFiles {input}'
rule star_map:
input:
dt="trim/{sample}/",
forward_paired="trim/{sample}/{sample}_forward_paired.fq.gz",
reverse_paired="trim/{sample}/{sample}_reverse_paired.fq.gz",
forward_unpaired="trim/{sample}/{sample}_forward_unpaired.fq.gz",
reverse_unpaired="trim/{sample}/{sample}_reverse_unpaired.fq.gz",
t1p="database.done",
output:
out1="ALIGN/{sample}/Aligned.sortedByCoord.out.bam",
out2="ALIGN/{sample}/",
# out2=touch("Star.align.done")
params:
genomedir = config['references']['basepath'],
sample="mitico",
platform_unit=config['platform'],
cente=config['center']
threads: 12
log: "ALIGN/log/{params.sample}_star.log"
shell:
'mkdir -p ALIGN/;STAR --runMode alignReads --genomeDir {params.genomedir} '
r' --outSAMattrRGline ID:{params.sample} SM:{params.sample} PL:{config[platform]} PU:{params.platform_unit} CN:{params.cente} '
'--readFilesIn {input.forward_paired} {input.reverse_paired} \
--readFilesCommand zcat
--outWigType wiggle \
--outWigStrand Stranded --runThreadN {threads} --outFileNamePrefix {output.out2} 2> {log} '
How can start a module only after all the previous function have finished.
I mean.Here i create the index then I trim ll my data and then I staart the alignment. I want after finishis all this sstep for all the sample start a new function like run fastqc. How can decode this in snakemake?
thanks so much for patience help
Without any mention of the genome as a required input for "star_map", I believe the rule is starting too early.
Try moving the genome reference from being a "Parameter" to being an "Input" requirement for star_map. Snakemake doesn't wait for parameters, only inputs. All reference genomes should be listed as inputs. In fact, all required files should be listed as input requirements. Param's are just for mostly convenience; ad-hoc strings and things on the fly.
I'm not entirely sure as to the connectivity across your files, some of these references are to a YAML file you have not provided, so I cannot guarantee the code will work.
rule star_map:
input:
dt="trim/{sample}/",
forward_paired="trim/{sample}/{sample}_forward_paired.fq.gz",
reverse_paired="trim/{sample}/{sample}_reverse_paired.fq.gz",
forward_unpaired="trim/{sample}/{sample}_forward_unpaired.fq.gz",
reverse_unpaired="trim/{sample}/{sample}_reverse_unpaired.fq.gz",
# Including the gnome as a required input, so Snakemake knows to wait for it too.
genomedir = config['references']['basepath'],
output:
out1="ALIGN/{sample}/Aligned.sortedByCoord.out.bam",
out2="ALIGN/{sample}/",
Snakemake doesn't check what files your shell commands are touching and modifying. Snakemake only knows to coordinate the files described in the "input" and "output" directives.

Hadoop jobs getting poor locality

I have some fairly simple Hadoop streaming jobs that look like this:
yarn jar /usr/lib/hadoop-mapreduce/hadoop-streaming-2.2.0.2.0.6.0-101.jar \
-files hdfs:///apps/local/count.pl \
-input /foo/data/bz2 \
-output /user/me/myoutput \
-mapper "cut -f4,8 -d," \
-reducer count.pl \
-combiner count.pl
The count.pl script is just a simple script that accumulates counts in a hash and prints them out at the end - the details are probably not relevant but I can post it if necessary.
The input is a directory containing 5 files encoded with bz2 compression, roughly the same size as each other, for a total of about 5GB (compressed).
When I look at the running job, it has 45 mappers, but they're all running on one node. The particular node changes from run to run, but always only one node. Therefore I'm achieving poor data locality as data is transferred over the network to this node, and probably achieving poor CPU usage too.
The entire cluster has 9 nodes, all the same basic configuration. The blocks of the data for all 5 files are spread out among the 9 nodes, as reported by the HDFS Name Node web UI.
I'm happy to share any requested info from my configuration, but this is a corporate cluster and I don't want to upload any full config files.
It looks like this previous thread [ why map task always running on a single node ] is relevant but not conclusive.
EDIT: at #jtravaglini's suggestion I tried the following variation and saw the same problem - all 45 map jobs running on a single node:
yarn jar \
/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.2.0.2.0.6.0-101.jar \
wordcount /foo/data/bz2 /user/me/myoutput
At the end of the output of that task in my shell, I see:
Launched map tasks=45
Launched reduce tasks=1
Data-local map tasks=18
Rack-local map tasks=27
which is the number of data-local tasks you'd expect to see on one node just by chance alone.