Snakemake: variable that defines whether process is submitted cluster job or the snakefile - snakemake

My current architecture is that at the start of my Snakefile I have a long running function somefunc which helps decide the "input" to rule all. I realized when I was running the workflow with slurm that somefunc is being executed by each job. Is there some variable I can access that defines whether the code is a submitted job or whether it is the main process:
if not snakemake.submitted_job:
config['layout'] = somefunc()
...

A solution which I don't really recommend is to make somefunc write the list of inputs to a tmp file so that slurm jobs will read this tmp file rather than reconstructing the list from scratch. The tmp file is created by whatever job is executed first so the long-running part is done only once.
At the end of the workflow delete the tmp file so that later executions will start fresh with new input.
Here's a sketch:
def somefunc():
try:
all_output = open('tmp.txt').readlines()
all_output = [x.strip() for x in all_output]
print('List of input files read from tmp.txt')
except:
all_output = ['file1.txt', 'file2.txt'] # Long running part
with open('tmp.txt', 'w') as fout:
for x in all_output:
fout.write(x + '\n')
print('List of input files created and written to tmp.txt')
return all_output
all_output = somefunc()
rule all:
input:
all_output,
rule one:
output:
all_output,
shell:
r"""
touch {output}
"""
onsuccess:
os.remove('tmp.txt')
onerror:
os.remove('tmp.txt')
Since jobs will be submitted in parallel, you should make sure that only one job writes tmp.txt and the others read it. I think the try/except above will do it but I'm not 100% sure. (Probably you want to use some better filename than tmp.txt, see the module tempfile. see also the module atexit) for exit handlers)

As discussed with #dariober it seems the cleanest to check whether the (hidden) snakemake directory has locks since they seem not to be generated until the first rule starts (assuming you are not using the --nolock argument).
import os
locked = len(os.listdir(".snakemake/locks")) > 0
However this results in a problem in my case:
import time
import os
def longfunc():
time.sleep(10)
return range(5)
locked = len(os.listdir(".snakemake/locks")) > 0
if not locked:
info = longfunc()
rule all:
input:
expand("test_{sample}", sample=info)
rule test:
output:
touch("test_{sample}")
run:
"""
sleep 1
"""
Somehow snakemake lets each rule reinterpret the complete snakefile, with the issue that all the jobs will complain that 'info is not defined'. For me it was easiest to store the results and load them for each job (pickle.dump and pickle.load).

Related

Passing list of filenames to nextflow process

I am a newcomer to Nextflow and I am trying to process multiple files in a workflow. The number of these files is more than 300, so I would like to not to paste it into a command line as an option. So what I have done is I've created a file with every filename of the files I need to process, but I am not sure how to pass it into the process. This is what I've tried:
params.SRRs = "srr_ids.txt"
process tmp {
input:
file ids
output:
path "*.txt"
script:
'''
while read id; do
touch ${id}.txt;
echo ${id} > ${id}.txt;
done < $ids
'''
}
workflow {
tmp(params.SRRs)
}
The script is supposed to read in the file srr_ids.txt, and create files that have their ids in it (just testing on a smaller task). The error log says that the id variable is unbound, but I don't understand why. What is the conventional way of passing lots of filenames to a pipeline? Should I write some other process that parses the list?
Maybe there's a typo in your question, but the error is actually that the ids variable is unbound:
Command error:
.command.sh: line 5: ids: unbound variable
The problem is that when you use a single-quote script string, you will not be able to access Nextflow variables in your script block. You can either define your script using a double-quote string and escape your shell variables:
params.SRRs = "srr_ids.txt"
process tmp {
input:
path ids
output:
path "*.txt"
script:
"""
while read id; do
touch "\${id}.txt"
echo "\${id}" > "\${id}.txt"
done < "${ids}"
"""
}
workflow {
SRRs = file(params.SRRs)
tmp(SRRs)
}
Or, use a shell block which uses the exclamation mark ! character as the variable placeholder for Nextflow variables. This makes it possible to use both Nextflow and shell variables in the same piece of code without having to escape each of the shell variables:
params.SRRs = "srr_ids.txt"
process tmp {
input:
path ids
output:
path "*.txt"
shell:
'''
while read id; do
touch "${id}.txt"
echo "${id}" > "${id}.txt"
done < "!{ids}"
'''
}
workflow {
SRRs = file(params.SRRs)
tmp(SRRs)
}
What is the conventional way of passing lots of filenames to a
pipeline?
The conventional way, I think, is to actually supply one (or more) glob patterns to the fromPath channel factory method. For example:
params.SRRs = "./path/to/files/SRR*.fastq.gz"
workflow {
Channel
.fromPath( params.SRRs )
.view()
}
Results:
$ nextflow run main.nf
N E X T F L O W ~ version 22.04.4
Launching `main.nf` [sleepy_bernard] DSL2 - revision: 30020008a7
/home/steve/working/stackoverflow/73702711/path/to/files/SRR1910483.fastq.gz
/home/steve/working/stackoverflow/73702711/path/to/files/SRR1910482.fastq.gz
/home/steve/working/stackoverflow/73702711/path/to/files/SRR1448795.fastq.gz
/home/steve/working/stackoverflow/73702711/path/to/files/SRR1448793.fastq.gz
/home/steve/working/stackoverflow/73702711/path/to/files/SRR1448794.fastq.gz
/home/steve/working/stackoverflow/73702711/path/to/files/SRR1448792.fastq.gz
If instead you would prefer to pass in a list of filenames, like in your example, use either the splitCsv or the splitText operator to get what you want. For example:
params.SRRs = "srr_ids.txt"
workflow {
Channel
.fromPath( params.SRRs )
.splitText() { it.strip() }
.view()
}
Results:
$ nextflow run main.nf
N E X T F L O W ~ version 22.04.4
Launching `main.nf` [fervent_ramanujan] DSL2 - revision: 89a1771d50
SRR1448794
SRR1448795
SRR1448792
SRR1448793
SRR1910483
SRR1910482
Should I write some other process that parses the list?
You may not need to. My feeling is that your code might benefit from using the fromSRA factory method, but we don't really have enough details to say one way or the other. If you need to, you could just write a function that returns a channel.

Snakemake variable number of files

I'm in a situation, where I would like to scatter my workflow into a variable number of chunks, which I don't know beforehand. Maybe it is easiest to explain the problem by being concrete:
Someone has handed me FASTQ files demultiplexed using bcl2fastq with the no-lane-splitting option. I would like to split these files according to lane, map each lane individually, and then finally gather everything again. However, I don't know the number of lanes beforehand.
Ideally, I would like a solution like this,
rule split_fastq_file: (...) # results in N FASTQ files
rule map_fastq_file: (...) # do this N times
rule merge_bam_files: (...) # merge the N BAM files
but I am not sure this is possbile. The expand function requires me to know the number of lanes, and can't see how it would be possible to use wildcards for this, either.
I should say that I am rather new to Snakemake, and that I may have complete misunderstood how Snakemake works. It has taken me some time to get used to think about things "upside-down" by focusing on output files and then working backwards.
One option is to use checkpoint when splitting the fastqs, so that you can dynamically re-evaluate the DAG at a later point to get the resulting lanes.
Here's an MWE step by step:
Setup and make an example fastq file.
# Requires Python 3.6+ for f-strings, Snakemake 5.4+ for checkpoints
import pathlib
import random
random.seed(1)
rule make_fastq:
output:
fastq = touch("input/{sample}.fastq")
Create a random number of lanes between 1 and 9 each with random identifier from 1 to 9. Note that we declare this as a checkpoint, rather than a rule, so that we can later access the result. Also, we declare the output here as a directory specific to the sample, so that we can later glob in it to get the lanes that were created.
checkpoint split_fastq:
input:
fastq = rules.make_fastq.output.fastq
output:
lane_dir = directory("temp/split_fastq/{sample}")
run:
pathlib.Path(output.lane_dir).mkdir(exist_ok=True)
n_lanes = random.randrange(1, 10)-
lane_numbers = random.sample(range(1, 10), k = n_lanes)
for lane_number in lane_numbers:
path = pathlib.Path(output.lane_dir) / f"L00{lane_number}.fastq"
path.touch()
Do some intermediate processing.
rule map_fastq:
input:
fastq = "temp/split_fastq/{sample}/L00{lane_number}.fastq"
output:
bam = "temp/map_fastq/{sample}/L00{lane_number}.bam"
run:
bam = pathlib.Path(output.bam)
bam.parent.mkdir(exist_ok=True)
bam.touch()
To merge all the processed files, we use an input function to access the lanes that were created in split_fastq, so that we can do a dynamic expand on these. We do the expand on the last rule in the chain of intermediate processing steps, in this case map_fastq, so that we ask for the correct inputs.
def get_bams(wildcards):
lane_dir = checkpoints.split_fastq.get(**wildcards).output[0]
lane_numbers = glob_wildcards(f"{lane_dir}/L00{{lane_number}}.fastq").lane_number
bams = expand(rules.map_fastq.output.bam, **wildcards, lane_number=lane_numbers)
return bams
This input function now gives us easy access to the bam files we wish to merge, however many there are, and whatever they may be called.
rule merge_bam:
input:
get_bams
output:
bam = "temp/merge_bam/{sample}.bam"
shell:
"cat {input} > {output.bam}"
This example runs, and with random.seed(1) happens to create three lanes (l001, l002, and l005).
If you don't want to use checkpoint, I think you could achieve something similar by creating an input function for merge_bam that opens up the original input fastq, scans the read names for lane info, and predicts what the input files ought to be. This seems less robust, however.

Snakemake running Subworkflow but not the Rest of my workflow (goes directly to rule All)

I'm a newbie in Snakemake and on StackOverflow. Don't hesitate to tell me if something is unclear or if you want any other detail.
I have written a workflow permitting to convert .BCL Illumina Base Calls files to demultiplexed .FASTQ files and to generate QC report (FastQC files). This workflow is composed of :
Subworkflow "convert_bcl_to_fastq" It creates FASTQ files in a directory named Fastq from BCL files. It must be executed before the main workflow, this is why I have chosen to use a subworkflow since my second rule depends on the generation of these FASTQ files which I don't know the names in advance. A fake file "convert_bcl_to_fastq.done" is created as an output in order to know when this subworkflow ran as espected.
Rule "generate_fastqc" It takes the FASTQ files generated thanks to the subworkflow and creates FASTQC files in a directory named FastQC.
Problem
When I try to run my workflow, I don't have any error but my workflow does not behave as expected. I only get the Subworkflow to be ran and then, the main workflow but only the Rule "all" is executed. My Rule "generate_fastqc" is not executed at all. I would like to know where I could possibly have been wrong ?
Here is what I get :
Building DAG of jobs...
Executing subworkflow convert_bcl_to_fastq.
Building DAG of jobs...
Job counts:
count jobs
1 convert_bcl_to_fastq
1
[...]
Processing completed with 0 errors and 1 warnings.
Touching output file convert_bcl_to_fastq.done.
Finished job 0.
1 of 1 steps (100%) done
Complete log: /path/to/my/working/directory/conversion/.snakemake/log/2020-03-12T171952.799414.snakemake.log
Executing main workflow.
Using shell: /usr/bin/bash
Provided cores: 40
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 all
1
localrule all:
input: /path/to/my/working/directory/conversion/convert_bcl_to_fastq.done
jobid: 0
Finished job 0.
1 of 1 steps (100%) done
And when all of my FASTQ files are generated, if I run again my workflow, this time it will execute the Rule "generate_fastqc".
Building DAG of jobs...
Executing subworkflow convert_bcl_to_fastq.
Building DAG of jobs...
Nothing to be done.
Complete log: /path/to/my/working/directory/conversion/.snakemake/log/2020-03-12T174337.605716.snakemake.log
Executing main workflow.
Using shell: /usr/bin/bash
Provided cores: 40
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 all
95 generate_fastqc
96
I wanted my workflow to execute itself entirely by running rule "generate_fastqc" just after the completion of the subworkflow execution but I am actually forced to execute my workflow 2 times. I thought that this workflow would work since all the files needed in the second part of the workflow will be generated thanks to the subworkflow... Do you have any idea of where I could have been wrong ?
My Code
Here is my Snakefile for the main workflow :
subworkflow convert_bcl_to_fastq:
workdir: WDIR + "conversion/"
snakefile: WDIR + "conversion/Snakefile"
SAMPLES, = glob_wildcards(FASTQ_DIR + "{sample}_R1_001.fastq.gz")
rule all:
input:
convert_bcl_to_fastq("convert_bcl_to_fastq.done"),
expand(FASTQC_DIR + "{sample}_R1_001_fastqc.html", sample=SAMPLES),
expand(FASTQC_DIR + "{sample}_R2_001_fastqc.html", sample=SAMPLES)
rule generate_fastqc:
output:
FASTQC_DIR + "{sample}_R1_001_fastqc.html",
FASTQC_DIR + "{sample}_R2_001_fastqc.html",
temp(FASTQC_DIR + "{sample}_R1_001_fastqc.zip"),
temp(FASTQC_DIR + "{sample}_R2_001_fastqc.zip")
shell:
"mkdir -p "+ FASTQC_DIR +" | " #Creates a FastQC directory if it is missing
"fastqc --outdir "+ FASTQC_DIR +" "+ FASTQ_DIR +"{wildcards.sample}_R1_001.fastq.gz "+ FASTQ_DIR + " {wildcards.sample}_R2_001.fastq.gz &" #Generates FASTQC files for each sample at a time
Here is my Snakefile for the subworkflow "convert_bcl_to_fastq" :
rule all:
input:
"convert_bcl_to_fastq.done"
rule convert_bcl_to_fastq:
output:
touch("convert_bcl_to_fastq.done")
shell:
"mkdir -p "+ FASTQ_DIR +" | " #Creates a Fastq directory if it is missing
"bcl2fastq --no-lane-splitting --runfolder-dir "+ INPUT_DIR +" --output-dir "+ FASTQ_DIR #Demultiplexes and Converts BCL files to FASTQ files
Thank you in advance for your help !
The documentation about subworkflows currently states:
When executing, snakemake first tries to create (or update, if necessary)
"test.txt" (and all other possibly mentioned dependencies) by executing the subworkflow.
Then the current workflow is executed.
In your case, the only dependency declared is "convert_bcl_to_fastq.done", which Snakemake happily produces the first time.
Snakemake usually does a one-pass parsing, and the main workflow has not been told to look for sample-files from the subworkflow. Since sample-files do not exist yet during the first execution, the main workflow gets no match in the expand() statements. No match, no work to be done :-)
When you run the main workflow the second time, it finds sample-matches in the expand() of rule all: and produces them.
Side note 1: Be happy to have noticed this. Using your code, if you actually had done changes that mandated re-run of the subworkflow, Snakemake would find an old "convert_bcl_to_fastq.done" and not re-execute the subworkflow.
Side note 2: If you want to make Snakemake be less 'one-pass' it has a rule-keyword checkpoint that can be used to re-evaluate what needs to be done as consequences of rule-execution. In your case, the checkpoint would have been rule convert_bcl_to_fastq . That would mandate the rules to be in the same logical snakefile (with include permitting multiple files though)

snakemake: how to implement log directive when using run directive?

Snakemake allows creation of a log for each rule with log parameter that specifies the name of the log file. It is relatively straightforward to pipe results from shell output to this log, but I am not able to figure out a way of logging output of run output (i.e. python script).
One workaround is to save the python code in a script and then run it from the shell, but I wonder if there is another way?
I have some rules that use both the log and run directives. In the run directive, I "manually" open and write the log file.
For instance:
rule compute_RPM:
input:
counts_table = source_small_RNA_counts,
summary_table = rules.gather_read_counts_summaries.output.summary_table,
tags_table = rules.associate_small_type.output.tags_table,
output:
RPM_table = OPJ(
annot_counts_dir,
"all_{mapped_type}_on_%s" % genome, "{small_type}_RPM.txt"),
log:
log = OPJ(log_dir, "compute_RPM_{mapped_type}", "{small_type}.log"),
benchmark:
OPJ(log_dir, "compute_RPM_{mapped_type}", "{small_type}_benchmark.txt"),
run:
with open(log.log, "w") as logfile:
logfile.write(f"Reading column counts from {input.counts_table}\n")
counts_data = pd.read_table(
input.counts_table,
index_col="gene")
logfile.write(f"Reading number of non-structural mappers from {input.summary_table}\n")
norm = pd.read_table(input.summary_table, index_col=0).loc["non_structural"]
logfile.write(str(norm))
logfile.write("Computing counts per million non-structural mappers\n")
RPM = 1000000 * counts_data / norm
add_tags_column(RPM, input.tags_table, "small_type").to_csv(output.RPM_table, sep="\t")
For third-party code that writes to stdout, maybe the redirect_stdout context manager could be helpful (found in https://stackoverflow.com/a/40417352/1878788, documented at
https://docs.python.org/3/library/contextlib.html#contextlib.redirect_stdout).
Test snakefile, test_run_log.snakefile:
from contextlib import redirect_stdout
rule all:
input:
"test_run_log.txt"
rule test_run_log:
output:
"test_run_log.txt"
log:
"test_run_log.log"
run:
with open(log[0], "w") as log_file:
with redirect_stdout(log_file):
print(f"Writing result to {output[0]}")
with open(output[0], "w") as out_file:
out_file.write("result\n")
Running it:
$ snakemake -s test_run_log.snakefile
Results:
$ cat test_run_log.log
Writing result to test_run_log.txt
$ cat test_run_log.txt
result
My solution was the following. This is usefull both for normal log and logging exceptions with traceback. You can then wrap logger setup in a function to make it more organized. It's not very pretty though. Would be much nicer if snakemake could do it by itself.
import logging
# some stuff
rule logging_test:
input: 'input.json'
output: 'output.json'
log: 'rules_logs/logging_test.log'
run:
logger = logging.getLogger('logging_test')
fh = logging.FileHandler(str(log))
fh.setLevel(logging.INFO)
formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
fh.setFormatter(formatter)
logger.addHandler(fh)
try:
logger.info('Starting operation!')
# do something
with open(str(output), 'w') as f:
f.write('success!')
logger.info('Ended!')
except Exception as e:
logger.error(e, exc_info=True)

snakemake STAR module issue and extra question

I discovered that the snakemake STAR module outputs as 'BAM Unsorted'.
Q1:Is there a way to change this to:
--outSAMtype BAM SortedByCoordinate
When I add the option in the 'extra' options I get an error message about duplicate definition:
EXITING: FATAL INPUT ERROR: duplicate parameter "outSAMtype" in input "Command-Line"
SOLUTION: keep only one definition of input parameters in each input source
Nov 15 09:46:07 ...... FATAL ERROR, exiting
logs/star/se/UY2_S7.log (END)
Should I consider adding a sorting module behind STAR instead?
Q2: How can I take a module from the wrapper repo and make it a local module, allowing me to edit it?
the code:
__author__ = "Johannes Köster"
__copyright__ = "Copyright 2016, Johannes Köster"
__email__ = "koester#jimmy.harvard.edu"
__license__ = "MIT"
import os
from snakemake.shell import shell
extra = snakemake.params.get("extra", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)
fq1 = snakemake.input.get("fq1")
assert fq1 is not None, "input-> fq1 is a required input parameter"
fq1 = [snakemake.input.fq1] if isinstance(snakemake.input.fq1, str) else snakemake.input.fq1
fq2 = snakemake.input.get("fq2")
if fq2:
fq2 = [snakemake.input.fq2] if isinstance(snakemake.input.fq2, str) else snakemake.input.fq2
assert len(fq1) == len(fq2), "input-> equal number of files required for fq1 and fq2"
input_str_fq1 = ",".join(fq1)
input_str_fq2 = ",".join(fq2) if fq2 is not None else ""
input_str = " ".join([input_str_fq1, input_str_fq2])
if fq1[0].endswith(".gz"):
readcmd = "--readFilesCommand zcat"
else:
readcmd = ""
outprefix = os.path.dirname(snakemake.output[0]) + "/"
shell(
"STAR "
"{extra} "
"--runThreadN {snakemake.threads} "
"--genomeDir {snakemake.params.index} "
"--readFilesIn {input_str} "
"{readcmd} "
"--outSAMtype BAM Unsorted "
"--outFileNamePrefix {outprefix} "
"--outStd Log "
"{log}")
Q1:Is there a way to change this to:
--outSAMtype BAM SortedByCoordinate
I would add another sorting rule after the wrapper as it is the most 'standardized` way of doing it. You can also use another wrapper for sorting.
There is an explanation from the author of snakemake for the reason why the default is unsorted and why there is no option for sorted output in the wrapper:
https://bitbucket.org/snakemake/snakemake/issues/440/pre-post-wrapper
Regarding the SAM/BAM issue, I would say any wrapper should always output the optimal file format. Hence, whenever I write a wrapper for a read mapper, I ensure that output is not SAM. Indexing and sorting should not be part of the same wrapper I think, because such a task has a completely different behavior regarding parallelization. Also, you would loose the mapping output if something goes wrong during the sorting or indexing.
Q2: How can I take a module from the wrapper repo and make it a local module, allowing me to edit it?
If you wanted to do this, one way would be to download the local copy of the wrapper. Change in the shell portion of the downloaded wrapper Unsorted to {snakemake.params.outsamtype}. In your Snakefile change (wrapper to script, path/to/downloaded/wrapper and add the outsamtype parameter):
rule star_se:
input:
fq1 = "reads/{sample}_R1.1.fastq"
output:
# see STAR manual for additional output files
"star/{sample}/Aligned.out.bam"
log:
"logs/star/{sample}.log"
params:
# path to STAR reference genome index
index="index",
# optional parameters
extra="",
outsamtype = "SortedByCoordinate"
threads: 8
script:
"path/to/downloaded/wrapper"
I think a separate rule w/o a wrapper for sorting or even making your own star rule rather is better. Modifying the wrapper defeats the whole purpose of it.