What are different methods used for naming snakemake pipeline output files that depends on multiple variables? - naming-conventions

I wrote a snakemake pipeline which is intended to be run again with different variables provided by the user in a new config file during each run.
config.yml:
param_a: 100 #filter dataset rule1
param_b: 200 #filter sample rule2
param_c: 300 #filter sample again rule3
config2.yml:
param_a: 150 #100->150
param_b: 200
param_c: 300
Snakefile:
rule rule1:
#dataset is filtered by param_a
output: {dataset}_{param_a}/{sample}
rule rule2:
#sample is filtered by param_a
output: {dataset}_{param_a}/{sample}_{param_b}
rule rule3:
#sample is then filtered by param_c
output: {dataset}_{param_a}/{sample}_{param_b}_{param_c}
The aim is making it possible for user to rerun the analyses with different options at different steps without having to run everything until the step with the param change again.
When we have too many of such parameters the directory and file names start to get too long, e.g.:
dataset1/sample-minSize200_samtools-F4-F1024-q20_mosdepth-minDepth4-maxDepth100_bedtools-merge-gap200_angsd-minQ20_loci-maxBase100/mysample.bam
dataset1/sample-minSize200_samtools-F4-F1024-q20_mosdepth-minDepth4-maxDepth100_bedtools-merge-gap200_angsd-minQ20_loci-maxBase200/mysample.bam
Is there any method for easier and more efficient naming, such as auto creating version names and saving parameter details to a text file?
I read about the shadow directory feature but I don't think it does what I am looking for.

If you want to be very fancy, you could encode the params into a SHA hash or similar and use that for the filename, recording the hash and parameter values in a table. You just need a function to take keyword params and translate that to the hash and use it for all your rule inputs. If I were you, I would use directories instead of flat filenames.
dataset1/sample-minSize200/samtools-F4-F1024-q20/mosdepth-minDepth4-maxDepth100/bedtools-merge-gap200/angsd-minQ20/loci-maxBase100/mysample.bam
That would make it easier to discard all of some parameter set that you don't need anymore and will make directory listing faster.

Related

Snakemake, RNA-seq : How can I execute one subpart of a pipeline or another subpart based on the characteristics of the sample that is analysed?

I am using snakemake to design a RNAseq-data analysis pipeline. While I've managed to do that, I want to make my pipeline to be as adaptable as possible and make it able to deal with single-reads (SE) data or paired-end (PE) data within the same run of analyses, instead of analysing SE data in one run and PE data in another.
My pipeline is supposed to be designed like this :
dataset download that gives 1 file (SE data) or 2 files (PE data) -->
set of rules A specific to 1 file OR set of rules B specific to 2 files -->
rule that takes 1 or 2 input files and merges it/them
into a single output -->
final set of rules.
Note : all rules of A have 1 input and 1 output, all rules of B have 2 inputs and 2 outputs and their respective commands look like :
1 input : somecommand -i {input} -o {output}
2 inputs : somecommand -i1 {input1} -i2 {input2} -o1 {output1} -o2 {output2}
Note 2 : except their differences in inputs/outputs, all rules of sets A and B have the same commands, parameters/etc...
In other words, I want my pipeline to be able to switch between the execution of set of rules A or set of rules B depending on the sample, either by giving it information on the sample in a config file at the start (sample 1 is SE, sample 2 is PE... this is known before-hand) or asking snakemake to counts the number of files after the dataset download to choose the proper next set of rules for each sample. If you see another way to do that, you're welcome to tell be about it.
I thought about using checkpoints, input functions and if/else statement, but I haven't managed to solve my problem with these.
Do you have any hints/advice/ways to make that "switch" happen ?
If you know the layout beforehand, then the easiest way would be to store it in some variable, something like this (or alternatively you read this from a config file into a dictionary):
layouts = {"sample1": "paired", "sample2": "single", ... etc}
What you can then do is "merge" your rule like this (I am guessing you are talking about trimming and alignment, so that's my example):
ruleorder: B > A
rule A:
input:
{sample}.fastq.gz
output:
trimmed_{sample}.fastq.gz
shell:
"somecommand -i {input} -o {output}"
rule B:
input:
input1={sample}_R1.fastq.gz,
input2={sample}_R2.fastq.gz
output:
output1=trimmed_{sample}_R1.fastq.gz,
output2=trimmed_{sample}_R2.fastq.gz
shell:
"somecommand -i1 {input.input1} -i2 {input.input2} -o1 {output.output1} -o2 {output.output2}"
def get_fastqs(wildcards):
output = dict()
if layouts[wildcards.sample] == "single":
output["input"] = "trimmed_sample2.fastq.gz"
elif layouts[wildcards.sample] == "paired":
output["input1"] = "trimmed_sample1_R1.fastq.gz"
output["input2"] = "trimmed_sample1_R2.fastq.gz"
return output
rule alignment:
def input:
unpack(get_fastqs)
def output:
somepath/{sample}.bam
shell:
...
There is a lot of stuff going on here.
First of all you need a ruleorder so snakemake knows how to handle ambiguous cases
Rule A and B both have to exist (unless you do sth hacky with the output files).
The alignment rule needs an input function to determine which input it requires.
Some self-promotion: I made a snakemake pipeline which does many things, including RNA-seq and downloading of samples online and automatically determining their layout (single-end vs paired-end). Please take a look and see if it solves your problem: https://vanheeringen-lab.github.io/seq2science/content/workflows/rna_seq.html
EDIT:
When you say “merging” rules, do you mean rule A, B and alignment ?
That was unclear wording of me. With merging I meant to "merge
the single-end and paired-end and paired-end logic together, so you can continue with a single rule (e.g. count table, you name it).
Rule order : why did you choose B > A ? To make sure that paired samples don’t end up running in the single-end rules?
Exactly! When a rule needs trimmed_sample1_R1.fastq.gz, how would Snakemake know the name of your sample? Is the name of the sample, sample1, or is it sample1_R1? It can be either, and that makes snakemake complain that it does not know how to resolve this. When you add a ruleorder you tell Snakemake, when it is unclear, resolve in this order.
The command in the alignment rule needs 1 or 2 inputs. I intend to use an if/else in params directive to choose the inputs. Am I correct to think that? (I think you did that as well in your pipeline)
Yes that's the way we solved it. We did it in that way since we want every rule to have it's own environment. If you do not use a seperate conda environment for alignment, then you can do it cleaner/prettier, like so
rule alignment:
input:
unpack(get_fastqs)
output:
somepath/{sample}.bam
run:
if layouts[wildcards.sample] == "single":
shell("single-end command")
if layouts[wildcards.sample] == "paired":
shell("paired-end command")
I feel like this option is much clearer than what we did in the seq2science pipeline. However in the seq2science pipeline we support many different aligners and they all have a different conda environment, so the run directive can not be used.

Can I add a file to rule all: which is not defined in output

A number of commands produce silently extra files not defined in the rule output section.
When I try to make sure these are produced by adding them to 'rule all:' a re-run of the workflow fails because the file are not found in the rule(s) output list.
Can I add a supplementary file (not present as {output}) to the 'rule all:'?
Thanks
eg: STAR index produces a number of files in a folder defined by command arguments, checking for the presence of the folder does not mean that indexing has worked out normally
added for clarity, the STAR index exmple takes 'star_idx_75' as output argument and makes a folder of it in which all the following files are stored (their number may vary in function of the index type).
chrLength.txt
chrName.txt
chrNameLength.txt
chrStart.txt
exonGeTrInfo.tab
exonInfo.tab
geneInfo.tab
Genome
genomeParameters.txt
SA
SAindex
sjdbInfo.txt
sjdbList.fromGTF.out.tab
sjdbList.out.tab
transcriptInfo.tab
What I wanted was to check that they are all present BUT none of them is used to build the command itself and if I required them in the rule all: a rerun breaks because they are not in any snakemake {output} definition.
This is why I asked wether I could create 'fake' output variables that are not 'used' for running a command but allow placing the corresponding items in the 'rule all:' - am I more clear now :-).
Can I add a supplementary file (not present as {output}) to the 'rule all:'?
I don't think so, at least not without resorting on some convoluted solution. Every file in rule all (or more precisely the first rule) must have a rule that lists it in output.
If you don't want to repeat a long list, why not doing something like this?
star_index= ['ref.idx1', 'ref.idx2', ...]
rule all:
input:
star_index
rule make_index:
input:
...
output:
star_index
shell:
...
It's probably better to list them all in the rule's output, but only use the relevant ones in subsequent rules. You could also look into using directory() which could possibly fit here.

Manually create snakemake wildcards

I'm struggling to integrate my sample sheet (TSV) into my pipeline. Specifically, I want to define the samples wildcard manually instead of reading it from a patch. The reason is that not all samples in a path are supposed to be analysed. Instead, I made a sample sheet that contains the list of samples, the path where to find, reference genome, etc.
The sheet looks like this:
name path reference
sample1 path/to/fastq/files mm9
sample2 path/to/fastq/files mm9
I load the sheet in my snakefile:
table_samples = pd.read_table(config["samples"], index_col="name")
SAMPLES = table_samples.index.values.tolist()
The first rule is supposed to merge the FASTQ files inside, so it would be nice to do something like this:
rule merge_fastq:
output: "{sample}/{sample}.fastq.gz"
params: path = table_samples['path'][{sample}]
shell: """
cat {params.path}/*.fastq.gz > {output}
"""
But as written above it won't work because the sample wildcard is not defined. Is there a way I can say the sample list I defined above (SAMPLES) contains all the samples for which rules should be executed?
I honestly feel stupid asking this question but I've already spent a couple of hours finding/searching a solution and at this point I need to be a bit more time efficient :D
Thanks!
You just need a target rule listing all the concrete files you want after your rule "merge_fastq":
rule all:
input: expand("{sample}/{sample}.fastq.gz",sample=SAMPLES)
This rule must be put at the top of the other rules. Wildcards can only be used if you define the concrete files you want at the end of the workflow.

How to gather files from subdirectories to run jobs in Snakemake?

I am currently working on this project where iam struggling with this issue.
My current directory structure is
/shared/dir1/file1.bam
/shared/dir2/file2.bam
/shared/dir3/file3.bam
I want to convert various .bam files to fastq in the results directory
results/file1_1.fastq.gz
results/file1_2.fastq.gz
results/file2_1.fastq.gz
results/file2_2.fastq.gz
results/file3_1.fastq.gz
results/file3_2.fastq.gz
I have the following code:
END=["1","2"]
(dirs, files) = glob_wildcards("/shared/{dir}/{file}.bam")
rule all:
input: expand( "/results/{sample}_{end}.fastq.gz",sample=files, end=END)
rule bam_to_fq:
input: {dir}/{sample}.bam"
output: left="/results/{sample}_1.fastq", right="/results/{sample}_2.fastq"
shell: "/shared/packages/bam2fastq/bam2fastq --force -o /results/{sample}.fastq {input}"
This outputs the following error:
Wildcards in input files cannot be determined from output files:
'dir'
Any help would be appreciated
You're just missing an assignment for "dir" in your input directive of the rule bam_to_fq. In your code, you are trying to get Snakemake to determine "{dir}" from the output of the same rule, because you have it setup as a wildcard. Since it didn't exist, as a variable in your output directive, you received an error.
input:
"{dir}/{sample}.bam"
output:
left="/results/{sample}_1.fastq",
right="/results/{sample}_2.fastq",
Rule of thumb: input and output wildcards must match
rule all:
input:
expand("/results/{sample}_{end}.fastq.gz", sample=files, end=END)
rule bam_to_fq:
input:
expand("{dir}/{{sample}}.bam", dir=dirs)
output:
left="/results/{sample}_1.fastq",
right="/results/{sample}_2.fastq"
shell:
"/shared/packages/bam2fastq/bam2fastq --force -o /results/{sample}.fastq {input}
NOTES
the sample variable in the input directive now requires double {}, because that is how one identifies wildcards in an expand.
dir is no longer a wildcard, it is explicitly set to point to the list of directories determined by the glob_wildcard call and assigned to the variable "dirs" which I am assuming you make earlier in your script, since the assignment of one of the variables is successful already, in your rule all input "sample=files".
I like and recommend easily differentiable variable names. I'm not a huge fan of the usage of variable names "dir", and "dirs". This makes you prone to pedantic spelling errors. Consider changing it to "dirLIST" and "dir"... or anything really. I just fear one day someone will miss an 's' somewhere and it's going to be frustrating to debug. I'm personally guilty, an thus a slight hypocrite, as I do use "sample=samples" in my core Snakefile. It has caused me minor stress, thus why I make this recommendation. Also makes it easier for others to read your code as well.
EDIT 1; Adding to response as I had initially missed the requirement for key-value matching of the dir and sample
I recommend keeping separate the path and the sample name in different variables. Two approaches I can think of:
Keep using glob_wildcards to make a blanket search for all possible variables, and then use a python function to validate which path+file combinations are legit.
Drop the usage of glob_wildcards. Propagate the directory name as a wildcard variable, {dir}, throughout your rules. Just set it as a sub-directory of "results". Use pandas to pass known, key-value pairs listed in a file to the rule all. Initially I suggest generating the key-value pairs file manually, but eventually, it's generation could just be a rule upstream of others.
Generalizing bam_to_fq a little bit... utilizing an external config, something like....
from pandas import read_table
rule all:
input:
expand("/results/{{sample[1][dir]}}/{sample[1][file]}_{end}.fastq.gz", sample=read_table(config["sampleFILE"], " ").iterrows(), end=['1','2'])
rule bam_to_fq:
input:
"{dir}/{sample}.bam"
output:
left="/results/{dir}/{sample}_1.fastq",
right="/results/{dir}/{sample}_2.fastq"
shell:
"/shared/packages/bam2fastq/bam2fastq --force -o /results/{sample}.fastq {input}
sampleFILE
dir file
dir1 file1
dir2 file2
dir3 file3

Reduce the set of input files dynamically during a snakemake run

this is more of a technical question regarding the capabilities of snakemake. I was wondering whether it is possible to dynamically alter the set of input samples during a snakemake run.
The reason why I would like to do so is the following: Let's assume a set of sample associated bam files. The first rule determines the quality of each sample (based on the bam file), i.e. all input files are concerned.
However, given specified criteria, only a subset of samples is considered as valid and should be processed further. So the next step (e.g. gene counting or something else) should only be done for the approved bam files, as shown in the minimal example below:
configfile: "config.yaml"
rule all:
input: "results/gene_count.tsv"
rule a:
input: expand( "data/{sample}.bam", sample=config['samples'])
output: "results/list_of_qual_approved_samples.out"
shell: '''command'''
rule b:
input: expand( "data/{sample}.bam", sample=config['valid_samples'])
output: "results/gene_count.tsv"
shell: '''command'''
In this example, rule a would extend the configuration file with a list of valid sample names, even though I believe to know that this is not possible.
Of course, the straightforward solution would be to have two distinct inputs: 1.) all bam files and 2.) a file that lists all valid files. This would boil down to do the sample selection within the code of the rule.
rule alternative_b:
input:
expand( "data/{sample}.bam", sample=config['samples']),
"results/list_of_qual_approved_samples.out"
output: "results/gene_count.tsv"
shell: '''command'''
However, do you see a way to setup the rules such that the behavior of the first example can be achieved?
Many thanks in advance,
Ralf
Another approach, one that does not use "dynamic".
It's not that you do not know how many files you are going to use, but rather, you are only using a sub-set of the files you would be starting with. Since you are able to generate a "samples.txt" list of all the potential files, I'm going to assume you have a firm starting point.
I did something similar, where I have initial files that I want to process for validity, (in my case, I'm increasing the quality~sorting, indexing etc). I then want to ignore everything except my resultant file.
What I suggest, to avoid creating a secondary list of sample files, is to create a second directory of data (reBamDIR), data2 (BamDIR). In data2, you symlink over all the files that are valid. That way, Snake can just process EVERYTHING in the data2 directory. Makes moving down the pipeline easier, the pipeline can stop relying on sample lists, and it can just process everything using wildcards (much easier to code). This is possible becuase when I symlink I then standardize the names. I list the symlinked files in the output rule so Snakemake knows about them and then it can create the DAG.
`-- output
|-- bam
| |-- Pfeiffer2.bam -> /home/tboyarski/share/projects/tboyarski/gitRepo-LCR-BCCRC/Snakemake/buildArea/output/reBam/Pfeiffer2_realigned_sorted.bam
| `-- Pfeiffer2.bam.bai -> /home/tboyarski/share/projects/tboyarski/gitRepo-LCR- BCCRC/Snakemake/buildArea/output/reBam/Pfeiffer2_realigned_sorted.bam.bai
|-- fastq
|-- mPile
|-- reBam
| |-- Pfeiffer2_realigned_sorted.bam
| `-- Pfeiffer2_realigned_sorted.bam.bai
In this case, all you need is a return value in your "validator", and a conditional operator to respond to it.
I would argue you already have this somewhere, since you must be using conditionals in your validation step. Instead of using it to write the file name to a txt file, just symlink the file in a finalized location and keep going.
My raw data is in reBamDIR.
The final data I store in BamDIR.
I only symlink the files from this stage in the pipeline over to bamDIR.
There are OTHER files in reBamDIR, but I don't want the rest of my pipeline to see them, so, I'm filtering them out.
I'm not exactly sure how to implement the "validator" and your conditional, as I do not know your situation, and I'm still learning too. Just trying to offer alternative perspectives//approaches.
from time import gmtime, strftime
rule indexBAM:
input:
expand("{outputDIR}/{reBamDIR}/{{samples}}{fileTAG}.bam", outputDIR=config["outputDIR"], reBamDIR=config["reBamDIR"], fileTAG=config["fileTAG"])
output:
expand("{outputDIR}/{reBamDIR}/{{samples}}{fileTAG}.bam.bai", outputDIR=config["outputDIR"], reBamDIR=config["reBamDIR"], fileTAG=config["fileTAG"]),
expand("{outputDIR}/{bamDIR}/{{samples}}.bam.bai", outputDIR=config["outputDIR"], bamDIR=config["bamDIR"]),
expand("{outputDIR}/{bamDIR}/{{samples}}.bam", outputDIR=config["outputDIR"], bamDIR=config["bamDIR"])
params:
bamDIR=config["bamDIR"],
outputDIR=config["outputDIR"],
logNAME="indexBAM." + strftime("%Y-%m-%d.%H-%M-%S", gmtime())
log:
"log/" + config["reBamDIR"]
shell:
"samtools index {input} {output[0]} " \
" 2> {log}/{params.logNAME}.stderr " \
"&& ln -fs $(pwd)/{output[0]} $(pwd)/{params.outputDIR}/{params.bamDIR}/{wildcards.samples}.bam.bai " \
"&& ln -fs $(pwd)/{input} $(pwd)/{params.outputDIR}/{params.bamDIR}/{wildcards.samples}.bam"
I think I have an answer that could be interesting.
At first I thought that it wasn't possible to do it. Because Snakemake needs the final files at the end. So you can't just separate a set of files without knowing the separation at the beginning.
But then I tried with the function dynamic. With the function dynamic you don't have to know the amount of files which will be created​ by the rule.
So I coded this :
rule all:
input: "results/gene_count.tsv"
rule a:
input: expand( "data/{sample}.bam", sample=config['samples'])
output: dynamic("data2/{foo}.bam")
shell:
'./bloup.sh "{input}"'
rule b:
input: dynamic("data2/{foo}.bam")
output: touch("results/gene_count.tsv")
shell: '''command'''
Like in your first example the snakefile wants to produce a file named results/gene_count.ts.
The rule a takes all samples from configuration file. This rule execute a script that chooses​ the files to create. I have 4 initial files (geneA, geneB, geneC, geneD) and it only touches two for the output (geneA and geneD files) in a second repertory. There is no problem with the dynamic function.
The rule b takes all the dynamics files created by the rule a. So you just have to produce the results/gene_count.tsv. I just touched​ it in the example.
Here is the log of Snakemake for more information :
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 a
1 all
1 b
3
rule a:
input: data/geneA.bam, data/geneB.bam, data/geneC.bam, data/geneD.bam
output: data2/{*}.bam (dynamic)
Subsequent jobs will be added dynamically depending on the output of this rule
./bloup.sh "data/geneA.bam data/geneB.bam data/geneC.bam data/geneD.bam"
Dynamically updating jobs
Updating job b.
1 of 3 steps (33%) done
rule b:
input: data2/geneD.bam, data2/geneA.bam
output: results/gene_count.tsv
command
Touching output file results/gene_count.tsv.
2 of 3 steps (67%) done
localrule all:
input: results/gene_count.tsv
3 of 3 steps (100%) done
**This is not exactly an answer to your question, but rather a suggestion to reach your goal. **
I think it's not possible - or at least not trivial - to modify a yaml file during the pipeline run.
Personally, when I run snakemake workflows I use external files that I call "metadata". They include a configfile, but also a tab-file containing the list of samples (and possibly additional information about said samples). The config file contains a parameter which is the path to this file.
In such a setup, I would recommend having your "rule a" output another tab-file containing the selected samples, and the path to this file could be included in the config file (even though it doesn't exist when you start the workflow). Rule b would take that file as an input.
In your case you could have:
config:
samples: "/path/to/samples.tab"
valid_samples: "/path/to/valid_samples.tab"
I don't know if it makes sense, since it's based on my own organization. I think it's useful because it allows storing more information than just sample names, and if you have 100s of samples it's much easier to manage!