Can I add a file to rule all: which is not defined in output - snakemake

A number of commands produce silently extra files not defined in the rule output section.
When I try to make sure these are produced by adding them to 'rule all:' a re-run of the workflow fails because the file are not found in the rule(s) output list.
Can I add a supplementary file (not present as {output}) to the 'rule all:'?
Thanks
eg: STAR index produces a number of files in a folder defined by command arguments, checking for the presence of the folder does not mean that indexing has worked out normally
added for clarity, the STAR index exmple takes 'star_idx_75' as output argument and makes a folder of it in which all the following files are stored (their number may vary in function of the index type).
chrLength.txt
chrName.txt
chrNameLength.txt
chrStart.txt
exonGeTrInfo.tab
exonInfo.tab
geneInfo.tab
Genome
genomeParameters.txt
SA
SAindex
sjdbInfo.txt
sjdbList.fromGTF.out.tab
sjdbList.out.tab
transcriptInfo.tab
What I wanted was to check that they are all present BUT none of them is used to build the command itself and if I required them in the rule all: a rerun breaks because they are not in any snakemake {output} definition.
This is why I asked wether I could create 'fake' output variables that are not 'used' for running a command but allow placing the corresponding items in the 'rule all:' - am I more clear now :-).

Can I add a supplementary file (not present as {output}) to the 'rule all:'?
I don't think so, at least not without resorting on some convoluted solution. Every file in rule all (or more precisely the first rule) must have a rule that lists it in output.
If you don't want to repeat a long list, why not doing something like this?
star_index= ['ref.idx1', 'ref.idx2', ...]
rule all:
input:
star_index
rule make_index:
input:
...
output:
star_index
shell:
...

It's probably better to list them all in the rule's output, but only use the relevant ones in subsequent rules. You could also look into using directory() which could possibly fit here.

Related

Snakemake copy from several directories

Snakemake is super-confusing to me. I have files of the form:
indir/type/name_1/run_1/name_1_processed.out
indir/type/name_1/run_2/name_1_processed.out
indir/type/name_2/run_1/name_2_processed.out
indir/type/name_2/run_2/name_2_processed.out
where type, name, and the numbers are variable. I would like to aggregate files such that all files with the same "name" end up in a single dir:
outdir/type/name/name_1-1.out
outdir/type/name/name_1-2.out
outdir/type/name/name_2-1.out
outdir/type/name/name_2-2.out
How do I write a snakemake rule to do this? I first tried the following
rule rename:
input:
"indir/{type}/{name}_{nameno}/run_{runno}/{name}_{nameno}_processed.out"
output:
"outdir/{type}/{name}/{name}_{nameno}-{runno}.out"
shell:
"cp {input} {output}"
# example command: snakemake --cores 1 outdir/type/name/name_1-1.out
This worked, but doing it this way doesn't save me any effort because I have to know what the output files are ahead of time, so basically I'd have to pass all the output files as a list of arguments to snakemake, requiring a bit of shell trickery to get the variables.
So then I tried to use directory (as well as give up on preserving runno).
rule rename2:
input:
"indir/{type}/{name}_{nameno}"
output:
directory("outdir/{type}/{name}")
shell:
"""
for d in {input}/run_*; do
i=0
for f in ${{d}}/*processed.out; do
cp ${{f}} {output}/{wildcards.name}_{wildcards.nameno}-${{i}}.out
done
let ++i
done
"""
This gave me the error, Wildcards in input files cannot be determined from output files: 'nameno'. I get it; {nameno} doesn't exist in output. But I don't want it there in the directory name, only in the filename that gets copied.
Also, if I delete {nameno}, then it complains because it can't find the right input file.
What are the best practices here for what I'm trying to do? Also, how does one wrap their head around the fact that in snakemake, you specify outputs, not inputs? I think this latter fact is what is so confusing.
I guess what you need is the expand function:
rule all:
input: expand("outdir/{type}/{name}/{name}_{nameno}-{runno}.out",
type=TYPES,
name=NAMES,
nameno=NAME_NUMBERS,
runno=RUN_NUMBERS)
The TYPES, NAMES, NAME_NUMBERS and RUN_NUMBERS are the lists of all possible values for these parameters. You either need to hardcode or use the glob_wildcards function to collects these data:
TYPES, NAMES, NAME_NUMBERS, RUN_NUMBERS, = glob_wildcards("indir/{type}/{name}_{nameno}/run_{runno}/{name}_{nameno}_processed.out")
This however would give you duplicates. If that is not desireble, remove the duplicates:
TYPES, NAMES, NAME_NUMBERS, RUN_NUMBERS, = map(set, glob_wildcards("indir/{type}/{name}_{nameno}/run_{runno}/{name}_{nameno}_processed.out"))

Manually create snakemake wildcards

I'm struggling to integrate my sample sheet (TSV) into my pipeline. Specifically, I want to define the samples wildcard manually instead of reading it from a patch. The reason is that not all samples in a path are supposed to be analysed. Instead, I made a sample sheet that contains the list of samples, the path where to find, reference genome, etc.
The sheet looks like this:
name path reference
sample1 path/to/fastq/files mm9
sample2 path/to/fastq/files mm9
I load the sheet in my snakefile:
table_samples = pd.read_table(config["samples"], index_col="name")
SAMPLES = table_samples.index.values.tolist()
The first rule is supposed to merge the FASTQ files inside, so it would be nice to do something like this:
rule merge_fastq:
output: "{sample}/{sample}.fastq.gz"
params: path = table_samples['path'][{sample}]
shell: """
cat {params.path}/*.fastq.gz > {output}
"""
But as written above it won't work because the sample wildcard is not defined. Is there a way I can say the sample list I defined above (SAMPLES) contains all the samples for which rules should be executed?
I honestly feel stupid asking this question but I've already spent a couple of hours finding/searching a solution and at this point I need to be a bit more time efficient :D
Thanks!
You just need a target rule listing all the concrete files you want after your rule "merge_fastq":
rule all:
input: expand("{sample}/{sample}.fastq.gz",sample=SAMPLES)
This rule must be put at the top of the other rules. Wildcards can only be used if you define the concrete files you want at the end of the workflow.

How to gather files from subdirectories to run jobs in Snakemake?

I am currently working on this project where iam struggling with this issue.
My current directory structure is
/shared/dir1/file1.bam
/shared/dir2/file2.bam
/shared/dir3/file3.bam
I want to convert various .bam files to fastq in the results directory
results/file1_1.fastq.gz
results/file1_2.fastq.gz
results/file2_1.fastq.gz
results/file2_2.fastq.gz
results/file3_1.fastq.gz
results/file3_2.fastq.gz
I have the following code:
END=["1","2"]
(dirs, files) = glob_wildcards("/shared/{dir}/{file}.bam")
rule all:
input: expand( "/results/{sample}_{end}.fastq.gz",sample=files, end=END)
rule bam_to_fq:
input: {dir}/{sample}.bam"
output: left="/results/{sample}_1.fastq", right="/results/{sample}_2.fastq"
shell: "/shared/packages/bam2fastq/bam2fastq --force -o /results/{sample}.fastq {input}"
This outputs the following error:
Wildcards in input files cannot be determined from output files:
'dir'
Any help would be appreciated
You're just missing an assignment for "dir" in your input directive of the rule bam_to_fq. In your code, you are trying to get Snakemake to determine "{dir}" from the output of the same rule, because you have it setup as a wildcard. Since it didn't exist, as a variable in your output directive, you received an error.
input:
"{dir}/{sample}.bam"
output:
left="/results/{sample}_1.fastq",
right="/results/{sample}_2.fastq",
Rule of thumb: input and output wildcards must match
rule all:
input:
expand("/results/{sample}_{end}.fastq.gz", sample=files, end=END)
rule bam_to_fq:
input:
expand("{dir}/{{sample}}.bam", dir=dirs)
output:
left="/results/{sample}_1.fastq",
right="/results/{sample}_2.fastq"
shell:
"/shared/packages/bam2fastq/bam2fastq --force -o /results/{sample}.fastq {input}
NOTES
the sample variable in the input directive now requires double {}, because that is how one identifies wildcards in an expand.
dir is no longer a wildcard, it is explicitly set to point to the list of directories determined by the glob_wildcard call and assigned to the variable "dirs" which I am assuming you make earlier in your script, since the assignment of one of the variables is successful already, in your rule all input "sample=files".
I like and recommend easily differentiable variable names. I'm not a huge fan of the usage of variable names "dir", and "dirs". This makes you prone to pedantic spelling errors. Consider changing it to "dirLIST" and "dir"... or anything really. I just fear one day someone will miss an 's' somewhere and it's going to be frustrating to debug. I'm personally guilty, an thus a slight hypocrite, as I do use "sample=samples" in my core Snakefile. It has caused me minor stress, thus why I make this recommendation. Also makes it easier for others to read your code as well.
EDIT 1; Adding to response as I had initially missed the requirement for key-value matching of the dir and sample
I recommend keeping separate the path and the sample name in different variables. Two approaches I can think of:
Keep using glob_wildcards to make a blanket search for all possible variables, and then use a python function to validate which path+file combinations are legit.
Drop the usage of glob_wildcards. Propagate the directory name as a wildcard variable, {dir}, throughout your rules. Just set it as a sub-directory of "results". Use pandas to pass known, key-value pairs listed in a file to the rule all. Initially I suggest generating the key-value pairs file manually, but eventually, it's generation could just be a rule upstream of others.
Generalizing bam_to_fq a little bit... utilizing an external config, something like....
from pandas import read_table
rule all:
input:
expand("/results/{{sample[1][dir]}}/{sample[1][file]}_{end}.fastq.gz", sample=read_table(config["sampleFILE"], " ").iterrows(), end=['1','2'])
rule bam_to_fq:
input:
"{dir}/{sample}.bam"
output:
left="/results/{dir}/{sample}_1.fastq",
right="/results/{dir}/{sample}_2.fastq"
shell:
"/shared/packages/bam2fastq/bam2fastq --force -o /results/{sample}.fastq {input}
sampleFILE
dir file
dir1 file1
dir2 file2
dir3 file3

Reduce the set of input files dynamically during a snakemake run

this is more of a technical question regarding the capabilities of snakemake. I was wondering whether it is possible to dynamically alter the set of input samples during a snakemake run.
The reason why I would like to do so is the following: Let's assume a set of sample associated bam files. The first rule determines the quality of each sample (based on the bam file), i.e. all input files are concerned.
However, given specified criteria, only a subset of samples is considered as valid and should be processed further. So the next step (e.g. gene counting or something else) should only be done for the approved bam files, as shown in the minimal example below:
configfile: "config.yaml"
rule all:
input: "results/gene_count.tsv"
rule a:
input: expand( "data/{sample}.bam", sample=config['samples'])
output: "results/list_of_qual_approved_samples.out"
shell: '''command'''
rule b:
input: expand( "data/{sample}.bam", sample=config['valid_samples'])
output: "results/gene_count.tsv"
shell: '''command'''
In this example, rule a would extend the configuration file with a list of valid sample names, even though I believe to know that this is not possible.
Of course, the straightforward solution would be to have two distinct inputs: 1.) all bam files and 2.) a file that lists all valid files. This would boil down to do the sample selection within the code of the rule.
rule alternative_b:
input:
expand( "data/{sample}.bam", sample=config['samples']),
"results/list_of_qual_approved_samples.out"
output: "results/gene_count.tsv"
shell: '''command'''
However, do you see a way to setup the rules such that the behavior of the first example can be achieved?
Many thanks in advance,
Ralf
Another approach, one that does not use "dynamic".
It's not that you do not know how many files you are going to use, but rather, you are only using a sub-set of the files you would be starting with. Since you are able to generate a "samples.txt" list of all the potential files, I'm going to assume you have a firm starting point.
I did something similar, where I have initial files that I want to process for validity, (in my case, I'm increasing the quality~sorting, indexing etc). I then want to ignore everything except my resultant file.
What I suggest, to avoid creating a secondary list of sample files, is to create a second directory of data (reBamDIR), data2 (BamDIR). In data2, you symlink over all the files that are valid. That way, Snake can just process EVERYTHING in the data2 directory. Makes moving down the pipeline easier, the pipeline can stop relying on sample lists, and it can just process everything using wildcards (much easier to code). This is possible becuase when I symlink I then standardize the names. I list the symlinked files in the output rule so Snakemake knows about them and then it can create the DAG.
`-- output
|-- bam
| |-- Pfeiffer2.bam -> /home/tboyarski/share/projects/tboyarski/gitRepo-LCR-BCCRC/Snakemake/buildArea/output/reBam/Pfeiffer2_realigned_sorted.bam
| `-- Pfeiffer2.bam.bai -> /home/tboyarski/share/projects/tboyarski/gitRepo-LCR- BCCRC/Snakemake/buildArea/output/reBam/Pfeiffer2_realigned_sorted.bam.bai
|-- fastq
|-- mPile
|-- reBam
| |-- Pfeiffer2_realigned_sorted.bam
| `-- Pfeiffer2_realigned_sorted.bam.bai
In this case, all you need is a return value in your "validator", and a conditional operator to respond to it.
I would argue you already have this somewhere, since you must be using conditionals in your validation step. Instead of using it to write the file name to a txt file, just symlink the file in a finalized location and keep going.
My raw data is in reBamDIR.
The final data I store in BamDIR.
I only symlink the files from this stage in the pipeline over to bamDIR.
There are OTHER files in reBamDIR, but I don't want the rest of my pipeline to see them, so, I'm filtering them out.
I'm not exactly sure how to implement the "validator" and your conditional, as I do not know your situation, and I'm still learning too. Just trying to offer alternative perspectives//approaches.
from time import gmtime, strftime
rule indexBAM:
input:
expand("{outputDIR}/{reBamDIR}/{{samples}}{fileTAG}.bam", outputDIR=config["outputDIR"], reBamDIR=config["reBamDIR"], fileTAG=config["fileTAG"])
output:
expand("{outputDIR}/{reBamDIR}/{{samples}}{fileTAG}.bam.bai", outputDIR=config["outputDIR"], reBamDIR=config["reBamDIR"], fileTAG=config["fileTAG"]),
expand("{outputDIR}/{bamDIR}/{{samples}}.bam.bai", outputDIR=config["outputDIR"], bamDIR=config["bamDIR"]),
expand("{outputDIR}/{bamDIR}/{{samples}}.bam", outputDIR=config["outputDIR"], bamDIR=config["bamDIR"])
params:
bamDIR=config["bamDIR"],
outputDIR=config["outputDIR"],
logNAME="indexBAM." + strftime("%Y-%m-%d.%H-%M-%S", gmtime())
log:
"log/" + config["reBamDIR"]
shell:
"samtools index {input} {output[0]} " \
" 2> {log}/{params.logNAME}.stderr " \
"&& ln -fs $(pwd)/{output[0]} $(pwd)/{params.outputDIR}/{params.bamDIR}/{wildcards.samples}.bam.bai " \
"&& ln -fs $(pwd)/{input} $(pwd)/{params.outputDIR}/{params.bamDIR}/{wildcards.samples}.bam"
I think I have an answer that could be interesting.
At first I thought that it wasn't possible to do it. Because Snakemake needs the final files at the end. So you can't just separate a set of files without knowing the separation at the beginning.
But then I tried with the function dynamic. With the function dynamic you don't have to know the amount of files which will be created​ by the rule.
So I coded this :
rule all:
input: "results/gene_count.tsv"
rule a:
input: expand( "data/{sample}.bam", sample=config['samples'])
output: dynamic("data2/{foo}.bam")
shell:
'./bloup.sh "{input}"'
rule b:
input: dynamic("data2/{foo}.bam")
output: touch("results/gene_count.tsv")
shell: '''command'''
Like in your first example the snakefile wants to produce a file named results/gene_count.ts.
The rule a takes all samples from configuration file. This rule execute a script that chooses​ the files to create. I have 4 initial files (geneA, geneB, geneC, geneD) and it only touches two for the output (geneA and geneD files) in a second repertory. There is no problem with the dynamic function.
The rule b takes all the dynamics files created by the rule a. So you just have to produce the results/gene_count.tsv. I just touched​ it in the example.
Here is the log of Snakemake for more information :
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 a
1 all
1 b
3
rule a:
input: data/geneA.bam, data/geneB.bam, data/geneC.bam, data/geneD.bam
output: data2/{*}.bam (dynamic)
Subsequent jobs will be added dynamically depending on the output of this rule
./bloup.sh "data/geneA.bam data/geneB.bam data/geneC.bam data/geneD.bam"
Dynamically updating jobs
Updating job b.
1 of 3 steps (33%) done
rule b:
input: data2/geneD.bam, data2/geneA.bam
output: results/gene_count.tsv
command
Touching output file results/gene_count.tsv.
2 of 3 steps (67%) done
localrule all:
input: results/gene_count.tsv
3 of 3 steps (100%) done
**This is not exactly an answer to your question, but rather a suggestion to reach your goal. **
I think it's not possible - or at least not trivial - to modify a yaml file during the pipeline run.
Personally, when I run snakemake workflows I use external files that I call "metadata". They include a configfile, but also a tab-file containing the list of samples (and possibly additional information about said samples). The config file contains a parameter which is the path to this file.
In such a setup, I would recommend having your "rule a" output another tab-file containing the selected samples, and the path to this file could be included in the config file (even though it doesn't exist when you start the workflow). Rule b would take that file as an input.
In your case you could have:
config:
samples: "/path/to/samples.tab"
valid_samples: "/path/to/valid_samples.tab"
I don't know if it makes sense, since it's based on my own organization. I think it's useful because it allows storing more information than just sample names, and if you have 100s of samples it's much easier to manage!

OCLint rule customization

I am using OCLint static code analysis tool for objective-C and want to find out how to customize rules? The rules are represented by set of dylib files.
In lieu of passing configuration as arguments (see Jon Boydell's answer), you can also create a YML file named .oclint in the project directory.
Here's an example file that customizes a few things:
rules:
- LongLine
disable-rules:
rulePaths:
- /etc/rules
rule-configurations:
- key: LONG_LINE
value: 20
output: filename
report-type: xml
max-priority-1: 10
max-priority-2: 20
max-priority-3: 30
enable-clang-static-analyzer: false
The answer, as with so many things, is that it depends.
If you want to write your own custom rule then you'll need to get down and dirty into writing your own rule, in C++ on top of the existing source code. Check out the oclint-rules/rules directory, size/LongLineRule.cpp is a simple rule to get going with. You'll need to recompile, etc.
If you want to change the parameters of an existing rule you need to add the command line parameter -rc=<rulename>=<value> to the call to oclint. For example, if you want the long lines rule to only activate for lines longer than 150 chars you need to add -rc=LONG_LINE=150.
I don't have the patience to list out all the different parameters you can change. The list of rules is here http://docs.oclint.org/en/dev/rules/index.html and a list of threshold based rules here http://docs.oclint.org/en/dev/customizing/rules.html but there's no list of acceptable values and I don't know whether these two URLs cover all the rules or not. You might have to look into the source code for each rule to work out how it works.
If you're using Xcode script you should use oclint_args like this:
oclint-json-compilation-database oclint_args "-rc LONG_LINE=150" | sed
's/(..\m{1,2}:[0-9]:[0-9]*:)/\1 warning:/'
in that sample I'm changing the rule of LONG_LINE to 150 chars