I have been having some difficulty for some time producing a workflow with many inputs and a single output, such as is shown below. The code below works fine to some extent, however when there are too many input files the concatenate step invariably fails:
rule generate_text:
input:
"data/{name}.csv"
output:
"text_files/{name}.txt"
shell:
"somecommand {input} -o {output}"
rule concatenate_text :
input:
expand("text_files/{name}.txt", name=names)
output:
"summaries/summary.txt"
shell:
"cat {input} > {output}"
I have done some digging and found that this is attributable to a limitation on the number of characters that can be put in a single command. I am working with increasingly large numbers of inputs and therefore the above solution is not scalable.
Can anybody please propose any solutions to this issue? I haven't been able to find any online.
Ideally the solution wouldn't be one limited to just cat or other shell commands and could be employed within the structure of a rule in cases where --use-conda can be employed. My current fix involves using an onsuccess script as follows, but this doesn't allow use of --use-conda and rule specific conda environments.
One handy thing about the shell command is that you can feed it snakemake variables, but its not quite flexible enough for my purposes due to the aforementioned conda issue.
onsuccess:
shell("cat text_files/*.txt > summaries/summary.txt")
Is it possible to use the same input and output in a rule?
For example,
rule example:
input:
"/path/to/my/data"
output:
"/path/to/my/data"
shell:
"my_command {input}"
I am pulling data from a previous rule, and am trying to move some of its outputs around, and merge files together.
I appreciate any help!
In a nutshell, no. Snakemake builds a DAG (directed acyclic graph) and then makes the dependencies for each node required by a target. In your case you are introducing a loop.
Anyway, from your description I don't see any reason for this cycle:
I am pulling data from a previous rule, and am trying to move some of
its outputs around, and merge files together.
That can be done in a "normal" way.
I am using snakemake to design a RNAseq-data analysis pipeline. While I've managed to do that, I want to make my pipeline to be as adaptable as possible and make it able to deal with single-reads (SE) data or paired-end (PE) data within the same run of analyses, instead of analysing SE data in one run and PE data in another.
My pipeline is supposed to be designed like this :
dataset download that gives 1 file (SE data) or 2 files (PE data) -->
set of rules A specific to 1 file OR set of rules B specific to 2 files -->
rule that takes 1 or 2 input files and merges it/them
into a single output -->
final set of rules.
Note : all rules of A have 1 input and 1 output, all rules of B have 2 inputs and 2 outputs and their respective commands look like :
1 input : somecommand -i {input} -o {output}
2 inputs : somecommand -i1 {input1} -i2 {input2} -o1 {output1} -o2 {output2}
Note 2 : except their differences in inputs/outputs, all rules of sets A and B have the same commands, parameters/etc...
In other words, I want my pipeline to be able to switch between the execution of set of rules A or set of rules B depending on the sample, either by giving it information on the sample in a config file at the start (sample 1 is SE, sample 2 is PE... this is known before-hand) or asking snakemake to counts the number of files after the dataset download to choose the proper next set of rules for each sample. If you see another way to do that, you're welcome to tell be about it.
I thought about using checkpoints, input functions and if/else statement, but I haven't managed to solve my problem with these.
Do you have any hints/advice/ways to make that "switch" happen ?
If you know the layout beforehand, then the easiest way would be to store it in some variable, something like this (or alternatively you read this from a config file into a dictionary):
layouts = {"sample1": "paired", "sample2": "single", ... etc}
What you can then do is "merge" your rule like this (I am guessing you are talking about trimming and alignment, so that's my example):
ruleorder: B > A
rule A:
input:
{sample}.fastq.gz
output:
trimmed_{sample}.fastq.gz
shell:
"somecommand -i {input} -o {output}"
rule B:
input:
input1={sample}_R1.fastq.gz,
input2={sample}_R2.fastq.gz
output:
output1=trimmed_{sample}_R1.fastq.gz,
output2=trimmed_{sample}_R2.fastq.gz
shell:
"somecommand -i1 {input.input1} -i2 {input.input2} -o1 {output.output1} -o2 {output.output2}"
def get_fastqs(wildcards):
output = dict()
if layouts[wildcards.sample] == "single":
output["input"] = "trimmed_sample2.fastq.gz"
elif layouts[wildcards.sample] == "paired":
output["input1"] = "trimmed_sample1_R1.fastq.gz"
output["input2"] = "trimmed_sample1_R2.fastq.gz"
return output
rule alignment:
def input:
unpack(get_fastqs)
def output:
somepath/{sample}.bam
shell:
...
There is a lot of stuff going on here.
First of all you need a ruleorder so snakemake knows how to handle ambiguous cases
Rule A and B both have to exist (unless you do sth hacky with the output files).
The alignment rule needs an input function to determine which input it requires.
Some self-promotion: I made a snakemake pipeline which does many things, including RNA-seq and downloading of samples online and automatically determining their layout (single-end vs paired-end). Please take a look and see if it solves your problem: https://vanheeringen-lab.github.io/seq2science/content/workflows/rna_seq.html
EDIT:
When you say “merging” rules, do you mean rule A, B and alignment ?
That was unclear wording of me. With merging I meant to "merge
the single-end and paired-end and paired-end logic together, so you can continue with a single rule (e.g. count table, you name it).
Rule order : why did you choose B > A ? To make sure that paired samples don’t end up running in the single-end rules?
Exactly! When a rule needs trimmed_sample1_R1.fastq.gz, how would Snakemake know the name of your sample? Is the name of the sample, sample1, or is it sample1_R1? It can be either, and that makes snakemake complain that it does not know how to resolve this. When you add a ruleorder you tell Snakemake, when it is unclear, resolve in this order.
The command in the alignment rule needs 1 or 2 inputs. I intend to use an if/else in params directive to choose the inputs. Am I correct to think that? (I think you did that as well in your pipeline)
Yes that's the way we solved it. We did it in that way since we want every rule to have it's own environment. If you do not use a seperate conda environment for alignment, then you can do it cleaner/prettier, like so
rule alignment:
input:
unpack(get_fastqs)
output:
somepath/{sample}.bam
run:
if layouts[wildcards.sample] == "single":
shell("single-end command")
if layouts[wildcards.sample] == "paired":
shell("paired-end command")
I feel like this option is much clearer than what we did in the seq2science pipeline. However in the seq2science pipeline we support many different aligners and they all have a different conda environment, so the run directive can not be used.
In the first step of my process, I am extracting some hourly data from a database. Because of things data is sometimes missing for some hours resulting in files. As long as the amount of missing files is not too large I still want to run some of the rules that depend on that data. When running those rules I will check how much data is missing and then decide if I want to generate an error or not.
An example below. The Snakefile:
rule parse_data:
input:
"data/1.csv", "data/2.csv", "data/3.csv", "data/4.csv"
output:
"result.csv"
shell:
"touch {output}"
rule get_data:
output:
"data/{id}.csv"
shell:
"Rscript get_data.R {output}"
And my get_data.R script:
output <- commandArgs(trailingOnly = TRUE)[1]
if (output == "data/1.csv")
stop("Some error")
writeLines("foo", output)
How do I force running of the rule parse_data even when some of it's inputs are missing? I do not want to force running any other rules when input is missing.
One possible solution would be to generate, for example, an empty file in get_data.R when the query failed. However, in practice I am also using --restart-times 5 when running snakemake as the query can also fail because of database timeouts. When creating an empty file this mechanism of retrying the queries would no longer work.
You need data-dependent conditional execution.
Use a checkpoint on get_data. Then you replace parse_data's input with a function, that aggregates whatever files do exist.
(note that I am a Snakemake newbie and am just learning this myself, I hope this is helpful)
this is more of a technical question regarding the capabilities of snakemake. I was wondering whether it is possible to dynamically alter the set of input samples during a snakemake run.
The reason why I would like to do so is the following: Let's assume a set of sample associated bam files. The first rule determines the quality of each sample (based on the bam file), i.e. all input files are concerned.
However, given specified criteria, only a subset of samples is considered as valid and should be processed further. So the next step (e.g. gene counting or something else) should only be done for the approved bam files, as shown in the minimal example below:
configfile: "config.yaml"
rule all:
input: "results/gene_count.tsv"
rule a:
input: expand( "data/{sample}.bam", sample=config['samples'])
output: "results/list_of_qual_approved_samples.out"
shell: '''command'''
rule b:
input: expand( "data/{sample}.bam", sample=config['valid_samples'])
output: "results/gene_count.tsv"
shell: '''command'''
In this example, rule a would extend the configuration file with a list of valid sample names, even though I believe to know that this is not possible.
Of course, the straightforward solution would be to have two distinct inputs: 1.) all bam files and 2.) a file that lists all valid files. This would boil down to do the sample selection within the code of the rule.
rule alternative_b:
input:
expand( "data/{sample}.bam", sample=config['samples']),
"results/list_of_qual_approved_samples.out"
output: "results/gene_count.tsv"
shell: '''command'''
However, do you see a way to setup the rules such that the behavior of the first example can be achieved?
Many thanks in advance,
Ralf
Another approach, one that does not use "dynamic".
It's not that you do not know how many files you are going to use, but rather, you are only using a sub-set of the files you would be starting with. Since you are able to generate a "samples.txt" list of all the potential files, I'm going to assume you have a firm starting point.
I did something similar, where I have initial files that I want to process for validity, (in my case, I'm increasing the quality~sorting, indexing etc). I then want to ignore everything except my resultant file.
What I suggest, to avoid creating a secondary list of sample files, is to create a second directory of data (reBamDIR), data2 (BamDIR). In data2, you symlink over all the files that are valid. That way, Snake can just process EVERYTHING in the data2 directory. Makes moving down the pipeline easier, the pipeline can stop relying on sample lists, and it can just process everything using wildcards (much easier to code). This is possible becuase when I symlink I then standardize the names. I list the symlinked files in the output rule so Snakemake knows about them and then it can create the DAG.
`-- output
|-- bam
| |-- Pfeiffer2.bam -> /home/tboyarski/share/projects/tboyarski/gitRepo-LCR-BCCRC/Snakemake/buildArea/output/reBam/Pfeiffer2_realigned_sorted.bam
| `-- Pfeiffer2.bam.bai -> /home/tboyarski/share/projects/tboyarski/gitRepo-LCR- BCCRC/Snakemake/buildArea/output/reBam/Pfeiffer2_realigned_sorted.bam.bai
|-- fastq
|-- mPile
|-- reBam
| |-- Pfeiffer2_realigned_sorted.bam
| `-- Pfeiffer2_realigned_sorted.bam.bai
In this case, all you need is a return value in your "validator", and a conditional operator to respond to it.
I would argue you already have this somewhere, since you must be using conditionals in your validation step. Instead of using it to write the file name to a txt file, just symlink the file in a finalized location and keep going.
My raw data is in reBamDIR.
The final data I store in BamDIR.
I only symlink the files from this stage in the pipeline over to bamDIR.
There are OTHER files in reBamDIR, but I don't want the rest of my pipeline to see them, so, I'm filtering them out.
I'm not exactly sure how to implement the "validator" and your conditional, as I do not know your situation, and I'm still learning too. Just trying to offer alternative perspectives//approaches.
from time import gmtime, strftime
rule indexBAM:
input:
expand("{outputDIR}/{reBamDIR}/{{samples}}{fileTAG}.bam", outputDIR=config["outputDIR"], reBamDIR=config["reBamDIR"], fileTAG=config["fileTAG"])
output:
expand("{outputDIR}/{reBamDIR}/{{samples}}{fileTAG}.bam.bai", outputDIR=config["outputDIR"], reBamDIR=config["reBamDIR"], fileTAG=config["fileTAG"]),
expand("{outputDIR}/{bamDIR}/{{samples}}.bam.bai", outputDIR=config["outputDIR"], bamDIR=config["bamDIR"]),
expand("{outputDIR}/{bamDIR}/{{samples}}.bam", outputDIR=config["outputDIR"], bamDIR=config["bamDIR"])
params:
bamDIR=config["bamDIR"],
outputDIR=config["outputDIR"],
logNAME="indexBAM." + strftime("%Y-%m-%d.%H-%M-%S", gmtime())
log:
"log/" + config["reBamDIR"]
shell:
"samtools index {input} {output[0]} " \
" 2> {log}/{params.logNAME}.stderr " \
"&& ln -fs $(pwd)/{output[0]} $(pwd)/{params.outputDIR}/{params.bamDIR}/{wildcards.samples}.bam.bai " \
"&& ln -fs $(pwd)/{input} $(pwd)/{params.outputDIR}/{params.bamDIR}/{wildcards.samples}.bam"
I think I have an answer that could be interesting.
At first I thought that it wasn't possible to do it. Because Snakemake needs the final files at the end. So you can't just separate a set of files without knowing the separation at the beginning.
But then I tried with the function dynamic. With the function dynamic you don't have to know the amount of files which will be created by the rule.
So I coded this :
rule all:
input: "results/gene_count.tsv"
rule a:
input: expand( "data/{sample}.bam", sample=config['samples'])
output: dynamic("data2/{foo}.bam")
shell:
'./bloup.sh "{input}"'
rule b:
input: dynamic("data2/{foo}.bam")
output: touch("results/gene_count.tsv")
shell: '''command'''
Like in your first example the snakefile wants to produce a file named results/gene_count.ts.
The rule a takes all samples from configuration file. This rule execute a script that chooses the files to create. I have 4 initial files (geneA, geneB, geneC, geneD) and it only touches two for the output (geneA and geneD files) in a second repertory. There is no problem with the dynamic function.
The rule b takes all the dynamics files created by the rule a. So you just have to produce the results/gene_count.tsv. I just touched it in the example.
Here is the log of Snakemake for more information :
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 a
1 all
1 b
3
rule a:
input: data/geneA.bam, data/geneB.bam, data/geneC.bam, data/geneD.bam
output: data2/{*}.bam (dynamic)
Subsequent jobs will be added dynamically depending on the output of this rule
./bloup.sh "data/geneA.bam data/geneB.bam data/geneC.bam data/geneD.bam"
Dynamically updating jobs
Updating job b.
1 of 3 steps (33%) done
rule b:
input: data2/geneD.bam, data2/geneA.bam
output: results/gene_count.tsv
command
Touching output file results/gene_count.tsv.
2 of 3 steps (67%) done
localrule all:
input: results/gene_count.tsv
3 of 3 steps (100%) done
**This is not exactly an answer to your question, but rather a suggestion to reach your goal. **
I think it's not possible - or at least not trivial - to modify a yaml file during the pipeline run.
Personally, when I run snakemake workflows I use external files that I call "metadata". They include a configfile, but also a tab-file containing the list of samples (and possibly additional information about said samples). The config file contains a parameter which is the path to this file.
In such a setup, I would recommend having your "rule a" output another tab-file containing the selected samples, and the path to this file could be included in the config file (even though it doesn't exist when you start the workflow). Rule b would take that file as an input.
In your case you could have:
config:
samples: "/path/to/samples.tab"
valid_samples: "/path/to/valid_samples.tab"
I don't know if it makes sense, since it's based on my own organization. I think it's useful because it allows storing more information than just sample names, and if you have 100s of samples it's much easier to manage!