Split multiple files into multiple parts with Snakemake - snakemake

I’m building a pipeline that is supposed to take a list of files as input (located anywhere on the disk), split all these files into smaller pieces, and then do some computations on all of these pieces before merging the results. I’m strugling on the first step.
For exemple
Input files = A and B
A and B are split into 10 files : A1, A2, A3, A4… B9, B10.
Some computations is made on all of the subfiles : results_A1, results_A2… results_B10
The results are merged, with respect to the input file they came from. So we end up with
results_A_merged and results_B_merged
The tool that splits the files (seqkit split) takes the number of pieces I want to split a file in, the file that I want to split, an output dir, and output the splitted files in this output dir with a given pattern. If the input file is path/to/file_A.fasta, it will output : output_dir/file_A.part_001.fasta, output_dir/file_A.part_002.fasta etc.
I achieve to do that with one single file as input.
my_files="path/to/file_1.fasta"
my_files_dir=[]
my_files_prefix=[]
my_files_extension=[]
###Store the path to the dir, the file name without extension, and the extension.
for i in my_files:
print(i)
my_files_dir.append(re.search(r'(.*)/(.*)',i).group(1))
my_files_prefix.append(re.search(r'(.*)/(.*)(\.[fna|fa|fasta])',i).group(2))
my_files_extension.append(re.search(r'(.*)/(.*)(\.fna)',i).group(3)) ###FIXME: hard coded shit...
###Create the name of all the splited files
my_temp_fasta=[]
for i in range(1,blast_jobs):
my_temp_fasta.append(my_files_prefix[0]+'.part_%03d'%i+my_files_extension[0])
###Split my file.
rule split_fasta:
input:
my_files
output:
expand('splited_fasta/{tmp_fasta_files}', tmp_fasta_files=my_temp_fasta)
params:
num_sequences=10
shell:
"seqkit split --out-dir splited_fasta --by-part {params.num_sequences} {input}"
But as soon as I try with multiple files, I cannot even manage to split them correctly.
Here is my unworking pipeline, which has only one rule to try to split the files, for the moment.
my_files=["path/to/file_1.fasta", "other/path/to/file_2.fasta"]
my_files_dir=[]
my_files_prefix=[]
my_files_extension=[]
###Store the path to the dir, the file name without extension, and the extension of each files.
for i in my_files:
print(i)
my_files_dir.append(re.search(r'(.*)/(.*)',i).group(1))
my_files_prefix.append(re.search(r'(.*)/(.*)(\.[fna|fa|fasta])',i).group(2))
my_files_extension.append(re.search(r'(.*)/(.*)(\.fna)',i).group(3)) ###FIXME: hard coded shit...
#Store all the files that will be created by the split command.
tmp=[]
my_temp_fasta_dict={}
for j in range(0,len(my_files)):
for i in range(1,10):
tmp.append(my_files_prefix[j]+'.part_%03d'%i+my_files_extension[j])
my_temp_fasta_dict[my_files_prefix[j]] = tmp
tmp=[]
##So I have a (useless...) dictionary, with file name prefix as key, and a list of splited file names as values.
rule split_fasta:
input:
my_files
output:
expand('splited_fasta/{tmp_fasta_files}', tmp_fasta_files=my_temp_fasta_dict.values())
params:
num_sequences=10
shell:
"seqkit split --out-dir splited_fasta --by-part {params.num_sequences} {input}"
Which gives a wrong command, concatenating all my input files :
seqkit split --out-dir splited_fasta --by-part 5 path/to/file_1.fasta other/path/to/file_2.fasta
Instead of running the command two times on the two input files. I just cannot succeed doing that. And the worse thing it's that it's probably easy...
Thanks you in advance for you help.

This is a common mistake for Snakemake users to start reasoning bottom-up (from input files to target). Try the top-down approach instead (start with the target, then think of what is needed to build this target, etc.):
rule all:
input = expand("results_{sample}_merged", sample=["A", "B"])
rule merge:
input = expand("output_dir/file_{{sample}}.part_00{n}.fasta", n=range(1,10))
output = "results_{sample}_merged"
rule split:
input = "{sample}"
output = expand("output_dir/file_{{sample}}.part_00{n}.fasta", n=range(1,10))

Related

How to implement splitting of files in snakemake when number of files is known

Context
rule A uses the split command in a shell directive.
The number of files generated by rule A depends on a user specified value from the config and is thus known.
In this question there is a difference because the number of output files is unknown, but there is a reference to the dynamic() keyword. Apparently this has been replaced by the use of checkpoint. Is this really the correct way to go in this scenario? There is also something like scattergatter but the example is not clear to me.
Code
chunks = config["chunks"]
sample_list = ["S1", "S2"]
rule all:
input:
expand("{sample}_chunk_{chunk}_done_something.tsv", sample=sample_list,
chunk=[f"{i}".zfill(len(str(chunks))-1) for i in range(0, chunks)])
rule A:
input:
"input_file_{sample}.tsv"
output:
# the user defined number of chunks, how to specify these?
params: chunks=chunks
shell:
"split -n {params.chunks} --numeric-suffixes=1 --additional-suffix=.tsv {input[0]} some_prefix_{wildcards.sample}_"
rule B:
input:
"some_prefix_{sample}_{chunk}.tsv"
output:
"{sample}_chunk_{chunk}_done_something.tsv"
shell:
"#Do something"
Attempts
I tried using a checkpoint with an input function for rule B and using directory() in rule A. However using directory results in SyntaxError in line 253 of MySnakefile: Unexpected keyword directory in rule definition (Snakefile, line 253) and even if that would not throw an error, I don't know how to get chunks into this input function since it is not a wildcard.
How to implement the splitting of an input file best in Snakemake?
Since the number of chunks is known beforehand, you can set the number of output files in rule A from the chunks parameter using an array:
rule A:
...
output:
chunks = ["some_prefix_{{sample}}_{02d}.tsv".format(x+1) for x in range(chunks)]
With chunks = 2, this would expand to chunks = ["some_prefix_{sample}_01.tsv", "some_prefix_{sample}_02.tsv"], matching the synatx of the split output. The {sample} wildcard will be filled-in with Snakemake's standard wildcard replacement.

How to write a Snakemake rule-all, where expand statements can handle the absence of all particular input files

I want to write a Snakemake-Pipeline to process either short or long read sequencing files or both types, depending on which type of files is provided in the input file.
First my Snakefile calls a shell script that creates a config file with the name of all short read files in the input directory under the heading short_reads and all long read files under the heading long_reads.
This is followed by my all rule:
rule all:
input:
expand("../qc/id/{sample}/fastqc_raw/{sample}_R1_fastqc.html", sample=config["samples_short"]),
expand("../qc/id/{sample}/nanoplot_raw/NanoPlot-report.html", sample=config["samples_long"])
...
However, if one of the file types (long or short reads) is not provided, Snakemake fails with a KeyError.
If I modify the config file in a way that the heading is still there but no sample names, Snakemake tries to call the input with the value None, e.g.
Missing input files for rule nanoplot_raw:
../raw_reads/None_ont.fastq.gz
How can I design the rule-all in a way, that it can handle either only short or long reads as well as both sequence types as Input?
Thanks for your help!
Does the following work?
if config["samples_short"]:
fastqc_short = expand("../qc/id/{sample}/fastqc_raw/{sample}_R1_fastqc.html", sample=config["samples_short"])
else:
fastqc_short = []
if config["samples_long"]:
nanoplot_long = expand("../qc/id/{sample}/nanoplot_raw/NanoPlot-report.html", sample=config["samples_long"])
else:
nanoplot_long = []
rule all:
input:
fastqc_short,
nanoplot_long,
...

Snakemake copy from several directories

Snakemake is super-confusing to me. I have files of the form:
indir/type/name_1/run_1/name_1_processed.out
indir/type/name_1/run_2/name_1_processed.out
indir/type/name_2/run_1/name_2_processed.out
indir/type/name_2/run_2/name_2_processed.out
where type, name, and the numbers are variable. I would like to aggregate files such that all files with the same "name" end up in a single dir:
outdir/type/name/name_1-1.out
outdir/type/name/name_1-2.out
outdir/type/name/name_2-1.out
outdir/type/name/name_2-2.out
How do I write a snakemake rule to do this? I first tried the following
rule rename:
input:
"indir/{type}/{name}_{nameno}/run_{runno}/{name}_{nameno}_processed.out"
output:
"outdir/{type}/{name}/{name}_{nameno}-{runno}.out"
shell:
"cp {input} {output}"
# example command: snakemake --cores 1 outdir/type/name/name_1-1.out
This worked, but doing it this way doesn't save me any effort because I have to know what the output files are ahead of time, so basically I'd have to pass all the output files as a list of arguments to snakemake, requiring a bit of shell trickery to get the variables.
So then I tried to use directory (as well as give up on preserving runno).
rule rename2:
input:
"indir/{type}/{name}_{nameno}"
output:
directory("outdir/{type}/{name}")
shell:
"""
for d in {input}/run_*; do
i=0
for f in ${{d}}/*processed.out; do
cp ${{f}} {output}/{wildcards.name}_{wildcards.nameno}-${{i}}.out
done
let ++i
done
"""
This gave me the error, Wildcards in input files cannot be determined from output files: 'nameno'. I get it; {nameno} doesn't exist in output. But I don't want it there in the directory name, only in the filename that gets copied.
Also, if I delete {nameno}, then it complains because it can't find the right input file.
What are the best practices here for what I'm trying to do? Also, how does one wrap their head around the fact that in snakemake, you specify outputs, not inputs? I think this latter fact is what is so confusing.
I guess what you need is the expand function:
rule all:
input: expand("outdir/{type}/{name}/{name}_{nameno}-{runno}.out",
type=TYPES,
name=NAMES,
nameno=NAME_NUMBERS,
runno=RUN_NUMBERS)
The TYPES, NAMES, NAME_NUMBERS and RUN_NUMBERS are the lists of all possible values for these parameters. You either need to hardcode or use the glob_wildcards function to collects these data:
TYPES, NAMES, NAME_NUMBERS, RUN_NUMBERS, = glob_wildcards("indir/{type}/{name}_{nameno}/run_{runno}/{name}_{nameno}_processed.out")
This however would give you duplicates. If that is not desireble, remove the duplicates:
TYPES, NAMES, NAME_NUMBERS, RUN_NUMBERS, = map(set, glob_wildcards("indir/{type}/{name}_{nameno}/run_{runno}/{name}_{nameno}_processed.out"))

Including unforeseen file names as wildcards in Snakemake

gdc-fastq-splitter splits FASTQ files into read groups. For instance, should 3 different read groups be included in dummy.fq.gz, three fastq files will be generated: dummy_readgroup_1.fq.gz, dummy_readgroup_2.fq.gz, dummy_readgroup_3.fq.gz. Given that each original FASTQ file is in a different folder and contains a different number of read groups, the resulting files cannot be easily inputted in the following step as wildcards.
Taking into account that I do not know the exact name and number of resulting files, is there a way to take output from one rule as wildcards for the next one?
An alternative could be to list all the generated files and provide as a list in a parallel Snakefile. I am hoping a more elegant solution.
This is my first ever question in StackOverflow and tried to check all the existing questions. Please, be kind with me if this questions sounds silly or if has been already answered :-)
It is not the prettiest, but this is the way it needs to be done:
import random
import glob
from pathlib import Path
SAMPLES = ['dummy', 'dommy']
rule all:
input:
[f"do_all_{sample}.out" for sample in SAMPLES]
def aggregate(wildcards):
checkpoints.fastq_splitter.get(sample=wildcards.sample)
read_groups = glob_wildcards(f"{wildcards.sample}_{{read_group}}.fastq.gz").read_group
return [f"bam/{wildcards.sample}_{read_group}.bam" for read_group in read_groups]
rule do_everything:
input:
aggregate
output:
touch("do_all_{sample}.out")
rule do_sth_splitted:
input:
"{sample}_{read_group}.fastq.gz"
output:
touch("bam/{sample}_{read_group}.bam")
checkpoint fastq_splitter:
input:
"{sample}.fastq.gz"
output:
touch("{sample}.done")
run:
for i in range(random.randint(1, 5)):
Path(f'{wildcards.sample}_{i}.fastq.gz').touch()
Before you run make sure the sample files exist: touch d{u,o}mmy.fastq.gz.
In the checkpoint fastq_splitter we generate a random number of "fastq" files. The rule do_sth_splitted we pretend we align this against a genome and we get a bam file for each read group. rule do_everything is there to check what the output is of checkpoint fastq_splitter, and is only evaluated after fastq_splitter is done. rule all is there to make sure everything is run for all samples.
Take a look at checkpoints. for a more proper explanation.

Reduce the set of input files dynamically during a snakemake run

this is more of a technical question regarding the capabilities of snakemake. I was wondering whether it is possible to dynamically alter the set of input samples during a snakemake run.
The reason why I would like to do so is the following: Let's assume a set of sample associated bam files. The first rule determines the quality of each sample (based on the bam file), i.e. all input files are concerned.
However, given specified criteria, only a subset of samples is considered as valid and should be processed further. So the next step (e.g. gene counting or something else) should only be done for the approved bam files, as shown in the minimal example below:
configfile: "config.yaml"
rule all:
input: "results/gene_count.tsv"
rule a:
input: expand( "data/{sample}.bam", sample=config['samples'])
output: "results/list_of_qual_approved_samples.out"
shell: '''command'''
rule b:
input: expand( "data/{sample}.bam", sample=config['valid_samples'])
output: "results/gene_count.tsv"
shell: '''command'''
In this example, rule a would extend the configuration file with a list of valid sample names, even though I believe to know that this is not possible.
Of course, the straightforward solution would be to have two distinct inputs: 1.) all bam files and 2.) a file that lists all valid files. This would boil down to do the sample selection within the code of the rule.
rule alternative_b:
input:
expand( "data/{sample}.bam", sample=config['samples']),
"results/list_of_qual_approved_samples.out"
output: "results/gene_count.tsv"
shell: '''command'''
However, do you see a way to setup the rules such that the behavior of the first example can be achieved?
Many thanks in advance,
Ralf
Another approach, one that does not use "dynamic".
It's not that you do not know how many files you are going to use, but rather, you are only using a sub-set of the files you would be starting with. Since you are able to generate a "samples.txt" list of all the potential files, I'm going to assume you have a firm starting point.
I did something similar, where I have initial files that I want to process for validity, (in my case, I'm increasing the quality~sorting, indexing etc). I then want to ignore everything except my resultant file.
What I suggest, to avoid creating a secondary list of sample files, is to create a second directory of data (reBamDIR), data2 (BamDIR). In data2, you symlink over all the files that are valid. That way, Snake can just process EVERYTHING in the data2 directory. Makes moving down the pipeline easier, the pipeline can stop relying on sample lists, and it can just process everything using wildcards (much easier to code). This is possible becuase when I symlink I then standardize the names. I list the symlinked files in the output rule so Snakemake knows about them and then it can create the DAG.
`-- output
|-- bam
| |-- Pfeiffer2.bam -> /home/tboyarski/share/projects/tboyarski/gitRepo-LCR-BCCRC/Snakemake/buildArea/output/reBam/Pfeiffer2_realigned_sorted.bam
| `-- Pfeiffer2.bam.bai -> /home/tboyarski/share/projects/tboyarski/gitRepo-LCR- BCCRC/Snakemake/buildArea/output/reBam/Pfeiffer2_realigned_sorted.bam.bai
|-- fastq
|-- mPile
|-- reBam
| |-- Pfeiffer2_realigned_sorted.bam
| `-- Pfeiffer2_realigned_sorted.bam.bai
In this case, all you need is a return value in your "validator", and a conditional operator to respond to it.
I would argue you already have this somewhere, since you must be using conditionals in your validation step. Instead of using it to write the file name to a txt file, just symlink the file in a finalized location and keep going.
My raw data is in reBamDIR.
The final data I store in BamDIR.
I only symlink the files from this stage in the pipeline over to bamDIR.
There are OTHER files in reBamDIR, but I don't want the rest of my pipeline to see them, so, I'm filtering them out.
I'm not exactly sure how to implement the "validator" and your conditional, as I do not know your situation, and I'm still learning too. Just trying to offer alternative perspectives//approaches.
from time import gmtime, strftime
rule indexBAM:
input:
expand("{outputDIR}/{reBamDIR}/{{samples}}{fileTAG}.bam", outputDIR=config["outputDIR"], reBamDIR=config["reBamDIR"], fileTAG=config["fileTAG"])
output:
expand("{outputDIR}/{reBamDIR}/{{samples}}{fileTAG}.bam.bai", outputDIR=config["outputDIR"], reBamDIR=config["reBamDIR"], fileTAG=config["fileTAG"]),
expand("{outputDIR}/{bamDIR}/{{samples}}.bam.bai", outputDIR=config["outputDIR"], bamDIR=config["bamDIR"]),
expand("{outputDIR}/{bamDIR}/{{samples}}.bam", outputDIR=config["outputDIR"], bamDIR=config["bamDIR"])
params:
bamDIR=config["bamDIR"],
outputDIR=config["outputDIR"],
logNAME="indexBAM." + strftime("%Y-%m-%d.%H-%M-%S", gmtime())
log:
"log/" + config["reBamDIR"]
shell:
"samtools index {input} {output[0]} " \
" 2> {log}/{params.logNAME}.stderr " \
"&& ln -fs $(pwd)/{output[0]} $(pwd)/{params.outputDIR}/{params.bamDIR}/{wildcards.samples}.bam.bai " \
"&& ln -fs $(pwd)/{input} $(pwd)/{params.outputDIR}/{params.bamDIR}/{wildcards.samples}.bam"
I think I have an answer that could be interesting.
At first I thought that it wasn't possible to do it. Because Snakemake needs the final files at the end. So you can't just separate a set of files without knowing the separation at the beginning.
But then I tried with the function dynamic. With the function dynamic you don't have to know the amount of files which will be created​ by the rule.
So I coded this :
rule all:
input: "results/gene_count.tsv"
rule a:
input: expand( "data/{sample}.bam", sample=config['samples'])
output: dynamic("data2/{foo}.bam")
shell:
'./bloup.sh "{input}"'
rule b:
input: dynamic("data2/{foo}.bam")
output: touch("results/gene_count.tsv")
shell: '''command'''
Like in your first example the snakefile wants to produce a file named results/gene_count.ts.
The rule a takes all samples from configuration file. This rule execute a script that chooses​ the files to create. I have 4 initial files (geneA, geneB, geneC, geneD) and it only touches two for the output (geneA and geneD files) in a second repertory. There is no problem with the dynamic function.
The rule b takes all the dynamics files created by the rule a. So you just have to produce the results/gene_count.tsv. I just touched​ it in the example.
Here is the log of Snakemake for more information :
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 a
1 all
1 b
3
rule a:
input: data/geneA.bam, data/geneB.bam, data/geneC.bam, data/geneD.bam
output: data2/{*}.bam (dynamic)
Subsequent jobs will be added dynamically depending on the output of this rule
./bloup.sh "data/geneA.bam data/geneB.bam data/geneC.bam data/geneD.bam"
Dynamically updating jobs
Updating job b.
1 of 3 steps (33%) done
rule b:
input: data2/geneD.bam, data2/geneA.bam
output: results/gene_count.tsv
command
Touching output file results/gene_count.tsv.
2 of 3 steps (67%) done
localrule all:
input: results/gene_count.tsv
3 of 3 steps (100%) done
**This is not exactly an answer to your question, but rather a suggestion to reach your goal. **
I think it's not possible - or at least not trivial - to modify a yaml file during the pipeline run.
Personally, when I run snakemake workflows I use external files that I call "metadata". They include a configfile, but also a tab-file containing the list of samples (and possibly additional information about said samples). The config file contains a parameter which is the path to this file.
In such a setup, I would recommend having your "rule a" output another tab-file containing the selected samples, and the path to this file could be included in the config file (even though it doesn't exist when you start the workflow). Rule b would take that file as an input.
In your case you could have:
config:
samples: "/path/to/samples.tab"
valid_samples: "/path/to/valid_samples.tab"
I don't know if it makes sense, since it's based on my own organization. I think it's useful because it allows storing more information than just sample names, and if you have 100s of samples it's much easier to manage!