How do Snakemake checkpoints work when i do not wanna make a folder? - snakemake

I have a snakemake file where one rule produces a file from witch i would like to extract the header and use as wildcards in my rule all.
The Snakemake guide provides an example where it creates new folders named like the wildcards, but if I can avoid that it would be nice since in some cases it would need to create 100-200 folders then. Any suggestions on how to make it work?
link to snakemake guide:
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html
import pandas as pd
rule all:
input:
final_report = expand('report_{fruit}.txt', fruit= ???)
rule create_file:
input:
output:
fruit = 'fruit_file.csv'
run:
....
rule next:
input:
fruit = 'fruit_file.csv'
output:
report = 'report_{phenotype}.txt'
run:
fruit_file = pd.read_csv({input.fruit}, header = 0, sep = '\t')
fruits= fruit_file.columns.tolist()[2:]
for i in fruits:
cmd = 'touch report_' + i + '.txt'
shell(cmd)
This is a simplified workflow since i am actually using some long script to both produce the pheno_file.csv and the report files.
The pheno_file.csv is tab-seperated and could look like this:
FID IID Apple Banana Plum
Mouse Mickey 0 0 1
Mouse Minnie 1 0 1
Duck Donnald 0 1 0

I think you are misreading the snakemake checkpoint example. You only need to create one folder in your case. They have a wildcard (sample) in the folder name, but that part of the output name is known ahead of time.
checkpoint fruit_reports:
input:
fruit = 'fruit_file.csv'
output:
report_dir = directory('reports')
run:
fruit_file = pd.read_csv({input.fruit}, header = 0, sep = '\t')
fruits= fruit_file.columns.tolist()[2:]
for i in fruits:
cmd = f'touch {output}/report_{i}.txt'
shell(cmd)
Since you do not know all names (fruits) ahead of time, you cannot include them in the all rule. You need to reference an intermediate rule to bring everything together. Maybe use a final report file:
rule all:
input: 'report.txt'
Then after the checkpoint:
def aggregate_fruit(wildcards):
checkpoint_output = checkpoints.fruit_reports.get(**wildcards).output[0]
return expand("reports/report_{i}.txt",
i=glob_wildcards(os.path.join(checkpoint_output, "report_{i}.txt")).i)
rule report:
input:
aggregate_input
output:
"report.txt"
shell:
"ls 1 {input} > {output}"

Related

Snakemake pipeline not attempting to produce output?

I have a relatively simple snakemake pipeline but when run I get all missing files for rule all:
refseq = 'refseq.fasta'
reads = ['_R1_001', '_R2_001']
def getsamples():
import glob
test = (glob.glob("*.fastq"))
print(test)
samples = []
for i in test:
samples.append(i.rsplit('_', 2)[0])
return(samples)
def getbarcodes():
with open('unique.barcodes.txt') as file:
lines = [line.rstrip() for line in file]
return(lines)
rule all:
input:
expand("grepped/{barcodes}{sample}_R1_001.plate.fastq", barcodes=getbarcodes(), sample=getsamples()),
expand("grepped/{barcodes}{sample}_R2_001.plate.fastq", barcodes=getbarcodes(), sample=getsamples())
wildcard_constraints:
barcodes="[a-z-A-Z]+$"
rule fastq_grep:
input:
R1 = "{sample}_R1_001.fastq",
R2 = "{sample}_R2_001.fastq"
output:
out1 = "grepped/{barcodes}{sample}_R1_001.plate.fastq",
out2 = "grepped/{barcodes}{sample}_R2_001.plate.fastq"
wildcard_constraints:
barcodes="[a-z-A-Z]+$"
shell:
"fastq-grep -i '{wildcards.barcodes}' {input.R1} > {output.out1} && fastq-grep -i '{wildcards.barcodes}' {input.R2} > {output.out2}"
The output files that are listed by the terminal seem correct, so it seems it is seeing what I want to produce but the shell is not making anything at all.
I want to produce a list of files that have grepped the list of barcodes I have in a file. But I get "Missing input files for rule all:"
There are two issues:
You have an impossible wildcard_constraints defined for {barcode}
Your two wildcards {barcode} and {sample} are competing with each other.
Remove the wildcard_constraints from your two rules and add the following lines to the top of your Snakefile:
wildcard_constraints:
barcodes="[A-Z]+",
sample="Well.*",
The constraint for {barcodes} now only matches capital letters. Before it also included end-of-line matching (trailing $) which was impossible to match for this wildcard as you had additional text in the filepath following.
The constraint for {sample} ensures that the path of the filename starting with "Well..." is interpreted as the start of the {sample} wildcard. Else you'd get something unwanted like barcode=ACGGTW instead of barcode=ACGGT.
A note of advice:
I usually find it easier to seperate wildcards into directory structures rather than having multiple wildcards in the same filename. In you case that would mean having a structure like
grepped/{barcode}/{sample}_R1_001.plate.fastq.
Full suggested Snakefile (formatted using snakefmt)
wildcard_constraints:
barcodes="[A-Z]+",
sample="Well.*",
refseq = "refseq.fasta"
reads = ["_R1_001", "_R2_001"]
def getsamples():
import glob
test = glob.glob("*.fastq")
print(test)
samples = []
for i in test:
samples.append(i.rsplit("_", 2)[0])
return samples
def getbarcodes():
with open("unique.barcodes.txt") as file:
lines = [line.rstrip() for line in file]
return lines
rule all:
input:
expand(
"grepped/{barcodes}{sample}_R1_001.plate.fastq",
barcodes=getbarcodes(),
sample=getsamples(),
),
expand(
"grepped/{barcodes}{sample}_R2_001.plate.fastq",
barcodes=getbarcodes(),
sample=getsamples(),
),
rule fastq_grep:
input:
R1="{sample}_R1_001.fastq",
R2="{sample}_R2_001.fastq",
output:
out1="grepped/{barcodes}{sample}_R1_001.plate.fastq",
out2="grepped/{barcodes}{sample}_R2_001.plate.fastq",
shell:
"fastq-grep -i '{wildcards.barcodes}' {input.R1} > {output.out1} && fastq-grep -i '{wildcards.barcodes}' {input.R2} > {output.out2}"
In addition to #euronion's answer (+1), I prefer to constrain wildcards to match only and exactly the list of values you expect. This means disabling the regex matching altogether. In your case, I would do something like:
wildcard_constraints:
barcodes='|'.join([re.escape(x) for x in getbarcodes()]),
sample='|'.join([re.escape(x) for x in getsamples()]),
now {barcodes} is allowed to match only the values in getbarcodes(), whatever they are, and the same for {sample}. In my opinion this is better than anticipating what combination of regex a wildcard can take.

Using multiple config file in snakemake results in not same wildcard error

I had asked an earlier question about running the same snakemake pipeline for multiple datasets and one of the solutions mentioned was using multiple config files by #bli. I am trying to implement it but got an error when I have to read in a file which has sample information.
error:
SyntaxError:
Not all output, log and benchmark files of rule fastqc contain the same wildcards. This is crucial though, in order to avoid that two or more jobs write to the same file.
File "Snakefile", line 64, in <module>
I have seen this error before but I cannot figure out why is it coming in this case when every input and output has sample as a wildcard. Any help is much appreciated!
My Snakefile looks like this:
import os
import pandas as pd
import yaml
configfile: "main_config.yaml"
all_keys = list(config.keys())
print(all_keys)
datasets =config["datasets"]
print(datasets.items())
for p_id, p_info in datasets.items():
for key in p_info:
print(key + '------',p_info[key])
conf_file=p_info["conf"]
conf_fh=open(conf_file)
dat_conf = yaml.safe_load(conf_fh)
output = dat_conf["output_dir"]
samples = dat_conf["sampletable"]
R1 = dat_conf["R1"]
R2 = dat_conf["R2"]
print(samples)
print(R1)
SampleTable = pd.read_table(samples,index_col=0)
SAMPLES = list(SampleTable.index)
print(SAMPLES)
PAIRED_END= ('R2' in SampleTable.columns)
FRACTIONS= ['R1']
if PAIRED_END: FRACTIONS+= ['R2']
qc = config["qc_only"]
def all_input_reads(qc):
if config["qc_only"]:
return expand("{output}/fastqc/{sample}" + config["R1"] + "_fastqc.html", sample=SAMPLES)
else:
return expand("{output}/fastqc/{sample}" + config["R1"] + "_fastqc.html", sample=SAMPLES)
rule all:
input:
all_input_reads
rule fastqc:
input:
unpack( lambda wc: dict(SampleTable.loc[wc.sample]))
output:
R1= "{output}/fastqc/{sample}{R1}" +"_fastqc.html",
R2 ="{output}/fastqc/{sample}{R2}" +"_fastqc.html"
conda:
"../envs/fastqc.yaml"
log:
"{output}/logs/qc/fastqc_{sample}_unfilt.log"
shell: "fastqc -o {output}/fastqc {input.R1} {input.R2} >> {log}"
The main config file is :
datasets:
dat1:
conf: "config_files/data1_config.yaml"
dat2:
conf: "config_files/data2_config.yaml"
qc_only: FALSE
and the individual config files looks like this data1_config.yaml:
# List of files
sampletable: "samples_data1.tsv"
output_dir: "data1"
## Cutadapt
## IMPORTANT ****** If you want to remove primers uncomment line 51 in utils/rules/qc_cutadapt.smk which will allow for primers to be removed
primers:
# Illumina V3V4 protocol primers
fwd_primer: "TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCCTACGGGNGGCWGCAG"
rev_primer: "GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGACTACHVGGGTATCTAATCC"
fwd_primer_rc: "CTGCWGCCNCCCGTAGGCTGTCTCTTATACACATCTGACGCTGCCGACGA"
rev_primer_rc: "GGATTAGATACCCBDGTAGTCCTGTCTCTTATACACATCTCCGAGCCCACGAGAC"
R1: "_R1"
R2: "_R2"
maxEE:
- 2
- 2
truncQ: 2
Here is your fastqc/output with little formatting:
rule fastqc:
output:
R1 = "{output}/fastqc/{sample}{R1}" + "_fastqc.html",
R2 = "{output}/fastqc/{sample}{R2}" + "_fastqc.html"
R1 and R2 wildcards are synonyms, and there is no way for Snakemake to differenciate them. For example, imagine that the rule all requires this file to be created: output_dir/fastqc/sample_aR1_fastqc.html. Which variables should Snakemake assign this file to, output.R1 or output.R2?
You need to separate these parameters with a non-wildcard, like that:
rule fastqc:
output:
R1 = "{output}/fastqc/{sample}R1" + "_fastqc.html",
R2 = "{output}/fastqc/{sample}R2" + "_fastqc.html"

Interconnected variables in snakemake

Let's say I have sample SAMPLE_A, divided into two files SAMPLE_A_1, SAMPLE_A_2 and associated to barcodes AATT, TTAA, and SAMPLE_Bassociated to barcodes CCGG, GGCC, GCGC, divided in 4 files SAMPLE_B_1...SAMPLE_B_4.
I can create getSampleNames() to get [SAMPLE_A,SAMPLE_A,SAMPLE_B,SAMPLE_B,SAMPLE_B,SAMPLE_B] and [1,2,1,2,3,4] and then zip them to get the combination {sample}_{id}. And then I can do the same thing for the barcodes: [SAMPLE_A,SAMPLE_A,SAMPLE_B,SAMPLE_B,SAMPLE_B] and [AATT, TTAA,CCGG, GGCC, GCGC].
SAMPLES_ID,IDs = getSampleNames()
SAMPLES_BC,BCs = getBCs(set(SAMPLES_ID))
rule refine:
input:
'{sample}/demultiplex/{sample}_{id}.demultiplex.bam'
output:
bam = '{sample}/polyA_trimming/{sample}_{id}.fltnc.bam',
shell:
"isoseq3 refine {input} "
rule split:
input:
expand('{sample}/polyA_trimming/{sample}_{id}.fltnc.bam', zip, sample = SAMPLES_ID, id = IDs),
output:
expand("{sample}/cells/{barcode}_{sample}/fltnc.bam", zip, sample = SAMPLES_BC, barcode = BCs),
shell:
"python {params.script_dir}/split_cells_bam.py"
rule dedup_split:
input:
"{sample}/cells/{barcode}_{sample}/fltnc.bam"
output:
bam = "{sample}/cells/{barcode}_{sample}/dedup/dedup.bam",
shell:
"isoseq3 dedup {input} {output.bam} "
rule merge:
input:
expand("{sample}/cells/{barcode}_{sample}/dedup/dedup.bam",
zip, sample = SAMPLES_BC, barcode = BCs),
How can I prevent the rule split to be a bottleneck in my pipeline? For now it waits for the refine rule to be done for all samples while it's not necessary, every sample should run independently, but I can't because the set of barcodes is different for each sample. Is there a way to have something like
expand("{sample}/cells/{barcode}_{sample}/fltnc.bam", zip, sample = SAMPLES_BC, barcode = BCs[SAMPLES_BC]), where each {sample} of SAMPLES_BC is a key in BCs dictionary ? And same for IDs? I know I can use functions, but then I'm not sure how to propagate the {barcode} through the rules
Based on your comment, there are a few routes to take which would involve changing your data structure holding the samples, barcodes, and ids. For now, you can just create a rule per sample:
for sample in set(SAMPLES_ID): # get uniq samples
# get ids and barcodes for this sample
ids = [tup[1] for tup in zip(SAMPLES_ID, IDs) if tup[0] == sample]
bcs = [tup[1] for tup in zip(SAMPLES_BC, BCs) if tup[0] == sample]
rule:
name: f'{sample}_split'
input:
expand('{sample}/polyA_trimming/{sample}_{id}.fltnc.bam',
sample = sample, id = ids),
output:
expand("{sample}/cells/{barcode}_{sample}/fltnc.bam",
sample = sample, barcode = bcs),
shell:
"python {params.script_dir}/split_cells_bam.py"
You don't need zip in the expand since the ids and bcs are for the single sample. I don't think this is the best way in general, but will be easiest for your current workflow.
Just noticing your shell command, how are you passing the input/output to your script?
I found how to use dictionaries through functions, which solved my problem!
The major default of this solution is that you have to create a dummy file as output of the split rule, instead of checking if each '{sample}/cells/{barcode}_{sample}/fltnc.bam' file is created, so I am still looking for something more elegant...
IDs = getSampleNames() #{SAMPLE_A:[1,2], SAMPLE_B:[1,2,3,4]}
SAMPLES = list(IDs.keys())
BCs = getBCs(SAMPLES) #{SAMPLE_A:[AATT, TTAA], SAMPLE_B:[CCGG,GGCC,GCGC]}
# function linking IDs and SAMPLE
def sample2ids(wildcards):
return expand('{{sample}}/polyA_trimming/{{sample}}_{id}.fltnc.bam',
id = IDs[wildcards.sample])
# function linking BCs and SAMPLE
def sample2ids(wildcards):
return expand('{{sample}}/cells/{barcode}_{{sample}}/dedup/dedup.bam',
barcode = BCs[wildcards.sample])
rule refine:
input:
'{sample}/demultiplex/{sample}_{id}.demultiplex.bam'
output:
bam = '{sample}/polyA_trimming/{sample}_{id}.fltnc.bam',
rule split:
input:
sample2ids
output:
# cannot use a function here, so I create a dummy file to pipe
'dummy_file.txt'
rule dedup_split:
input:
'dummy_file.txt'
output:
bam = "{sample}/cells/{barcode}_{sample}/dedup/dedup.bam",
rule merge:
input:
sample2bc

snakemake checkpoint and create a expand list/wildcards based on created output files

hope someone can guide me in the right direction. See below for a small working example:
from glob import glob
rule all:
input:
expand("output/step4/{number}_step4.txt", number=["1","2","3","4"])
checkpoint split_to_fasta:
input:
seqfile = "files/seqs.csv"#just an empty file in this example
output:
fastas=directory("output/fasta")
shell:
#A python script will create below files and I dont know them beforehand.
"mkdir -p {output.fastas} ; "
"echo test > {output.fastas}/1_LEFT_sample_name_foo.fa ;"
"echo test > {output.fastas}/1_RIGHT_sample_name_foo.fa ;"
"echo test > {output.fastas}/2_LEFT_sample_name_spam.fa ;"
"echo test > {output.fastas}/2_RIGHT_sample_name_bla.fa ;"
"echo test > {output.fastas}/3_LEFT_sample_name_egg.fa ;"
"echo test > {output.fastas}/4_RIGHT_sample_name_ham.fa ;"
rule step2:
input:
fasta = "output/fasta/{fasta}.fa"
output:
step2 = "output/step2/{fasta}_step2.txt",
shell:
"cp {input.fasta} {output.step2}"
rule step3:
input:
file = rules.step2.output.step2
output:
step3 = "output/step3/{fasta}_step3.txt",
shell:
"cp {input.file} {output.step3}"
def aggregate_input(wildcards):
checkpoint_output = checkpoints.split_to_fasta.get(**wildcards).output[0]
###dont know where to use this line correctly
###files = [Path(x).stem.split("_")[0] for x in glob("output/step3/"+ f"*_step3.txt") ]
return expand("output/step3/{fasta}_step3.txt", fasta=glob_wildcards(os.path.join(checkpoint_output, "{fasta}.fa")).fasta)
def get_id_files(wildcards):
blast = glob("output/step3/"+ f"{wildcards.number}*_step3.txt")
return sorted(blast)
rule step4:
input:
step3files = aggregate_input,
idfiles = get_id_files
output:
step4 = "output/step4/{number}_step4.txt",
run:
shell("cat {input.idfiles} > {output.step4}")
Because of rule all snakemake knows how to "start" the pipeline. I hard coded the numbers 1,2,3 and 4 but in a real situation I dont know these numbers beforehand.
expand("output/step4/{number}_step4.txt", number=["1","2","3","4"])
What I want is to get those numbers based on the output filenames of split_to_fasta, step2 or step3. And then use it as a target for wildcards. (I can easily get the numbers with glob and split)
I want to do it with wildcards like in def get_id_files because I want to execute the next step in parallel. In other words, the following sets of files need to go in the next step:
[1_LEFT_sample_name_foo.fa, 1_RIGHT_sample_name_foo.fa]
[2_LEFT_sample_name_spam.fa, 2_RIGHT_sample_name_bla.fa]
[3_LEFT_sample_name_egg.fa]
[4_RIGHT_sample_name_ham.fa]
A may need a second checkpoint but not sure how to implement that.
EDIT (solution):
I was already worried my question was not so clear so I made another example, see below. This pipeline generates some fake files (end of step 3). From this point I want to continue and process all files with the same id in parallel. The id is the number at the beginning of the filename. I could make a second pipeline that "starts" with step 4 and execute them after each other but that sounds like bad practice. I think I need to define a target for the next rule (step 4) but dont know how to do that based on this situation. The code to define the target itself is something like:
files = [Path(x).stem.split("_")[0] for x in glob("output/step3/"+ f"*_step3.txt") ]
ids = list(set(files))
expand("output/step4/{number}_step4.txt", number=ids)
The second example (Edited to the solution):
from glob import glob
def aggregate_input(wildcards):
checkpoint_output = checkpoints.split_to_fasta.get(**wildcards).output[0]
ids = [Path(x).stem.split("_")[0] for x in glob("output/fasta/"+ f"*.fa") ]
return expand("output/step3/{fasta}_step3.txt", fasta=glob_wildcards(os.path.join(checkpoint_output, "{fasta}.fa")).fasta) + expand("output/step4/{number}_step4.txt", number=ids)
rule all:
input:
aggregate_input,
checkpoint split_to_fasta:
input:
seqfile = "files/seqs.csv"
output:
fastas=directory("output/fasta")
shell:
#A python script will create below files and I dont know them beforehand.
#I could get the numbers if needed
"mkdir -p {output.fastas} ; "
"echo test1 > {output.fastas}/1_LEFT_sample_name_foo.fa ;"
"echo test2 > {output.fastas}/1_RIGHT_sample_name_foo.fa ;"
"echo test3 > {output.fastas}/2_LEFT_sample_name_spam.fa ;"
"echo test4 > {output.fastas}/2_RIGHT_sample_name_bla.fa ;"
"echo test5 > {output.fastas}/3_LEFT_sample_name_egg.fa ;"
"echo test6 > {output.fastas}/4_RIGHT_sample_name_ham.fa ;"
rule step2:
input:
fasta = "output/fasta/{fasta}.fa"
output:
step2 = "output/step2/{fasta}_step2.txt",
shell:
"cp {input.fasta} {output.step2}"
rule step3:
input:
file = rules.step2.output.step2
output:
step3 = "output/step3/{fasta}_step3.txt",
shell:
"cp {input.file} {output.step3}"
def get_id_files(wildcards):
#blast = glob("output/step3/"+ f"{wildcards.number}*_step3.txt")
blast = expand(f"output/step3/{wildcards.number}_{{sample}}_step3.txt", sample=glob_wildcards(f"output/fasta/{wildcards.number}_{{sample}}.fa").sample)
return blast
rule step4:
input:
idfiles = get_id_files
output:
step4 = "output/step4/{number}_step4.txt",
run:
shell("cat {input.idfiles} > {output.step4}")
Replacing your blast line in get_id_functions in second example with
blast = expand(f"output/step3/{wildcards.number}_{{sample}}_step3.txt", sample=glob_wildcards(f"output/fasta/{wildcards.number}_{{sample}}.fa").sample)
This is my way of understanding checkpoint, when the input of a rule (say rule a) is checkpoint, anything upstream of a is blocked by the first evaluation of DAG, after checkpoint has been successfully executed. The second round of evaluation would start with knowing the output of checkpoints.
So in your case, putting checkpoint in rule all would hide step2/3/4 at 1st evaluation (since these steps are upstream of all). Then checkpoint got executed, then 2nd evaluation. At this time point, you are evaluating a new workflow knowing all outputs of checkpoint, so you can 1. infer the ids 2. infer the corresponding step3 outputs according to split_to_fasta outputs.
1st evaluation: Rule all -> checkpoint split_to_fasta (TBD)
2nd evaluation(split_to_fasta executed): Rule all -> checkpoint split_to_fasta -> Rule step_4 -> Rule step_3 -> Rule step_2
get_id_files happens at step_4, where step_3 has not been executed, this is why you need to infer based on outputs of split_to_fasta instead of directly finding the outputs of step 3
If I understand the problem correctly, the following line should be changed:
ids = [Path(x).stem.split("_")[0] for x in glob("output/step3/"+ f"*_step3.txt") ]
Right now it's glob-bing for files in step3 (I presume these files do not yet exist). Instead, the right thing to glob is the output of the rule split_to_fasta, so something like this:
ids = [Path(x).stem.split("_")[0] for x in glob("output/fasta*.fa") ]
And later to use these ids to extract the relevant wildcards and use them in the expand("output/step3/{fasta}_step3.txt", ...).
Sorry this is not a functional example, but the original code is a bit hard to read.

A curious case of snakemake

I have a similar goal as in Snakemake: unknown output/input files after splitting by chromosome , however, as pointed out, I do know in advance that I have e.g., 5 chromosomes in my sample.bam file. Using as a toy example:
$ cat sample.bam
chromosome 1
chromosome 2
chromosome 3
chromosome 4
chromosome 5
I wish to "split" this bam file, and then do a bunch of per chromosome downstream jobs on the resulting chromosomes. The simplest solution I could conjure up was:
chromosomes = '1 2 3 4 5'.split()
rule master :
input :
expand('sample.REF_{chromosome}.bam',
chromosome = chromosomes)
rule chromosome :
output :
touch('sample.REF_{chromosome}.bam')
input : 'split.done'
rule split_bam :
output :
touch('split.done')
input : 'sample.bam'
run :
print('splitting bam..')
chromosome = 1
for line in open(input[0]) :
outfile = 'sample.REF_{}.bam'.format(chromosome)
print(line, end = '', file = open(outfile, 'w'))
chromosome += 1
results in empty sample_REF_{chromosome}.bam files. I understand why this happens, and indeed snakemake even warns, e.g.,
Warning: the following output files of rule chromosome were not present when the DAG was created:
{'sample.REF_3.bam'}
Touching output file sample.REF_3.bam.
that is, these files were not in the DAG to begin with, and snakemake touches them with empty versions, erasing what was put there. I guess I am surprised by this behavior, and wonder if there is a good reason for this. Note that this behavior is not limited to snakemake's touch(), since, should I replace touch('sample.REF_{chromosome}.bam') with simply 'sample.REF_{chromosome}.bam', and then have a shell :touch {output}`, I get the same result. Now, of course, I have found a perfectly acceptable workaround:
chromosomes = '1 2 3 4 5'.split()
rule master :
input :
expand('sample.REF_{chromosome}.bam',
chromosome = chromosomes)
rule chromosome :
output : 'sample.REF_{chromosome}.bam'
input : 'split_dir'
shell : 'mv {input}/{output} {output}'
rule split_bam :
output :
temp(directory('split_dir'))
input : 'sample.bam'
run :
print('splitting bam..')
shell('mkdir {output}')
chromosome = 1
for line in open(input[0]) :
outfile = '{}/sample.REF_{}.bam'.format(output[0], chromosome)
print(line, end = '', file = open(outfile, 'w'))
chromosome += 1
but I am surprised I have to go though these gymnastics for a seemingly simple task. For this reason, I wonder if there is a better design, or if I am not asking the right question. Any advice/ideas are most welcome.
I think your example is a bit contrived for a couple of reasons. The rule split_bam already produces the final output sample.REF_{chromosome}.bam. Also, the rule master uses the chromosomes taken from the variable chromosomes whereas the rule split_bam iterates through the bam file to get the chromosomes.
My impression is that what you want could be something like:
chromosomes= '1 2 3 4 5'.split()
rule master:
input:
expand('sample.REF_{chromosome}.bam',
chromosome = chromosomes)
rule split_bam :
input:
'sample.bam'
output:
expand('sample.split.{chromosome}.bam', chromosome= chromosomes)
run:
print('splitting bam..')
for chromosome in chromosomes:
outfile = 'sample.split.{}.bam'.format(chromosome)
print(chromosome, end = '', file = open(outfile, 'w'))
rule chromosome:
input:
'sample.split.{chromosome}.bam'
output:
touch('sample.REF_{chromosome}.bam')