I have a similar goal as in Snakemake: unknown output/input files after splitting by chromosome , however, as pointed out, I do know in advance that I have e.g., 5 chromosomes in my sample.bam file. Using as a toy example:
$ cat sample.bam
chromosome 1
chromosome 2
chromosome 3
chromosome 4
chromosome 5
I wish to "split" this bam file, and then do a bunch of per chromosome downstream jobs on the resulting chromosomes. The simplest solution I could conjure up was:
chromosomes = '1 2 3 4 5'.split()
rule master :
input :
expand('sample.REF_{chromosome}.bam',
chromosome = chromosomes)
rule chromosome :
output :
touch('sample.REF_{chromosome}.bam')
input : 'split.done'
rule split_bam :
output :
touch('split.done')
input : 'sample.bam'
run :
print('splitting bam..')
chromosome = 1
for line in open(input[0]) :
outfile = 'sample.REF_{}.bam'.format(chromosome)
print(line, end = '', file = open(outfile, 'w'))
chromosome += 1
results in empty sample_REF_{chromosome}.bam files. I understand why this happens, and indeed snakemake even warns, e.g.,
Warning: the following output files of rule chromosome were not present when the DAG was created:
{'sample.REF_3.bam'}
Touching output file sample.REF_3.bam.
that is, these files were not in the DAG to begin with, and snakemake touches them with empty versions, erasing what was put there. I guess I am surprised by this behavior, and wonder if there is a good reason for this. Note that this behavior is not limited to snakemake's touch(), since, should I replace touch('sample.REF_{chromosome}.bam') with simply 'sample.REF_{chromosome}.bam', and then have a shell :touch {output}`, I get the same result. Now, of course, I have found a perfectly acceptable workaround:
chromosomes = '1 2 3 4 5'.split()
rule master :
input :
expand('sample.REF_{chromosome}.bam',
chromosome = chromosomes)
rule chromosome :
output : 'sample.REF_{chromosome}.bam'
input : 'split_dir'
shell : 'mv {input}/{output} {output}'
rule split_bam :
output :
temp(directory('split_dir'))
input : 'sample.bam'
run :
print('splitting bam..')
shell('mkdir {output}')
chromosome = 1
for line in open(input[0]) :
outfile = '{}/sample.REF_{}.bam'.format(output[0], chromosome)
print(line, end = '', file = open(outfile, 'w'))
chromosome += 1
but I am surprised I have to go though these gymnastics for a seemingly simple task. For this reason, I wonder if there is a better design, or if I am not asking the right question. Any advice/ideas are most welcome.
I think your example is a bit contrived for a couple of reasons. The rule split_bam already produces the final output sample.REF_{chromosome}.bam. Also, the rule master uses the chromosomes taken from the variable chromosomes whereas the rule split_bam iterates through the bam file to get the chromosomes.
My impression is that what you want could be something like:
chromosomes= '1 2 3 4 5'.split()
rule master:
input:
expand('sample.REF_{chromosome}.bam',
chromosome = chromosomes)
rule split_bam :
input:
'sample.bam'
output:
expand('sample.split.{chromosome}.bam', chromosome= chromosomes)
run:
print('splitting bam..')
for chromosome in chromosomes:
outfile = 'sample.split.{}.bam'.format(chromosome)
print(chromosome, end = '', file = open(outfile, 'w'))
rule chromosome:
input:
'sample.split.{chromosome}.bam'
output:
touch('sample.REF_{chromosome}.bam')
Related
I have a relatively simple snakemake pipeline but when run I get all missing files for rule all:
refseq = 'refseq.fasta'
reads = ['_R1_001', '_R2_001']
def getsamples():
import glob
test = (glob.glob("*.fastq"))
print(test)
samples = []
for i in test:
samples.append(i.rsplit('_', 2)[0])
return(samples)
def getbarcodes():
with open('unique.barcodes.txt') as file:
lines = [line.rstrip() for line in file]
return(lines)
rule all:
input:
expand("grepped/{barcodes}{sample}_R1_001.plate.fastq", barcodes=getbarcodes(), sample=getsamples()),
expand("grepped/{barcodes}{sample}_R2_001.plate.fastq", barcodes=getbarcodes(), sample=getsamples())
wildcard_constraints:
barcodes="[a-z-A-Z]+$"
rule fastq_grep:
input:
R1 = "{sample}_R1_001.fastq",
R2 = "{sample}_R2_001.fastq"
output:
out1 = "grepped/{barcodes}{sample}_R1_001.plate.fastq",
out2 = "grepped/{barcodes}{sample}_R2_001.plate.fastq"
wildcard_constraints:
barcodes="[a-z-A-Z]+$"
shell:
"fastq-grep -i '{wildcards.barcodes}' {input.R1} > {output.out1} && fastq-grep -i '{wildcards.barcodes}' {input.R2} > {output.out2}"
The output files that are listed by the terminal seem correct, so it seems it is seeing what I want to produce but the shell is not making anything at all.
I want to produce a list of files that have grepped the list of barcodes I have in a file. But I get "Missing input files for rule all:"
There are two issues:
You have an impossible wildcard_constraints defined for {barcode}
Your two wildcards {barcode} and {sample} are competing with each other.
Remove the wildcard_constraints from your two rules and add the following lines to the top of your Snakefile:
wildcard_constraints:
barcodes="[A-Z]+",
sample="Well.*",
The constraint for {barcodes} now only matches capital letters. Before it also included end-of-line matching (trailing $) which was impossible to match for this wildcard as you had additional text in the filepath following.
The constraint for {sample} ensures that the path of the filename starting with "Well..." is interpreted as the start of the {sample} wildcard. Else you'd get something unwanted like barcode=ACGGTW instead of barcode=ACGGT.
A note of advice:
I usually find it easier to seperate wildcards into directory structures rather than having multiple wildcards in the same filename. In you case that would mean having a structure like
grepped/{barcode}/{sample}_R1_001.plate.fastq.
Full suggested Snakefile (formatted using snakefmt)
wildcard_constraints:
barcodes="[A-Z]+",
sample="Well.*",
refseq = "refseq.fasta"
reads = ["_R1_001", "_R2_001"]
def getsamples():
import glob
test = glob.glob("*.fastq")
print(test)
samples = []
for i in test:
samples.append(i.rsplit("_", 2)[0])
return samples
def getbarcodes():
with open("unique.barcodes.txt") as file:
lines = [line.rstrip() for line in file]
return lines
rule all:
input:
expand(
"grepped/{barcodes}{sample}_R1_001.plate.fastq",
barcodes=getbarcodes(),
sample=getsamples(),
),
expand(
"grepped/{barcodes}{sample}_R2_001.plate.fastq",
barcodes=getbarcodes(),
sample=getsamples(),
),
rule fastq_grep:
input:
R1="{sample}_R1_001.fastq",
R2="{sample}_R2_001.fastq",
output:
out1="grepped/{barcodes}{sample}_R1_001.plate.fastq",
out2="grepped/{barcodes}{sample}_R2_001.plate.fastq",
shell:
"fastq-grep -i '{wildcards.barcodes}' {input.R1} > {output.out1} && fastq-grep -i '{wildcards.barcodes}' {input.R2} > {output.out2}"
In addition to #euronion's answer (+1), I prefer to constrain wildcards to match only and exactly the list of values you expect. This means disabling the regex matching altogether. In your case, I would do something like:
wildcard_constraints:
barcodes='|'.join([re.escape(x) for x in getbarcodes()]),
sample='|'.join([re.escape(x) for x in getsamples()]),
now {barcodes} is allowed to match only the values in getbarcodes(), whatever they are, and the same for {sample}. In my opinion this is better than anticipating what combination of regex a wildcard can take.
hope someone can guide me in the right direction. See below for a small working example:
from glob import glob
rule all:
input:
expand("output/step4/{number}_step4.txt", number=["1","2","3","4"])
checkpoint split_to_fasta:
input:
seqfile = "files/seqs.csv"#just an empty file in this example
output:
fastas=directory("output/fasta")
shell:
#A python script will create below files and I dont know them beforehand.
"mkdir -p {output.fastas} ; "
"echo test > {output.fastas}/1_LEFT_sample_name_foo.fa ;"
"echo test > {output.fastas}/1_RIGHT_sample_name_foo.fa ;"
"echo test > {output.fastas}/2_LEFT_sample_name_spam.fa ;"
"echo test > {output.fastas}/2_RIGHT_sample_name_bla.fa ;"
"echo test > {output.fastas}/3_LEFT_sample_name_egg.fa ;"
"echo test > {output.fastas}/4_RIGHT_sample_name_ham.fa ;"
rule step2:
input:
fasta = "output/fasta/{fasta}.fa"
output:
step2 = "output/step2/{fasta}_step2.txt",
shell:
"cp {input.fasta} {output.step2}"
rule step3:
input:
file = rules.step2.output.step2
output:
step3 = "output/step3/{fasta}_step3.txt",
shell:
"cp {input.file} {output.step3}"
def aggregate_input(wildcards):
checkpoint_output = checkpoints.split_to_fasta.get(**wildcards).output[0]
###dont know where to use this line correctly
###files = [Path(x).stem.split("_")[0] for x in glob("output/step3/"+ f"*_step3.txt") ]
return expand("output/step3/{fasta}_step3.txt", fasta=glob_wildcards(os.path.join(checkpoint_output, "{fasta}.fa")).fasta)
def get_id_files(wildcards):
blast = glob("output/step3/"+ f"{wildcards.number}*_step3.txt")
return sorted(blast)
rule step4:
input:
step3files = aggregate_input,
idfiles = get_id_files
output:
step4 = "output/step4/{number}_step4.txt",
run:
shell("cat {input.idfiles} > {output.step4}")
Because of rule all snakemake knows how to "start" the pipeline. I hard coded the numbers 1,2,3 and 4 but in a real situation I dont know these numbers beforehand.
expand("output/step4/{number}_step4.txt", number=["1","2","3","4"])
What I want is to get those numbers based on the output filenames of split_to_fasta, step2 or step3. And then use it as a target for wildcards. (I can easily get the numbers with glob and split)
I want to do it with wildcards like in def get_id_files because I want to execute the next step in parallel. In other words, the following sets of files need to go in the next step:
[1_LEFT_sample_name_foo.fa, 1_RIGHT_sample_name_foo.fa]
[2_LEFT_sample_name_spam.fa, 2_RIGHT_sample_name_bla.fa]
[3_LEFT_sample_name_egg.fa]
[4_RIGHT_sample_name_ham.fa]
A may need a second checkpoint but not sure how to implement that.
EDIT (solution):
I was already worried my question was not so clear so I made another example, see below. This pipeline generates some fake files (end of step 3). From this point I want to continue and process all files with the same id in parallel. The id is the number at the beginning of the filename. I could make a second pipeline that "starts" with step 4 and execute them after each other but that sounds like bad practice. I think I need to define a target for the next rule (step 4) but dont know how to do that based on this situation. The code to define the target itself is something like:
files = [Path(x).stem.split("_")[0] for x in glob("output/step3/"+ f"*_step3.txt") ]
ids = list(set(files))
expand("output/step4/{number}_step4.txt", number=ids)
The second example (Edited to the solution):
from glob import glob
def aggregate_input(wildcards):
checkpoint_output = checkpoints.split_to_fasta.get(**wildcards).output[0]
ids = [Path(x).stem.split("_")[0] for x in glob("output/fasta/"+ f"*.fa") ]
return expand("output/step3/{fasta}_step3.txt", fasta=glob_wildcards(os.path.join(checkpoint_output, "{fasta}.fa")).fasta) + expand("output/step4/{number}_step4.txt", number=ids)
rule all:
input:
aggregate_input,
checkpoint split_to_fasta:
input:
seqfile = "files/seqs.csv"
output:
fastas=directory("output/fasta")
shell:
#A python script will create below files and I dont know them beforehand.
#I could get the numbers if needed
"mkdir -p {output.fastas} ; "
"echo test1 > {output.fastas}/1_LEFT_sample_name_foo.fa ;"
"echo test2 > {output.fastas}/1_RIGHT_sample_name_foo.fa ;"
"echo test3 > {output.fastas}/2_LEFT_sample_name_spam.fa ;"
"echo test4 > {output.fastas}/2_RIGHT_sample_name_bla.fa ;"
"echo test5 > {output.fastas}/3_LEFT_sample_name_egg.fa ;"
"echo test6 > {output.fastas}/4_RIGHT_sample_name_ham.fa ;"
rule step2:
input:
fasta = "output/fasta/{fasta}.fa"
output:
step2 = "output/step2/{fasta}_step2.txt",
shell:
"cp {input.fasta} {output.step2}"
rule step3:
input:
file = rules.step2.output.step2
output:
step3 = "output/step3/{fasta}_step3.txt",
shell:
"cp {input.file} {output.step3}"
def get_id_files(wildcards):
#blast = glob("output/step3/"+ f"{wildcards.number}*_step3.txt")
blast = expand(f"output/step3/{wildcards.number}_{{sample}}_step3.txt", sample=glob_wildcards(f"output/fasta/{wildcards.number}_{{sample}}.fa").sample)
return blast
rule step4:
input:
idfiles = get_id_files
output:
step4 = "output/step4/{number}_step4.txt",
run:
shell("cat {input.idfiles} > {output.step4}")
Replacing your blast line in get_id_functions in second example with
blast = expand(f"output/step3/{wildcards.number}_{{sample}}_step3.txt", sample=glob_wildcards(f"output/fasta/{wildcards.number}_{{sample}}.fa").sample)
This is my way of understanding checkpoint, when the input of a rule (say rule a) is checkpoint, anything upstream of a is blocked by the first evaluation of DAG, after checkpoint has been successfully executed. The second round of evaluation would start with knowing the output of checkpoints.
So in your case, putting checkpoint in rule all would hide step2/3/4 at 1st evaluation (since these steps are upstream of all). Then checkpoint got executed, then 2nd evaluation. At this time point, you are evaluating a new workflow knowing all outputs of checkpoint, so you can 1. infer the ids 2. infer the corresponding step3 outputs according to split_to_fasta outputs.
1st evaluation: Rule all -> checkpoint split_to_fasta (TBD)
2nd evaluation(split_to_fasta executed): Rule all -> checkpoint split_to_fasta -> Rule step_4 -> Rule step_3 -> Rule step_2
get_id_files happens at step_4, where step_3 has not been executed, this is why you need to infer based on outputs of split_to_fasta instead of directly finding the outputs of step 3
If I understand the problem correctly, the following line should be changed:
ids = [Path(x).stem.split("_")[0] for x in glob("output/step3/"+ f"*_step3.txt") ]
Right now it's glob-bing for files in step3 (I presume these files do not yet exist). Instead, the right thing to glob is the output of the rule split_to_fasta, so something like this:
ids = [Path(x).stem.split("_")[0] for x in glob("output/fasta*.fa") ]
And later to use these ids to extract the relevant wildcards and use them in the expand("output/step3/{fasta}_step3.txt", ...).
Sorry this is not a functional example, but the original code is a bit hard to read.
I have a snakemake file where one rule produces a file from witch i would like to extract the header and use as wildcards in my rule all.
The Snakemake guide provides an example where it creates new folders named like the wildcards, but if I can avoid that it would be nice since in some cases it would need to create 100-200 folders then. Any suggestions on how to make it work?
link to snakemake guide:
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html
import pandas as pd
rule all:
input:
final_report = expand('report_{fruit}.txt', fruit= ???)
rule create_file:
input:
output:
fruit = 'fruit_file.csv'
run:
....
rule next:
input:
fruit = 'fruit_file.csv'
output:
report = 'report_{phenotype}.txt'
run:
fruit_file = pd.read_csv({input.fruit}, header = 0, sep = '\t')
fruits= fruit_file.columns.tolist()[2:]
for i in fruits:
cmd = 'touch report_' + i + '.txt'
shell(cmd)
This is a simplified workflow since i am actually using some long script to both produce the pheno_file.csv and the report files.
The pheno_file.csv is tab-seperated and could look like this:
FID IID Apple Banana Plum
Mouse Mickey 0 0 1
Mouse Minnie 1 0 1
Duck Donnald 0 1 0
I think you are misreading the snakemake checkpoint example. You only need to create one folder in your case. They have a wildcard (sample) in the folder name, but that part of the output name is known ahead of time.
checkpoint fruit_reports:
input:
fruit = 'fruit_file.csv'
output:
report_dir = directory('reports')
run:
fruit_file = pd.read_csv({input.fruit}, header = 0, sep = '\t')
fruits= fruit_file.columns.tolist()[2:]
for i in fruits:
cmd = f'touch {output}/report_{i}.txt'
shell(cmd)
Since you do not know all names (fruits) ahead of time, you cannot include them in the all rule. You need to reference an intermediate rule to bring everything together. Maybe use a final report file:
rule all:
input: 'report.txt'
Then after the checkpoint:
def aggregate_fruit(wildcards):
checkpoint_output = checkpoints.fruit_reports.get(**wildcards).output[0]
return expand("reports/report_{i}.txt",
i=glob_wildcards(os.path.join(checkpoint_output, "report_{i}.txt")).i)
rule report:
input:
aggregate_input
output:
"report.txt"
shell:
"ls 1 {input} > {output}"
Imagine you have a workflow with a wildcard (wc in the example below) and you want to run it for a large number of different values of that wildcard (e.g. 1000 samples). Typically I would create a rule all that takes an input a function that generates the 1000 filenames. But what I've found is that Snakemake will execute one 1000 times then two 1000 times. This is problematic if the intermediate file produced by two is very large, because you end up with 1000 huge files.
Instead, what I would like is for Snakemake to produce five_1.txt ... five_1000.txt, ensuring that it actually produces one output of rule all before moving onto the next. This way, temp() deletes one three_{wc}.txt before the next is produced and you don't end up with a large number of large files.
In a linear workflow, you can use priorities as suggested by #Maarten-vd-Sande. This is because Snakemake looks at the jobs it can perform, and picks the highest priority, which will always be the one further down the chain in the linear workflow. In a fork however, this doesn't work, because both sides of the fork need to be the same priority, but then Snakemake just does all of one rule first.
rule one:
input: "input_{wc}.txt"
output:
touch("one_{wc}.txt"),
touch("two_{wc}.txt")
rule two:
input: "one_{wc}.txt"
output: temp(touch("three_{wc}.txt"))
rule three:
input: "two_{wc}.txt"
output: touch("four_{wc}.txt")
rule four:
input:
"three_{wc}.txt",
"four_{wc}.txt"
output: "five_{wc}.txt"
shell:
"""
touch {output}
"""
If it's a fork just make it not a fork. two has to execute before three. Use #Maarten-vd-Sande's solution.
rule one:
input: "input_{wc}.txt"
output:
touch("one_{wc}.txt"),
touch("two_{wc}.txt")
rule two:
input: "one_{wc}.txt"
output: temp(touch("three_{wc}.txt"))
priority: 1
rule three:
input:
"two_{wc}.txt",
"one_{wc}.txt"
output: touch("four_{wc}.txt")
priority: 2
rule four:
input:
"three_{wc}.txt",
"four_{wc}.txt"
output: touch("five_{wc}.txt")
priority: 3
I am transitioning a bash script to snakemake and I would like to parallelize a step I was previously handling with a for loop. The issue I am running into is that instead of running parallel processes, snakemake ends up trying to run one process with all parameters and fails.
My original bash script runs a program multiple times for a range of values of the parameter K.
for num in {1..3}
do
structure.py -K $num --input=fileprefix --output=fileprefix
done
There are multiple input files that start with fileprefix. And there are two main outputs per run, e.g. for K=1 they are fileprefix.1.meanP, fileprefix.1.meanQ. My config and snakemake files are as follows.
Config:
cat config.yaml
infile: fileprefix
K:
- 1
- 2
- 3
Snakemake:
configfile: 'config.yaml'
rule all:
input:
expand("output/{sample}.{K}.{ext}",
sample = config['infile'],
K = config['K'],
ext = ['meanQ', 'meanP'])
rule structure:
output:
"output/{sample}.{K}.meanQ",
"output/{sample}.{K}.meanP"
params:
prefix = config['infile'],
K = config['K']
threads: 3
shell:
"""
structure.py -K {params.K} \
--input=output/{params.prefix} \
--output=output/{params.prefix}
"""
This was executed with snakemake --cores 3. The problem persists when I only use one thread.
I expected the outputs described above for each value of K, but the run fails with this error:
RuleException:
CalledProcessError in line 84 of Snakefile:
Command ' set -euo pipefail; structure.py -K 1 2 3 --input=output/fileprefix \
--output=output/fileprefix ' returned non-zero exit status 2.
File "Snakefile", line 84, in __rule_Structure
File "snake/lib/python3.6/concurrent/futures/thread.py", line 56, in run
When I set K to a single value such as K = ['1'], everything works. So the problem seems to be that {params.K} is being expanded to all values of K when the shell command is executed. I started teaching myself snakemake today, and it works really well, but I'm hitting a brick wall with this.
You need to retrieve the argument for -K from the wildcards, not from the config file. The config file will simply return your list of possible values, it is a plain python dictionary.
configfile: 'config.yaml'
rule all:
input:
expand("output/{sample}.{K}.{ext}",
sample = config['infile'],
K = config['K'],
ext = ['meanQ', 'meanP'])
rule structure:
output:
"output/{sample}.{K}.meanQ",
"output/{sample}.{K}.meanP"
params:
prefix = config['invcf'],
K = config['K']
threads: 3
shell:
"structure.py -K {wildcards.K} "
"--input=output/{params.prefix} "
"--output=output/{params.prefix}"
Note that there are more things to improve here. For example, the rule structure does not define any input file, although it uses one.
There is an option now for parameter space exploration
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#parameter-space-exploration