snakemake checkpoint and create a expand list/wildcards based on created output files - snakemake

hope someone can guide me in the right direction. See below for a small working example:
from glob import glob
rule all:
input:
expand("output/step4/{number}_step4.txt", number=["1","2","3","4"])
checkpoint split_to_fasta:
input:
seqfile = "files/seqs.csv"#just an empty file in this example
output:
fastas=directory("output/fasta")
shell:
#A python script will create below files and I dont know them beforehand.
"mkdir -p {output.fastas} ; "
"echo test > {output.fastas}/1_LEFT_sample_name_foo.fa ;"
"echo test > {output.fastas}/1_RIGHT_sample_name_foo.fa ;"
"echo test > {output.fastas}/2_LEFT_sample_name_spam.fa ;"
"echo test > {output.fastas}/2_RIGHT_sample_name_bla.fa ;"
"echo test > {output.fastas}/3_LEFT_sample_name_egg.fa ;"
"echo test > {output.fastas}/4_RIGHT_sample_name_ham.fa ;"
rule step2:
input:
fasta = "output/fasta/{fasta}.fa"
output:
step2 = "output/step2/{fasta}_step2.txt",
shell:
"cp {input.fasta} {output.step2}"
rule step3:
input:
file = rules.step2.output.step2
output:
step3 = "output/step3/{fasta}_step3.txt",
shell:
"cp {input.file} {output.step3}"
def aggregate_input(wildcards):
checkpoint_output = checkpoints.split_to_fasta.get(**wildcards).output[0]
###dont know where to use this line correctly
###files = [Path(x).stem.split("_")[0] for x in glob("output/step3/"+ f"*_step3.txt") ]
return expand("output/step3/{fasta}_step3.txt", fasta=glob_wildcards(os.path.join(checkpoint_output, "{fasta}.fa")).fasta)
def get_id_files(wildcards):
blast = glob("output/step3/"+ f"{wildcards.number}*_step3.txt")
return sorted(blast)
rule step4:
input:
step3files = aggregate_input,
idfiles = get_id_files
output:
step4 = "output/step4/{number}_step4.txt",
run:
shell("cat {input.idfiles} > {output.step4}")
Because of rule all snakemake knows how to "start" the pipeline. I hard coded the numbers 1,2,3 and 4 but in a real situation I dont know these numbers beforehand.
expand("output/step4/{number}_step4.txt", number=["1","2","3","4"])
What I want is to get those numbers based on the output filenames of split_to_fasta, step2 or step3. And then use it as a target for wildcards. (I can easily get the numbers with glob and split)
I want to do it with wildcards like in def get_id_files because I want to execute the next step in parallel. In other words, the following sets of files need to go in the next step:
[1_LEFT_sample_name_foo.fa, 1_RIGHT_sample_name_foo.fa]
[2_LEFT_sample_name_spam.fa, 2_RIGHT_sample_name_bla.fa]
[3_LEFT_sample_name_egg.fa]
[4_RIGHT_sample_name_ham.fa]
A may need a second checkpoint but not sure how to implement that.
EDIT (solution):
I was already worried my question was not so clear so I made another example, see below. This pipeline generates some fake files (end of step 3). From this point I want to continue and process all files with the same id in parallel. The id is the number at the beginning of the filename. I could make a second pipeline that "starts" with step 4 and execute them after each other but that sounds like bad practice. I think I need to define a target for the next rule (step 4) but dont know how to do that based on this situation. The code to define the target itself is something like:
files = [Path(x).stem.split("_")[0] for x in glob("output/step3/"+ f"*_step3.txt") ]
ids = list(set(files))
expand("output/step4/{number}_step4.txt", number=ids)
The second example (Edited to the solution):
from glob import glob
def aggregate_input(wildcards):
checkpoint_output = checkpoints.split_to_fasta.get(**wildcards).output[0]
ids = [Path(x).stem.split("_")[0] for x in glob("output/fasta/"+ f"*.fa") ]
return expand("output/step3/{fasta}_step3.txt", fasta=glob_wildcards(os.path.join(checkpoint_output, "{fasta}.fa")).fasta) + expand("output/step4/{number}_step4.txt", number=ids)
rule all:
input:
aggregate_input,
checkpoint split_to_fasta:
input:
seqfile = "files/seqs.csv"
output:
fastas=directory("output/fasta")
shell:
#A python script will create below files and I dont know them beforehand.
#I could get the numbers if needed
"mkdir -p {output.fastas} ; "
"echo test1 > {output.fastas}/1_LEFT_sample_name_foo.fa ;"
"echo test2 > {output.fastas}/1_RIGHT_sample_name_foo.fa ;"
"echo test3 > {output.fastas}/2_LEFT_sample_name_spam.fa ;"
"echo test4 > {output.fastas}/2_RIGHT_sample_name_bla.fa ;"
"echo test5 > {output.fastas}/3_LEFT_sample_name_egg.fa ;"
"echo test6 > {output.fastas}/4_RIGHT_sample_name_ham.fa ;"
rule step2:
input:
fasta = "output/fasta/{fasta}.fa"
output:
step2 = "output/step2/{fasta}_step2.txt",
shell:
"cp {input.fasta} {output.step2}"
rule step3:
input:
file = rules.step2.output.step2
output:
step3 = "output/step3/{fasta}_step3.txt",
shell:
"cp {input.file} {output.step3}"
def get_id_files(wildcards):
#blast = glob("output/step3/"+ f"{wildcards.number}*_step3.txt")
blast = expand(f"output/step3/{wildcards.number}_{{sample}}_step3.txt", sample=glob_wildcards(f"output/fasta/{wildcards.number}_{{sample}}.fa").sample)
return blast
rule step4:
input:
idfiles = get_id_files
output:
step4 = "output/step4/{number}_step4.txt",
run:
shell("cat {input.idfiles} > {output.step4}")

Replacing your blast line in get_id_functions in second example with
blast = expand(f"output/step3/{wildcards.number}_{{sample}}_step3.txt", sample=glob_wildcards(f"output/fasta/{wildcards.number}_{{sample}}.fa").sample)
This is my way of understanding checkpoint, when the input of a rule (say rule a) is checkpoint, anything upstream of a is blocked by the first evaluation of DAG, after checkpoint has been successfully executed. The second round of evaluation would start with knowing the output of checkpoints.
So in your case, putting checkpoint in rule all would hide step2/3/4 at 1st evaluation (since these steps are upstream of all). Then checkpoint got executed, then 2nd evaluation. At this time point, you are evaluating a new workflow knowing all outputs of checkpoint, so you can 1. infer the ids 2. infer the corresponding step3 outputs according to split_to_fasta outputs.
1st evaluation: Rule all -> checkpoint split_to_fasta (TBD)
2nd evaluation(split_to_fasta executed): Rule all -> checkpoint split_to_fasta -> Rule step_4 -> Rule step_3 -> Rule step_2
get_id_files happens at step_4, where step_3 has not been executed, this is why you need to infer based on outputs of split_to_fasta instead of directly finding the outputs of step 3

If I understand the problem correctly, the following line should be changed:
ids = [Path(x).stem.split("_")[0] for x in glob("output/step3/"+ f"*_step3.txt") ]
Right now it's glob-bing for files in step3 (I presume these files do not yet exist). Instead, the right thing to glob is the output of the rule split_to_fasta, so something like this:
ids = [Path(x).stem.split("_")[0] for x in glob("output/fasta*.fa") ]
And later to use these ids to extract the relevant wildcards and use them in the expand("output/step3/{fasta}_step3.txt", ...).
Sorry this is not a functional example, but the original code is a bit hard to read.

Related

Snakemake pipeline not attempting to produce output?

I have a relatively simple snakemake pipeline but when run I get all missing files for rule all:
refseq = 'refseq.fasta'
reads = ['_R1_001', '_R2_001']
def getsamples():
import glob
test = (glob.glob("*.fastq"))
print(test)
samples = []
for i in test:
samples.append(i.rsplit('_', 2)[0])
return(samples)
def getbarcodes():
with open('unique.barcodes.txt') as file:
lines = [line.rstrip() for line in file]
return(lines)
rule all:
input:
expand("grepped/{barcodes}{sample}_R1_001.plate.fastq", barcodes=getbarcodes(), sample=getsamples()),
expand("grepped/{barcodes}{sample}_R2_001.plate.fastq", barcodes=getbarcodes(), sample=getsamples())
wildcard_constraints:
barcodes="[a-z-A-Z]+$"
rule fastq_grep:
input:
R1 = "{sample}_R1_001.fastq",
R2 = "{sample}_R2_001.fastq"
output:
out1 = "grepped/{barcodes}{sample}_R1_001.plate.fastq",
out2 = "grepped/{barcodes}{sample}_R2_001.plate.fastq"
wildcard_constraints:
barcodes="[a-z-A-Z]+$"
shell:
"fastq-grep -i '{wildcards.barcodes}' {input.R1} > {output.out1} && fastq-grep -i '{wildcards.barcodes}' {input.R2} > {output.out2}"
The output files that are listed by the terminal seem correct, so it seems it is seeing what I want to produce but the shell is not making anything at all.
I want to produce a list of files that have grepped the list of barcodes I have in a file. But I get "Missing input files for rule all:"
There are two issues:
You have an impossible wildcard_constraints defined for {barcode}
Your two wildcards {barcode} and {sample} are competing with each other.
Remove the wildcard_constraints from your two rules and add the following lines to the top of your Snakefile:
wildcard_constraints:
barcodes="[A-Z]+",
sample="Well.*",
The constraint for {barcodes} now only matches capital letters. Before it also included end-of-line matching (trailing $) which was impossible to match for this wildcard as you had additional text in the filepath following.
The constraint for {sample} ensures that the path of the filename starting with "Well..." is interpreted as the start of the {sample} wildcard. Else you'd get something unwanted like barcode=ACGGTW instead of barcode=ACGGT.
A note of advice:
I usually find it easier to seperate wildcards into directory structures rather than having multiple wildcards in the same filename. In you case that would mean having a structure like
grepped/{barcode}/{sample}_R1_001.plate.fastq.
Full suggested Snakefile (formatted using snakefmt)
wildcard_constraints:
barcodes="[A-Z]+",
sample="Well.*",
refseq = "refseq.fasta"
reads = ["_R1_001", "_R2_001"]
def getsamples():
import glob
test = glob.glob("*.fastq")
print(test)
samples = []
for i in test:
samples.append(i.rsplit("_", 2)[0])
return samples
def getbarcodes():
with open("unique.barcodes.txt") as file:
lines = [line.rstrip() for line in file]
return lines
rule all:
input:
expand(
"grepped/{barcodes}{sample}_R1_001.plate.fastq",
barcodes=getbarcodes(),
sample=getsamples(),
),
expand(
"grepped/{barcodes}{sample}_R2_001.plate.fastq",
barcodes=getbarcodes(),
sample=getsamples(),
),
rule fastq_grep:
input:
R1="{sample}_R1_001.fastq",
R2="{sample}_R2_001.fastq",
output:
out1="grepped/{barcodes}{sample}_R1_001.plate.fastq",
out2="grepped/{barcodes}{sample}_R2_001.plate.fastq",
shell:
"fastq-grep -i '{wildcards.barcodes}' {input.R1} > {output.out1} && fastq-grep -i '{wildcards.barcodes}' {input.R2} > {output.out2}"
In addition to #euronion's answer (+1), I prefer to constrain wildcards to match only and exactly the list of values you expect. This means disabling the regex matching altogether. In your case, I would do something like:
wildcard_constraints:
barcodes='|'.join([re.escape(x) for x in getbarcodes()]),
sample='|'.join([re.escape(x) for x in getsamples()]),
now {barcodes} is allowed to match only the values in getbarcodes(), whatever they are, and the same for {sample}. In my opinion this is better than anticipating what combination of regex a wildcard can take.

How do Snakemake checkpoints work when i do not wanna make a folder?

I have a snakemake file where one rule produces a file from witch i would like to extract the header and use as wildcards in my rule all.
The Snakemake guide provides an example where it creates new folders named like the wildcards, but if I can avoid that it would be nice since in some cases it would need to create 100-200 folders then. Any suggestions on how to make it work?
link to snakemake guide:
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html
import pandas as pd
rule all:
input:
final_report = expand('report_{fruit}.txt', fruit= ???)
rule create_file:
input:
output:
fruit = 'fruit_file.csv'
run:
....
rule next:
input:
fruit = 'fruit_file.csv'
output:
report = 'report_{phenotype}.txt'
run:
fruit_file = pd.read_csv({input.fruit}, header = 0, sep = '\t')
fruits= fruit_file.columns.tolist()[2:]
for i in fruits:
cmd = 'touch report_' + i + '.txt'
shell(cmd)
This is a simplified workflow since i am actually using some long script to both produce the pheno_file.csv and the report files.
The pheno_file.csv is tab-seperated and could look like this:
FID IID Apple Banana Plum
Mouse Mickey 0 0 1
Mouse Minnie 1 0 1
Duck Donnald 0 1 0
I think you are misreading the snakemake checkpoint example. You only need to create one folder in your case. They have a wildcard (sample) in the folder name, but that part of the output name is known ahead of time.
checkpoint fruit_reports:
input:
fruit = 'fruit_file.csv'
output:
report_dir = directory('reports')
run:
fruit_file = pd.read_csv({input.fruit}, header = 0, sep = '\t')
fruits= fruit_file.columns.tolist()[2:]
for i in fruits:
cmd = f'touch {output}/report_{i}.txt'
shell(cmd)
Since you do not know all names (fruits) ahead of time, you cannot include them in the all rule. You need to reference an intermediate rule to bring everything together. Maybe use a final report file:
rule all:
input: 'report.txt'
Then after the checkpoint:
def aggregate_fruit(wildcards):
checkpoint_output = checkpoints.fruit_reports.get(**wildcards).output[0]
return expand("reports/report_{i}.txt",
i=glob_wildcards(os.path.join(checkpoint_output, "report_{i}.txt")).i)
rule report:
input:
aggregate_input
output:
"report.txt"
shell:
"ls 1 {input} > {output}"

How to pass variable value as input in snakemake?

I want to download the fastq files from SRA database using SRR ID using Snakemake. I read a file to get SRR ID using python code.
I want to parse the Variable one by one as input. My code is below.
I want to run command
fastq-dump SRR390728
#SAMPLES = ['SRR390728','SRR400816']
SAMPLES = [line.strip() for line in open("./srrList", 'r')]
rule all:
input:
expand("fastq/{sample}.fastq.log",sample=SAMPLES)
rule download_fastq:
input:
"{sample}"
output:
"fastq/{sample}.fastq.log"
shell:
"fastq-dump {input} > {output}"
Skip input and just call the wildcard in shell command. input needs to be a filepath that needs to already exist or be created as part of the pipeline - neither are true in your case.
rule download_fastq:
output:
"fastq/{sample}.fastq.log"
shell:
"fastq-dump {wildcards.sample} > {output}"

Parallel execution of a snakemake rule with same input and a range of values for a single parameter

I am transitioning a bash script to snakemake and I would like to parallelize a step I was previously handling with a for loop. The issue I am running into is that instead of running parallel processes, snakemake ends up trying to run one process with all parameters and fails.
My original bash script runs a program multiple times for a range of values of the parameter K.
for num in {1..3}
do
structure.py -K $num --input=fileprefix --output=fileprefix
done
There are multiple input files that start with fileprefix. And there are two main outputs per run, e.g. for K=1 they are fileprefix.1.meanP, fileprefix.1.meanQ. My config and snakemake files are as follows.
Config:
cat config.yaml
infile: fileprefix
K:
- 1
- 2
- 3
Snakemake:
configfile: 'config.yaml'
rule all:
input:
expand("output/{sample}.{K}.{ext}",
sample = config['infile'],
K = config['K'],
ext = ['meanQ', 'meanP'])
rule structure:
output:
"output/{sample}.{K}.meanQ",
"output/{sample}.{K}.meanP"
params:
prefix = config['infile'],
K = config['K']
threads: 3
shell:
"""
structure.py -K {params.K} \
--input=output/{params.prefix} \
--output=output/{params.prefix}
"""
This was executed with snakemake --cores 3. The problem persists when I only use one thread.
I expected the outputs described above for each value of K, but the run fails with this error:
RuleException:
CalledProcessError in line 84 of Snakefile:
Command ' set -euo pipefail; structure.py -K 1 2 3 --input=output/fileprefix \
--output=output/fileprefix ' returned non-zero exit status 2.
File "Snakefile", line 84, in __rule_Structure
File "snake/lib/python3.6/concurrent/futures/thread.py", line 56, in run
When I set K to a single value such as K = ['1'], everything works. So the problem seems to be that {params.K} is being expanded to all values of K when the shell command is executed. I started teaching myself snakemake today, and it works really well, but I'm hitting a brick wall with this.
You need to retrieve the argument for -K from the wildcards, not from the config file. The config file will simply return your list of possible values, it is a plain python dictionary.
configfile: 'config.yaml'
rule all:
input:
expand("output/{sample}.{K}.{ext}",
sample = config['infile'],
K = config['K'],
ext = ['meanQ', 'meanP'])
rule structure:
output:
"output/{sample}.{K}.meanQ",
"output/{sample}.{K}.meanP"
params:
prefix = config['invcf'],
K = config['K']
threads: 3
shell:
"structure.py -K {wildcards.K} "
"--input=output/{params.prefix} "
"--output=output/{params.prefix}"
Note that there are more things to improve here. For example, the rule structure does not define any input file, although it uses one.
There is an option now for parameter space exploration
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#parameter-space-exploration

Snakemake run rule with wildcarded output of previous rules multiple times

I have multiple studies and I must make two files (a .notsad and .txt file) for each of the n number of studies. After these are created, I must run a command which runs per chromosome and uses the same two input files (.notsad, .txt) for each chromosome within a given study. So:
mycommand.py study1.notsad study1_filter.txt chr1.bad.gz --out chr1_filter.bad.gz
mycommand.py study1.notsad study1_filter.txt chr2.bad.gz --out chr2_filter.bad.gz
...
mycommand.py study2.notsad study2_filter.txt chr1.bad.gz --out chr1_filter.bad.gz
...
However Im having trouble getting this to run. Im getting an error:
WildcardError in line 33 of /scripts/Snakefile:
Wildcards in input files cannot be determined from output files:
'ds_lower'
My rules so far:
import os
import glob
ROOT = "/rootdir/"
ORIGINAL_DATA_FOLDER="original/"
PROCESS_DATA_FOLDER="process/"
ORIGINAL_DATA_SOURCE=ROOT+ORIGINAL_DATA_FOLDER
PROCESS_DATA_SOURCE=ROOT+PROCESS_DATA_FOLDER
DATASETS = [name for name in os.listdir(ORIGINAL_DATA_SOURCE) if os.path.isdir(os.path.join(ORIGINAL_DATA_SOURCE, name))]
LOWERCASE_DATASETS = [dataset.lower() for dataset in DATASETS]
CHROMOSOME = list(range(1,23))
rule all:
input:
expand(PROCESS_DATA_SOURCE+"{ds}/chr{chr}_filtered.gen.gz", ds=DATASETS, chr=CHROMOSOME)
rule run_command:
input:
ORIGINAL_DATA_SOURCE+"{ds}/chr{chr}.bad.gz", # Matches 22 chroms
PROCESS_DATA_SOURCE+"{ds}/{ds_lower}_filter.txt", # But this should be common to all chr runs for this study.
PROCESS_DATA_SOURCE+"{ds}/{ds_lower}.notsad" # This one as well.
output:
PROCESS_DATA_SOURCE+"{ds}/chr{chr}_filtered.gen.gz"
shell:
# Run command that uses each of the previous files and runs per chromosome
"mycommand.py {input.2} {input.1} {input.0} --out {output}"
rule write_txt_file:
input:
ORIGINAL_DATA_SOURCE+"{ds}/{ds_lower}_info.txt"
output:
PROCESS_DATA_SOURCE+"{ds}/{ds_lower}_filter.txt"
shell:
"touch {output}"
rule write_notsad_file:
input:
ORIGINAL_DATA_SOURCE+"{ds}/_{ds_lower}.sad"
output:
PROCESS_DATA_SOURCE+"{ds}/{ds_lower}.notsad"
shell:
"touch {output}"
UPDATE
Changing inputs for rule run_command to lambda functions did work.
rule run_command:
input:
ORIGINAL_DATA_SOURCE+"{ds}/chr{chr}.gen.gz",
lambda wildcards: PROCESS_DATA_SOURCE + f"{wildcards.ds}/{wildcards.ds.lower()}_filter.txt",
lambda wildcards: PROCESS_DATA_SOURCE + f"{wildcards.ds}/{wildcards.ds.lower()}.sample"
output:
PROCESS_DATA_SOURCE+"{ds}/chr{chr}_filtered.gen.gz"
run:
# Run command that uses each of the previous files and runs per chromosome
"mycommand.py {input.2} {input.1} {input.0} --out {output}"
All the wildcards used in input need to be present in output. In rule run_command, wildcard {ds_lower} is present only in input but not in output.