Snakemake, producing list of files that are created within the pipeline - snakemake

This is my first snakemake workflow, so it might be that I'm overcomplicating things.
My workflow takes as input the 'database query' for downloading some files, which is specified in my 'config.yaml'. It means that I do not know the names of the files that will be downloaded before running the pipeline.
# configfile: "config.yaml"
# DATABASE = config["database"]
# database: '("Apis"[Organism] OR Apis[All Fields]) AND (latest[filter] AND "representative genome"[filter] AND all[filter] NOT anomalous[filter])'
DATABASE = '("Apis"[Organism] OR Apis[All Fields]) AND (latest[filter] AND "representative genome"[filter] AND all[filter] NOT anomalous[filter])'
What I want to do is to:
Create a genome list: call a database with my query and extract the links to the files (create_genome_list). (Here, I use entrez)
Next, I want to download the files using the collected links (download_genome)
Files are zipped, so I want to unzip them (unzip_genome)
Finally, I would like to create a list of all downloaded and unzipped files... and here I struggle. (make_summary_table)
I can run my snakemake on steps 1-3 when I call one of the expected output files with the following:
snakemake -p database/GCA_000184785.2_Aflo_1.1_genomic/GCA_000184785.2_Aflo_1.1_genomic.fna --use-conda
It gives me links to all expected files (5) in folder /temp,
and 1 downloaded and unzipped file: /database/GCA_000184785.2_Aflo_1.1_genomic/GCA_000184785.2_Aflo_1.1_genomic.fna
My snakemake for steps 1-3 looks like this:
rule create_genome_list:
output: touch("temp/{genome}")
conda: "entrez_env.yaml"
message: "Creating the genomes list..."
shell:
r"""
esearch -db assembly -query '{DATABASE}' \
| esummary \
| xtract -pattern DocumentSummary -element FtpPath_GenBank \
| while read -r line ;
do
fname=$(echo $line | grep -o 'GCA_.*' | sed 's/$/_genomic.fna.gz/');
wildcard=$(echo $fname | sed -e 's!.fna.gz!!');
echo "$line/$fname" > temp/$wildcard;
#echo $wildcard >> list_of_genomes.txt
done
"""
rule download_genome:
output: touch("database/{genome}/{genome}.fna.gz")
input: "temp/{genome}"
shell:
r"""
GENOME_LINK=$(cat {input})
GENOME="${{GENOME_LINK##*/}}"
wget -P ./database/{wildcards.genome}/ $GENOME_LINK
"""
rule unzip_genome:
output: touch("database/{genome}/{genome}.fna")
input: "database/{genome}/{genome}.fna.gz"
shell: "gunzip {input}"
My problem starts when I want to create the final rule, which will wrap up the results of my pipeline. In my real pipeline, I do some additional analyses with downloaded genomes, and at the end, I want to join all partial results obtained per single genome into one table. Here I post a toy example, which I believe reflects my problem the best.
I guess there is some way to extract the genomes' names so I could call them in the final summarising rule's input.
I approached it in an ugly way by listing files in temp/ and using them in expand() like follow:
GENOMES = os.listdir("temp/")
rule make_summary_table:
output: "summary_table.txt"
input: expand("database/{genome}/{genome}.fna", genome = GENOMES)
shell:
"""
echo {input} >> {output}
echo " " >> {output}
"""
But it works only when /temp exists before running the pipeline. And it produces the summary_table.txt with 5 positions only when I run steps 1-3 before (otherwise, it produces an empty file).
I am also afraid that in my real pipeline, it might happen that not all genomes will produce partial results on time the last summarising rule will be called. But maybe Snakemake handles it somehow (by waiting?) once all the inputs are specified.
-----------------------------EDIT-----------------------------------------
I have tried to implement checkpoint as a possible solution as follow:
DATABASE = '("Apis"[Organism] OR Apis[All Fields]) AND (latest[filter] AND "representative genome"[filter] AND all[filter] NOT anomalous[filter])'
rule all:
input: "summary_table.txt"
checkpoint create_genome_list:
output: directory("temp/")
conda: "entrez_env.yaml"
shell:
r"""
esearch -db assembly -query '{DATABASE}' \
| esummary \
| xtract -pattern DocumentSummary -element FtpPath_GenBank \
| while read -r line ;
do
fname=$(echo $line | grep -o 'GCA_.*' | sed 's/$/_genomic.fna.gz/');
wildcard=$(echo $fname | sed -e 's!.fna.gz!!');
echo "$line/$fname" > temp/$wildcard;
#echo $wildcard >> list_of_genomes.txt
done
"""
rule download_genome:
output: touch("database/{genome}/{genome}.fna.gz")
input: "temp/{genome}"
shell:
r"""
GENOME_LINK=$(cat {input})
GENOME="${{GENOME_LINK##*/}}"
wget -P ./database/{wildcards.genome}/ $GENOME_LINK
"""
rule unzip_genome:
output: "database/{genome}/{genome}.fna"
input: "database/{genome}/{genome}.fna.gz"
shell:
r"""
gunzip {input}
"""
def aggregate_input(wildcards):
checkpoint_output = checkpoints.create_genome_list.get(**wildcards).output[0]
return expand("database/{genome}/{genome}.fna",
i=glob_wildcards(os.path.join(checkpoint_output, "{genome}.fna")).genome)
rule make_summary_table:
output: "summary_table.txt"
input: aggregate_input
shell:
"""
echo {input} >> {output}
echo " " >> {output}
"""
But cannot overcome the error: InputFunctionException in line 73 (rule make_summary_table) of ~/snakemake_test/Snakefile: WildcardError: No values given for wildcard 'genome'. Wildcards:

For your updated code to work you need to apply atleast the following two fixes:
Redefine the rule as checkpoint (not necessary, see edit note below)
Expand all wildcards in the checkpoint-related function (your expand leaves {genome} un-expanded as it expands i which is not defined as a wildcard and thus does nothing.
The relevant code lines:
def aggregate_input(wildcards):
checkpoint_output = checkpoints.create_genome_list.get(**wildcards).output[0]
return expand(
"database/{genome}/{genome}.fna",
genome=glob_wildcards(os.path.join(checkpoint_output, "{genome}.fna")).genome,
)
rule make_summary_table:
output:
"summary_table.txt",
input:
aggregate_input,
shell:
"""
echo {input} >> {output}
echo " " >> {output}
"""
Give it a try and let us know if it works!
edit: Sorry, I realised that the correct rule was already converted to a checkpoint and my 1. point is invalid. I've updated the answer above.

Related

How to avoid Snakemake rule from using incomplete output file from other rules

rule rule1:
output: tsv = "..."
input: faa = "..."
shell:
"""
awk ... > {output.tsv}
some commands {input.faa} | awk ... >> {output.tsv}
"""
rule rule2:
output:
tsv = "..."
input:
tsv = rules.rule1.output.tsv,
shell:
"""
awk ... {input.tsv} > {output.tsv}
"""
As it illustrated above, rule2 takes input file from rule1.
According to the official docs, since the output file in rule1 is created successfully by awk, Snakemake assumes everything worked fine, even if my output file is incomplete, because awk is going to append to that file. Snakemake just ran rule2 and took the incomplete file from rule1. Actually, the second awk command in rule1 have not being executed, leaving the output file incomplete.
As far as I can tell, Snakemake detecting the presence of output.csv and assuming the rule completed successfully (since your awks didn't error) is working as intended.
It's not very easy for me to suggest specific edits since the commands are not complete, but how about creating intermediate files in your rule for the two awk commands, then combining them, so that if one or the other don't run the rule fails. Something like:
rule rule1:
output: tsv = "...", int1 = temp(".../{sample}_i1.tsv"), int2 = temp(".../{sample}_i2.tsv")
input: faa = "..."
shell:
"""
awk ... > {output.int1}
awk ... {input.faa} >> {output.int2}
[some logic to make sure the processing is complete]
cp {output.int2} {output.tsv}
"""
I wrapped both intermediates in temp() so that snakemake cleans them up after the rule ends.

Split files in Snakemake

I have a simple question, but I just cannot figure it out myself.
I have a list of inputs (a,b,c). For each input, I need to extract some data (1 to 23):
bcftools view -H a.vcf.gz -r 1 > a_chr1.txt
...
bcftools view -H a.vcf.gz -r 23 > a_chr23.txt
I can do it with FOR loop in the Snakemake rule:
IDS=['a','b','c']
chrs=range(1,23)
rule:
input:
expand("{id}.vcf.gz", id=IDS)
output:
expand("{id}_{chr}.txt", chr=chrs, id=IDS)
run:
for i in IDS:
for c in chrs:
shell("bcftools view -H {i}.vcf.gz -r {c} > {i}_chr{c}.txt")
, but FOR loop does not parallelize it. I need a proper Snakemake-way, smth like below, but it does not work.
IDS=['a','b','c']
chrs=range(1,23)
rule:
input:
expand("{id}.vcf.gz", id=IDS)
output:
expand("{id}_{chr}.txt", chr=chrs, id=IDS)
params:
c=expand("{chr}", chr=chrs)
shell:
"bcftools view -H {input} -r {params.c} > {output}"
Could you please help?
You are not taking advantage of the snakemake wildcards here. If you specify an expand in your inputs and outputs then snakemake will run the rule only once. It tells snakemake that all vcf files are needed to run the rule and that this rule will produce all splitted files. What you need is a rule that can be applied to any vcf file and will produce only one splitted (by chr) output.
IDS=['a','b','c']
chrs=range(1,23)
rule all:
input: expand("{id}_{chr}.txt", chr=chrs, id=IDS)
rule splitByChr:
input:
"{id}.vcf.gz"
output:
"{id}_{chr}.txt"
shell:
"bcftools view -H {input} -r {wildcards.chr} > {output}"
The rule all here will trigger the rule splitByChr as many times as necessary.
Also note that {id} and {chr} in the expand function are not wildcards. They are placeholders for the expand arguments defined.

Can I stop a rule in snakefile being parallel executed

I tried to concatenate files created via snakemake workflow as the last rule. To separate and identify the contents of each file, I echo each file name first in the shell as a separation tag (see the code below)
rule cat:
input:
expand('Analysis/typing/{sample}_type.txt', sample=samples)
output:
'Analysis/typing/Sum_type.txt'
shell:
'echo {input} >> {output} && cat {input} >> {output}'
I was looking for the result as this format:
file name of sample 1 content of sample 1 file name of sample
2 content of sample 2
instead I got this format:
file name of sample 1 file name of sample 2 ... content of sample 1
content of sample 2 ...
It seems snakemake execute echo command in parallel first then execute the cat command. What can I do the get the format I wanted?
Thanks
This looks more like a shell issue than a Snakemake issue.
If you want the file names and contents alternate, you can use a loop on the input files, as follows:
# Just an example:
samples = ["A", "B", "C"]
rule all:
input:
'Analysis/typing/Sum_type.txt'
rule cat:
input:
expand('Analysis/typing/{sample}_type.txt', sample=samples)
output:
'Analysis/typing/Sum_type.txt'
shell:
"""
for file in {input}
do
echo ${{file}} >> {output}
cat ${{file}} >> {output}
done
"""
(Double curly braces avoid the interpretation of the intended shell variable file as a thing that Snakemake/Python should "interpolate" when computing the string it passes to the shell.)
The output you get is consistent with the way bash works rather than with snakemake. Anyway, I think the snakemake way of doing it would be a rule to add the filename to the file content of each file and a rule to concatenate the output. E.g. (not checked for errors):
rule cat:
input:
'Analysis/typing/{sample}_type.txt',
output:
temp('Analysis/typing/{sample}_type.txt.out'),
shell:
r"""
echo {input} > {output}
cat {input} >> {output}
"""
rule cat_all:
input:
expand('Analysis/typing/{sample}_type.txt.out', sample=samples)
output:
'Analysis/typing/Sum_type.txt'
shell:
r"""
cat {input} > {output}
"""

Combine shell command lines in snakemake

I would like to combine two command lines as one single to avoid the intermediate files.
workdir: "/path/to/workdir/"
rule all:
input:
"my.filtered.vcf.gz"
rule bedtools:
input:
invcf="/path/to/my.vcf.gz",
bedgz="/path/to/my.bed.gz"
output:
outvcf="my.filtered.vcf.gz"
shell:
"/Tools/bedtools2/bin/bedtools intersect -a {input.invcf} -b {input.bedgz} -header -wa |"
"/Tools/bcftools/bcftools annotate -c CHROM,FROM,TO,GENE -h <(echo '##INFO=<ID=GENE,Number=1,Type=String,Description="Gene name">') > {output.outvcf}"
I am getting invalid syntax error. I would appreciate if you could explain how to combine multiple shell lines in snakemake.
You probably get an invalid syntax because of the " you use in your shell here: Description="Gene name">. This closes your shell. You can either escape these quotes or use the """ syntax:
rule bedtools:
input:
invcf="/path/to/my.vcf.gz",
bedgz="/path/to/my.bed.gz"
output:
outvcf="my.filtered.vcf.gz"
shell:
"/Tools/bedtools2/bin/bedtools intersect -a {input.invcf} -b {input.bedgz} -header -wa |"
"/Tools/bcftools/bcftools annotate -c CHROM,FROM,TO,GENE -h <(echo '##INFO=<ID=GENE,Number=1,Type=String,Description=\"Gene name\">') > {output.outvcf}"
or
rule bedtools:
input:
invcf="/path/to/my.vcf.gz",
bedgz="/path/to/my.bed.gz"
output:
outvcf="my.filtered.vcf.gz"
shell:
"""
/Tools/bedtools2/bin/bedtools intersect -a {input.invcf} -b {input.bedgz} -header -wa | /Tools/bcftools/bcftools annotate -c CHROM,FROM,TO,GENE -h <(echo '##INFO=<ID=GENE,Number=1,Type=String,Description="Gene name">') > {output.outvcf}
"""
Note that you can use multi line with """. Example without pipes:
shell:
"""
bedtools .... {input} > tempFile
bcftools .... tempFile > tempFile2
whatever .... tempFile2 > {output}
"""
Escaping the double quotes is the problem, but to add a little more on formatting and pipes.
I prefer the syntax of wrapping each line in " so that the lines can be spaced better:
rule bedtools:
input:
invcf="/path/to/my.vcf.gz",
bedgz="/path/to/my.bed.gz"
output:
outvcf="my.filtered.vcf.gz"
shell:
"/Tools/bedtools2/bin/bedtools "
"intersect "
"-a {input.invcf} "
"-b {input.bedgz} "
"-header -wa "
"| /Tools/bcftools/bcftools "
"annotate "
"-c CHROM,FROM,TO,GENE "
"-h <(echo '##INFO=<ID=GENE,Number=1,Type=String,Description=\"Gene name\">') "
"> {output.outvcf}"
I find that clearer to see each argument and easier to change by moving lines around. But note that the trailing space of each line is required and you have to use an explicit newline, \n, if you want a separate command. When the prompt is printed the output is nicely formated. With the """ syntax you have to escape each newline with \ at the end and spaces at the start of the line are retained in printing.
If you have lot's of pipe work to do, check out the pipe flag. You write your first step as a rule and snakemake produces a named pipe between the rules, submitting them as a group:
rule bedtools_intersect:
input:
invcf="/path/to/my.vcf.gz",
bedgz="/path/to/my.bed.gz"
output:
outvcf=pipe("my.intersected.vcf.gz")
shell:
"/Tools/bedtools2/bin/bedtools "
"intersect "
"-a {input.invcf} "
"-b {input.bedgz} "
"-header -wa "
"> {output.outvcf}"
rule bcftools_annotate:
input:
invcf="my.intersected.vcf.gz"
output:
outvcf="my.filtered.vcf.gz"
shell:
"/Tools/bcftools/bcftools "
"annotate "
"-c CHROM,FROM,TO,GENE "
"-h <(echo '##INFO=<ID=GENE,Number=1,Type=String,Description=\"Gene name\">') "
"{input.invcf} "
"> {output.outvcf}"
The advantage is you can reuse each rule throughout your pipeline to intersect or annotate while avoiding temporary files.

Wildcard SyntaxError in Snakemake with no obvious cause

I keep getting an error about a rule not having the same wildcards in its output rules and I can't figure out what the source of the error might be:
SyntaxError:
Not all output, log and benchmark files of rule bcftools_filter contain the same wildcards. This is crucial though, in order to avoid that two or more jobs write to the same file.
...
rule merge_YRI_GTEx:
input:
kg=expand("kg_vcf/1kg_yri_chr{q}.vcf.gz", q=range(1,23)),
gtex=expand("gtex_vcf/gtex_chr{v}.snps.recode.vcf.gz", v=range(1, 23))
output:
"merged/merged_chr{i}.vcf.gz"
shell:
"bcftools merge \
-0 \
-O z \
-o {output} \
{input.kg} \
{input.gtex}"
rule bcftools_filter:
input:
expand("merged/merged_chr{i}.vcf.gz", i=range(1,23))
output:
filt="filtered_vcf/merged_filtered_chr{i}.vcf.gz",
chk=touch(".bcftools_filter.chkpnt")
threads:
4
shell:
"bcftools filter \
--include 'AN=1890 && AC > 0' \
--threads {threads} \
-O z \
-o {output.filt} \
{input}"
...
rule list_merged_filtered_vcfs:
input:
".bcftools_filter.chkpnt"
output:
"processed_vcf_list.txt"
shell:
"for i in {{1..22}}; do \ "
"echo \"{config[sprime_dir]}/filtered_vcf/merged_filtered_chr${{i}}.vcf.gz\" >> \
{output}; done"
The specific line it's complaining about is the one that's just "bcftools filter \ which is even more dumbfounding to me. I've tried giving names to the input wildcard and even scrutinizing the rule which calls bcftools_filter's output as well as the rule which produces bcftools_filter's input to no avail. Not sure what is giving me this error.
I think the error comes from chk=touch(".bcftools_filter.chkpnt") not containing the wildcard {i}.
Apart from that, I'm not sure you rule is very sensible. You are passing to bcftools filter a list of input files (from expand(...)) but I don't think bcftools filter accept more than one input file. Also, your rule will create output files filtered_vcf/merged_filtered_chr{i}.vcf.gz (one for each value of i) using the same list of input files. Are you sure you want expand("merged/merged_chr{i}.vcf.gz", i=range(1,23)) instead of just "merged/merged_chr{i}.vcf.gz", with values for i given somewhere upstream?