Split files in Snakemake - snakemake

I have a simple question, but I just cannot figure it out myself.
I have a list of inputs (a,b,c). For each input, I need to extract some data (1 to 23):
bcftools view -H a.vcf.gz -r 1 > a_chr1.txt
...
bcftools view -H a.vcf.gz -r 23 > a_chr23.txt
I can do it with FOR loop in the Snakemake rule:
IDS=['a','b','c']
chrs=range(1,23)
rule:
input:
expand("{id}.vcf.gz", id=IDS)
output:
expand("{id}_{chr}.txt", chr=chrs, id=IDS)
run:
for i in IDS:
for c in chrs:
shell("bcftools view -H {i}.vcf.gz -r {c} > {i}_chr{c}.txt")
, but FOR loop does not parallelize it. I need a proper Snakemake-way, smth like below, but it does not work.
IDS=['a','b','c']
chrs=range(1,23)
rule:
input:
expand("{id}.vcf.gz", id=IDS)
output:
expand("{id}_{chr}.txt", chr=chrs, id=IDS)
params:
c=expand("{chr}", chr=chrs)
shell:
"bcftools view -H {input} -r {params.c} > {output}"
Could you please help?

You are not taking advantage of the snakemake wildcards here. If you specify an expand in your inputs and outputs then snakemake will run the rule only once. It tells snakemake that all vcf files are needed to run the rule and that this rule will produce all splitted files. What you need is a rule that can be applied to any vcf file and will produce only one splitted (by chr) output.
IDS=['a','b','c']
chrs=range(1,23)
rule all:
input: expand("{id}_{chr}.txt", chr=chrs, id=IDS)
rule splitByChr:
input:
"{id}.vcf.gz"
output:
"{id}_{chr}.txt"
shell:
"bcftools view -H {input} -r {wildcards.chr} > {output}"
The rule all here will trigger the rule splitByChr as many times as necessary.
Also note that {id} and {chr} in the expand function are not wildcards. They are placeholders for the expand arguments defined.

Related

Snakemake, producing list of files that are created within the pipeline

This is my first snakemake workflow, so it might be that I'm overcomplicating things.
My workflow takes as input the 'database query' for downloading some files, which is specified in my 'config.yaml'. It means that I do not know the names of the files that will be downloaded before running the pipeline.
# configfile: "config.yaml"
# DATABASE = config["database"]
# database: '("Apis"[Organism] OR Apis[All Fields]) AND (latest[filter] AND "representative genome"[filter] AND all[filter] NOT anomalous[filter])'
DATABASE = '("Apis"[Organism] OR Apis[All Fields]) AND (latest[filter] AND "representative genome"[filter] AND all[filter] NOT anomalous[filter])'
What I want to do is to:
Create a genome list: call a database with my query and extract the links to the files (create_genome_list). (Here, I use entrez)
Next, I want to download the files using the collected links (download_genome)
Files are zipped, so I want to unzip them (unzip_genome)
Finally, I would like to create a list of all downloaded and unzipped files... and here I struggle. (make_summary_table)
I can run my snakemake on steps 1-3 when I call one of the expected output files with the following:
snakemake -p database/GCA_000184785.2_Aflo_1.1_genomic/GCA_000184785.2_Aflo_1.1_genomic.fna --use-conda
It gives me links to all expected files (5) in folder /temp,
and 1 downloaded and unzipped file: /database/GCA_000184785.2_Aflo_1.1_genomic/GCA_000184785.2_Aflo_1.1_genomic.fna
My snakemake for steps 1-3 looks like this:
rule create_genome_list:
output: touch("temp/{genome}")
conda: "entrez_env.yaml"
message: "Creating the genomes list..."
shell:
r"""
esearch -db assembly -query '{DATABASE}' \
| esummary \
| xtract -pattern DocumentSummary -element FtpPath_GenBank \
| while read -r line ;
do
fname=$(echo $line | grep -o 'GCA_.*' | sed 's/$/_genomic.fna.gz/');
wildcard=$(echo $fname | sed -e 's!.fna.gz!!');
echo "$line/$fname" > temp/$wildcard;
#echo $wildcard >> list_of_genomes.txt
done
"""
rule download_genome:
output: touch("database/{genome}/{genome}.fna.gz")
input: "temp/{genome}"
shell:
r"""
GENOME_LINK=$(cat {input})
GENOME="${{GENOME_LINK##*/}}"
wget -P ./database/{wildcards.genome}/ $GENOME_LINK
"""
rule unzip_genome:
output: touch("database/{genome}/{genome}.fna")
input: "database/{genome}/{genome}.fna.gz"
shell: "gunzip {input}"
My problem starts when I want to create the final rule, which will wrap up the results of my pipeline. In my real pipeline, I do some additional analyses with downloaded genomes, and at the end, I want to join all partial results obtained per single genome into one table. Here I post a toy example, which I believe reflects my problem the best.
I guess there is some way to extract the genomes' names so I could call them in the final summarising rule's input.
I approached it in an ugly way by listing files in temp/ and using them in expand() like follow:
GENOMES = os.listdir("temp/")
rule make_summary_table:
output: "summary_table.txt"
input: expand("database/{genome}/{genome}.fna", genome = GENOMES)
shell:
"""
echo {input} >> {output}
echo " " >> {output}
"""
But it works only when /temp exists before running the pipeline. And it produces the summary_table.txt with 5 positions only when I run steps 1-3 before (otherwise, it produces an empty file).
I am also afraid that in my real pipeline, it might happen that not all genomes will produce partial results on time the last summarising rule will be called. But maybe Snakemake handles it somehow (by waiting?) once all the inputs are specified.
-----------------------------EDIT-----------------------------------------
I have tried to implement checkpoint as a possible solution as follow:
DATABASE = '("Apis"[Organism] OR Apis[All Fields]) AND (latest[filter] AND "representative genome"[filter] AND all[filter] NOT anomalous[filter])'
rule all:
input: "summary_table.txt"
checkpoint create_genome_list:
output: directory("temp/")
conda: "entrez_env.yaml"
shell:
r"""
esearch -db assembly -query '{DATABASE}' \
| esummary \
| xtract -pattern DocumentSummary -element FtpPath_GenBank \
| while read -r line ;
do
fname=$(echo $line | grep -o 'GCA_.*' | sed 's/$/_genomic.fna.gz/');
wildcard=$(echo $fname | sed -e 's!.fna.gz!!');
echo "$line/$fname" > temp/$wildcard;
#echo $wildcard >> list_of_genomes.txt
done
"""
rule download_genome:
output: touch("database/{genome}/{genome}.fna.gz")
input: "temp/{genome}"
shell:
r"""
GENOME_LINK=$(cat {input})
GENOME="${{GENOME_LINK##*/}}"
wget -P ./database/{wildcards.genome}/ $GENOME_LINK
"""
rule unzip_genome:
output: "database/{genome}/{genome}.fna"
input: "database/{genome}/{genome}.fna.gz"
shell:
r"""
gunzip {input}
"""
def aggregate_input(wildcards):
checkpoint_output = checkpoints.create_genome_list.get(**wildcards).output[0]
return expand("database/{genome}/{genome}.fna",
i=glob_wildcards(os.path.join(checkpoint_output, "{genome}.fna")).genome)
rule make_summary_table:
output: "summary_table.txt"
input: aggregate_input
shell:
"""
echo {input} >> {output}
echo " " >> {output}
"""
But cannot overcome the error: InputFunctionException in line 73 (rule make_summary_table) of ~/snakemake_test/Snakefile: WildcardError: No values given for wildcard 'genome'. Wildcards:
For your updated code to work you need to apply atleast the following two fixes:
Redefine the rule as checkpoint (not necessary, see edit note below)
Expand all wildcards in the checkpoint-related function (your expand leaves {genome} un-expanded as it expands i which is not defined as a wildcard and thus does nothing.
The relevant code lines:
def aggregate_input(wildcards):
checkpoint_output = checkpoints.create_genome_list.get(**wildcards).output[0]
return expand(
"database/{genome}/{genome}.fna",
genome=glob_wildcards(os.path.join(checkpoint_output, "{genome}.fna")).genome,
)
rule make_summary_table:
output:
"summary_table.txt",
input:
aggregate_input,
shell:
"""
echo {input} >> {output}
echo " " >> {output}
"""
Give it a try and let us know if it works!
edit: Sorry, I realised that the correct rule was already converted to a checkpoint and my 1. point is invalid. I've updated the answer above.

How to avoid Snakemake rule from using incomplete output file from other rules

rule rule1:
output: tsv = "..."
input: faa = "..."
shell:
"""
awk ... > {output.tsv}
some commands {input.faa} | awk ... >> {output.tsv}
"""
rule rule2:
output:
tsv = "..."
input:
tsv = rules.rule1.output.tsv,
shell:
"""
awk ... {input.tsv} > {output.tsv}
"""
As it illustrated above, rule2 takes input file from rule1.
According to the official docs, since the output file in rule1 is created successfully by awk, Snakemake assumes everything worked fine, even if my output file is incomplete, because awk is going to append to that file. Snakemake just ran rule2 and took the incomplete file from rule1. Actually, the second awk command in rule1 have not being executed, leaving the output file incomplete.
As far as I can tell, Snakemake detecting the presence of output.csv and assuming the rule completed successfully (since your awks didn't error) is working as intended.
It's not very easy for me to suggest specific edits since the commands are not complete, but how about creating intermediate files in your rule for the two awk commands, then combining them, so that if one or the other don't run the rule fails. Something like:
rule rule1:
output: tsv = "...", int1 = temp(".../{sample}_i1.tsv"), int2 = temp(".../{sample}_i2.tsv")
input: faa = "..."
shell:
"""
awk ... > {output.int1}
awk ... {input.faa} >> {output.int2}
[some logic to make sure the processing is complete]
cp {output.int2} {output.tsv}
"""
I wrapped both intermediates in temp() so that snakemake cleans them up after the rule ends.

Can I stop a rule in snakefile being parallel executed

I tried to concatenate files created via snakemake workflow as the last rule. To separate and identify the contents of each file, I echo each file name first in the shell as a separation tag (see the code below)
rule cat:
input:
expand('Analysis/typing/{sample}_type.txt', sample=samples)
output:
'Analysis/typing/Sum_type.txt'
shell:
'echo {input} >> {output} && cat {input} >> {output}'
I was looking for the result as this format:
file name of sample 1 content of sample 1 file name of sample
2 content of sample 2
instead I got this format:
file name of sample 1 file name of sample 2 ... content of sample 1
content of sample 2 ...
It seems snakemake execute echo command in parallel first then execute the cat command. What can I do the get the format I wanted?
Thanks
This looks more like a shell issue than a Snakemake issue.
If you want the file names and contents alternate, you can use a loop on the input files, as follows:
# Just an example:
samples = ["A", "B", "C"]
rule all:
input:
'Analysis/typing/Sum_type.txt'
rule cat:
input:
expand('Analysis/typing/{sample}_type.txt', sample=samples)
output:
'Analysis/typing/Sum_type.txt'
shell:
"""
for file in {input}
do
echo ${{file}} >> {output}
cat ${{file}} >> {output}
done
"""
(Double curly braces avoid the interpretation of the intended shell variable file as a thing that Snakemake/Python should "interpolate" when computing the string it passes to the shell.)
The output you get is consistent with the way bash works rather than with snakemake. Anyway, I think the snakemake way of doing it would be a rule to add the filename to the file content of each file and a rule to concatenate the output. E.g. (not checked for errors):
rule cat:
input:
'Analysis/typing/{sample}_type.txt',
output:
temp('Analysis/typing/{sample}_type.txt.out'),
shell:
r"""
echo {input} > {output}
cat {input} >> {output}
"""
rule cat_all:
input:
expand('Analysis/typing/{sample}_type.txt.out', sample=samples)
output:
'Analysis/typing/Sum_type.txt'
shell:
r"""
cat {input} > {output}
"""

Wildcard SyntaxError in Snakemake with no obvious cause

I keep getting an error about a rule not having the same wildcards in its output rules and I can't figure out what the source of the error might be:
SyntaxError:
Not all output, log and benchmark files of rule bcftools_filter contain the same wildcards. This is crucial though, in order to avoid that two or more jobs write to the same file.
...
rule merge_YRI_GTEx:
input:
kg=expand("kg_vcf/1kg_yri_chr{q}.vcf.gz", q=range(1,23)),
gtex=expand("gtex_vcf/gtex_chr{v}.snps.recode.vcf.gz", v=range(1, 23))
output:
"merged/merged_chr{i}.vcf.gz"
shell:
"bcftools merge \
-0 \
-O z \
-o {output} \
{input.kg} \
{input.gtex}"
rule bcftools_filter:
input:
expand("merged/merged_chr{i}.vcf.gz", i=range(1,23))
output:
filt="filtered_vcf/merged_filtered_chr{i}.vcf.gz",
chk=touch(".bcftools_filter.chkpnt")
threads:
4
shell:
"bcftools filter \
--include 'AN=1890 && AC > 0' \
--threads {threads} \
-O z \
-o {output.filt} \
{input}"
...
rule list_merged_filtered_vcfs:
input:
".bcftools_filter.chkpnt"
output:
"processed_vcf_list.txt"
shell:
"for i in {{1..22}}; do \ "
"echo \"{config[sprime_dir]}/filtered_vcf/merged_filtered_chr${{i}}.vcf.gz\" >> \
{output}; done"
The specific line it's complaining about is the one that's just "bcftools filter \ which is even more dumbfounding to me. I've tried giving names to the input wildcard and even scrutinizing the rule which calls bcftools_filter's output as well as the rule which produces bcftools_filter's input to no avail. Not sure what is giving me this error.
I think the error comes from chk=touch(".bcftools_filter.chkpnt") not containing the wildcard {i}.
Apart from that, I'm not sure you rule is very sensible. You are passing to bcftools filter a list of input files (from expand(...)) but I don't think bcftools filter accept more than one input file. Also, your rule will create output files filtered_vcf/merged_filtered_chr{i}.vcf.gz (one for each value of i) using the same list of input files. Are you sure you want expand("merged/merged_chr{i}.vcf.gz", i=range(1,23)) instead of just "merged/merged_chr{i}.vcf.gz", with values for i given somewhere upstream?

Extract user specified sequence from reverse strand of from FASTA file Using samtools

I have a list of regions with start and end points.
I used the samtools faidx ref.fa <region> command. This command gave me the forward strand sequence for that region.
In the samtools manual there is an option to extract reverse strand but I could not figure out how to use that.
Does anybody know how to run this command for reverse strand in samtools?
My regions are like:
LG2:124522-124572 (Forward)
LG3:250022-250072 (Reverse)
LG29:4822278-4822318 (Reverse)
LG12:2,595,915-2,596,240 (Forward)
LG16:5,405,500-5,405,828 (Reverse)
As you noticed, samtools has the option --reverse-complement (or -i) to output the sequence from the reverse strand.
As far as I know, samtools does not support a region notation which permits specifying the strand.
A quick solution would be to separate your region file into forward and reverse locations and run samtools twice.
The steps below are rather verbose, just so the steps are clear. It's fairly straight-forward to clean this up with process substitution in bash, for example.
# Separate the strand regions.
# Use grep and sed twice, or awk (below).
grep -F '(Forward)' regions.txt | sed 's/ (Forward)//' > forward-regions.txt
grep -F '(Reverse)' regions.txt | sed 's/ (Reverse)//' > reverse-regions.txt
# Above as an awk one-liner.
awk '{ strand=($2 == "(Forward)") ? "forward" : "reverse"; print $1 > strand"-regions.txt" }' regions.txt
# Run samtools, marking the strand as +/- in the FASTA output.
samtools faidx ref.fa -r forward-regions.txt --mark-strand sign -o forward-sequences.fa
samtools faidx ref.fa -r reverse-regions.txt --mark-strand sign -o reverse-sequences.fa --reverse-complement
# Combine the FASTA output to a single file.
cat forward-sequences.fa reverse-sequences.fa > sequences.fa
rm forward-sequences.fa reverse-sequences.fa
just want to mention that you probably need to update your samtools to the latest version if you met problem. In my case, samtools V1.2 didn't work, and V1.10 worked.