Snakemake: specifying a workdir in YAML config file - config

I specified a workdir in the YAML config file to use with snakemake as follows:
$> cat config.yaml
workdir: "/home/lina/test_output"
id: 1234
$> cat Snakefile
rule all:
input:
data_out = expand("cat_out/{id}_times_two.txt", id = config['id'])
rule double_print:
input:
data = expand("data/{id}.txt", id = config['id'])
output:
data_out = expand("cat_out/{id}_times_two.txt", id = config['id'])
shell:
'cat {input.data} {input.data} > {output.data_out}'
$> snakemake --configfile=config.yaml
However, once I ran my snakemake command, the output was generated in the directory where the snakefile resides. My snakefile was able to take advantage of the id parameter I specified in the config file, so it was able to read the config file and at least interpret the id parameter.
How should I modify the config file or my snakemake command to make sure the output ends up in the workdir I specified?
Thanks much!

You need to add workdir to the snakefile, not the configuration.
But you can set it dynamically, so in the snakefile, write:
workdir: config['workdir']

Related

Snakemake - MissingInputException

My snakemake pipeline containing 31 rules is driving me crazy. It's a mapping and snp calling pipeline that uses BWA and HaplotypeCaller among others. I have created a conda environment for each rule, depending on the program used. My code is quite long and can be seen if needed at this address : https://github.com/ltalignani/SHAVE1
Concretely, when I want to build the DAG, snakemake tells me that the haplotype_caller rule doesn't have the reference genome as input. But it is in the file. Here is the concerned code:
rule haplotype_caller_gvcf:
# Aim: Call germline SNPs and indels via local re-assembly of haplotypes
# Use: gatk --java-options '-Xmx{MEM_GB}g' HaplotypeCaller \
# -R Homo_sapiens_assembly38.fasta \
# -I input.bam \
# -O output.g.vcf.gz \
# -ERC GVCF # Essential to GenotypeGVCFs: produce genotype likelihoods
message:
"HaplotypeCaller calling SNVs and Indels for {wildcards.sample} sample ({wildcards.aligner}-{wildcards.mincov})"
conda:
GATK4
input:
refpath = REFPATH,
reference = REFERENCE,
bam = "results/04_Variants/{sample}_{aligner}_{mincov}X_indel-qual.bam"
output:
gvcf="results/04_Variants/haplotypecaller/{sample}_{aligner}_{mincov}X_variant-call.g.vcf"
log:
"results/11_Reports/haplotypecaller/{sample}_{aligner}_{mincov}X_variant-call.log" # optional
resources:
mem_gb= MEM_GB,
shell:
"gatk HaplotypeCaller " # --java-options '-Xmx{resources.mem_gb}g'
"-R {input.refpath}{input.reference} "
"-I {input.bam} "
"-O {output.gvcf} "
"-ERC GVCF" # Essential to GenotypeGVCFs: produce genotype likelihoods
With the REFPATH and REFERENCE variables defined as follows in the snakefile header:
REFPATH = config["consensus"]["path"] # Path to reference genome
REFERENCE = config["consensus"]["reference"] # Genome reference sequence, in fasta format
And the config file in .yaml is like this:
consensus:
reference: "GCA_018104305.1_AalbF3_genomic.fasta"
path: "resources/genomes/" # Path to genome reference
When I ask for the DAG :
snakemake -s workflow/rules/shave.smk --dag | dot -Tpng > test.png
I get this error:
`MissingInputException in line 247 of /Users/loic/snakemake/short-read-alignment-vector-pipeline/workflow/rules/shave.smk:`
Missing input files for rule haplotype_caller_gvcf:
GCA_018104305.1_AalbF3_genomic.fasta
Here is the structure of the snakemake:
enter image description here
also tried to use snakemake --lint but the output was OK.
I've looked at your Github, and the folder resources/genomes/ only contains a file GCA_018104305.1_AalbF3_genomic.fasta.fai. Have you tried renaming this file to the expected input name GCA_018104305.1_AalbF3_genomic.fasta, e.g. get rid of the .fai extension?
Thanks for your answer,
The fasta is 1.47 GB. That's why it's not in the resources/genomes folder.
The .fai is the fasta index, necessary for some programs like GATK.

Error: no Snakefile found, tried Snakefile, snakefile, workflow/Snakefile, workflow/snakefile

I had this error whenever running my snakefile.smk
Error: no Snakefile found, tried Snakefile, snakefile, workflow/Snakefile, workflow/snakefile.
using ls command shows that the file exists in directory
Miniconda3-4.7.12.1-Linux-x86_64.sh config.yml miniconda3 nano.save.1 snakemake
Miniconda3-4.7.12.1-MacOSX-x86_64.sh download.r nano.save snakefile.smk work
I am using WSL2 ubuntu 20
the snakefile contents
sample = ["GSE6955", "GSE67311"]
rule download:
output:
"~/{sample}.tar"
run:
"~/download.R"
rule extract:
input:
"~/{sample}.tar"
output:
directory("~/{sample}")
shell:
tar xvf {input}`
Does anyone have an idea how to fix this?
Snakemake by default looks for a file called snakefile or Snakefile in the working directory if a specific file is not provided. In your case, your snakefile is called snakefile.smk. To run snakemake with a specific snakefile, you can call it with the -s or --snakefile command line arg.
I recommend you call snakemake with the -h flag to see all the options available when calling snakemake.

Snakemake: catch output file whose name cannot be changed

As part of a Snakemake pipeline that I'm building, I have to use a program that does not allow me to specify the file path or name of an output file.
E.g. when running the program in the working directory workdir/ it produces the following output:
workdir/output.txt
My snakemake rule looks something like this:
rule NAME:
input: "path/to/inputfile"
output: "path/to/outputfile"
shell: "somecommand {input} {output}"
So every time the rule NAME runs, I get an additional file output.txt in the snakemake working directory, which is then overwritten if the rule NAME runs multiple times or in parallel.
I'm aware of shadow rules, and adding shadow: "full" allows me to simply ignore the output.txt file. However, I'd like to keep output.txt and save it in the same directory as the outputfile. Is there a way of achieving this, either with the shadow directive or otherwise?
I was also thinking I could prepend somecommand with a cd command, but then I'd probably run into other issues downstream when linking up other rules to the outputs of the rule NAME.
How about simply moving it directly afterwards in the shell part (provided somecommand completes successfully)?
rule NAME:
input: "path/to/inputfile"
output: "path/to/outputfile"
params:
output_dir = "path/to/output_dir",
shell: "somecommand {input} {output} && mv output.txt {params.output_dir}/output.txt"
EDIT: for multiple executions of NAME in parallel, combining with shadow: "full" could work:
rule NAME:
input: "path/to/inputfile"
output:
output_file = "path/to/outputfile"
output_txt = "path/to/output_dir/output.txt"
shadow: "full"
shell: "somecommand {input} {output.output_file} && mv output.txt {output.output_txt}"
That should run each execution of the rule in its own temporary dir, and by specifying the moved output.txt as an output Snakemake should move it to the real output dir once the rule is done running.
I was also thinking I could prepend somecommand with a cd command, but then I'd probably run into other issues downstream when linking up other rules to the outputs of the rule NAME.
I think you are on the right track here. Each shell block is run in a separate process with the working directory inherited from the snakemake process (specified with the --directory argument on the command line). Accordingly, cd commands in one shell block will not affect other jobs from the same rule or other downstream/upstream jobs.
rule NAME:
input: "path/to/inputfile"
output: "path/to/outputfile"
shell:
"""
input_file=$(realpath "{input}") # get the absolute path, before the `cd`
base_dir=$(dirname "{output}")
cd "$base_dir"
somecommand ...
"""

ChildIOException when creating a subdirectory of a subdirectory

I'm looking for maybe, help or understanding toward an error.
I have an ChildIO exception when i create a subdirectory in a directory created on a previous subdirectory created by a rule.
Basicly, i've a rule that'll create a directory with a couple subdirectories and files through a first script. Then, my 2nd rule will take one pecular subdirectory and make another inside the parent directory of the subdirectory through another script. And my 3rd rule is taking on that new subdirectory, and make in it another (with others files).
I don't understand, why my rule 2 work, while the third don't
My workflow is as following :
configfile: "config.yaml"
dirname = config["dirname"].values()
script_dir = config["script_dir"]
rule all:
# Contain all output
input:
expand(["{dirname}/GFF/","{dirname}/GFF/final_gffs/", "{dirname}/GFF/roary_results/",
"{dirname}/GFF/roary_results/pangenome_multifastas/"], dirname=dirname)
rule prepa_gff:
# Transform gbff files to gff through prepare to roary
input:
expand("{dirname}/GenBank/",dirname=dirname)
output:
gff_dir = directory(expand("{dirname}/GFF/",dirname=dirname)),
gff_fin = directory(expand("{dirname}/GFF/final_gffs/",dirname=dirname))
params:
script_dir = script_dir
message:
"Converting gbff files into gff files."
run:
for dir in dirname:
shell("cd {script_dir} && python3 prepare_to_roary.py -i {dir}/GenBank -o {dir}/GFF")
rule roary:
# Launch roary, with the script itself launching the cluster for operating
input:
rules.prepa_gff.output.gff_fin
output:
dir = directory(expand("{dirname}/GFF/roary_results/", dirname=dirname)),
params:
script_dir = script_dir
message:
"Launching roary."
run:
for i in input:
shell("cd {script_dir} && python3 roary_launcher.py -i {i}")
rule cluster_fasta:
# Launch the script for creating multi-fasta files corresponding to each identified cluster
input:
rules.roary.output.dir
output:
directory(expand("{dirname}/GFF/roary_results/pangenome_multifastas/", dirname=dirname))
params:
script_dir = script_dir
message:
"Clustering in multi-fasta format."
run:
for i in input:
shell("cd {script_dir} && python3 pan_genome_maker_T.py -i {i}")
ChildIOException:
File/directory is a child to another output:
('../Sero3/GFF/roary_results', roary)
('../Sero3/GFF/roary_results/pangenome_multifastas', cluster_fasta)
There is no strict order of execution if the rules have no dependencies. Your rule all: specifies the target of 3 directories, but they are nested, so only the last is needed.
From the point of view of Snakemake, the goal is to create one directory: "{dirname}/GFF/roary_results/pangenome_multifastas/", and the rest is irrelevant. What does prepare_to_roary.py script do? I don't know, Snakemake neither.
Try to rethink your task in terms of the files that your pipeline produces, and disambiguate your intention.

snakemake - output one only file from multiple input files in one rule

I'm using snakemake for the first time in order to build a basic pipeline using cutadapt, bwa and GATK (trimming ; mapping ; calling). I would like to run this pipeline on every fastq file contained in a directory, without having to specify their name or whatever in the snakefile or in the config file. I would like to succeed in doing this.
The first two steps (cutadapt and bwa / trimming and mapping) are running fine, but I'm encountering some problems with GATK.
First, I have to generate g.vcf files from bam files. I'm doing this using these rules:
configfile: "config.yaml"
import os
import glob
rule all:
input:
"merge_calling.g.vcf"
rule cutadapt:
input:
read="data/Raw_reads/{sample}_R1_{run}.fastq.gz",
read2="data/Raw_reads/{sample}_R2_{run}.fastq.gz"
output:
R1=temp("trimmed_reads/{sample}_R1_{run}.fastq.gz"),
R2=temp("trimmed_reads/{sample}_R2_{run}.fastq.gz")
threads:
10
shell:
"cutadapt -q {config[Cutadapt][Quality_value]} -m {config[Cutadapt][min_length]} -a {config[Cutadapt][forward_adapter]} -A {config[Cutadapt][reverse_adapter]} -o {output.R1} -p '{output.R2}' {input.read} {input.read2}"
rule bwa_map:
input:
genome="data/genome.fasta",
read=expand("trimmed_reads/{{sample}}_{pair}_{{run}}.fastq.gz", pair=["R1", "R2"])
output:
temp("mapped_bam/{sample}_{run}.bam")
threads:
10
params:
rg="#RG\\tID:{sample}\\tPL:ILLUMINA\\tSM:{sample}"
shell:
"bwa mem -t 2 -R '{params.rg}' {input.genome} {input.read} | samtools view -Sb - > {output}"
rule picard_sort:
input:
"mapped_bam/{sample}.bam"
output:
"sorted_reads/{sample}.bam"
shell:
"java -Xmx4g -jar /home/alexandre/picard-tools/picard.jar SortSam I={input} O={output} SO=coordinate VALIDATION_STRINGENCY=SILENT"
rule picard_rmdup:
input:
bam="sorted_reads/{sample}.bam"
output:
"rmduped_reads/{sample}.bam",
"picard_stats/{sample}.bam"
params:
reads="rmduped_reads/{sample}.bam",
stats="picard_stats/{sample}.bam",
shell:
"java -jar -Xmx2g /home/alexandre/picard-tools/picard.jar MarkDuplicates "
"I={input.bam} "
"O='{params.reads}' "
"VALIDATION_STRINGENCY=SILENT "
"MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000 "
"REMOVE_DUPLICATES=TRUE "
"M='{params.stats}'"
rule samtools_index:
input:
"rmduped_reads/{sample}.bam"
output:
"rmduped_reads/{sample}.bam.bai"
shell:
"samtools index {input}"
rule GATK_raw_calling:
input:
bam="rmduped_reads/{sample}.bam",
bai="rmduped_reads/{sample}.bam.bai",
genome="data/genome.fasta"
output:
"Raw_calling/{sample}.g.vcf",
shell:
"java -Xmx4g -jar /home/alexandre/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -ploidy 2 --emitRefConfidence GVCF -T HaplotypeCaller -R {input.genome} -I {input.bam} --genotyping_mode DISCOVERY -o {output}"
These rules work fine. For example, if I have the files :
Cla001d_S281_L001_R1_001.fastq.gz
Cla001d_S281_L001_R2_001.fastq.gz
I can create one bam file (Cla001d_S281_L001_001.bam) and from that bam file create a GVCF file (Cla001d_S281_L001_001.g.vcf). I have a lot of sample like this one, and I need to create one GVCF file for each, and then merge these GVCF files into one file. The problem is that I'm unable to give the list of the file to merge to the following rule:
rule GATK_merge:
input:
???
output:
"merge_calling.g.vcf"
shell:
"java -Xmx4g -jar /home/alexandre/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar "
"-T CombineGVCFs "
"-R data/genome.fasta "
"--variant {input} "
"-o {output}"
I tried several things in order to do that, but cannot succeed. The problem is the link between the two rules (GATK_raw_calling and GATK_merge that is supposed to merge the output of GATK_raw_calling). I can't output one single file if I'm specifying the output of GATK_raw_calling as the input of the following rule (Wildcards in input files cannot be determined from output files), and I'm unable to make a link between the two rules if I'm not specifying these files as an input...
Is there a way to succeed in doing that? The difficulty is that I'm not defining a list of names or whatever, I think.
Thanks you in advance for your help.
You can try to generate a list of sample IDs using glob_wildcards on the initial fastq.gz files:
sample_ids, run_ids = glob_wildcards("data/Raw_reads/{sample}_R1_{run}.fastq.gz")
Then, you can use this to expand the input of GATK_merge:
rule GATK_merge:
input:
expand("Raw_calling/{sample}_{run}.g.vcf",
sample=sample_ids, run=run_ids)
If the same run ID always come with the same sample ID, you will need to zip instead of expanding, in order to avoid non-existing combinations:
rule GATK_merge:
input:
["Raw_calling/{sample}_{run}.g.vcf".format(
sample=sample_id,
run=run_id) for sample_id, run_id in zip(sample_ids, run_ids)]
You can achieve this by using a python function as an input for your rule, as described in the snakemake documentation here.
Could look like this for example:
# Define input files
def gatk_inputs(wildcards):
files = expand("Raw_calling/{sample}.g.vcf", sample=<samples list>)
return files
# Rule
rule gatk:
input: gatk_inputs
output: <output file name>
run: ...
Hope this helps.