I'm looking for maybe, help or understanding toward an error.
I have an ChildIO exception when i create a subdirectory in a directory created on a previous subdirectory created by a rule.
Basicly, i've a rule that'll create a directory with a couple subdirectories and files through a first script. Then, my 2nd rule will take one pecular subdirectory and make another inside the parent directory of the subdirectory through another script. And my 3rd rule is taking on that new subdirectory, and make in it another (with others files).
I don't understand, why my rule 2 work, while the third don't
My workflow is as following :
configfile: "config.yaml"
dirname = config["dirname"].values()
script_dir = config["script_dir"]
rule all:
# Contain all output
input:
expand(["{dirname}/GFF/","{dirname}/GFF/final_gffs/", "{dirname}/GFF/roary_results/",
"{dirname}/GFF/roary_results/pangenome_multifastas/"], dirname=dirname)
rule prepa_gff:
# Transform gbff files to gff through prepare to roary
input:
expand("{dirname}/GenBank/",dirname=dirname)
output:
gff_dir = directory(expand("{dirname}/GFF/",dirname=dirname)),
gff_fin = directory(expand("{dirname}/GFF/final_gffs/",dirname=dirname))
params:
script_dir = script_dir
message:
"Converting gbff files into gff files."
run:
for dir in dirname:
shell("cd {script_dir} && python3 prepare_to_roary.py -i {dir}/GenBank -o {dir}/GFF")
rule roary:
# Launch roary, with the script itself launching the cluster for operating
input:
rules.prepa_gff.output.gff_fin
output:
dir = directory(expand("{dirname}/GFF/roary_results/", dirname=dirname)),
params:
script_dir = script_dir
message:
"Launching roary."
run:
for i in input:
shell("cd {script_dir} && python3 roary_launcher.py -i {i}")
rule cluster_fasta:
# Launch the script for creating multi-fasta files corresponding to each identified cluster
input:
rules.roary.output.dir
output:
directory(expand("{dirname}/GFF/roary_results/pangenome_multifastas/", dirname=dirname))
params:
script_dir = script_dir
message:
"Clustering in multi-fasta format."
run:
for i in input:
shell("cd {script_dir} && python3 pan_genome_maker_T.py -i {i}")
ChildIOException:
File/directory is a child to another output:
('../Sero3/GFF/roary_results', roary)
('../Sero3/GFF/roary_results/pangenome_multifastas', cluster_fasta)
There is no strict order of execution if the rules have no dependencies. Your rule all: specifies the target of 3 directories, but they are nested, so only the last is needed.
From the point of view of Snakemake, the goal is to create one directory: "{dirname}/GFF/roary_results/pangenome_multifastas/", and the rest is irrelevant. What does prepare_to_roary.py script do? I don't know, Snakemake neither.
Try to rethink your task in terms of the files that your pipeline produces, and disambiguate your intention.
Related
As part of a Snakemake pipeline that I'm building, I have to use a program that does not allow me to specify the file path or name of an output file.
E.g. when running the program in the working directory workdir/ it produces the following output:
workdir/output.txt
My snakemake rule looks something like this:
rule NAME:
input: "path/to/inputfile"
output: "path/to/outputfile"
shell: "somecommand {input} {output}"
So every time the rule NAME runs, I get an additional file output.txt in the snakemake working directory, which is then overwritten if the rule NAME runs multiple times or in parallel.
I'm aware of shadow rules, and adding shadow: "full" allows me to simply ignore the output.txt file. However, I'd like to keep output.txt and save it in the same directory as the outputfile. Is there a way of achieving this, either with the shadow directive or otherwise?
I was also thinking I could prepend somecommand with a cd command, but then I'd probably run into other issues downstream when linking up other rules to the outputs of the rule NAME.
How about simply moving it directly afterwards in the shell part (provided somecommand completes successfully)?
rule NAME:
input: "path/to/inputfile"
output: "path/to/outputfile"
params:
output_dir = "path/to/output_dir",
shell: "somecommand {input} {output} && mv output.txt {params.output_dir}/output.txt"
EDIT: for multiple executions of NAME in parallel, combining with shadow: "full" could work:
rule NAME:
input: "path/to/inputfile"
output:
output_file = "path/to/outputfile"
output_txt = "path/to/output_dir/output.txt"
shadow: "full"
shell: "somecommand {input} {output.output_file} && mv output.txt {output.output_txt}"
That should run each execution of the rule in its own temporary dir, and by specifying the moved output.txt as an output Snakemake should move it to the real output dir once the rule is done running.
I was also thinking I could prepend somecommand with a cd command, but then I'd probably run into other issues downstream when linking up other rules to the outputs of the rule NAME.
I think you are on the right track here. Each shell block is run in a separate process with the working directory inherited from the snakemake process (specified with the --directory argument on the command line). Accordingly, cd commands in one shell block will not affect other jobs from the same rule or other downstream/upstream jobs.
rule NAME:
input: "path/to/inputfile"
output: "path/to/outputfile"
shell:
"""
input_file=$(realpath "{input}") # get the absolute path, before the `cd`
base_dir=$(dirname "{output}")
cd "$base_dir"
somecommand ...
"""
I'm trying to get GiRaF (https://github.com/sdwfrost/giraf) running on Kubernetes using Azure Blob Storage - it's not my code, I've just fixed a few errors, wrote a Dockerfile, and a test Snakefile. I want to do repeat runs, so my solution for the local filesystem is here:
# Set number of repeats
N = 2
def repeat_runs():
files=[]
for i in range(0,N,1):
files.append("run_"+str(i+1)+"/left-right_report")
return files
rule all:
input:
repeat_runs()
rule giraf:
input:
"{r}/in.giraf"
output:
"{r}/left-right_report"
params:
infile="in.giraf"
shell:
"cd {wildcards.r};giraf {params.infile}"
rule copy_infile:
input:
"in.giraf"
output:
"{r}/in.giraf"
shell:
"cp {input} {output}"
However, I can't change directory like this using Azure Blob Storage - I can create and copy files however. Has anyone encountered something like this before? Giraf is actually multiple subprograms so it would be more time consuming to add in an argument for the output directory.
I am using Snakemake to submit jobs to the cluster. I am facing a situation where I would like to force a particular rule to run only after all other rules have run - this is because the input files for this job (R script) are not yet ready.
I happened to see this on the Snakemake documentation page where it states one can force rule execution order - https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#flag-files
I have different rules, but for sake of simplicity, I am showing my Snakefile and the last 2 rules below (rsem_model and tximport_rsem). On my qsub cluster workflow, I want tximport_rsem to execute only after rsem_model has finished and I tried the "touchfile" method but I am not able to get it working successfully.
# Snakefile
rule all:
input:
expand("results/fastqc/{sample}_fastqc.zip",sample=samples),
expand("results/bbduk/{sample}_trimmed.fastq",sample=samples),
expand("results/bbduk/{sample}_trimmed_fastqc.zip",sample=samples),
expand("results/bam/{sample}_Aligned.toTranscriptome.out.bam",sample=samples),
expand("results/bam/{sample}_ReadsPerGene.out.tab",sample=samples),
expand("results/quant/{sample}.genes.results",sample=samples),
expand("results/quant/{sample}_diagnostic.pdf",sample=samples),
expand("results/multiqc/project_QS_STAR_RSEM_trial.html"),
expand("results/rsem_tximport/RSEM_GeneLevel_Summarization.csv"),
expand("mytask.done")
rule clean:
shell: "rm -rf .snakemake/"
include: 'rules/fastqc.smk'
include: 'rules/bbduk.smk'
include: 'rules/fastqc_after.smk'
include: 'rules/star_align.smk'
include: 'rules/rsem_norm.smk'
include: 'rules/rsem_model.smk'
include: 'rules/tximport_rsem.smk'
include: 'rules/multiqc.smk'
rule rsem_model:
input:
'results/quant/{sample}.genes.results'
output:
'results/quant/{sample}_diagnostic.pdf'
params:
plotmodel = config['rsem_plot_model'],
prefix = 'results/quant/{sample}',
touchfile = 'mytask.done'
threads: 16
priority: 60
shell:"""
touch {params.touchfile}
{params.plotmodel} {params.prefix} {output}
"""
rule tximport_rsem:
input: 'mytask.done'
output:
'results/rsem_tximport/RSEM_GeneLevel_Summarization.csv'
priority: 50
shell: "Rscript scripts/RSEM_tximport.R"
Here is the error I get when I try to do a dry-run
snakemake -np
Building DAG of jobs...
MissingInputException in line 1 of /home/yh6314/rsem/tutorials/QS_Snakemake/rules/tximport_rsem.smk:
Missing input files for rule tximport_rsem:
mytask.done
One important thing to note: If I try running this on the head node, I do not have to do "touch file" and everything works fine.
I would appreciate suggestions and help to figure out a workaround.
Thanks in advance.
Rule tximport_rsem will be executed only after all jobs from rule rsem_model are completed (based on comments). Hence, intermediate file mytask.done is unnecessary in this scenario. Using output files of rule rsem_model for all samples to rule tximport_rsem will suffice.
rule rsem_model:
input:
'results/quant/{sample}.genes.results'
output:
'results/quant/{sample}_diagnostic.pdf',
shell:
"""
{params.plotmodel} {params.prefix} {output.pdf}
"""
rule tximport_rsem:
input:
expand('results/quant/{sample}_diagnostic.pdf', sample=sample_names)
output:
'results/rsem_tximport/RSEM_GeneLevel_Summarization.csv'
shell:
"Rscript scripts/RSEM_tximport.R"
I specified a workdir in the YAML config file to use with snakemake as follows:
$> cat config.yaml
workdir: "/home/lina/test_output"
id: 1234
$> cat Snakefile
rule all:
input:
data_out = expand("cat_out/{id}_times_two.txt", id = config['id'])
rule double_print:
input:
data = expand("data/{id}.txt", id = config['id'])
output:
data_out = expand("cat_out/{id}_times_two.txt", id = config['id'])
shell:
'cat {input.data} {input.data} > {output.data_out}'
$> snakemake --configfile=config.yaml
However, once I ran my snakemake command, the output was generated in the directory where the snakefile resides. My snakefile was able to take advantage of the id parameter I specified in the config file, so it was able to read the config file and at least interpret the id parameter.
How should I modify the config file or my snakemake command to make sure the output ends up in the workdir I specified?
Thanks much!
You need to add workdir to the snakefile, not the configuration.
But you can set it dynamically, so in the snakefile, write:
workdir: config['workdir']
I'm using snakemake for the first time in order to build a basic pipeline using cutadapt, bwa and GATK (trimming ; mapping ; calling). I would like to run this pipeline on every fastq file contained in a directory, without having to specify their name or whatever in the snakefile or in the config file. I would like to succeed in doing this.
The first two steps (cutadapt and bwa / trimming and mapping) are running fine, but I'm encountering some problems with GATK.
First, I have to generate g.vcf files from bam files. I'm doing this using these rules:
configfile: "config.yaml"
import os
import glob
rule all:
input:
"merge_calling.g.vcf"
rule cutadapt:
input:
read="data/Raw_reads/{sample}_R1_{run}.fastq.gz",
read2="data/Raw_reads/{sample}_R2_{run}.fastq.gz"
output:
R1=temp("trimmed_reads/{sample}_R1_{run}.fastq.gz"),
R2=temp("trimmed_reads/{sample}_R2_{run}.fastq.gz")
threads:
10
shell:
"cutadapt -q {config[Cutadapt][Quality_value]} -m {config[Cutadapt][min_length]} -a {config[Cutadapt][forward_adapter]} -A {config[Cutadapt][reverse_adapter]} -o {output.R1} -p '{output.R2}' {input.read} {input.read2}"
rule bwa_map:
input:
genome="data/genome.fasta",
read=expand("trimmed_reads/{{sample}}_{pair}_{{run}}.fastq.gz", pair=["R1", "R2"])
output:
temp("mapped_bam/{sample}_{run}.bam")
threads:
10
params:
rg="#RG\\tID:{sample}\\tPL:ILLUMINA\\tSM:{sample}"
shell:
"bwa mem -t 2 -R '{params.rg}' {input.genome} {input.read} | samtools view -Sb - > {output}"
rule picard_sort:
input:
"mapped_bam/{sample}.bam"
output:
"sorted_reads/{sample}.bam"
shell:
"java -Xmx4g -jar /home/alexandre/picard-tools/picard.jar SortSam I={input} O={output} SO=coordinate VALIDATION_STRINGENCY=SILENT"
rule picard_rmdup:
input:
bam="sorted_reads/{sample}.bam"
output:
"rmduped_reads/{sample}.bam",
"picard_stats/{sample}.bam"
params:
reads="rmduped_reads/{sample}.bam",
stats="picard_stats/{sample}.bam",
shell:
"java -jar -Xmx2g /home/alexandre/picard-tools/picard.jar MarkDuplicates "
"I={input.bam} "
"O='{params.reads}' "
"VALIDATION_STRINGENCY=SILENT "
"MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000 "
"REMOVE_DUPLICATES=TRUE "
"M='{params.stats}'"
rule samtools_index:
input:
"rmduped_reads/{sample}.bam"
output:
"rmduped_reads/{sample}.bam.bai"
shell:
"samtools index {input}"
rule GATK_raw_calling:
input:
bam="rmduped_reads/{sample}.bam",
bai="rmduped_reads/{sample}.bam.bai",
genome="data/genome.fasta"
output:
"Raw_calling/{sample}.g.vcf",
shell:
"java -Xmx4g -jar /home/alexandre/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar -ploidy 2 --emitRefConfidence GVCF -T HaplotypeCaller -R {input.genome} -I {input.bam} --genotyping_mode DISCOVERY -o {output}"
These rules work fine. For example, if I have the files :
Cla001d_S281_L001_R1_001.fastq.gz
Cla001d_S281_L001_R2_001.fastq.gz
I can create one bam file (Cla001d_S281_L001_001.bam) and from that bam file create a GVCF file (Cla001d_S281_L001_001.g.vcf). I have a lot of sample like this one, and I need to create one GVCF file for each, and then merge these GVCF files into one file. The problem is that I'm unable to give the list of the file to merge to the following rule:
rule GATK_merge:
input:
???
output:
"merge_calling.g.vcf"
shell:
"java -Xmx4g -jar /home/alexandre/GenomeAnalysisTK-3.7/GenomeAnalysisTK.jar "
"-T CombineGVCFs "
"-R data/genome.fasta "
"--variant {input} "
"-o {output}"
I tried several things in order to do that, but cannot succeed. The problem is the link between the two rules (GATK_raw_calling and GATK_merge that is supposed to merge the output of GATK_raw_calling). I can't output one single file if I'm specifying the output of GATK_raw_calling as the input of the following rule (Wildcards in input files cannot be determined from output files), and I'm unable to make a link between the two rules if I'm not specifying these files as an input...
Is there a way to succeed in doing that? The difficulty is that I'm not defining a list of names or whatever, I think.
Thanks you in advance for your help.
You can try to generate a list of sample IDs using glob_wildcards on the initial fastq.gz files:
sample_ids, run_ids = glob_wildcards("data/Raw_reads/{sample}_R1_{run}.fastq.gz")
Then, you can use this to expand the input of GATK_merge:
rule GATK_merge:
input:
expand("Raw_calling/{sample}_{run}.g.vcf",
sample=sample_ids, run=run_ids)
If the same run ID always come with the same sample ID, you will need to zip instead of expanding, in order to avoid non-existing combinations:
rule GATK_merge:
input:
["Raw_calling/{sample}_{run}.g.vcf".format(
sample=sample_id,
run=run_id) for sample_id, run_id in zip(sample_ids, run_ids)]
You can achieve this by using a python function as an input for your rule, as described in the snakemake documentation here.
Could look like this for example:
# Define input files
def gatk_inputs(wildcards):
files = expand("Raw_calling/{sample}.g.vcf", sample=<samples list>)
return files
# Rule
rule gatk:
input: gatk_inputs
output: <output file name>
run: ...
Hope this helps.