How to change the output directory in Snakemake without using cd - snakemake

I'm trying to get GiRaF (https://github.com/sdwfrost/giraf) running on Kubernetes using Azure Blob Storage - it's not my code, I've just fixed a few errors, wrote a Dockerfile, and a test Snakefile. I want to do repeat runs, so my solution for the local filesystem is here:
# Set number of repeats
N = 2
def repeat_runs():
files=[]
for i in range(0,N,1):
files.append("run_"+str(i+1)+"/left-right_report")
return files
rule all:
input:
repeat_runs()
rule giraf:
input:
"{r}/in.giraf"
output:
"{r}/left-right_report"
params:
infile="in.giraf"
shell:
"cd {wildcards.r};giraf {params.infile}"
rule copy_infile:
input:
"in.giraf"
output:
"{r}/in.giraf"
shell:
"cp {input} {output}"
However, I can't change directory like this using Azure Blob Storage - I can create and copy files however. Has anyone encountered something like this before? Giraf is actually multiple subprograms so it would be more time consuming to add in an argument for the output directory.

Related

Snakemake: catch output file whose name cannot be changed

As part of a Snakemake pipeline that I'm building, I have to use a program that does not allow me to specify the file path or name of an output file.
E.g. when running the program in the working directory workdir/ it produces the following output:
workdir/output.txt
My snakemake rule looks something like this:
rule NAME:
input: "path/to/inputfile"
output: "path/to/outputfile"
shell: "somecommand {input} {output}"
So every time the rule NAME runs, I get an additional file output.txt in the snakemake working directory, which is then overwritten if the rule NAME runs multiple times or in parallel.
I'm aware of shadow rules, and adding shadow: "full" allows me to simply ignore the output.txt file. However, I'd like to keep output.txt and save it in the same directory as the outputfile. Is there a way of achieving this, either with the shadow directive or otherwise?
I was also thinking I could prepend somecommand with a cd command, but then I'd probably run into other issues downstream when linking up other rules to the outputs of the rule NAME.
How about simply moving it directly afterwards in the shell part (provided somecommand completes successfully)?
rule NAME:
input: "path/to/inputfile"
output: "path/to/outputfile"
params:
output_dir = "path/to/output_dir",
shell: "somecommand {input} {output} && mv output.txt {params.output_dir}/output.txt"
EDIT: for multiple executions of NAME in parallel, combining with shadow: "full" could work:
rule NAME:
input: "path/to/inputfile"
output:
output_file = "path/to/outputfile"
output_txt = "path/to/output_dir/output.txt"
shadow: "full"
shell: "somecommand {input} {output.output_file} && mv output.txt {output.output_txt}"
That should run each execution of the rule in its own temporary dir, and by specifying the moved output.txt as an output Snakemake should move it to the real output dir once the rule is done running.
I was also thinking I could prepend somecommand with a cd command, but then I'd probably run into other issues downstream when linking up other rules to the outputs of the rule NAME.
I think you are on the right track here. Each shell block is run in a separate process with the working directory inherited from the snakemake process (specified with the --directory argument on the command line). Accordingly, cd commands in one shell block will not affect other jobs from the same rule or other downstream/upstream jobs.
rule NAME:
input: "path/to/inputfile"
output: "path/to/outputfile"
shell:
"""
input_file=$(realpath "{input}") # get the absolute path, before the `cd`
base_dir=$(dirname "{output}")
cd "$base_dir"
somecommand ...
"""

ChildIOException when creating a subdirectory of a subdirectory

I'm looking for maybe, help or understanding toward an error.
I have an ChildIO exception when i create a subdirectory in a directory created on a previous subdirectory created by a rule.
Basicly, i've a rule that'll create a directory with a couple subdirectories and files through a first script. Then, my 2nd rule will take one pecular subdirectory and make another inside the parent directory of the subdirectory through another script. And my 3rd rule is taking on that new subdirectory, and make in it another (with others files).
I don't understand, why my rule 2 work, while the third don't
My workflow is as following :
configfile: "config.yaml"
dirname = config["dirname"].values()
script_dir = config["script_dir"]
rule all:
# Contain all output
input:
expand(["{dirname}/GFF/","{dirname}/GFF/final_gffs/", "{dirname}/GFF/roary_results/",
"{dirname}/GFF/roary_results/pangenome_multifastas/"], dirname=dirname)
rule prepa_gff:
# Transform gbff files to gff through prepare to roary
input:
expand("{dirname}/GenBank/",dirname=dirname)
output:
gff_dir = directory(expand("{dirname}/GFF/",dirname=dirname)),
gff_fin = directory(expand("{dirname}/GFF/final_gffs/",dirname=dirname))
params:
script_dir = script_dir
message:
"Converting gbff files into gff files."
run:
for dir in dirname:
shell("cd {script_dir} && python3 prepare_to_roary.py -i {dir}/GenBank -o {dir}/GFF")
rule roary:
# Launch roary, with the script itself launching the cluster for operating
input:
rules.prepa_gff.output.gff_fin
output:
dir = directory(expand("{dirname}/GFF/roary_results/", dirname=dirname)),
params:
script_dir = script_dir
message:
"Launching roary."
run:
for i in input:
shell("cd {script_dir} && python3 roary_launcher.py -i {i}")
rule cluster_fasta:
# Launch the script for creating multi-fasta files corresponding to each identified cluster
input:
rules.roary.output.dir
output:
directory(expand("{dirname}/GFF/roary_results/pangenome_multifastas/", dirname=dirname))
params:
script_dir = script_dir
message:
"Clustering in multi-fasta format."
run:
for i in input:
shell("cd {script_dir} && python3 pan_genome_maker_T.py -i {i}")
ChildIOException:
File/directory is a child to another output:
('../Sero3/GFF/roary_results', roary)
('../Sero3/GFF/roary_results/pangenome_multifastas', cluster_fasta)
There is no strict order of execution if the rules have no dependencies. Your rule all: specifies the target of 3 directories, but they are nested, so only the last is needed.
From the point of view of Snakemake, the goal is to create one directory: "{dirname}/GFF/roary_results/pangenome_multifastas/", and the rest is irrelevant. What does prepare_to_roary.py script do? I don't know, Snakemake neither.
Try to rethink your task in terms of the files that your pipeline produces, and disambiguate your intention.

running metabat2 with snakemake but not getting the bin files

I have been trying to run metabat2 with snakemake. I can run it but the output files in metabat2/ are missing. The checkM that works after it does use the data and can work I just cant find the files later. There should be files created with numbers but it is imposible to predict how many files will be created. Is there a way I can specify it to make sure that the files are created in that file?
rule all:
[f"metabat2/" for sample in samples],
[f"checkm/" for sample in samples]
rule metabat2:
input:
"input/consensus.fasta"
output:
directory("metabat2/")
conda:
"envs/metabat2.yaml"
shell:
"metabat2 -i {input} -o {output} -v"
rule checkM:
input:
"metabat2/"
output:
c = "bacteria/CheckM.txt",
d = directory("checkm/")
conda:
"envs/metabat2.yaml"
shell:
"checkm lineage_wf -f {output.c} -t 10 -x fa {input} {output.d}"
the normal code to run metabat2 would be
metabat2 -i path/to/consensus.fasta -o /outputdir/bin -v
this will create in outputdir files with bin.[number].fa
I can't tell what the problem is but I have a couple of suggestions...
[f"metabat2/" for sample in samples]: I doubt this will do what you expect as it will simply create a list with the string metabat2/ repeat len(samples) times. Maybe you want [f"metabat2/{sample}" for sample in samples]? The same for [f"checkm/" for sample in samples]
The samples variable is not used anywhere in the rules following all. I suspect somewhere it should be used and/or you should use something like output: directory("metabat2/{sample}")
Execute snakemake with -p option to see what commands are executed. It may be useful to post the stdout from it.

Does Snakefile location matter?

I am absolute beginner to snakemake. I am building a pipeline as I learn. My question is if the Snakefile is placed with data file that I want to process an NameError: occurs but if I move the Snakefile to a parent directory and edit the path information of input: and output: the code works. what am I missing?
rule sra_convert:
input:
"rna/{id}.sra"
output:
"rna/fastq/{id}.fastq"
shell:
"fastq-dump {input} -O {output}"
above code works fine when I run with
snakemake -p rna/fastq/SRR873382.fastq
However, if I move the file to "rna" directory where the SRR873382.sra file is and edit the code as below
rule sra_convert:
input:
"{id}.sra"
output:
"fastq/{id}.fastq"
message:
"Converting from {id}.sra to {id}.fastq"
shell:
"fastq-dump {input} -O {output}"
and run
snakemake -p fastq/SRR873382.fastq
I get the following error
Building DAG of jobs...
Job counts:
count jobs
1 sra_convert
1
RuleException in line 7 of /home/sarc/Data/rna/Snakefile:
NameError: The name 'id' is unknown in this context. Please make sure that you defined that variable. Also note that braces not used for variable access have to be escaped by repeating them, i.e. {{print $1}}
Solution
rule sra_convert:
input:
"{id}.sra"
output:
"fastq/{id}.fastq"
message:
"Converting from {wildcards.id}.sra to {wildcards.id}.fastq"
shell:
"fastq-dump {input} -O {output}"
above code runs fine without error
I believe that the best source that answers your actual question is:
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#wildcards
If the rule’s output matches a requested file, the substrings matched
by the wildcards are propagated to the input files and to the variable
wildcards, that is here also used in the shell command. The wildcards
object can be accessed in the same way as input and output, which is
described above.

snakemake: Use different folders for input and output files

This is most likely a very basic issue, but I could not find it documented anywhere.
rule all:
input:
"fasta_file.fna"
output:
"headers.txt"
shell:
"grep "^>" {input} > {output}"
I want to run this for a set of files that are not necessarily in the same folder. Is there a way to provide as command (or config file) the input file name from another directory?
Okay never mind, this was probably no smart question indeed.
rule all:
input:
"input/{sample}.fna"
output:
"output/{sample}_headers.txt"
shell:
"grep "^>" {input} > {output}"
And with that I can just run snakemake for my target file, something like snakemake output/A1_headers.txt, or build a for loop over my input sequences.