Does Snakefile location matter? - snakemake

I am absolute beginner to snakemake. I am building a pipeline as I learn. My question is if the Snakefile is placed with data file that I want to process an NameError: occurs but if I move the Snakefile to a parent directory and edit the path information of input: and output: the code works. what am I missing?
rule sra_convert:
input:
"rna/{id}.sra"
output:
"rna/fastq/{id}.fastq"
shell:
"fastq-dump {input} -O {output}"
above code works fine when I run with
snakemake -p rna/fastq/SRR873382.fastq
However, if I move the file to "rna" directory where the SRR873382.sra file is and edit the code as below
rule sra_convert:
input:
"{id}.sra"
output:
"fastq/{id}.fastq"
message:
"Converting from {id}.sra to {id}.fastq"
shell:
"fastq-dump {input} -O {output}"
and run
snakemake -p fastq/SRR873382.fastq
I get the following error
Building DAG of jobs...
Job counts:
count jobs
1 sra_convert
1
RuleException in line 7 of /home/sarc/Data/rna/Snakefile:
NameError: The name 'id' is unknown in this context. Please make sure that you defined that variable. Also note that braces not used for variable access have to be escaped by repeating them, i.e. {{print $1}}
Solution
rule sra_convert:
input:
"{id}.sra"
output:
"fastq/{id}.fastq"
message:
"Converting from {wildcards.id}.sra to {wildcards.id}.fastq"
shell:
"fastq-dump {input} -O {output}"
above code runs fine without error

I believe that the best source that answers your actual question is:
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#wildcards
If the rule’s output matches a requested file, the substrings matched
by the wildcards are propagated to the input files and to the variable
wildcards, that is here also used in the shell command. The wildcards
object can be accessed in the same way as input and output, which is
described above.

Related

Snakemake: catch output file whose name cannot be changed

As part of a Snakemake pipeline that I'm building, I have to use a program that does not allow me to specify the file path or name of an output file.
E.g. when running the program in the working directory workdir/ it produces the following output:
workdir/output.txt
My snakemake rule looks something like this:
rule NAME:
input: "path/to/inputfile"
output: "path/to/outputfile"
shell: "somecommand {input} {output}"
So every time the rule NAME runs, I get an additional file output.txt in the snakemake working directory, which is then overwritten if the rule NAME runs multiple times or in parallel.
I'm aware of shadow rules, and adding shadow: "full" allows me to simply ignore the output.txt file. However, I'd like to keep output.txt and save it in the same directory as the outputfile. Is there a way of achieving this, either with the shadow directive or otherwise?
I was also thinking I could prepend somecommand with a cd command, but then I'd probably run into other issues downstream when linking up other rules to the outputs of the rule NAME.
How about simply moving it directly afterwards in the shell part (provided somecommand completes successfully)?
rule NAME:
input: "path/to/inputfile"
output: "path/to/outputfile"
params:
output_dir = "path/to/output_dir",
shell: "somecommand {input} {output} && mv output.txt {params.output_dir}/output.txt"
EDIT: for multiple executions of NAME in parallel, combining with shadow: "full" could work:
rule NAME:
input: "path/to/inputfile"
output:
output_file = "path/to/outputfile"
output_txt = "path/to/output_dir/output.txt"
shadow: "full"
shell: "somecommand {input} {output.output_file} && mv output.txt {output.output_txt}"
That should run each execution of the rule in its own temporary dir, and by specifying the moved output.txt as an output Snakemake should move it to the real output dir once the rule is done running.
I was also thinking I could prepend somecommand with a cd command, but then I'd probably run into other issues downstream when linking up other rules to the outputs of the rule NAME.
I think you are on the right track here. Each shell block is run in a separate process with the working directory inherited from the snakemake process (specified with the --directory argument on the command line). Accordingly, cd commands in one shell block will not affect other jobs from the same rule or other downstream/upstream jobs.
rule NAME:
input: "path/to/inputfile"
output: "path/to/outputfile"
shell:
"""
input_file=$(realpath "{input}") # get the absolute path, before the `cd`
base_dir=$(dirname "{output}")
cd "$base_dir"
somecommand ...
"""

Snakemake cannot handle very long command line?

This is a very strange problem.
When my {input} specified in the rule section is a list of <200 files, snakemake worked all right.
But when {input} has more than 500 files, snakemake just quitted with messages (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!). The complete log did not provide any error messages.
For the log, please see: https://github.com/snakemake/snakemake/files/5285271/2020-09-25T151835.613199.snakemake.log
The rule that worked is (NOTE the input is capped to 200 files):
rule combine_fastq:
input:
lambda wildcards: samples.loc[(wildcards.sample), ["fq"]].dropna()[0].split(',')[:200]
output:
"combined.fastq/{sample}.fastq.gz"
group: "minion_assemble"
shell:
"""
echo {input} > {output}
"""
The rule that failed is:
rule combine_fastq:
input:
lambda wildcards: samples.loc[(wildcards.sample), ["fq"]].dropna()[0].split(',')
output:
"combined.fastq/{sample}.fastq.gz"
group: "minion_assemble"
shell:
"""
echo {input} > {output}
"""
My question is also posted in GitHub: https://github.com/snakemake/snakemake/issues/643.
I second Maarten's answer, with that many files you are running up against a shell limit; snakemake is just doing a poor job helping you identify the problem.
Based on the issue you reference, it seems like you are using cat to combine all of your files. Maybe following the answer here would help:
rule combine_fastq_list:
input:
lambda wildcards: samples.loc[(wildcards.sample), ["fq"]].dropna()[0].split(',')
output:
temp("{sample}.tmp.list")
group: "minion_assemble"
script:
with open(output[0]) as out:
out.write('\n'.join(input))
rule combine_fastq:
input:
temp("{sample}.tmp.list")
output:
'combined.fastq/{sample}.fastq.gz'
group: "minion_assemble"
shell:
'cat {input} | ' # this is reading the list of files from the file
'xargs zcat -f | '
'...'
Hope it gets you on the right track.
edit
The first option executes your command separately for each input file. A different option that executes the command once for the whole list of input is:
rule combine_fastq:
...
shell:
"""
command $(< {input}) ...
"""
For those landing here with similar questions (like Snakemake expand function alternative), snakemake 6 can handle long command lines. The following test fails on snakemake < 6 but succeeds on 6.0.0 on my Ubuntu machine:
rule all:
input:
'output.txt',
rule one:
output:
'output.txt',
params:
x= list(range(0, 1000000))
shell:
r"""
echo {params.x} > {output}
"""

Using directory as input in snakemake for particulars scripts

i'm sorry if my question may seem a bit dumb.
So, i'm currently trying to write a workflow on snakemake (my first, as a trainee), i've to automate a couple of steps, those steps dependings all on python scripts already made.
My trouble is that the input and outputs of those scripts are folders themselves (and their content corresponding of files linked of the first directory content..).
So far, i did this (which is not working, as we can expect)
configfile: "config.yaml"
rule all:
input:
"{dirname}/directory_results/sub_dir2", dirname=config["dirname"]
rule script1:
input:
"{dirname}/reference/{files}.gbff", dirname=config["dirname"]
output:
"{dirname}/directory_results", dirname=config["dirname"]
shell:
"python script_1.py -i {dirname}/reference -o {output}"
rule script2:
input:
"{dirname}/directory_results/sub_dir1/{files}.gbff.gff", dirname=config["dirname"]
output:
"{dirname}/directory_results/sub_dir2", dirname=config["dirname"]
shell:
"python script_2.py -i {dirname}/directory_results/sub_dir1"
As for config.yaml, it's a simple file that i used for now, to put the path of the said "dirname"
dirname:
Sero_1: /project/work/test_snake/Sero_1
I know that there is much to refactor (i'm still not accustomed to snakemake since, beside the tutorial, it's my first workflow ever made). I also understand that the problem lie probably in the fact that, input can't be directories. I tried a couple of things since a couple of days, and i thought i may ask some advice since i'm struggling
How can i put an input that will permit to use for scripts directories?
If it may help, i solved my rule "script1" by doing:
configfile: "config.yaml"
dirname = config["dirname"]
rule all:
input:
expand("{dirname}/directory_results/", "{dirname}/directory_results/subdir2" dirname=dirname)
rule script1:
input:
expand("{dirname}/reference/", dirname=dirname)
output:
directory(expand("{dirname}/directory_results", dirname=dirname))
shell:
"python script_1.py -i {input} -o {output}"
rule script2:
input:
rules.script1.output
output:
directory(extend("{dirname}/directory_results/sub_dir2", dirname=dirname))
shell:
"python script_2.py -i {input}"
As for the config.yaml file:
dirname:
- /project/work/test_snake/Sero_1
- /project/work/test_snake/Sero_2

running metabat2 with snakemake but not getting the bin files

I have been trying to run metabat2 with snakemake. I can run it but the output files in metabat2/ are missing. The checkM that works after it does use the data and can work I just cant find the files later. There should be files created with numbers but it is imposible to predict how many files will be created. Is there a way I can specify it to make sure that the files are created in that file?
rule all:
[f"metabat2/" for sample in samples],
[f"checkm/" for sample in samples]
rule metabat2:
input:
"input/consensus.fasta"
output:
directory("metabat2/")
conda:
"envs/metabat2.yaml"
shell:
"metabat2 -i {input} -o {output} -v"
rule checkM:
input:
"metabat2/"
output:
c = "bacteria/CheckM.txt",
d = directory("checkm/")
conda:
"envs/metabat2.yaml"
shell:
"checkm lineage_wf -f {output.c} -t 10 -x fa {input} {output.d}"
the normal code to run metabat2 would be
metabat2 -i path/to/consensus.fasta -o /outputdir/bin -v
this will create in outputdir files with bin.[number].fa
I can't tell what the problem is but I have a couple of suggestions...
[f"metabat2/" for sample in samples]: I doubt this will do what you expect as it will simply create a list with the string metabat2/ repeat len(samples) times. Maybe you want [f"metabat2/{sample}" for sample in samples]? The same for [f"checkm/" for sample in samples]
The samples variable is not used anywhere in the rules following all. I suspect somewhere it should be used and/or you should use something like output: directory("metabat2/{sample}")
Execute snakemake with -p option to see what commands are executed. It may be useful to post the stdout from it.

snakemake: Use different folders for input and output files

This is most likely a very basic issue, but I could not find it documented anywhere.
rule all:
input:
"fasta_file.fna"
output:
"headers.txt"
shell:
"grep "^>" {input} > {output}"
I want to run this for a set of files that are not necessarily in the same folder. Is there a way to provide as command (or config file) the input file name from another directory?
Okay never mind, this was probably no smart question indeed.
rule all:
input:
"input/{sample}.fna"
output:
"output/{sample}_headers.txt"
shell:
"grep "^>" {input} > {output}"
And with that I can just run snakemake for my target file, something like snakemake output/A1_headers.txt, or build a for loop over my input sequences.