I am using some python scripts with snakemake to automate the workflow. These scripts take in command line arguments and, while I could replace them with snakemake.input[0], snakemake.output[0], etc, I am reluctant to do so since I'd also like to be able to use them outside of snakemake.
One natural way to solve this problem -- what I have been doing -- is to run them as a shell and not a script. However, when I do this the dependency graph is broken; I update my script and the DAG doesn't think anything needs to be re-run.
Is there a way to pass command line arguments to scripts but still run them as a script?
Edit: an example
My python script
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("-o", type=str)
args = parser.parse_args()
with open(args.o, "w") as file:
file.write("My file output")
My Snakefile
rule some_rule:
output: "some_file_name.txt"
shell: "python my_script.py -o {output}"
Based on comment from #troy-comi, I've been doing the following which -- while a bit of a hack -- does exactly what I want. I define the script as an input to the snakemake rule, which actually can help readability as well. A typical rule (this is not a full MWE) might look like
rule some_rule:
input:
files=expand("path_to_files/f", f=config["my_files"]),
script="scripts/do_something.py"
output: "path/to/my/output.txt"
shell: "python {input.script} -i {input.files} -o {output}"
when I modify the scripts, it triggers a re-run; it's readable; and it doesn't require me to insert snakemake.output[0] in my python scripts (making them hard to recycle outside this workflow).
could you do something as simple an if statement to get parameters from snakemake or command line
that is
if snakemake in globals():
get parameters from snakemake
else:
get code some other way
Sounds like what you need is use of argparse in your python script. Here is an example, in which python script accepts arguments via commandline:
Python script example.py
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("-i", "--infile", help="Filepath")
parser.add_argument("-o", "--outfile", help="Filepath")
args = parser.parse_args()
infilepath = args.infile
outfilepath = args.outfile
# blah blah code
Snakefile
rule xx:
input: "in.txt"
output: "out.txt"
shell: "python example.py -i {input} -o {output}"
PS - When I'm lazy, I just use Fire library instead of argparse. Fire easily exposes functions/classes to commandline with few lines of code.
Related
I am new to Snakemake and planning to use Snakemake for qsub in my cluster environment. In order to avoid critical mistakes that may disturb the cluster, I would like to check qsub job scripts and a qsub command that will be generated by Snakemake before actually submitting the jobs to the que.
Is it possible to see qsub job script files etc. in the dry run mode or in some other ways? I searched for relevant questions but could not find the answer. Thank you for your kind help.
Best,
Schuma
Using --printshellcmds or short version -p with --dry-run will allow you to see the commands snakemake will feed to qsub, but you won't see qsub options.
I don't know any option showing which parameters are given to qsub, but snakemake follows a simple set of rules, which you can find detailed information here and here. As you'll see, you can feed arguments to qsub in multiple ways.
With default values --default-resources resource_name1=<value1> resource_name2=<value2> when invoking snakemake.
On a per-rule basis, using resources in rules (prioritized over default values).
With explicitly set values, either for the whole pipeline using --set-resources resource_name1=<value1> or for a specific rule using --set-resources rule_name:resource_name1=<value1> (prioritized over default and per-rule values)
Suppose you have the following pipeline:
rule all:
input:
input.txt
output:
output.txt
resources:
mem_mb=2000
runtime_min=240
shell:
"""
some_command {input} {output}
"""
If you call qsub using the --cluster directive, you can access all keywords of your rules. Your command could then look like this:
snakemake all --cluster "qsub --runtime {resources.runtime} -l mem={resources.mem_mb}mb"
This means snakemake will submit the following script to the cluster just as if you did directly in your command line:
qsub --runtime 240 -l mem=2000mb some_command input.txt output.txt
It is up to you to see which parameters you define where. You might want to check your cluster's documentation or with its administrator what parameters are required and what to avoid.
Also note that for cluster use, Snakemake documentation recommends setting up a profile which you can then use with snakemake --profile myprofile instead of having to specify arguments and default values each time.
Such a profile can be written in a ~/.config/snakemake/profile_name/config.yaml file. Here is an example of such a profile:
cluster: "qsub -l mem={resources.mem_mb}mb other_resource={resources.other_resource_name}"
jobs: 256
printshellcmds: true
rerun-incomplete: true
default-resources:
- mem_mb=1000
- other_resource_name="foo"
Invoking snakemake all --profile profile_name corresponds to invoking
snakemake all --cluster "qsub -l mem={resources.mem_mb}mb other_resource= resources.other_resource_name_in_snakefile}" --jobs 256 --printshellcmds --rerun-incomplete --default-resources mem_mb=1000 other_resource_name "foo"
You may also want to define test rules, like a minimal example of your pipeline for instance, and try these first to verify all goes well before running your full pipeline.
i'm sorry if my question may seem a bit dumb.
So, i'm currently trying to write a workflow on snakemake (my first, as a trainee), i've to automate a couple of steps, those steps dependings all on python scripts already made.
My trouble is that the input and outputs of those scripts are folders themselves (and their content corresponding of files linked of the first directory content..).
So far, i did this (which is not working, as we can expect)
configfile: "config.yaml"
rule all:
input:
"{dirname}/directory_results/sub_dir2", dirname=config["dirname"]
rule script1:
input:
"{dirname}/reference/{files}.gbff", dirname=config["dirname"]
output:
"{dirname}/directory_results", dirname=config["dirname"]
shell:
"python script_1.py -i {dirname}/reference -o {output}"
rule script2:
input:
"{dirname}/directory_results/sub_dir1/{files}.gbff.gff", dirname=config["dirname"]
output:
"{dirname}/directory_results/sub_dir2", dirname=config["dirname"]
shell:
"python script_2.py -i {dirname}/directory_results/sub_dir1"
As for config.yaml, it's a simple file that i used for now, to put the path of the said "dirname"
dirname:
Sero_1: /project/work/test_snake/Sero_1
I know that there is much to refactor (i'm still not accustomed to snakemake since, beside the tutorial, it's my first workflow ever made). I also understand that the problem lie probably in the fact that, input can't be directories. I tried a couple of things since a couple of days, and i thought i may ask some advice since i'm struggling
How can i put an input that will permit to use for scripts directories?
If it may help, i solved my rule "script1" by doing:
configfile: "config.yaml"
dirname = config["dirname"]
rule all:
input:
expand("{dirname}/directory_results/", "{dirname}/directory_results/subdir2" dirname=dirname)
rule script1:
input:
expand("{dirname}/reference/", dirname=dirname)
output:
directory(expand("{dirname}/directory_results", dirname=dirname))
shell:
"python script_1.py -i {input} -o {output}"
rule script2:
input:
rules.script1.output
output:
directory(extend("{dirname}/directory_results/sub_dir2", dirname=dirname))
shell:
"python script_2.py -i {input}"
As for the config.yaml file:
dirname:
- /project/work/test_snake/Sero_1
- /project/work/test_snake/Sero_2
I have been trying to run metabat2 with snakemake. I can run it but the output files in metabat2/ are missing. The checkM that works after it does use the data and can work I just cant find the files later. There should be files created with numbers but it is imposible to predict how many files will be created. Is there a way I can specify it to make sure that the files are created in that file?
rule all:
[f"metabat2/" for sample in samples],
[f"checkm/" for sample in samples]
rule metabat2:
input:
"input/consensus.fasta"
output:
directory("metabat2/")
conda:
"envs/metabat2.yaml"
shell:
"metabat2 -i {input} -o {output} -v"
rule checkM:
input:
"metabat2/"
output:
c = "bacteria/CheckM.txt",
d = directory("checkm/")
conda:
"envs/metabat2.yaml"
shell:
"checkm lineage_wf -f {output.c} -t 10 -x fa {input} {output.d}"
the normal code to run metabat2 would be
metabat2 -i path/to/consensus.fasta -o /outputdir/bin -v
this will create in outputdir files with bin.[number].fa
I can't tell what the problem is but I have a couple of suggestions...
[f"metabat2/" for sample in samples]: I doubt this will do what you expect as it will simply create a list with the string metabat2/ repeat len(samples) times. Maybe you want [f"metabat2/{sample}" for sample in samples]? The same for [f"checkm/" for sample in samples]
The samples variable is not used anywhere in the rules following all. I suspect somewhere it should be used and/or you should use something like output: directory("metabat2/{sample}")
Execute snakemake with -p option to see what commands are executed. It may be useful to post the stdout from it.
I am absolute beginner to snakemake. I am building a pipeline as I learn. My question is if the Snakefile is placed with data file that I want to process an NameError: occurs but if I move the Snakefile to a parent directory and edit the path information of input: and output: the code works. what am I missing?
rule sra_convert:
input:
"rna/{id}.sra"
output:
"rna/fastq/{id}.fastq"
shell:
"fastq-dump {input} -O {output}"
above code works fine when I run with
snakemake -p rna/fastq/SRR873382.fastq
However, if I move the file to "rna" directory where the SRR873382.sra file is and edit the code as below
rule sra_convert:
input:
"{id}.sra"
output:
"fastq/{id}.fastq"
message:
"Converting from {id}.sra to {id}.fastq"
shell:
"fastq-dump {input} -O {output}"
and run
snakemake -p fastq/SRR873382.fastq
I get the following error
Building DAG of jobs...
Job counts:
count jobs
1 sra_convert
1
RuleException in line 7 of /home/sarc/Data/rna/Snakefile:
NameError: The name 'id' is unknown in this context. Please make sure that you defined that variable. Also note that braces not used for variable access have to be escaped by repeating them, i.e. {{print $1}}
Solution
rule sra_convert:
input:
"{id}.sra"
output:
"fastq/{id}.fastq"
message:
"Converting from {wildcards.id}.sra to {wildcards.id}.fastq"
shell:
"fastq-dump {input} -O {output}"
above code runs fine without error
I believe that the best source that answers your actual question is:
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#wildcards
If the rule’s output matches a requested file, the substrings matched
by the wildcards are propagated to the input files and to the variable
wildcards, that is here also used in the shell command. The wildcards
object can be accessed in the same way as input and output, which is
described above.
I have these two rules:
all_participants = ['01','03','04','05','06','07','08']
rule all:
input: expand("data/interim/tables/screen/p{participant_id}.csv",participant_id=all_participants)
rule extract_screen_table:
output: "data/interim/tables/screen/p{participant_id}.csv"
shell: "python src/data/sql_table_to_csv.py --table screen"
If I execute snakemake everything works, but if I change the code and execute: snakemake -n -R 'snakemake --list-code-changes' I get this error:
Building DAG of jobs...
MissingRuleException:
No rule to produce snakemake --list-code-changes (if you use input functions make sure that they don't raise unexpected exceptions).
The output of snakemake --list-code-change is:
Building DAG of jobs...
data/interim/tables/screen/p03.csv
which I reckon it shouldn't be, and I should get the python script instead.
You have to use backticks for the list-code-changes: `snakemake --list-code-changes`. This is bash syntax for execute the contained command and return STDOUT as a string.