Accessing the --default-remote-prefix within the Snakefile - snakemake

When I run snakemake on the google life sciences executor, I run something like:
snakemake --google-lifesciences --default-remote-prefix my_bucket_name --preemption-default 10 --use-conda
Now, my_bucket_name is going to get added to all of the input and output paths.
BUT for reasons I need to recreate the full path within the Snakefile code and therefore I want to be able to access whatever is passed to --default-remote-prefix within the code
Is there a way to do this?

I want to be able to access whatever is passed to --default-remote-prefix within the code
You can use the workflow object like:
print(workflow.default_remote_prefix) # Will print my_bucket_name in your example
rule all:
input: ...
I'm not 100% sure if the workflow object is supposed to be used by the user or if it's private to snakemake and if so it could be changed in the future without warning. But I think it's ok, I use workflow.basedir all the time to get the directory where the Snakefile sits.
Alternatively you could parse the sys.argv list but I think that this is more hacky.
Another option:
bucket_name=foo
snakemake --default-remote-prefix $bucket_name --config bucket_name=$bucket_name ...
then use config["bucket_name"] within the code to get the value foo. But I still prefer the workflow solution.

Related

Snakemake live user input

I have an a bunch of R scripts that follow one another and I wanted to connect them using Snakemake. But I’m running in a problem.
One of my R scripts shows two images and asks a user’s input on how many cluster there are present. The R function for this is [readline]
This query on how many clusters is asked but directly after the next line of code is run. Without an opportunity to input a number. the rest of the program crashes, since trying to calculate (empty number) of clusters doesn’t really work. Is there a way around this. By getting the values via a function/rule from Snakemake
or is there a other way to work around this issue?
Based on my testing with snakemake v5.8.2 in MacOS, this is not a snakemake issue. Example setup below works without any problem.
File test.R
cat("What's your name? ")
x <- readLines(file("stdin"),1)
print(x)
File Snakefile
rule all:
input:
"a.txt",
"b.txt"
rule test_rule:
output:
"{sample}.txt"
shell:
"Rscript test.R; touch {output}"
Executing them with command snakemake -p behaves as expected. That is, it asks for user input and then touch output file.
I used function readLines in R script, but this example shows that error you are facing is likely not a snakemake issue.

Snakemake › Access multiple keys from config file

I have the question about the proper handling of the config file. I'm trying to solve my issue for a couple of days now but with the best will, I just can't find out how to do it. I know that this question is maybe quite similar with all the others here and I really tried to use them - but I didn't really get it. I hope that some things about how snakemake works will be more clear when I solved this problem.
I'm just switching to snakemake and I thought I just can easily convert my bash script. To get familiar with snakemake I started trying a simple Data-Processing pipeline. I know I could solve my case while defining every variable within the snakefile. But I want to use an external config file.
First is to say, for better understanding I decided just to post the code which I thought would work somehow. I already played around with different versions for a "rule all" and the "lambda" functions, but nothing worked so far and it just would be confusing. I'm really a bit embarrassed and confused about why I can't get this working. The variable differs from the key because I aways had a version where I redefine the variable, like:
$ sample=config["samples"]
I would be incredibly thankful for an example code.
What I'd like to have is:
The config file:
samples:
- SRX1232390
- SRX2312380
names:
- SomeData
- SomeControl
adapters:
- GATCGTAGC
- GATCAGTCG
And then I thought I can just call the keys like different variables.
rule download_fastq:
output:
"fastq/{name}.fastq.gz"
shell:
"fastq-dump {wildcards.sample} > {output}"
later there will be more rules, so I thought for them I also just need a key:
rule trimming_cutadapt:
input:
"fastq/{name}.fastq"
output:
"ctadpt_{name}.fastq"
shell:
"cutadapt -a {adapt}"
I also tried something with a config file like this:
samples:
Somedata: SRX1232131
SomeControl: SRX12323
But in the end I also didn't find the final solution nor would I know how to add a third "variable" then.
I hope it is somehow understandable what I want to have. It would be very awesome if someone could help me.
EDIT:
Ok - I reworked my code and tried to dig into everything. I fear my understanding lacks in connecting the things I read in this case. I would really appreciate some tips which will probably help me to understand my confusion.
First of all: Rather than try to download data from a pipeline I decided to do this in a config step. I tried out two different versions now:
Based on this answer I tried version one. I like the version with the two files. But I'm stuck in how to deal with the variables now in things like using them with the lambda function or everything where you normally would write "config["sample"]".
So my problem here is that I don't knwo ho to proceed or how the correct syntax is now to call the variables.
#version one
configfile: "config.yaml"
sample_file = config["sample_file"]
import pandas as pd
sample = pd.read_table(sample_file)['samples']
adapt = pd.read_table(sample_file)['adapters']
rule trimming_cutadapt:
input:
data=expand("../data/{sample}.fastq", name = pd.read_table(sample_file)['names']),
lambda wildcards: ???
output:
"trimmed/ctadpt_{sample}.fastq"
shell:
"cutadapt -a {adapt}"
So I went back to try to understand using and defining the wildcards. So (among other things) I looked into the example snakefile and the example rules of Johannes. And of course into the man. Oh and the Thing about the zip function.
At least I don't get an error anymore that it can't deal with wildcards or whatever. Now it's just doing nothing. And I can't find out why because I don't get any information. Additionaly I marked some points which I don't understand.
#version two
configfile: "config_ChIP_Seq_Pipeline.yaml"
rule all:
input:
expand("../data/{sample}.fastq", sample=config["samples"])
#when to write the lambda or the expand in a rule all and when into the actual rule?
rule trimming_cutadapt:
input:
"../data/{sample}.fastq"
params:
adapt=lambda wildcards: config[wildcards.sample]["adapt"] #why do I have to write .samle? when I have to use wildcard.XXX in the shell part?
output:
"trimmed/ctadpt_{sample}.fastq"
shell:
"cutadapt -a {params.adapt}"
As a testfile I used this one.
My configfile in version 1:
sample_file: "sample.tab"
and the tab file:
samples names adapters
test_1 input GACCTA
and the configfile from version two:
samples:
- test_1
adapt:
- GTACGTAG
Thanks for your help and patients!
Cheers
You can look at this post to see how to store and access sample information.
Then you can look at Snakemake documentation here, more specifically at the zip function, which might help you as well.

How to make Snakemake wrappers work with the {threads} variable

EDIT 27-07-2018: Wrapper does not account for threads. Furthermore, the syntax I`m trying here won't work and as far as I can find similar syntax is not supported. Answer is from cross-post on Snakemake Google groups and meono below.
I am using Snakemake and I'm quite happy with it. However, for some processes I`m using wrappers (i.e. FastQC and Trimmomatic). However, I notice that these wrappers do not take the {threads} variable into account. Can someone explain what the proper syntax is to make this work?
I've tried setting threads: 4 and then specifying {threads} at the proper place in the code (e.g. for FastQC: params: "--threads {threads}"). Likewise, I've tested setting {wildcards.threads} and also {snakemake.threads}. It looks like that wrapper codeblock is unable to "see" the value of the threads variable.
Please see the example below.
Note: I've looked at the Bitbucket snakemake-wrapper repo and readthedocs readme, but could not find an answer.
rule FastQC_preTrim:
input:
join(RAW_DATA, PATTERN_ANY)
output:
html="FastQC_pretrim/{sample}.html",
zip="FastQC_pretrim/{sample}_fastqc.zip"
threads: 4
params:
"--threads {wildcards.threads}" # Also tried {threads}
wrapper:
"0.20.1/bio/fastqc"
(would put this in a comment but don't have the reps)
fastqc wrapper doesn't account for threads in rule. I think
params:
"--threads 4"
would work for you.
I ran into the same issue while trying to use snakemake wrappers but failed when specifying {threads} or any other wildcards into params.
A workaround is to explicitly define a "thread" parameter in your config.yaml and calling it from there. For example:
# In config.yaml
thread_use: 32
And
# In rule
rule some_rule_with_wrapper:
input:
"path/to/input"
output:
"path/to/output"
params:
extra="-t "+str(config["thread_use"]) # Remember to coerce int to string
threads: 32
wrapper:
"some/wrapper"

How to overwrite a parameter from the configfile that is not at the first level in a snakemake call?

I can't figure out the syntax. For example:
snakemake --configfile myconfig.yml --config myparam="new value"
This will overwrite the value of config["myparam"] from the yaml file upon workflow execution.
But what if I want to overwrite config["myparam"]["otherparam"]?
Thanks!
This is currently not possible. A general remark: Note that --config should be used as little as possible, because it defeats the goal of reproducibility and data provenance (you would have to remember the command line with which you invoked snakemake).

How to gather files from subdirectories to run jobs in Snakemake?

I am currently working on this project where iam struggling with this issue.
My current directory structure is
/shared/dir1/file1.bam
/shared/dir2/file2.bam
/shared/dir3/file3.bam
I want to convert various .bam files to fastq in the results directory
results/file1_1.fastq.gz
results/file1_2.fastq.gz
results/file2_1.fastq.gz
results/file2_2.fastq.gz
results/file3_1.fastq.gz
results/file3_2.fastq.gz
I have the following code:
END=["1","2"]
(dirs, files) = glob_wildcards("/shared/{dir}/{file}.bam")
rule all:
input: expand( "/results/{sample}_{end}.fastq.gz",sample=files, end=END)
rule bam_to_fq:
input: {dir}/{sample}.bam"
output: left="/results/{sample}_1.fastq", right="/results/{sample}_2.fastq"
shell: "/shared/packages/bam2fastq/bam2fastq --force -o /results/{sample}.fastq {input}"
This outputs the following error:
Wildcards in input files cannot be determined from output files:
'dir'
Any help would be appreciated
You're just missing an assignment for "dir" in your input directive of the rule bam_to_fq. In your code, you are trying to get Snakemake to determine "{dir}" from the output of the same rule, because you have it setup as a wildcard. Since it didn't exist, as a variable in your output directive, you received an error.
input:
"{dir}/{sample}.bam"
output:
left="/results/{sample}_1.fastq",
right="/results/{sample}_2.fastq",
Rule of thumb: input and output wildcards must match
rule all:
input:
expand("/results/{sample}_{end}.fastq.gz", sample=files, end=END)
rule bam_to_fq:
input:
expand("{dir}/{{sample}}.bam", dir=dirs)
output:
left="/results/{sample}_1.fastq",
right="/results/{sample}_2.fastq"
shell:
"/shared/packages/bam2fastq/bam2fastq --force -o /results/{sample}.fastq {input}
NOTES
the sample variable in the input directive now requires double {}, because that is how one identifies wildcards in an expand.
dir is no longer a wildcard, it is explicitly set to point to the list of directories determined by the glob_wildcard call and assigned to the variable "dirs" which I am assuming you make earlier in your script, since the assignment of one of the variables is successful already, in your rule all input "sample=files".
I like and recommend easily differentiable variable names. I'm not a huge fan of the usage of variable names "dir", and "dirs". This makes you prone to pedantic spelling errors. Consider changing it to "dirLIST" and "dir"... or anything really. I just fear one day someone will miss an 's' somewhere and it's going to be frustrating to debug. I'm personally guilty, an thus a slight hypocrite, as I do use "sample=samples" in my core Snakefile. It has caused me minor stress, thus why I make this recommendation. Also makes it easier for others to read your code as well.
EDIT 1; Adding to response as I had initially missed the requirement for key-value matching of the dir and sample
I recommend keeping separate the path and the sample name in different variables. Two approaches I can think of:
Keep using glob_wildcards to make a blanket search for all possible variables, and then use a python function to validate which path+file combinations are legit.
Drop the usage of glob_wildcards. Propagate the directory name as a wildcard variable, {dir}, throughout your rules. Just set it as a sub-directory of "results". Use pandas to pass known, key-value pairs listed in a file to the rule all. Initially I suggest generating the key-value pairs file manually, but eventually, it's generation could just be a rule upstream of others.
Generalizing bam_to_fq a little bit... utilizing an external config, something like....
from pandas import read_table
rule all:
input:
expand("/results/{{sample[1][dir]}}/{sample[1][file]}_{end}.fastq.gz", sample=read_table(config["sampleFILE"], " ").iterrows(), end=['1','2'])
rule bam_to_fq:
input:
"{dir}/{sample}.bam"
output:
left="/results/{dir}/{sample}_1.fastq",
right="/results/{dir}/{sample}_2.fastq"
shell:
"/shared/packages/bam2fastq/bam2fastq --force -o /results/{sample}.fastq {input}
sampleFILE
dir file
dir1 file1
dir2 file2
dir3 file3