Snakemake › Access multiple keys from config file - snakemake

I have the question about the proper handling of the config file. I'm trying to solve my issue for a couple of days now but with the best will, I just can't find out how to do it. I know that this question is maybe quite similar with all the others here and I really tried to use them - but I didn't really get it. I hope that some things about how snakemake works will be more clear when I solved this problem.
I'm just switching to snakemake and I thought I just can easily convert my bash script. To get familiar with snakemake I started trying a simple Data-Processing pipeline. I know I could solve my case while defining every variable within the snakefile. But I want to use an external config file.
First is to say, for better understanding I decided just to post the code which I thought would work somehow. I already played around with different versions for a "rule all" and the "lambda" functions, but nothing worked so far and it just would be confusing. I'm really a bit embarrassed and confused about why I can't get this working. The variable differs from the key because I aways had a version where I redefine the variable, like:
$ sample=config["samples"]
I would be incredibly thankful for an example code.
What I'd like to have is:
The config file:
samples:
- SRX1232390
- SRX2312380
names:
- SomeData
- SomeControl
adapters:
- GATCGTAGC
- GATCAGTCG
And then I thought I can just call the keys like different variables.
rule download_fastq:
output:
"fastq/{name}.fastq.gz"
shell:
"fastq-dump {wildcards.sample} > {output}"
later there will be more rules, so I thought for them I also just need a key:
rule trimming_cutadapt:
input:
"fastq/{name}.fastq"
output:
"ctadpt_{name}.fastq"
shell:
"cutadapt -a {adapt}"
I also tried something with a config file like this:
samples:
Somedata: SRX1232131
SomeControl: SRX12323
But in the end I also didn't find the final solution nor would I know how to add a third "variable" then.
I hope it is somehow understandable what I want to have. It would be very awesome if someone could help me.
EDIT:
Ok - I reworked my code and tried to dig into everything. I fear my understanding lacks in connecting the things I read in this case. I would really appreciate some tips which will probably help me to understand my confusion.
First of all: Rather than try to download data from a pipeline I decided to do this in a config step. I tried out two different versions now:
Based on this answer I tried version one. I like the version with the two files. But I'm stuck in how to deal with the variables now in things like using them with the lambda function or everything where you normally would write "config["sample"]".
So my problem here is that I don't knwo ho to proceed or how the correct syntax is now to call the variables.
#version one
configfile: "config.yaml"
sample_file = config["sample_file"]
import pandas as pd
sample = pd.read_table(sample_file)['samples']
adapt = pd.read_table(sample_file)['adapters']
rule trimming_cutadapt:
input:
data=expand("../data/{sample}.fastq", name = pd.read_table(sample_file)['names']),
lambda wildcards: ???
output:
"trimmed/ctadpt_{sample}.fastq"
shell:
"cutadapt -a {adapt}"
So I went back to try to understand using and defining the wildcards. So (among other things) I looked into the example snakefile and the example rules of Johannes. And of course into the man. Oh and the Thing about the zip function.
At least I don't get an error anymore that it can't deal with wildcards or whatever. Now it's just doing nothing. And I can't find out why because I don't get any information. Additionaly I marked some points which I don't understand.
#version two
configfile: "config_ChIP_Seq_Pipeline.yaml"
rule all:
input:
expand("../data/{sample}.fastq", sample=config["samples"])
#when to write the lambda or the expand in a rule all and when into the actual rule?
rule trimming_cutadapt:
input:
"../data/{sample}.fastq"
params:
adapt=lambda wildcards: config[wildcards.sample]["adapt"] #why do I have to write .samle? when I have to use wildcard.XXX in the shell part?
output:
"trimmed/ctadpt_{sample}.fastq"
shell:
"cutadapt -a {params.adapt}"
As a testfile I used this one.
My configfile in version 1:
sample_file: "sample.tab"
and the tab file:
samples names adapters
test_1 input GACCTA
and the configfile from version two:
samples:
- test_1
adapt:
- GTACGTAG
Thanks for your help and patients!
Cheers

You can look at this post to see how to store and access sample information.
Then you can look at Snakemake documentation here, more specifically at the zip function, which might help you as well.

Related

Accessing file path from a config.yaml in Snakemake

I'm working with Snakemake for NGS analysis. I have a list of input files, stored in a YAML file as follows:
DATASETS:
sample1: /path/to/input/bam
.
.
A very simplified skeleton of my Snakemake file, as described earlier in Snakemake: How to use config file efficiently and https://www.biostars.org/p/406452/, is as follows:
rule all:
input:
expand("report/{sample}.xlsx", sample = config["DATASETS"])
rule call:
input:
lambda wildcards: config["DATASETS"][wildcards.sample]
output:
"tmp/{sample}.vcf"
shell:
"some mutect2 script"
rule summarize:
input:
"tmp/{sample}.vcf"
output:
"report/{sample}.xlsx"
shell:
"processVCF.py"
This complains about missing input files for rule all. I'm really not too sure what I am missing out here: Could someone perhaps point out where I can start looking to try to solve my problem?
This problem persists even when I execute snakemake -n tmp/sample1.vcf, so it seems the problem is related to the inability to pass the input file to the rule call. I have a nagging feeling that I'm really missing something trivial here.

Snakemake live user input

I have an a bunch of R scripts that follow one another and I wanted to connect them using Snakemake. But I’m running in a problem.
One of my R scripts shows two images and asks a user’s input on how many cluster there are present. The R function for this is [readline]
This query on how many clusters is asked but directly after the next line of code is run. Without an opportunity to input a number. the rest of the program crashes, since trying to calculate (empty number) of clusters doesn’t really work. Is there a way around this. By getting the values via a function/rule from Snakemake
or is there a other way to work around this issue?
Based on my testing with snakemake v5.8.2 in MacOS, this is not a snakemake issue. Example setup below works without any problem.
File test.R
cat("What's your name? ")
x <- readLines(file("stdin"),1)
print(x)
File Snakefile
rule all:
input:
"a.txt",
"b.txt"
rule test_rule:
output:
"{sample}.txt"
shell:
"Rscript test.R; touch {output}"
Executing them with command snakemake -p behaves as expected. That is, it asks for user input and then touch output file.
I used function readLines in R script, but this example shows that error you are facing is likely not a snakemake issue.

How to make Snakemake wrappers work with the {threads} variable

EDIT 27-07-2018: Wrapper does not account for threads. Furthermore, the syntax I`m trying here won't work and as far as I can find similar syntax is not supported. Answer is from cross-post on Snakemake Google groups and meono below.
I am using Snakemake and I'm quite happy with it. However, for some processes I`m using wrappers (i.e. FastQC and Trimmomatic). However, I notice that these wrappers do not take the {threads} variable into account. Can someone explain what the proper syntax is to make this work?
I've tried setting threads: 4 and then specifying {threads} at the proper place in the code (e.g. for FastQC: params: "--threads {threads}"). Likewise, I've tested setting {wildcards.threads} and also {snakemake.threads}. It looks like that wrapper codeblock is unable to "see" the value of the threads variable.
Please see the example below.
Note: I've looked at the Bitbucket snakemake-wrapper repo and readthedocs readme, but could not find an answer.
rule FastQC_preTrim:
input:
join(RAW_DATA, PATTERN_ANY)
output:
html="FastQC_pretrim/{sample}.html",
zip="FastQC_pretrim/{sample}_fastqc.zip"
threads: 4
params:
"--threads {wildcards.threads}" # Also tried {threads}
wrapper:
"0.20.1/bio/fastqc"
(would put this in a comment but don't have the reps)
fastqc wrapper doesn't account for threads in rule. I think
params:
"--threads 4"
would work for you.
I ran into the same issue while trying to use snakemake wrappers but failed when specifying {threads} or any other wildcards into params.
A workaround is to explicitly define a "thread" parameter in your config.yaml and calling it from there. For example:
# In config.yaml
thread_use: 32
And
# In rule
rule some_rule_with_wrapper:
input:
"path/to/input"
output:
"path/to/output"
params:
extra="-t "+str(config["thread_use"]) # Remember to coerce int to string
threads: 32
wrapper:
"some/wrapper"

Manually create snakemake wildcards

I'm struggling to integrate my sample sheet (TSV) into my pipeline. Specifically, I want to define the samples wildcard manually instead of reading it from a patch. The reason is that not all samples in a path are supposed to be analysed. Instead, I made a sample sheet that contains the list of samples, the path where to find, reference genome, etc.
The sheet looks like this:
name path reference
sample1 path/to/fastq/files mm9
sample2 path/to/fastq/files mm9
I load the sheet in my snakefile:
table_samples = pd.read_table(config["samples"], index_col="name")
SAMPLES = table_samples.index.values.tolist()
The first rule is supposed to merge the FASTQ files inside, so it would be nice to do something like this:
rule merge_fastq:
output: "{sample}/{sample}.fastq.gz"
params: path = table_samples['path'][{sample}]
shell: """
cat {params.path}/*.fastq.gz > {output}
"""
But as written above it won't work because the sample wildcard is not defined. Is there a way I can say the sample list I defined above (SAMPLES) contains all the samples for which rules should be executed?
I honestly feel stupid asking this question but I've already spent a couple of hours finding/searching a solution and at this point I need to be a bit more time efficient :D
Thanks!
You just need a target rule listing all the concrete files you want after your rule "merge_fastq":
rule all:
input: expand("{sample}/{sample}.fastq.gz",sample=SAMPLES)
This rule must be put at the top of the other rules. Wildcards can only be used if you define the concrete files you want at the end of the workflow.

How to gather files from subdirectories to run jobs in Snakemake?

I am currently working on this project where iam struggling with this issue.
My current directory structure is
/shared/dir1/file1.bam
/shared/dir2/file2.bam
/shared/dir3/file3.bam
I want to convert various .bam files to fastq in the results directory
results/file1_1.fastq.gz
results/file1_2.fastq.gz
results/file2_1.fastq.gz
results/file2_2.fastq.gz
results/file3_1.fastq.gz
results/file3_2.fastq.gz
I have the following code:
END=["1","2"]
(dirs, files) = glob_wildcards("/shared/{dir}/{file}.bam")
rule all:
input: expand( "/results/{sample}_{end}.fastq.gz",sample=files, end=END)
rule bam_to_fq:
input: {dir}/{sample}.bam"
output: left="/results/{sample}_1.fastq", right="/results/{sample}_2.fastq"
shell: "/shared/packages/bam2fastq/bam2fastq --force -o /results/{sample}.fastq {input}"
This outputs the following error:
Wildcards in input files cannot be determined from output files:
'dir'
Any help would be appreciated
You're just missing an assignment for "dir" in your input directive of the rule bam_to_fq. In your code, you are trying to get Snakemake to determine "{dir}" from the output of the same rule, because you have it setup as a wildcard. Since it didn't exist, as a variable in your output directive, you received an error.
input:
"{dir}/{sample}.bam"
output:
left="/results/{sample}_1.fastq",
right="/results/{sample}_2.fastq",
Rule of thumb: input and output wildcards must match
rule all:
input:
expand("/results/{sample}_{end}.fastq.gz", sample=files, end=END)
rule bam_to_fq:
input:
expand("{dir}/{{sample}}.bam", dir=dirs)
output:
left="/results/{sample}_1.fastq",
right="/results/{sample}_2.fastq"
shell:
"/shared/packages/bam2fastq/bam2fastq --force -o /results/{sample}.fastq {input}
NOTES
the sample variable in the input directive now requires double {}, because that is how one identifies wildcards in an expand.
dir is no longer a wildcard, it is explicitly set to point to the list of directories determined by the glob_wildcard call and assigned to the variable "dirs" which I am assuming you make earlier in your script, since the assignment of one of the variables is successful already, in your rule all input "sample=files".
I like and recommend easily differentiable variable names. I'm not a huge fan of the usage of variable names "dir", and "dirs". This makes you prone to pedantic spelling errors. Consider changing it to "dirLIST" and "dir"... or anything really. I just fear one day someone will miss an 's' somewhere and it's going to be frustrating to debug. I'm personally guilty, an thus a slight hypocrite, as I do use "sample=samples" in my core Snakefile. It has caused me minor stress, thus why I make this recommendation. Also makes it easier for others to read your code as well.
EDIT 1; Adding to response as I had initially missed the requirement for key-value matching of the dir and sample
I recommend keeping separate the path and the sample name in different variables. Two approaches I can think of:
Keep using glob_wildcards to make a blanket search for all possible variables, and then use a python function to validate which path+file combinations are legit.
Drop the usage of glob_wildcards. Propagate the directory name as a wildcard variable, {dir}, throughout your rules. Just set it as a sub-directory of "results". Use pandas to pass known, key-value pairs listed in a file to the rule all. Initially I suggest generating the key-value pairs file manually, but eventually, it's generation could just be a rule upstream of others.
Generalizing bam_to_fq a little bit... utilizing an external config, something like....
from pandas import read_table
rule all:
input:
expand("/results/{{sample[1][dir]}}/{sample[1][file]}_{end}.fastq.gz", sample=read_table(config["sampleFILE"], " ").iterrows(), end=['1','2'])
rule bam_to_fq:
input:
"{dir}/{sample}.bam"
output:
left="/results/{dir}/{sample}_1.fastq",
right="/results/{dir}/{sample}_2.fastq"
shell:
"/shared/packages/bam2fastq/bam2fastq --force -o /results/{sample}.fastq {input}
sampleFILE
dir file
dir1 file1
dir2 file2
dir3 file3