Wildcards in input and output not working - snakemake

I added a rule get_timezone_periods with wildcards in the input and output but is not working with error Missing input files for rule all
Manual typing the paths works
"data/raw/test1/ros/timezone.csv",
"data/raw/test3/t02/timezone.csv"
Using wildcards does not
"data/raw/{{db}}/{{user}}/timezone.csv"
My code:
SENSORS=["timezone", "touch"]
DBS_USERS={"test1":["ros"],
"test3":["t02"]}
def db_user_path(paths):
new_paths = []
for db, users in DBS_USERS.items():
for user in users:
for path in paths:
new_paths.append(path.replace("db/", db + "/").replace("user/", user+ "/"))
return new_paths
rule all:
input:
sensors = db_user_path(expand("data/raw/db/user/{sensor}.csv", sensor=SENSORS)),
timezone_periods = db_user_path(["data/processed/db/user/timezone_periods.csv"])
rule download_dataset:
input:
"data/external/{db}-{user}.participant"
output:
expand("data/raw/{{db}}/{{user}}/{sensor}.csv", sensor=SENSORS)
script:
"src/data/download_dataset.R"
rule get_timezone_periods:
input:
# This line below does not work
# "data/raw/{{db}}/{{user}}/timezone.csv"
# These two lines work
"data/raw/test1/ros/timezone.csv",
"data/raw/test3/t02/timezone.csv"
output:
# This line below does not work
# "data/processed/{{db}}/{{user}}/timezone_periods.csv"
# These two lines work
"data/processed/test1/ros/timezone_periods.csv",
"data/processed/test3/t02/timezone_periods.csv"
script:
"src/data/get_timezone_periods.R"

I just realised that I was adding an extra pair of curly braces, it should have been only {db}

Related

Snakemake pipeline not attempting to produce output?

I have a relatively simple snakemake pipeline but when run I get all missing files for rule all:
refseq = 'refseq.fasta'
reads = ['_R1_001', '_R2_001']
def getsamples():
import glob
test = (glob.glob("*.fastq"))
print(test)
samples = []
for i in test:
samples.append(i.rsplit('_', 2)[0])
return(samples)
def getbarcodes():
with open('unique.barcodes.txt') as file:
lines = [line.rstrip() for line in file]
return(lines)
rule all:
input:
expand("grepped/{barcodes}{sample}_R1_001.plate.fastq", barcodes=getbarcodes(), sample=getsamples()),
expand("grepped/{barcodes}{sample}_R2_001.plate.fastq", barcodes=getbarcodes(), sample=getsamples())
wildcard_constraints:
barcodes="[a-z-A-Z]+$"
rule fastq_grep:
input:
R1 = "{sample}_R1_001.fastq",
R2 = "{sample}_R2_001.fastq"
output:
out1 = "grepped/{barcodes}{sample}_R1_001.plate.fastq",
out2 = "grepped/{barcodes}{sample}_R2_001.plate.fastq"
wildcard_constraints:
barcodes="[a-z-A-Z]+$"
shell:
"fastq-grep -i '{wildcards.barcodes}' {input.R1} > {output.out1} && fastq-grep -i '{wildcards.barcodes}' {input.R2} > {output.out2}"
The output files that are listed by the terminal seem correct, so it seems it is seeing what I want to produce but the shell is not making anything at all.
I want to produce a list of files that have grepped the list of barcodes I have in a file. But I get "Missing input files for rule all:"
There are two issues:
You have an impossible wildcard_constraints defined for {barcode}
Your two wildcards {barcode} and {sample} are competing with each other.
Remove the wildcard_constraints from your two rules and add the following lines to the top of your Snakefile:
wildcard_constraints:
barcodes="[A-Z]+",
sample="Well.*",
The constraint for {barcodes} now only matches capital letters. Before it also included end-of-line matching (trailing $) which was impossible to match for this wildcard as you had additional text in the filepath following.
The constraint for {sample} ensures that the path of the filename starting with "Well..." is interpreted as the start of the {sample} wildcard. Else you'd get something unwanted like barcode=ACGGTW instead of barcode=ACGGT.
A note of advice:
I usually find it easier to seperate wildcards into directory structures rather than having multiple wildcards in the same filename. In you case that would mean having a structure like
grepped/{barcode}/{sample}_R1_001.plate.fastq.
Full suggested Snakefile (formatted using snakefmt)
wildcard_constraints:
barcodes="[A-Z]+",
sample="Well.*",
refseq = "refseq.fasta"
reads = ["_R1_001", "_R2_001"]
def getsamples():
import glob
test = glob.glob("*.fastq")
print(test)
samples = []
for i in test:
samples.append(i.rsplit("_", 2)[0])
return samples
def getbarcodes():
with open("unique.barcodes.txt") as file:
lines = [line.rstrip() for line in file]
return lines
rule all:
input:
expand(
"grepped/{barcodes}{sample}_R1_001.plate.fastq",
barcodes=getbarcodes(),
sample=getsamples(),
),
expand(
"grepped/{barcodes}{sample}_R2_001.plate.fastq",
barcodes=getbarcodes(),
sample=getsamples(),
),
rule fastq_grep:
input:
R1="{sample}_R1_001.fastq",
R2="{sample}_R2_001.fastq",
output:
out1="grepped/{barcodes}{sample}_R1_001.plate.fastq",
out2="grepped/{barcodes}{sample}_R2_001.plate.fastq",
shell:
"fastq-grep -i '{wildcards.barcodes}' {input.R1} > {output.out1} && fastq-grep -i '{wildcards.barcodes}' {input.R2} > {output.out2}"
In addition to #euronion's answer (+1), I prefer to constrain wildcards to match only and exactly the list of values you expect. This means disabling the regex matching altogether. In your case, I would do something like:
wildcard_constraints:
barcodes='|'.join([re.escape(x) for x in getbarcodes()]),
sample='|'.join([re.escape(x) for x in getsamples()]),
now {barcodes} is allowed to match only the values in getbarcodes(), whatever they are, and the same for {sample}. In my opinion this is better than anticipating what combination of regex a wildcard can take.

Using multiple config file in snakemake results in not same wildcard error

I had asked an earlier question about running the same snakemake pipeline for multiple datasets and one of the solutions mentioned was using multiple config files by #bli. I am trying to implement it but got an error when I have to read in a file which has sample information.
error:
SyntaxError:
Not all output, log and benchmark files of rule fastqc contain the same wildcards. This is crucial though, in order to avoid that two or more jobs write to the same file.
File "Snakefile", line 64, in <module>
I have seen this error before but I cannot figure out why is it coming in this case when every input and output has sample as a wildcard. Any help is much appreciated!
My Snakefile looks like this:
import os
import pandas as pd
import yaml
configfile: "main_config.yaml"
all_keys = list(config.keys())
print(all_keys)
datasets =config["datasets"]
print(datasets.items())
for p_id, p_info in datasets.items():
for key in p_info:
print(key + '------',p_info[key])
conf_file=p_info["conf"]
conf_fh=open(conf_file)
dat_conf = yaml.safe_load(conf_fh)
output = dat_conf["output_dir"]
samples = dat_conf["sampletable"]
R1 = dat_conf["R1"]
R2 = dat_conf["R2"]
print(samples)
print(R1)
SampleTable = pd.read_table(samples,index_col=0)
SAMPLES = list(SampleTable.index)
print(SAMPLES)
PAIRED_END= ('R2' in SampleTable.columns)
FRACTIONS= ['R1']
if PAIRED_END: FRACTIONS+= ['R2']
qc = config["qc_only"]
def all_input_reads(qc):
if config["qc_only"]:
return expand("{output}/fastqc/{sample}" + config["R1"] + "_fastqc.html", sample=SAMPLES)
else:
return expand("{output}/fastqc/{sample}" + config["R1"] + "_fastqc.html", sample=SAMPLES)
rule all:
input:
all_input_reads
rule fastqc:
input:
unpack( lambda wc: dict(SampleTable.loc[wc.sample]))
output:
R1= "{output}/fastqc/{sample}{R1}" +"_fastqc.html",
R2 ="{output}/fastqc/{sample}{R2}" +"_fastqc.html"
conda:
"../envs/fastqc.yaml"
log:
"{output}/logs/qc/fastqc_{sample}_unfilt.log"
shell: "fastqc -o {output}/fastqc {input.R1} {input.R2} >> {log}"
The main config file is :
datasets:
dat1:
conf: "config_files/data1_config.yaml"
dat2:
conf: "config_files/data2_config.yaml"
qc_only: FALSE
and the individual config files looks like this data1_config.yaml:
# List of files
sampletable: "samples_data1.tsv"
output_dir: "data1"
## Cutadapt
## IMPORTANT ****** If you want to remove primers uncomment line 51 in utils/rules/qc_cutadapt.smk which will allow for primers to be removed
primers:
# Illumina V3V4 protocol primers
fwd_primer: "TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCCTACGGGNGGCWGCAG"
rev_primer: "GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGACTACHVGGGTATCTAATCC"
fwd_primer_rc: "CTGCWGCCNCCCGTAGGCTGTCTCTTATACACATCTGACGCTGCCGACGA"
rev_primer_rc: "GGATTAGATACCCBDGTAGTCCTGTCTCTTATACACATCTCCGAGCCCACGAGAC"
R1: "_R1"
R2: "_R2"
maxEE:
- 2
- 2
truncQ: 2
Here is your fastqc/output with little formatting:
rule fastqc:
output:
R1 = "{output}/fastqc/{sample}{R1}" + "_fastqc.html",
R2 = "{output}/fastqc/{sample}{R2}" + "_fastqc.html"
R1 and R2 wildcards are synonyms, and there is no way for Snakemake to differenciate them. For example, imagine that the rule all requires this file to be created: output_dir/fastqc/sample_aR1_fastqc.html. Which variables should Snakemake assign this file to, output.R1 or output.R2?
You need to separate these parameters with a non-wildcard, like that:
rule fastqc:
output:
R1 = "{output}/fastqc/{sample}R1" + "_fastqc.html",
R2 = "{output}/fastqc/{sample}R2" + "_fastqc.html"

Snakemake: wildcards for parameter keys

I'm trying to create a snakemake rule for which the input and output are config parameters specified by a wildcard but having problems.
I would like to do something like:
config.yaml
cam1:
raw: "src/in1.avi"
bg: "out/bg1.png"
cam2:
raw: "src/in2.avi"
bg: "out/bg2.png"
cam3:
raw: "src/in3.avi"
bg: "out/bg3.png"
Snakefile:
configfile: "config.yml"
...
rule all:
input:
[config[f'cam{id}']['background'] for id in [1, 2, 3]]
rule make_bg:
input:
raw=config["{cam}"]["raw"]
output:
bg=config["{cam}"]["bg"]
shell:
"""
./process.py {input.raw} {output.bg}
"""
But this doesn't seem to play - I would like {cam} to be treated as a wildcard, instead I get a KeyError for {cam}. Can anyone help?
Is it possible to specify {cam} as a wildcard (or something else) that could then be used a config key?
I think that there are a few problems with this approach:
Conceptually
It does not make much sense to specify the exact input and output filenames in a config, since this is pretty much diametrically opposed to why you would use snakemake: Infer from the inputs what part of the pipeline needs to be run to create the desired outputs. In this case, you would always have to first edit the config for each input/output pair and the whole point of automatisation is lost.
Now, the actual problem is to access config variables from the config for input and output. Typically, you would e.g. provide some paths in the config and use something like:
config.yaml:
raw_input = 'src'
bg_output = 'out'
In the pipeline, you could then use it like this:
input: os.path.join(config['raw_input'], in{id}.avi)
output: os.path.join(config['bg_output'], bg{id}.avi)
Like I said, it makes no sense to specify especially the outputs in the config file.
If you were to specify the inputs in config.yaml:
cam1:
raw: "src/in1.avi"
cam2:
raw: "src/in2.avi"
cam3:
raw: "src/in3.avi"
you could then get the inputs with a function as below:
configfile: "config.yaml"
# create sample data
os.makedirs('src', exist_ok= True)
for i in [1,2,3]:
Path(f'src/in{i}.avi').touch()
ids = [1,2,3]
def get_raw(wildcards):
id = 'cam' + wildcards.id
raw = config[f'{id}']['raw']
return raw
rule all:
input: expand('out/bg{id}.png', id = ids)
rule make_bg:
input:
raw = get_raw
output:
bg='out/bg{id}.png'
shell:
" touch {input.raw} ;"
" cp {input.raw} {output.bg};"

Snakemake decide which rules to execute during execution

I'm working on a bioinformatics pipeline which must be able to run different rules to produce different outputs based on the contents of an input file:
def foo(file):
'''
Function will read the file contents and output a boolean value based on its contents
'''
# Code to read file here...
return bool
rule check_input:
input: "input.txt"
run:
bool = foo("input.txt")
rule bool_is_True:
input: "input.txt"
output: "out1.txt"
run:
# Some code to generate out1.txt. This rule is supposed to run only if foo("input.txt") is true
rule bool_is_False:
input: "input.txt"
output: "out2.txt"
run:
# Some code to generate out2.txt. This rule is supposed to run only if foo("input.txt") is False
How do I write my rules to handle this situation? Also how do I write my first rule all if the output files are unknown before the rule check_input is executed?
Thanks!
You're right, snakemake has to know which files to produce before executing the rules. Therefore, I suggest you use a function which reads what you called "the input file" and define the output of the workflow accordingly.
ex:
def getTargetsFromInput():
targets = list()
## read file and add target files to targets
return targets
rule all:
input: getTargetsFromInput()
...
You can define the path of the input file with --config argument on the snakemake command line or directly use some sort of structured input file (yaml, json) and use the keyword configfile: in the Snakefile: https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html
Thanks Eric. I got it work with:
def getTargetsFromInput(file):
with open(file) as f:
line = f.readline()
if line.strip() == "out1":
return "out1.txt"
else:
return "out2.txt"
rule all:
input: getTargetsFromInput("input.txt")
rule out1:
input: "input.txt"
output: "out1.txt"
run: shell("echo 'out1' > out1.txt")
rule out2:
input: "input.txt"
output: "out2.txt"
run: shell("echo 'out2' > out2.txt")

Snakemake: How to use config file efficiently

I'm using the following config file format in snakemake for a some sequencing analysis practice (I have loads of samples each containing 2 fastq files:
samples:
Sample1_XY:
- fastq_files/SRR4356728_1.fastq.gz
- fastq_files/SRR4356728_2.fastq.gz
Sample2_AB:
- fastq_files/SRR6257171_1.fastq.gz
- fastq_files/SRR6257171_2.fastq.gz
I'm using the following rules at the start of my pipeline to run fastqc and for alignment of the fastqc files:
import os
# read config info into this namespace
configfile: "config.yaml"
rule all:
input:
expand("FastQC/{sample}_fastqc.zip", sample=config["samples"]),
expand("bam_files/{sample}.bam", sample=config["samples"]),
"FastQC/fastq_multiqc.html"
rule fastqc:
input:
sample=lambda wildcards: config['samples'][wildcards.sample]
output:
# Output needs to end in '_fastqc.html' for multiqc to work
html="FastQC/{sample}_fastqc.html",
zip="FastQC/{sample}_fastqc.zip"
params: ""
wrapper:
"0.21.0/bio/fastqc"
rule bowtie2:
input:
sample=lambda wildcards: config['samples'][wildcards.sample]
output:
"bam_files/{sample}.bam"
log:
"logs/bowtie2/{sample}.txt"
params:
index=config["index"], # prefix of reference genome index (built with bowtie2-build),
extra=""
threads: 8
wrapper:
"0.21.0/bio/bowtie2/align"
rule multiqc_fastq:
input:
expand("FastQC/{sample}_fastqc.html", sample=config["samples"])
output:
"FastQC/fastq_multiqc.html"
params:
log:
"logs/multiqc.log"
wrapper:
"0.21.0/bio/multiqc"
My issue is with the fastqc rule.
Currently both the fastqc rule and the bowtie2 rule create one output file generated using two inputs SRRXXXXXXX_1.fastq.gz and SRRXXXXXXX_2.fastq.gz.
I need the fastq rule to generate two files, a separate one for each of the fastq.gz files but I'm unsure how to index the config file correctly from the fastqc rule input statement, or how to combine the the expand and wildcards commands to solve this. I can get an individual fastq file by adding [0] or [1] to the end of the input statement, but not both run individually/separately.
I've been messing around trying to get the correct indexing format to access each file separately. The current format is the only one I've managed that allows snakemake -np to generate a job list.
Any tips would be greatly appreciated.
It appears each sample would have two fastq files, and they are named in format ***_1.fastq.gz and ***_2.fastq.gz. In that case, config and code below would work.
config.yaml:
samples:
Sample_A: fastq_files/SRR4356728
Sample_B: fastq_files/SRR6257171
Snakefile:
# read config info into this namespace
configfile: "config.yaml"
print (config['samples'])
rule all:
input:
expand("FastQC/{sample}_{num}_fastqc.zip", sample=config["samples"], num=['1', '2']),
expand("bam_files/{sample}.bam", sample=config["samples"]),
"FastQC/fastq_multiqc.html"
rule fastqc:
input:
sample=lambda wildcards: f"{config['samples'][wildcards.sample]}_{wildcards.num}.fastq.gz"
output:
# Output needs to end in '_fastqc.html' for multiqc to work
html="FastQC/{sample}_{num}_fastqc.html",
zip="FastQC/{sample}_{num}_fastqc.zip"
wrapper:
"0.21.0/bio/fastqc"
rule bowtie2:
input:
sample=lambda wildcards: expand(f"{config['samples'][wildcards.sample]}_{{num}}.fastq.gz", num=[1,2])
output:
"bam_files/{sample}.bam"
wrapper:
"0.21.0/bio/bowtie2/align"
rule multiqc_fastq:
input:
expand("FastQC/{sample}_{num}_fastqc.html", sample=config["samples"], num=['1', '2'])
output:
"FastQC/fastq_multiqc.html"
wrapper:
"0.21.0/bio/multiqc"