Using multiple config file in snakemake results in not same wildcard error - snakemake

I had asked an earlier question about running the same snakemake pipeline for multiple datasets and one of the solutions mentioned was using multiple config files by #bli. I am trying to implement it but got an error when I have to read in a file which has sample information.
error:
SyntaxError:
Not all output, log and benchmark files of rule fastqc contain the same wildcards. This is crucial though, in order to avoid that two or more jobs write to the same file.
File "Snakefile", line 64, in <module>
I have seen this error before but I cannot figure out why is it coming in this case when every input and output has sample as a wildcard. Any help is much appreciated!
My Snakefile looks like this:
import os
import pandas as pd
import yaml
configfile: "main_config.yaml"
all_keys = list(config.keys())
print(all_keys)
datasets =config["datasets"]
print(datasets.items())
for p_id, p_info in datasets.items():
for key in p_info:
print(key + '------',p_info[key])
conf_file=p_info["conf"]
conf_fh=open(conf_file)
dat_conf = yaml.safe_load(conf_fh)
output = dat_conf["output_dir"]
samples = dat_conf["sampletable"]
R1 = dat_conf["R1"]
R2 = dat_conf["R2"]
print(samples)
print(R1)
SampleTable = pd.read_table(samples,index_col=0)
SAMPLES = list(SampleTable.index)
print(SAMPLES)
PAIRED_END= ('R2' in SampleTable.columns)
FRACTIONS= ['R1']
if PAIRED_END: FRACTIONS+= ['R2']
qc = config["qc_only"]
def all_input_reads(qc):
if config["qc_only"]:
return expand("{output}/fastqc/{sample}" + config["R1"] + "_fastqc.html", sample=SAMPLES)
else:
return expand("{output}/fastqc/{sample}" + config["R1"] + "_fastqc.html", sample=SAMPLES)
rule all:
input:
all_input_reads
rule fastqc:
input:
unpack( lambda wc: dict(SampleTable.loc[wc.sample]))
output:
R1= "{output}/fastqc/{sample}{R1}" +"_fastqc.html",
R2 ="{output}/fastqc/{sample}{R2}" +"_fastqc.html"
conda:
"../envs/fastqc.yaml"
log:
"{output}/logs/qc/fastqc_{sample}_unfilt.log"
shell: "fastqc -o {output}/fastqc {input.R1} {input.R2} >> {log}"
The main config file is :
datasets:
dat1:
conf: "config_files/data1_config.yaml"
dat2:
conf: "config_files/data2_config.yaml"
qc_only: FALSE
and the individual config files looks like this data1_config.yaml:
# List of files
sampletable: "samples_data1.tsv"
output_dir: "data1"
## Cutadapt
## IMPORTANT ****** If you want to remove primers uncomment line 51 in utils/rules/qc_cutadapt.smk which will allow for primers to be removed
primers:
# Illumina V3V4 protocol primers
fwd_primer: "TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGCCTACGGGNGGCWGCAG"
rev_primer: "GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGACTACHVGGGTATCTAATCC"
fwd_primer_rc: "CTGCWGCCNCCCGTAGGCTGTCTCTTATACACATCTGACGCTGCCGACGA"
rev_primer_rc: "GGATTAGATACCCBDGTAGTCCTGTCTCTTATACACATCTCCGAGCCCACGAGAC"
R1: "_R1"
R2: "_R2"
maxEE:
- 2
- 2
truncQ: 2

Here is your fastqc/output with little formatting:
rule fastqc:
output:
R1 = "{output}/fastqc/{sample}{R1}" + "_fastqc.html",
R2 = "{output}/fastqc/{sample}{R2}" + "_fastqc.html"
R1 and R2 wildcards are synonyms, and there is no way for Snakemake to differenciate them. For example, imagine that the rule all requires this file to be created: output_dir/fastqc/sample_aR1_fastqc.html. Which variables should Snakemake assign this file to, output.R1 or output.R2?
You need to separate these parameters with a non-wildcard, like that:
rule fastqc:
output:
R1 = "{output}/fastqc/{sample}R1" + "_fastqc.html",
R2 = "{output}/fastqc/{sample}R2" + "_fastqc.html"

Related

Snakemake pipeline not attempting to produce output?

I have a relatively simple snakemake pipeline but when run I get all missing files for rule all:
refseq = 'refseq.fasta'
reads = ['_R1_001', '_R2_001']
def getsamples():
import glob
test = (glob.glob("*.fastq"))
print(test)
samples = []
for i in test:
samples.append(i.rsplit('_', 2)[0])
return(samples)
def getbarcodes():
with open('unique.barcodes.txt') as file:
lines = [line.rstrip() for line in file]
return(lines)
rule all:
input:
expand("grepped/{barcodes}{sample}_R1_001.plate.fastq", barcodes=getbarcodes(), sample=getsamples()),
expand("grepped/{barcodes}{sample}_R2_001.plate.fastq", barcodes=getbarcodes(), sample=getsamples())
wildcard_constraints:
barcodes="[a-z-A-Z]+$"
rule fastq_grep:
input:
R1 = "{sample}_R1_001.fastq",
R2 = "{sample}_R2_001.fastq"
output:
out1 = "grepped/{barcodes}{sample}_R1_001.plate.fastq",
out2 = "grepped/{barcodes}{sample}_R2_001.plate.fastq"
wildcard_constraints:
barcodes="[a-z-A-Z]+$"
shell:
"fastq-grep -i '{wildcards.barcodes}' {input.R1} > {output.out1} && fastq-grep -i '{wildcards.barcodes}' {input.R2} > {output.out2}"
The output files that are listed by the terminal seem correct, so it seems it is seeing what I want to produce but the shell is not making anything at all.
I want to produce a list of files that have grepped the list of barcodes I have in a file. But I get "Missing input files for rule all:"
There are two issues:
You have an impossible wildcard_constraints defined for {barcode}
Your two wildcards {barcode} and {sample} are competing with each other.
Remove the wildcard_constraints from your two rules and add the following lines to the top of your Snakefile:
wildcard_constraints:
barcodes="[A-Z]+",
sample="Well.*",
The constraint for {barcodes} now only matches capital letters. Before it also included end-of-line matching (trailing $) which was impossible to match for this wildcard as you had additional text in the filepath following.
The constraint for {sample} ensures that the path of the filename starting with "Well..." is interpreted as the start of the {sample} wildcard. Else you'd get something unwanted like barcode=ACGGTW instead of barcode=ACGGT.
A note of advice:
I usually find it easier to seperate wildcards into directory structures rather than having multiple wildcards in the same filename. In you case that would mean having a structure like
grepped/{barcode}/{sample}_R1_001.plate.fastq.
Full suggested Snakefile (formatted using snakefmt)
wildcard_constraints:
barcodes="[A-Z]+",
sample="Well.*",
refseq = "refseq.fasta"
reads = ["_R1_001", "_R2_001"]
def getsamples():
import glob
test = glob.glob("*.fastq")
print(test)
samples = []
for i in test:
samples.append(i.rsplit("_", 2)[0])
return samples
def getbarcodes():
with open("unique.barcodes.txt") as file:
lines = [line.rstrip() for line in file]
return lines
rule all:
input:
expand(
"grepped/{barcodes}{sample}_R1_001.plate.fastq",
barcodes=getbarcodes(),
sample=getsamples(),
),
expand(
"grepped/{barcodes}{sample}_R2_001.plate.fastq",
barcodes=getbarcodes(),
sample=getsamples(),
),
rule fastq_grep:
input:
R1="{sample}_R1_001.fastq",
R2="{sample}_R2_001.fastq",
output:
out1="grepped/{barcodes}{sample}_R1_001.plate.fastq",
out2="grepped/{barcodes}{sample}_R2_001.plate.fastq",
shell:
"fastq-grep -i '{wildcards.barcodes}' {input.R1} > {output.out1} && fastq-grep -i '{wildcards.barcodes}' {input.R2} > {output.out2}"
In addition to #euronion's answer (+1), I prefer to constrain wildcards to match only and exactly the list of values you expect. This means disabling the regex matching altogether. In your case, I would do something like:
wildcard_constraints:
barcodes='|'.join([re.escape(x) for x in getbarcodes()]),
sample='|'.join([re.escape(x) for x in getsamples()]),
now {barcodes} is allowed to match only the values in getbarcodes(), whatever they are, and the same for {sample}. In my opinion this is better than anticipating what combination of regex a wildcard can take.

How do Snakemake checkpoints work when i do not wanna make a folder?

I have a snakemake file where one rule produces a file from witch i would like to extract the header and use as wildcards in my rule all.
The Snakemake guide provides an example where it creates new folders named like the wildcards, but if I can avoid that it would be nice since in some cases it would need to create 100-200 folders then. Any suggestions on how to make it work?
link to snakemake guide:
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html
import pandas as pd
rule all:
input:
final_report = expand('report_{fruit}.txt', fruit= ???)
rule create_file:
input:
output:
fruit = 'fruit_file.csv'
run:
....
rule next:
input:
fruit = 'fruit_file.csv'
output:
report = 'report_{phenotype}.txt'
run:
fruit_file = pd.read_csv({input.fruit}, header = 0, sep = '\t')
fruits= fruit_file.columns.tolist()[2:]
for i in fruits:
cmd = 'touch report_' + i + '.txt'
shell(cmd)
This is a simplified workflow since i am actually using some long script to both produce the pheno_file.csv and the report files.
The pheno_file.csv is tab-seperated and could look like this:
FID IID Apple Banana Plum
Mouse Mickey 0 0 1
Mouse Minnie 1 0 1
Duck Donnald 0 1 0
I think you are misreading the snakemake checkpoint example. You only need to create one folder in your case. They have a wildcard (sample) in the folder name, but that part of the output name is known ahead of time.
checkpoint fruit_reports:
input:
fruit = 'fruit_file.csv'
output:
report_dir = directory('reports')
run:
fruit_file = pd.read_csv({input.fruit}, header = 0, sep = '\t')
fruits= fruit_file.columns.tolist()[2:]
for i in fruits:
cmd = f'touch {output}/report_{i}.txt'
shell(cmd)
Since you do not know all names (fruits) ahead of time, you cannot include them in the all rule. You need to reference an intermediate rule to bring everything together. Maybe use a final report file:
rule all:
input: 'report.txt'
Then after the checkpoint:
def aggregate_fruit(wildcards):
checkpoint_output = checkpoints.fruit_reports.get(**wildcards).output[0]
return expand("reports/report_{i}.txt",
i=glob_wildcards(os.path.join(checkpoint_output, "report_{i}.txt")).i)
rule report:
input:
aggregate_input
output:
"report.txt"
shell:
"ls 1 {input} > {output}"

Wildcards in input and output not working

I added a rule get_timezone_periods with wildcards in the input and output but is not working with error Missing input files for rule all
Manual typing the paths works
"data/raw/test1/ros/timezone.csv",
"data/raw/test3/t02/timezone.csv"
Using wildcards does not
"data/raw/{{db}}/{{user}}/timezone.csv"
My code:
SENSORS=["timezone", "touch"]
DBS_USERS={"test1":["ros"],
"test3":["t02"]}
def db_user_path(paths):
new_paths = []
for db, users in DBS_USERS.items():
for user in users:
for path in paths:
new_paths.append(path.replace("db/", db + "/").replace("user/", user+ "/"))
return new_paths
rule all:
input:
sensors = db_user_path(expand("data/raw/db/user/{sensor}.csv", sensor=SENSORS)),
timezone_periods = db_user_path(["data/processed/db/user/timezone_periods.csv"])
rule download_dataset:
input:
"data/external/{db}-{user}.participant"
output:
expand("data/raw/{{db}}/{{user}}/{sensor}.csv", sensor=SENSORS)
script:
"src/data/download_dataset.R"
rule get_timezone_periods:
input:
# This line below does not work
# "data/raw/{{db}}/{{user}}/timezone.csv"
# These two lines work
"data/raw/test1/ros/timezone.csv",
"data/raw/test3/t02/timezone.csv"
output:
# This line below does not work
# "data/processed/{{db}}/{{user}}/timezone_periods.csv"
# These two lines work
"data/processed/test1/ros/timezone_periods.csv",
"data/processed/test3/t02/timezone_periods.csv"
script:
"src/data/get_timezone_periods.R"
I just realised that I was adding an extra pair of curly braces, it should have been only {db}

snakemake: how to implement log directive when using run directive?

Snakemake allows creation of a log for each rule with log parameter that specifies the name of the log file. It is relatively straightforward to pipe results from shell output to this log, but I am not able to figure out a way of logging output of run output (i.e. python script).
One workaround is to save the python code in a script and then run it from the shell, but I wonder if there is another way?
I have some rules that use both the log and run directives. In the run directive, I "manually" open and write the log file.
For instance:
rule compute_RPM:
input:
counts_table = source_small_RNA_counts,
summary_table = rules.gather_read_counts_summaries.output.summary_table,
tags_table = rules.associate_small_type.output.tags_table,
output:
RPM_table = OPJ(
annot_counts_dir,
"all_{mapped_type}_on_%s" % genome, "{small_type}_RPM.txt"),
log:
log = OPJ(log_dir, "compute_RPM_{mapped_type}", "{small_type}.log"),
benchmark:
OPJ(log_dir, "compute_RPM_{mapped_type}", "{small_type}_benchmark.txt"),
run:
with open(log.log, "w") as logfile:
logfile.write(f"Reading column counts from {input.counts_table}\n")
counts_data = pd.read_table(
input.counts_table,
index_col="gene")
logfile.write(f"Reading number of non-structural mappers from {input.summary_table}\n")
norm = pd.read_table(input.summary_table, index_col=0).loc["non_structural"]
logfile.write(str(norm))
logfile.write("Computing counts per million non-structural mappers\n")
RPM = 1000000 * counts_data / norm
add_tags_column(RPM, input.tags_table, "small_type").to_csv(output.RPM_table, sep="\t")
For third-party code that writes to stdout, maybe the redirect_stdout context manager could be helpful (found in https://stackoverflow.com/a/40417352/1878788, documented at
https://docs.python.org/3/library/contextlib.html#contextlib.redirect_stdout).
Test snakefile, test_run_log.snakefile:
from contextlib import redirect_stdout
rule all:
input:
"test_run_log.txt"
rule test_run_log:
output:
"test_run_log.txt"
log:
"test_run_log.log"
run:
with open(log[0], "w") as log_file:
with redirect_stdout(log_file):
print(f"Writing result to {output[0]}")
with open(output[0], "w") as out_file:
out_file.write("result\n")
Running it:
$ snakemake -s test_run_log.snakefile
Results:
$ cat test_run_log.log
Writing result to test_run_log.txt
$ cat test_run_log.txt
result
My solution was the following. This is usefull both for normal log and logging exceptions with traceback. You can then wrap logger setup in a function to make it more organized. It's not very pretty though. Would be much nicer if snakemake could do it by itself.
import logging
# some stuff
rule logging_test:
input: 'input.json'
output: 'output.json'
log: 'rules_logs/logging_test.log'
run:
logger = logging.getLogger('logging_test')
fh = logging.FileHandler(str(log))
fh.setLevel(logging.INFO)
formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
fh.setFormatter(formatter)
logger.addHandler(fh)
try:
logger.info('Starting operation!')
# do something
with open(str(output), 'w') as f:
f.write('success!')
logger.info('Ended!')
except Exception as e:
logger.error(e, exc_info=True)

Parallel execution of a snakemake rule with same input and a range of values for a single parameter

I am transitioning a bash script to snakemake and I would like to parallelize a step I was previously handling with a for loop. The issue I am running into is that instead of running parallel processes, snakemake ends up trying to run one process with all parameters and fails.
My original bash script runs a program multiple times for a range of values of the parameter K.
for num in {1..3}
do
structure.py -K $num --input=fileprefix --output=fileprefix
done
There are multiple input files that start with fileprefix. And there are two main outputs per run, e.g. for K=1 they are fileprefix.1.meanP, fileprefix.1.meanQ. My config and snakemake files are as follows.
Config:
cat config.yaml
infile: fileprefix
K:
- 1
- 2
- 3
Snakemake:
configfile: 'config.yaml'
rule all:
input:
expand("output/{sample}.{K}.{ext}",
sample = config['infile'],
K = config['K'],
ext = ['meanQ', 'meanP'])
rule structure:
output:
"output/{sample}.{K}.meanQ",
"output/{sample}.{K}.meanP"
params:
prefix = config['infile'],
K = config['K']
threads: 3
shell:
"""
structure.py -K {params.K} \
--input=output/{params.prefix} \
--output=output/{params.prefix}
"""
This was executed with snakemake --cores 3. The problem persists when I only use one thread.
I expected the outputs described above for each value of K, but the run fails with this error:
RuleException:
CalledProcessError in line 84 of Snakefile:
Command ' set -euo pipefail; structure.py -K 1 2 3 --input=output/fileprefix \
--output=output/fileprefix ' returned non-zero exit status 2.
File "Snakefile", line 84, in __rule_Structure
File "snake/lib/python3.6/concurrent/futures/thread.py", line 56, in run
When I set K to a single value such as K = ['1'], everything works. So the problem seems to be that {params.K} is being expanded to all values of K when the shell command is executed. I started teaching myself snakemake today, and it works really well, but I'm hitting a brick wall with this.
You need to retrieve the argument for -K from the wildcards, not from the config file. The config file will simply return your list of possible values, it is a plain python dictionary.
configfile: 'config.yaml'
rule all:
input:
expand("output/{sample}.{K}.{ext}",
sample = config['infile'],
K = config['K'],
ext = ['meanQ', 'meanP'])
rule structure:
output:
"output/{sample}.{K}.meanQ",
"output/{sample}.{K}.meanP"
params:
prefix = config['invcf'],
K = config['K']
threads: 3
shell:
"structure.py -K {wildcards.K} "
"--input=output/{params.prefix} "
"--output=output/{params.prefix}"
Note that there are more things to improve here. For example, the rule structure does not define any input file, although it uses one.
There is an option now for parameter space exploration
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#parameter-space-exploration