Snakemake problem: Merge all files together with space delimiter instead of iterating through it - snakemake

I was trying to run a command which ideally looks like this,
minimap2 -a -x map-ont -t 20 /staging/reference.fasta fastq/sample01.fastq | samtools view -bS -F 4 - | samtools sort -o fastq_minon/sample01.bam
Similarly, I have multiple samples (referring to fastq/sample01.fastq) in the folder.
The snakemake file I wrote to automate this behaviour is, however, parsing all files at once in the command like,
minimap2 -a -x map-ont -t 1 /staging/reference.fasta fastq/sample02.fastq fastq/sample03.fastq fastq/sample01.fastq | samtools view -bS -F 4 - | samtools sort -o fastq_minon/sample02.bam fastq_minon/sample03.bam fastq_minon/sample01.bam
I have pasted the code and logs below. Please help me try to figure out this mistake.
Code
SAMPLES, = glob_wildcards("fastq/{smp}.fastq")
rule minimap:
input:
expand("fastq/{smp}.fastq", smp=SAMPLES)
output:
expand("fastq_minon/{smp}.bam", smp=SAMPLES)
params:
ref = FASTA
threads: 40
shell:
"""
minimap2 -a -x map-ont -t {threads} {params.ref} {input} | samtools view -bS -F 4 - | samtools sort -o {output}
"""
log
Building DAG of jobs...
Job counts:
count jobs
1 minimap
1
[Tue May 5 03:28:50 2020]
rule minimap:
input: fastq/sample02.fastq, fastq/sample03.fastq, fastq/sample01.fastq
output: fastq_minon/sample02.bam, fastq_minon/sample03.bam, fastq_minon/sample01.bam
jobid: 0
minimap2 -a -x map-ont -t 1 /staging/reference.fasta fastq/sample02.fastq fastq/sample03.fastq fastq/sample01.fastq | samtools view -bS -F 4 - | samtools sort -o fastq_minon/sample02.bam fastq_minon/sample03.bam fastq_minon/sample01.bam
Job counts:
count jobs
1 minimap
1
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

The expand function is used to create a list. Thus, in your rule minimap, you're telling snakemake that you want all fastq files as input and that the rule will produce as many bam files. What you want is a rule that will be triggered for every sample using a wildcard:
SAMPLES, = glob_wildcards("fastq/{smp}.fastq")
rule all:
input: expand("fastq_minon/{smp}.bam", smp=SAMPLES)
rule minimap:
input:
"fastq/{smp}.fastq"
output:
"fastq_minon/{smp}.bam"
params:
ref = FASTA
threads: 40
shell:
"""
minimap2 -a -x map-ont -t {threads} {params.ref} {input} | samtools view -bS -F 4 - | samtools sort -o {output}
"""
By defining all the files wanted in rule all, the rule minimap will be triggered as many times as necessary to create ONE bam file from ONE fastq file.
Have a look at my answer to this question to understand the use of wildcards and expand.

Related

snakemake - replacing command line parameters with wildcards by cluster profile

I am writing a snakemake pipeline to eventually identify corona virus variants.
Below is a minimal example with three steps:
LOGDIR = '/path/to/logDir'
barcodes = ['barcode49', 'barcode50', 'barcode51']
rule all:
input:
expand([
# guppyplex
"out/guppyplex/{barcode}/{barcode}.fastq",
# catFasta
"out/catFasta/cat_consensus.fasta",
], barcode = barcodes)
rule guppyplex:
input:
FQ = f"fastq/{{barcode}}" # FASTQ_PATH is parsed from config.yaml
output:
"out/guppyplex/{barcode}/{barcode}.fastq"
shell:
"touch {output}" # variables in CAPITALS are parsed from config.yaml
rule minion:
input:
INFQ = rules.guppyplex.output,
FAST5 = f"fasta/{{barcode}}"
params:
OUTDIR = "out/nanopolish/{barcode}"
output:
"out/nanopolish/{barcode}/{barcode}.consensus.fasta"
shell:
"""
touch {output} && echo {wildcards.barcode} > {output}
"""
rule catFasta:
input:
expand("out/nanopolish/{barcode}/{barcode}.consensus.fasta", barcode = barcodes)
output:
"out/catFasta/cat_consensus.fasta"
shell:
"cat {input} > {output}"
If I run the snakemake locally by calling snakemake -p --cores 1 all everything works. Yet my ultimate goal is to use qsub to run the jobs on a cluster. I also want the stderr and stdout from qsub to have meaningful names, which include wildcards and the rule names for each job.
However, if I call snakemake with
snakemake -p --cluster "qsub -q onlybngs05b -e {LOGDIR} -o {LOGDIR} -j y" -j 5 --jobname "{wildcards.barcode}.{rule}.{jobid}" all
I will get the following error:
AttributeError: 'Wildcards' object has no attribute 'barcode'
I have recently read the snakemake documentation where it appears that I could replace the command line parameters (--cluster "qsub -q onlybngs05b -e {LOGDIR} -o {LOGDIR} -j y" -j 5 --jobname "{wildcards.barcode}.{rule}.{jobid}") by a yaml file. Although the documentation is not all that clear to me.
I have created a config.yaml file at /home/user/.config/snakemake which looks like so:
cluster: 'qsub'
q: 'onlybngs05b'
e: '/home/ngs/tempOutSnakemake'
o: '/home/ngs/tempOutSnakemake'
j: 5
jobname: "{wildcards.barcode}.{rule}.{jobid}
But then it appears that snakemake is not properly parsing the config.yaml. I am getting
snakemake: error: ambiguous option: --o=/home/ngs/tempOutSnakemake could match --omit-from, --output-wait, --overwrite-shellcmd
I also tried to replace o in the config file by stdout (kind of the long version of the parameter (-h vs --help for several programs), though it does not work.
Therefore my question is how I can replace the command line parameters --cluster "qsub -q onlybngs05b -e {LOGDIR} -o {LOGDIR} -j y" -j 5 --jobname "{wildcards.barcode}.{rule}.{jobid}" by a config.yaml file that accepts wildcards?
I think the problem is that rule catFasta doesn't contain the wildcard barcode. If you think about it, what job name would you expect in {wildcards.barcode}.{rule}.{jobid}?
Maybe a solution could be to add to each rule a jobname parameter that could be {barcode} for guppyplex and minion and 'all_barcodes' for catFasta. Then use --jobname "{params.jobname}.{rule}.{jobid}"

"Input files updated by another job" but not true

I have an issue with a pipeline and I don't know if it is a bad use on my part or a bug. I am using snakemake 5.26.1. I did not have any problems until a few days ago, with the same version of snakemake. I don't understand what changed.
A pipeline part is just after a checkpoint rule and an aggregating step, this produces an output file e.g. foo.fasta.
I have rules following the production of foo.fasta using this as input file, and that are run again with the --reason Input files updated by another job but this is not the case and their output is more recent than foo.fasta.
Additionally, using the --summary option the file foo.fasta is marked as no update and files have the right timestamp order for rules not having to be executed again.
I cannot find the reason why following rules are run again. Is it possible the checkpoint is causing an issue making snakemake think foo.fasta is updated while it is not?
Here is a dry run output:
rule agouti_scaffolding produces tros_v5.pseudohap.fasta.gz (my foo.fasta)
Building DAG of jobs...
Updating job 7 (aggregate_gff3).
Replace aggregate_gff3 with dynamic branch aggregate_gff3
updating depending job agouti_scaffolding
Updating job 3 (agouti_scaffolding).
Replace agouti_scaffolding with dynamic branch agouti_scaffolding
updating depending job unzip_fasta
updating depending job btk_prepare_workdir
updating depending job v6_clean_rename
updating depending job kat_comp
updating depending job assembly_stats
Job counts:
count jobs
1 kat_comp
1 kat_plot_spectra
2
[Tue Oct 27 12:08:41 2020]
rule kat_comp:
input: results/preprocessing/tros/tros_dedup_proc_fastp_R1_001.fastq.gz, results/pre
sing/tros/tros_dedup_proc_fastp_R2_001.fastq.gz, results/fasta/tros_v5.pseudohap.fasta.g
output: results/kat/tros_v5/tros_v5_comp-main.mx
log: logs/kat_comp.tros_v5.log
jobid: 1
reason: Input files updated by another job: results/fasta/tros_v5.pseudohap.fasta.gz
wildcards: sample=tros, version=v5
[Tue Oct 27 12:08:41 2020]
rule kat_plot_spectra:
input: results/kat/tros_v5/tros_v5_comp-main.mx
output: results/kat/tros_v5/tros_v5_spectra.pdf
log: logs/kat_plot.tros_v5.log
jobid: 0
reason: Input files updated by another job: results/kat/tros_v5/tros_v5_comp-main.mx
wildcards: sample=tros, version=v5
Job counts:
count jobs
1 kat_comp
1 kat_plot_spectra
2
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
It seems updating depending job kat_comp is the issue.
Even the ancient flag on kat_comp input does not change anything.
EDIT
Here is part of the pipeline.
As #TroyComi was suggesting the last development version (5.26.1+26.gc2e2b501.dirty) solves the issue of the ancient directive. When removing the ancient directive rules are still executed.
rule all:
"results/kat/tros_v5/tros_v5_spectra.pdf"
checkpoint split_fa_augustus:
input:
"results/fasta/{sample}_v4.pseudohap.fasta.gz"
output:
"results/agouti/{sample}/{sample}_v4.pseudohap.fa",
directory("results/agouti/{sample}/split")
params:
split_size = 50000000
conda:
"../envs/augustus.yaml"
shell:
"""
zcat {input} > {output[0]}
mkdir {output[1]}
splitMfasta.pl {output[0]} \
--outputpath={output[1]} --minsize={params.split_size}
"""
rule augustus:
input:
"results/agouti/{sample}/split/{sample}_v4.pseudohap.split.{i}.fa"
output:
"results/agouti/{sample}/split/pred_{i}.gff3"
conda:
"../envs/augustus.yaml"
shell:
"""
augustus --gff3=on --species=caenorhabditis {input} > {output}
"""
def aggregate_input_gff3(wildcards):
checkpoint_output = checkpoints.split_fa_augustus.get(**wildcards).output[1]
return expand("results/agouti/{sample}/split/pred_{i}.gff3",
sample=wildcards.sample,
i=glob_wildcards(os.path.join(checkpoint_output, f"{wildcards.sample}_v4.pseudohap.split." + "{i}.fa")).i)
rule aggregate_gff3:
input:
aggregate_input_gff3
output:
"results/agouti/{sample}/{sample}_v4.pseudohap.gff3"
conda:
"../envs/augustus.yaml"
shell:
"cat {input} | join_aug_pred.pl > {output}"
#===============================
# Preprocess RNAseq data
# The RNAseq reads need to be in the folder resources/RNAseq_raw/{sample}
# Files must be named {run}_R1.fastq.gz and {run}_R2.fastq.gz for globbing to work
# globbing is done in the rule merge_RNA_bams
rule rna_rcorrector:
input:
expand("resources/RNAseq_raw/{{sample}}/{{run}}_{R}.fastq.gz",
R=['R1', 'R2'])
output:
temp(expand("results/agouti/{{sample}}/RNA_preproc/{{run}}_{R}.cor.fq.gz",
R=['R1', 'R2']))
params:
outdir = lambda w, output: os.path.dirname(output[0])
log:
"logs/rcorrector_{sample}_{run}.log"
threads:
config['agouti']['threads']
conda:
"../envs/rna_seq.yaml"
shell:
"""
run_rcorrector.pl -1 {input[0]} -2 {input[1]} \
-t {threads} \
-od {params.outdir} \
> {log} 2>&1
"""
rule rna_trimgalore:
input:
expand("results/agouti/{{sample}}/RNA_preproc/{{run}}_{R}.cor.fq.gz",
R=['R1', 'R2'])
output:
temp(expand("results/agouti/{{sample}}/RNA_preproc/{{run}}_trimgal_val_{i}.fq",
i=['1', '2']))
params:
outdir = lambda w, output: os.path.dirname(output[0]),
basename = lambda w: f'{w.run}_trimgal'
log:
"logs/trimgalore_{sample}_{run}.log"
threads:
config['agouti']['threads']
conda:
"../envs/rna_seq.yaml"
shell:
"""
trim_galore --cores {threads} \
--phred33 \
--quality 20 \
--stringency 1 \
-e 0.1 \
--length 70 \
--output_dir {params.outdir} \
--basename {params.basename} \
--dont_gzip \
--paired \
{input} \
> {log} 2>&1
"""
#===============================
# Map the RNAseq reads
rule index_ref:
input:
"results/agouti/{sample}/{sample}_v4.pseudohap.fa"
output:
multiext("results/agouti/{sample}/{sample}_v4.pseudohap.fa",
".0123", ".amb", ".ann", ".bwt.2bit.64", ".bwt.8bit.32", ".pac")
conda:
"../envs/mapping.yaml"
shell:
"bwa-mem2 index {input}"
rule map_RNAseq:
input:
expand("results/agouti/{{sample}}/RNA_preproc/{{run}}_trimgal_val_{i}.fq",
i=['1', '2']),
"results/agouti/{sample}/{sample}_v4.pseudohap.fa",
multiext("results/agouti/{sample}/{sample}_v4.pseudohap.fa",
".0123", ".amb", ".ann", ".bwt.2bit.64", ".bwt.8bit.32", ".pac")
output:
"results/agouti/{sample}/mapping/{run}.bam"
log:
"logs/bwa_rna_{sample}_{run}.log"
conda:
"../envs/mapping.yaml"
threads:
config['agouti']['threads']
shell:
"""
bwa-mem2 mem -t {threads} {input[2]} {input[0]} {input[1]} 2> {log} \
| samtools view -b -# {threads} -o {output}
"""
def get_sample_rna_runs(w):
list_R1_files = glob.glob(f"resources/RNAseq_raw/{w.sample}/*_R1.fastq.gz")
list_runs = [re.sub('_R1\.fastq\.gz$', '', os.path.basename(f)) for f in list_R1_files]
return [f'results/agouti/{w.sample}/mapping/{run}.bam' for run in list_runs]
rule merge_RNA_bams:
input:
get_sample_rna_runs
output:
"results/agouti/{sample}/RNAseq_mapped_merged.bam"
params:
tmp_merge = lambda w: f'results/agouti/{w.sample}/tmp_merge.bam'
conda:
"../envs/mapping.yaml"
threads:
config['agouti']['threads']
shell:
"""
samtools merge -# {threads} {params.tmp_merge} {input}
samtools sort -# {threads} -n -o {output} {params.tmp_merge}
rm {params.tmp_merge}
"""
#===============================
# Run agouti on all that
rule agouti_scaffolding:
input:
fa = "results/agouti/{sample}/{sample}_v4.pseudohap.fa",
bam = "results/agouti/{sample}/RNAseq_mapped_merged.bam",
gff = "results/agouti/{sample}/{sample}_v4.pseudohap.gff3"
output:
protected("results/fasta/{sample}_v5.pseudohap.fasta.gz")
params:
outdir = lambda w: f'results/agouti/{w.sample}/agouti_out',
minMQ = 20,
maxFracMM = 0.05
log:
"logs/agouti_{sample}.log"
conda:
"../envs/agouti.yaml"
shell:
"""
python /opt/agouti/agouti.py scaffold \
-assembly {input.fa} \
-bam {input.bam} \
-gff {input.gff} \
-outdir {params.outdir} \
-minMQ {params.minMQ} -maxFracMM {params.maxFracMM} \
> {log} 2>&1
gzip -c {params.outdir}/agouti.agouti.fasta > {output}
"""
#===============================================================
# Now do something on output {sample}_{version}.pseudohap.fasta.gz
rule kat_comp:
input:
expand("results/preprocessing/{{sample}}/{{sample}}_dedup_proc_fastp_{R}_001.fastq.gz",
R=["R1", "R2"]),
ancient("results/fasta/{sample}_{version}.pseudohap.fasta.gz")
output:
"results/kat/{sample}_{version}/{sample}_{version}_comp-main.mx"
params:
outprefix = lambda w: f'results/kat/{w.sample}_{w.version}/{w.sample}_{w.version}_comp'
log:
"logs/kat_comp.{sample}_{version}.log"
conda:
"../envs/kat.yaml"
threads:
16
shell:
"""
kat comp -t {threads} \
-o {params.outprefix} \
'{input[0]} {input[1]}' \
{input[2]} \
> {log} 2>&1
"""
rule kat_plot_spectra:
input:
"results/kat/{sample}_{version}/{sample}_{version}_comp-main.mx"
output:
"results/kat/{sample}_{version}/{sample}_{version}_spectra.pdf"
params:
title = lambda w: f'{w.sample}_{w.version}'
log:
"logs/kat_plot.{sample}_{version}.log"
conda:
"../envs/kat.yaml"
shell:
"""
kat plot spectra-cn \
-o {output} \
-t {params.title} \
{input} \
> {log} 2>&1
"""

Snakemake files in multiple directories

My {dir} is a variable=/nameX/tissueX/trimmed
This is the code I use:
HISAT2_INDEX_PREFIX = "/index/genome_chromosomes"
directories, SAMPLES=glob_wildcards('/test/{dir}/{sample}_1.fastq.gz')
rule all:
input:
expand("{dir}/{sample}.bam", zip, dir=directories, sample=SAMPLES)
rule hisat2:
input:
hisat2_index=expand("%s.{ix}.ht2l" % HISAT2_INDEX_PREFIX, ix=range(1, 9)),
fastq1="/test/{dir}/{sample}_1.fastq.gz",
fastq2="/test/{dir}/{sample}_1.fastq.gz"
output:
bam = "{dir}/{sample}.bam",
txt = "{dir}/{sample}.txt",
log: "{dir}/{sample}.snakemake_log.txt"
threads: 2
shell:
"hisat2 -p {threads} -x {HISAT2_INDEX_PREFIX}"
" -1 {input.fastq1} -2 {input.fastq2} --summary-file {output.txt} |"
"samtools sort -# {threads} -o {output.bam}"
How to modify in the way to add nameX prefix to each bam file, and save all bam files in the same directory? And create for the same nameX one single bam file?

snakemake running single jobs in parallel from all files in folder

My problem is related to Running parallel instances of a single job/rule on Snakemake but I believe different.
I cannot create a all: rule for it in advance because the folder of input files will be created by a previous rule and depends on the user initial data
pseudocode
rule1: get a big file (OK)
rule2: split the file in parts in Split folder (OK)
rule3: run a program on each file created in Split
I am now at rule3 with Split containing 70 files like
Split/file_001.fq
Split/file_002.fq
..
Split/file_069.fq
Could you please help me creating a rule for pigz to run compress the 70 files in parallel to 70 .gz files
I am running with snakemake -j 24 ZipSplit
config["pigt"] gives 4 threads for each compression job and I give 24 threads to snakemake so I expect 6 parallel compressions but my current rule merges the inputs to one archive in a single job instead of parallelizing !?
Should I build the list of input fully in the rule? how?
# parallel job
files, = glob_wildcards("Split/{x}.fq")
rule ZipSplit:
input: expand("Split/{x}.fq", x=files)
threads: config["pigt"]
shell:
"""
pigz -k -p {threads} {input}
"""
I tried to define input directly with
input: glob_wildcards("Split/{x}.fq")
but syntax error occures
# InSilico_PCR Snakefile
import os
import re
from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider
HTTP = HTTPRemoteProvider()
# source config variables
configfile: "config.yaml"
# single job
rule GetRawData:
input:
HTTP.remote(os.path.join(config["host"], config["infile"]), keep_local=True, allow_redirects=True)
output:
os.path.join("RawData", config["infile"])
run:
shell("cp {input} {output}")
# single job
rule SplitFastq:
input:
os.path.join("RawData", config["infile"])
params:
lines_per_file = config["lines_per_file"]
output:
pfx = os.path.join("Split", config["infile"] + "_")
shell:
"""
zcat {input} | split --numeric-suffixes --additional-suffix=.fq -a 3 -l {params.lines_per_file} - {output.pfx}
"""
# parallel job
files, = glob_wildcards("Split/{x}.fq")
rule ZipSplit:
input: expand("Split/{x}.fq", x=files)
threads: config["pigt"]
shell:
"""
pigz -k -p {threads} {input}
"""
I think the example below should do it, using checkpoints as suggested by #Maarten-vd-Sande.
However, in your particular case of splitting a big file and compress the output on the fly, you may be better off using the --filter option of split as in
split -a 3 -d -l 4 --filter='gzip -c > $FILE.fastq.gz' bigfile.fastq split/
The snakemake solution, assuming your input file is called bigfile.fastq, split and compress output will be in directory splitting./bigfile/
rule all:
input:
expand("{sample}.split.done", sample= ['bigfile']),
checkpoint splitting:
input:
"{sample}.fastq"
output:
directory("splitting/{sample}")
shell:
r"""
mkdir splitting/{wildcards.sample}
split -a 3 -d --additional-suffix .fastq -l 4 {input} splitting/{wildcards.sample}/
"""
rule compress:
input:
"splitting/{sample}/{i}.fastq",
output:
"splitting/{sample}/{i}.fastq.gz",
shell:
r"""
gzip -c {input} > {output}
"""
def aggregate_input(wildcards):
checkpoint_output = checkpoints.splitting.get(**wildcards).output[0]
return expand("splitting/{sample}/{i}.fastq.gz",
sample=wildcards.sample,
i=glob_wildcards(os.path.join(checkpoint_output, "{i}.fastq")).i)
rule all_done:
input:
aggregate_input
output:
touch("{sample}.split.done")

rule not picked up by snakemake

I'm starting with snakemake. I managed to define some rules which I can run indepently, but not in a workflow. Maybe the issue is that they have unrelated inputs and outputs.
My current workflow is like this:
configfile: './config.yaml'
rule all:
input: dynamic("task/{job}/taskOutput.tab")
rule split_input:
input: "input_fasta/snp.fa"
output: dynamic("task/{job}/taskInput.fa")
shell:
"rm -Rf tasktmp task; \
mkdir tasktmp task; \
split -l 200 -d {input} ./tasktmp/; \
ls tasktmp | awk '{{print \"mkdir task/\"$0}}' | sh; \
ls tasktmp | awk '{{print \"mv ./tasktmp/\"$0\" ./task/\"$0\"/taskInput.fa\"}}' | sh"
rule task:
input: "task/{job}/taskInput.fa"
output: "task/{job}/taskOutput.tab"
shell: "cp {input} {output}"
rule make_parameter_file:
output:
"par/parameters.txt
shell:
"rm -Rf par;mkdir par; \
echo \"\
minimumFlankLength=5\n\
maximumFlankLength=200\n\
alignmentLengthDifference=2\
allowedMismatch=4\n\
allowedProxyMismatch=2\n\
allowedIndel=3\n\
ambiguitiesAsMatch=1\n\" \
> par/parameters.txt"
rule build_target:
input:
"./my_target"
output:
touch("build_target.done")
shell:
"build_target -template format_nt -source {input} -target my_target"
If I call this as such:
snakemake -p -s snakefile
The first three rules are being executed, the others not.
I can run the last rule by specifying it as an argument.
snakemake -p -s snakefile build_target
But I don't see how I can run all.
Thanks a lot for any suggestion on how to solve this.
By default snakemake executes only the first rule of a snakefile. Here it is rule all. In order to produce rule all's input dynamic("task/{job}/taskOutput.tab"), it needs to run the following two rules task and split_input, and so it does.
If you want the other rules to be run as well, you should put their output in rule all, eg.:
rule all:
input:
dynamic("task/{job}/taskOutput.tab"),
"par/parameters.txt",
"build_target.done"