I have 2 parameters for which I am trying to iterate over in a Snakemake. This is normally feasible using expand however, I want to use a parameter from the expand into the command line:
STEPSMSA=["0","10","20","30","40"]
LENGTHS=["30","35","40","45","50"]
rule sim:
input: expand("simulations/gen_{step}.fa",step=STEPSMSA)
output: expand("simulations/gen_{step}_n1000000_l{len}.fa",step=STEPSMSA,len=LENGTHS)
shell: expand("simulate -n 10000000 -l {len} {input} |gzip > {output}",len=LENGTHS)
Note that "simulate" is an internal UNIX program that is accessible. What I would like is to run:
simulate -n 10000000 -l 30 simulations/gen_0.fa |gzip > simulations/gen_0_n1000000_l30.fa
simulate -n 10000000 -l 35 simulations/gen_0.fa |gzip > simulations/gen_0_n1000000_l35.fa
simulate -n 10000000 -l 40 simulations/gen_0.fa |gzip > simulations/gen_0_n1000000_l40.fa
...
simulate -n 10000000 -l 50 simulations/gen_40.fa |gzip > simulations/gen_40_n1000000_l50.fa
Is there any way to specify this using Snakemake?
So you want simulations/gen_{step}_n1000000_l{len}.fa for all combinations of step and len, right? If so, you could do:
STEPSMSA=["0","10","20","30","40"]
LENGTHS=["30","35","40","45","50"]
rule all:
input:
expand("simulations/gen_{step}_n1000000_l{len}.fa", step=STEPSMSA, len=LENGTHS)
rule sim:
input:
"simulations/gen_{step}.fa"
output:
"simulations/gen_{step}_n1000000_l{len}.fa"
shell:
"simulate -n 10000000 -l {wildcards.len} {input} | gzip > {output}"
Related
I show below a pseudocode version of my snakefile. Snakemake rule A creates the input files for Snakemake rule B2 and I would like to run Snakemake rules B1 and B2 at the same time but am not having success. I can run this snakefile successfully on very small data without a problem (although the Snakemake rules B1 and B2 do not run in parallel) but once I give it larger data it fails to create the output for Snakemake rule B1. The commands between Snakemake rule B1 and B2 use the same program but have different arguments and input files so I didn't think they should be in the same rule.
rule all:
input: file_A_out, file_B1_out, file_B2_out, file_C_out
rule A:
input: file_A_in
output: file_A_out
log: file_A_log
shell: 'progA {input} --output {output}'
rule B1:
input: file_B1_in
output: file_B1_out
group: 'groupB'
log: file_B1_log
shell: 'progB {input} -x 100 -o {output}'
rule B2:
input: file_A_out
output: file_B2_out
group: 'groupB'
log: file_B2_log
shell: 'progB {input} -x 1 --y -o {output}'
rule C:
input: file_B1_out, file_B2_out
output: file_C_out
log: file_C_log
shell: 'progC {input[0]} {input[1]} -o {output}'
I thought using group to group the rules would indicate to Snakemake that the two rules can be ran at once. To execute snakemake I run nohup snakemake --cores 16 > log.txt 2>&1 & however, it only successfully runs rule B2 while the output of rule B1 is deemed corrupted. I have seen solutions on running one rule in parallel but what about running different rules in parallel?
Error in rule B1:
jobid: 2
input: 'file_B1_in'
output: 'file_B1_out'
log: 'file_B1_log'
(check log file(s) for error details)
shell: 'progB {input} -x 100 -o {output}'
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Removing output files of failed job B1 since they might be corrupted:
file_B1_out
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
The snakefile below runs rules A, B1, and B2 in parallel then runs rule C, as expected. Maybe there is something you are not showing us?
# Make dummy input files
touch file_A_in file_B1_in
# Run pipeline
snakemake -p -j 10
The snakefile:
rule all:
input: 'file_A_out', 'file_B1_out', 'file_B2_out', 'file_C_out'
rule A:
input: 'file_A_in'
output: 'file_A_out'
shell: 'sleep 10; echo {input} > {output}'
rule B1:
input: 'file_B1_in'
output: 'file_B1_out'
shell: 'sleep 10; echo {input} > {output}'
rule B2:
input: 'file_A_in'
output: 'file_B2_out'
shell: 'sleep 10; echo {input} > {output}'
rule C:
input: 'file_B1_out', 'file_B2_out'
output: 'file_C_out'
shell: 'sleep 10; echo {input[0]} {input[1]} > {output}'
I was trying to run a command which ideally looks like this,
minimap2 -a -x map-ont -t 20 /staging/reference.fasta fastq/sample01.fastq | samtools view -bS -F 4 - | samtools sort -o fastq_minon/sample01.bam
Similarly, I have multiple samples (referring to fastq/sample01.fastq) in the folder.
The snakemake file I wrote to automate this behaviour is, however, parsing all files at once in the command like,
minimap2 -a -x map-ont -t 1 /staging/reference.fasta fastq/sample02.fastq fastq/sample03.fastq fastq/sample01.fastq | samtools view -bS -F 4 - | samtools sort -o fastq_minon/sample02.bam fastq_minon/sample03.bam fastq_minon/sample01.bam
I have pasted the code and logs below. Please help me try to figure out this mistake.
Code
SAMPLES, = glob_wildcards("fastq/{smp}.fastq")
rule minimap:
input:
expand("fastq/{smp}.fastq", smp=SAMPLES)
output:
expand("fastq_minon/{smp}.bam", smp=SAMPLES)
params:
ref = FASTA
threads: 40
shell:
"""
minimap2 -a -x map-ont -t {threads} {params.ref} {input} | samtools view -bS -F 4 - | samtools sort -o {output}
"""
log
Building DAG of jobs...
Job counts:
count jobs
1 minimap
1
[Tue May 5 03:28:50 2020]
rule minimap:
input: fastq/sample02.fastq, fastq/sample03.fastq, fastq/sample01.fastq
output: fastq_minon/sample02.bam, fastq_minon/sample03.bam, fastq_minon/sample01.bam
jobid: 0
minimap2 -a -x map-ont -t 1 /staging/reference.fasta fastq/sample02.fastq fastq/sample03.fastq fastq/sample01.fastq | samtools view -bS -F 4 - | samtools sort -o fastq_minon/sample02.bam fastq_minon/sample03.bam fastq_minon/sample01.bam
Job counts:
count jobs
1 minimap
1
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
The expand function is used to create a list. Thus, in your rule minimap, you're telling snakemake that you want all fastq files as input and that the rule will produce as many bam files. What you want is a rule that will be triggered for every sample using a wildcard:
SAMPLES, = glob_wildcards("fastq/{smp}.fastq")
rule all:
input: expand("fastq_minon/{smp}.bam", smp=SAMPLES)
rule minimap:
input:
"fastq/{smp}.fastq"
output:
"fastq_minon/{smp}.bam"
params:
ref = FASTA
threads: 40
shell:
"""
minimap2 -a -x map-ont -t {threads} {params.ref} {input} | samtools view -bS -F 4 - | samtools sort -o {output}
"""
By defining all the files wanted in rule all, the rule minimap will be triggered as many times as necessary to create ONE bam file from ONE fastq file.
Have a look at my answer to this question to understand the use of wildcards and expand.
My problem is related to Running parallel instances of a single job/rule on Snakemake but I believe different.
I cannot create a all: rule for it in advance because the folder of input files will be created by a previous rule and depends on the user initial data
pseudocode
rule1: get a big file (OK)
rule2: split the file in parts in Split folder (OK)
rule3: run a program on each file created in Split
I am now at rule3 with Split containing 70 files like
Split/file_001.fq
Split/file_002.fq
..
Split/file_069.fq
Could you please help me creating a rule for pigz to run compress the 70 files in parallel to 70 .gz files
I am running with snakemake -j 24 ZipSplit
config["pigt"] gives 4 threads for each compression job and I give 24 threads to snakemake so I expect 6 parallel compressions but my current rule merges the inputs to one archive in a single job instead of parallelizing !?
Should I build the list of input fully in the rule? how?
# parallel job
files, = glob_wildcards("Split/{x}.fq")
rule ZipSplit:
input: expand("Split/{x}.fq", x=files)
threads: config["pigt"]
shell:
"""
pigz -k -p {threads} {input}
"""
I tried to define input directly with
input: glob_wildcards("Split/{x}.fq")
but syntax error occures
# InSilico_PCR Snakefile
import os
import re
from snakemake.remote.HTTP import RemoteProvider as HTTPRemoteProvider
HTTP = HTTPRemoteProvider()
# source config variables
configfile: "config.yaml"
# single job
rule GetRawData:
input:
HTTP.remote(os.path.join(config["host"], config["infile"]), keep_local=True, allow_redirects=True)
output:
os.path.join("RawData", config["infile"])
run:
shell("cp {input} {output}")
# single job
rule SplitFastq:
input:
os.path.join("RawData", config["infile"])
params:
lines_per_file = config["lines_per_file"]
output:
pfx = os.path.join("Split", config["infile"] + "_")
shell:
"""
zcat {input} | split --numeric-suffixes --additional-suffix=.fq -a 3 -l {params.lines_per_file} - {output.pfx}
"""
# parallel job
files, = glob_wildcards("Split/{x}.fq")
rule ZipSplit:
input: expand("Split/{x}.fq", x=files)
threads: config["pigt"]
shell:
"""
pigz -k -p {threads} {input}
"""
I think the example below should do it, using checkpoints as suggested by #Maarten-vd-Sande.
However, in your particular case of splitting a big file and compress the output on the fly, you may be better off using the --filter option of split as in
split -a 3 -d -l 4 --filter='gzip -c > $FILE.fastq.gz' bigfile.fastq split/
The snakemake solution, assuming your input file is called bigfile.fastq, split and compress output will be in directory splitting./bigfile/
rule all:
input:
expand("{sample}.split.done", sample= ['bigfile']),
checkpoint splitting:
input:
"{sample}.fastq"
output:
directory("splitting/{sample}")
shell:
r"""
mkdir splitting/{wildcards.sample}
split -a 3 -d --additional-suffix .fastq -l 4 {input} splitting/{wildcards.sample}/
"""
rule compress:
input:
"splitting/{sample}/{i}.fastq",
output:
"splitting/{sample}/{i}.fastq.gz",
shell:
r"""
gzip -c {input} > {output}
"""
def aggregate_input(wildcards):
checkpoint_output = checkpoints.splitting.get(**wildcards).output[0]
return expand("splitting/{sample}/{i}.fastq.gz",
sample=wildcards.sample,
i=glob_wildcards(os.path.join(checkpoint_output, "{i}.fastq")).i)
rule all_done:
input:
aggregate_input
output:
touch("{sample}.split.done")
My code is like this, I have set the rule all:
rule all:
input:
expand("data/sam/{sample}.sam", sample=SAMPLE_NAMES)
rule trimmomatic:
input:
"data/samples/{sample}.fastq"
output:
"data/samples/{sample}.clean.fastq"
shell:
"trimmomatic SE -threads 5 -phred33 -trimlog trim.log {input} {output} LEADING:20 TRAILING:20 MINLEN:16"
rule hisat2:
input:
fa="data/genome.fa",
fastq="data/samples/{sample}.clean.fastq"
output:
"data/sam/{sample}.sam"
shell:
"hisat2-build {input.fa} ./index/geneindex | hisat2 -x - -q samples/{inout.fastq} -S {output}"
but it still shows that :
Nothing to be done.
I have try to find the way out. but useless.
help!
I'm starting with snakemake. I managed to define some rules which I can run indepently, but not in a workflow. Maybe the issue is that they have unrelated inputs and outputs.
My current workflow is like this:
configfile: './config.yaml'
rule all:
input: dynamic("task/{job}/taskOutput.tab")
rule split_input:
input: "input_fasta/snp.fa"
output: dynamic("task/{job}/taskInput.fa")
shell:
"rm -Rf tasktmp task; \
mkdir tasktmp task; \
split -l 200 -d {input} ./tasktmp/; \
ls tasktmp | awk '{{print \"mkdir task/\"$0}}' | sh; \
ls tasktmp | awk '{{print \"mv ./tasktmp/\"$0\" ./task/\"$0\"/taskInput.fa\"}}' | sh"
rule task:
input: "task/{job}/taskInput.fa"
output: "task/{job}/taskOutput.tab"
shell: "cp {input} {output}"
rule make_parameter_file:
output:
"par/parameters.txt
shell:
"rm -Rf par;mkdir par; \
echo \"\
minimumFlankLength=5\n\
maximumFlankLength=200\n\
alignmentLengthDifference=2\
allowedMismatch=4\n\
allowedProxyMismatch=2\n\
allowedIndel=3\n\
ambiguitiesAsMatch=1\n\" \
> par/parameters.txt"
rule build_target:
input:
"./my_target"
output:
touch("build_target.done")
shell:
"build_target -template format_nt -source {input} -target my_target"
If I call this as such:
snakemake -p -s snakefile
The first three rules are being executed, the others not.
I can run the last rule by specifying it as an argument.
snakemake -p -s snakefile build_target
But I don't see how I can run all.
Thanks a lot for any suggestion on how to solve this.
By default snakemake executes only the first rule of a snakefile. Here it is rule all. In order to produce rule all's input dynamic("task/{job}/taskOutput.tab"), it needs to run the following two rules task and split_input, and so it does.
If you want the other rules to be run as well, you should put their output in rule all, eg.:
rule all:
input:
dynamic("task/{job}/taskOutput.tab"),
"par/parameters.txt",
"build_target.done"