Snakemake perform rules based on configs - snakemake

I wish to do be able to write a workflow so that I can choose which optional rules to run in the config.json file. For example if I have a Snakefile with 2 rules, rule_a and rule_b, each with the same input but different outputs:
rule_a:
input: input.txt
output: out_a.txt
run: ...
rule_b:
input: input.txt
output: out_b.txt
run: ...
And I have the following configurations in the json file:
{
"run_a": "T",
"run_b": "F"
}
How do I write the Snakefile so that in this case only rule_a will be run while rule_b will be ignored?

As python can be used in snakemake scripts, you could use python code to identify which files need to be created.
Configfile config.json:
{
"run_a": true,
"run_b": false
}
Snakefile:
configfile: "config.json"
if config['run_a']:
target = 'out_a.txt'
elif config['run_b']:
target = 'out_b.txt'
rule all:
input:
target
rule a:
input: 'input.txt'
output: 'out_a.txt'
shell:
"touch {output}"
rule b:
input: 'input.txt'
output: 'out_b.txt'
shell:
"touch {output}"

Related

Nextflow name collision

I have files with identical names but in different folders. Nextflow stages these files into the same work directory resulting in name collisions. My question is how to deal with that without renaming the files. Example:
# Example data
mkdir folder1 folder2
echo 1 > folder1/file.txt
echo 2 > folder2/file.txt
# We read from samplesheet
$ cat samplesheet.csv
sample,file
sample1,/home/atpoint/foo/folder1/file.txt
sample1,/home/atpoint/foo/folder2/file.txt
# Nextflow main.nf
#! /usr/bin/env nextflow
nextflow.enable.dsl=2
// Read samplesheet and group files by sample (first column)
samplesheet = Channel
.fromPath(params.samplesheet)
.splitCsv(header:true)
.map {
sample = it['sample']
file = it['file']
tuple(sample, file)
}
ch_samplesheet = samplesheet.groupTuple(by:0)
// That creates a tuple like:
// [sample1, [/home/atpoint/foo/folder1/file.txt, /home/atpoint/foo/folder2/file.txt]]
// Dummy process that stages both files into the same work directory folder
process PRO {
input:
tuple val(samplename), path(files)
output:
path("out.txt")
script:
"""
echo $samplename with files $files > out.txt
"""
}
workflow { PRO(ch_samplesheet) }
# Run it
NXF_VER=21.10.6 nextflow run main.nf --samplesheet $(realpath samplesheet.csv)
...obviously resulting in:
N E X T F L O W ~ version 21.10.6
Launching `main.nf` [adoring_jennings] - revision: 87f26fa90b
[- ] process > PRO -
Error executing process > 'PRO (1)'
Caused by:
Process `PRO` input file name collision -- There are multiple input files for each of the following file names: file.txt
So, what now? The real world application here is sequencing replicates of the same fastq file, which then have the same name, but are in different folders, and I want to feed them into a process that merges them. I am aware of this section in the docs but cannot say that any of it was helpful or that I understand it properly.
You can use stageAs option in your process definition.
#! /usr/bin/env nextflow
nextflow.enable.dsl=2
samplesheet = Channel
.fromPath(params.samplesheet)
.splitCsv(header:true)
.map {
sample = it['sample']
file = it['file']
tuple(sample, file)
}
.groupTuple()
.set { ch_samplesheet }
// [sample1, [/path/to/folder1/file.txt, /path/to/folder2/file.txt]]
process PRO {
input:
tuple val(samplename), path(files, stageAs: "?/*")
output:
path("out.txt")
shell:
def input_str = files instanceof List ? files.join(" ") : files
"""
cat ${input_str} > out.txt
"""
}
workflow { PRO(ch_samplesheet) }
See an example from nf-core and the path input type docs

Nextflow - No such variable: prefix

I tried to run my nextflow script and the first two precess worded fine, but the third process Conbinevcf reported an error, showing that the variable prefix was not found.
process Annovar_genebased {
publishDir "${params.output}/annovar", mode: 'copy'
input:
path 'snp_anatation' from anatation1.flatMap()
val humandb
val refgene
output:
path "*.exonic_variant_function" into end
"""
prefix=\$(basename \$(readlink snp_anatation) .avinput)
perl $refgene -geneanno -dbtype refGene -out \${prefix}.anatation -buildver hg19 $snp_anatation $humandb -hgvs
rm *.log
rm *.variant_function
"""
}
process Annovar {
publishDir "${params.output}/annovar", mode: 'copy'
input:
path 'snp_anatation' from anatation2.flatMap()
val annovar_table
val humandb
output:
path "*.csv" into end1
"""
prefix=\$(basename \$(readlink snp_anatation) .avinput)
perl $annovar_table $snp_anatation $humandb -buildver hg19 -out \${prefix}.anatation -remove -protocol refGene,cytoBand,exac03,clinvar_20200316,gnomad211_exome -operation g,r,f,f,f -nastring . -csvout -polish
"""
}
I got stuck on this process
process Combinevcf {
publishDir "${params.output}/combinevcf", mode: 'copy'
input:
path 'genebased' from end.flatMap()
path 'allbased' from end1.flatMap()
output:
path "*_3.csv" into end3
"""
prefix=\$(basename \$(readlink genebased) .exonic_variant_function)
prefix1=\$(basename \$(readlink allbased) .csv)
cat ${prefix}.exonic_variant_function | tr -s ‘[:blank:]’ ‘,’ | awk 'BEGIN{FS=",";OFS="," }{ print \$3,\$13,\$22}' | awk ' BEGIN { OFS=", "; print "refGene", "refGene", "refGene", "refGene", "refGene", "Zogysity","chr", "filter" } { print \$0, "" } ' > ${prefix}_1.csv
awk 'BEGIN{FS=",";OFS="," }{ print \$1,\$2,\$3,\$4,\$5,\$6,\$7,\$8,\$9,\$10,\$15,\$21,\$24,\$25}' ${prefix1}.csv > ${prefix1}_2.csv
paste ${prefix}_1.csv ${prefix1}_2.csv > ${prefix}_3.csv
"""
}
I am not sure what went wrong, any help would be appreciated.
You need to escape your ${prefix} with backslashes to tell nextflow that the variable prefix is in the script block scope, and not in the nextflow scope.
See https://www.nextflow.io/docs/latest/process.html#script for more info:
Since Nextflow uses the same Bash syntax for variable substitutions in strings, you must manage them carefully depending on whether you want to evaluate a Nextflow variable or a Bash variable

How to process multiple samples as input in Nextflow?

I'm trying to learn nextflow but it's not going very well. I used NGS-based double-end sequencing data to build an analysis flow from fastq files to vcf files using Nextflow. However I got stuck right at the beginning, as shown in the code. The first process and the second porcess sworks fine, but when passing the files to the third process
there is an ERROR and I can't execute the whole process anymore. What should I do? Thanks for a help.
Following is my code:
#! /usr/bin/env nextflow
params.fq1 = "/home/duxu/project/data/*1.fq.gz"
params.fq2 = "/home/duxu/project/data/*2.fq.gz"
params.index = "/home/duxu/project/result/index.list"
params.primer1 = "/home/duxu/project/data/primer_1.fasta"
params.primer2 = "/home/duxu/project/data/primer_2.fasta"
params.ref = "/home/duxu/project/data/hg19.p13.plusMT.no_alt_analysis_set/"
params.output='results'
params.refhg19 = "/home/duxu/project/data/hg19.p13.plusMT.no_alt_analysis_set/hg19.p13.plusMT.no_alt_analysis_set.fa"
params.Mills = "/home/duxu/project/data/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf"
params.1000G = "/home/duxu/project/data/1000G_phase1.indels.hg19.sites.vcf"
params.dbsnp = "/home/duxu/project/data/dbsnp_138.hg19.vcf
fq2 = Channel.fromPath(params.fq2)
fq2 = Channel.fromPath(params.fq2)
index = Channel.fromPath(params.index)
index.into { index_1; index_2; index_3 }
primer1 = Channel.fromPath(params.primer1)
primer2 = Channel.fromPath(params.primer2)
ref = Channel.fromPath(params.ref)
refhg19 = Channel.fromPath(params.refhg19)
refhg19.into { refhg19_1; refhg19_2 ; refhg19_3; refhg19_4; refhg19_5}
Mills = Channel.fromPath(params.Mills)
1000G = Channel.fromPath(params.1000G)
dbsnp = Channel.fromPath(params.dbsnp)
This is first process:
process soapnuke{
conda'soapnuke'
tag{"soapnuk ${fq1} ${fq2}"}
publishDir "${params.outdir}/SOAPnuke", mode: 'copy'
input:
file rawfq1 from fq1
file rawfq2 from fq2
output:
file 'clean1.fastq.gz' into clean_fq1
file 'clean2.fastq.gz' into clean_fq2
script:
"""
SOAPnuke filter -1 $rawfq1 -2 $rawfq2 -l 12 -q 0.5 -Q 2 -o . \
-C clean1.fastq.gz -D clean2.fastq.gz --trim 8,0,8,0
"""
}
The second process
process barcode_splitter{
tag{"barcode_splitter"}
publishDir "${params.output}/barcode_splitter", mode: 'copy'
input:
file split1 from clean_fq1
file split2 from clean_fq2
file index from index_1
output:
file '*-read-1.fastq.gz' into trimmed_index1
file '*-read-2.fastq.gz' into trimmed_index2
script:
"""
barcode_splitter --bcfile $index $split1 $split2 --idxread 1 2 --mismatches 1 --suffix .fastq --gzipout
mv multimatched-read-1.fastq.gz multicatched.1.fastq.gz
mv multimatched-read-2.fastq.gz multicatched.2.fastq.gz
mv untimatched-read-1.fastq.gz multicatched.1.fastq.gz
mv untimatched-read-2.fastq.gz multicatched.2.fastq.gz
"""
}
The third ,and I got an error from this step. In fact, this process has multiple samples, since the previous process baicode_splitter output multiple files. This process cutadapt is designed to excise the first few bases of multiple samples.
process cutadapt{
tag{"cutadapt"}
publishDir "${params.output}/cut_primer", mode: 'copy'
input:
val sample from sample
file primer_1 from primer1
file primer_2 from primer2
file ${sample}-read-1.fastq.gz from trimmed_index1.collect()
file ${sample}-read-2.fastq.gz from trimmed_index2.collect()
output:
file '*.trim.1.fastq.gz' into trimmed_primer1
file '*.trim.2.fastq.gz' into trimmed_primer2
script:
"""
cutadapt -g file:$primer_1 -G file:$primer_2 -j 64 --discard-untrimmed -o \${sample}.trim.1.fastq.gz -p \$(sample}.trim.2.fastq.gz ${sample}-read-1.fastq.gz ${sample}-read-2.fastq.gz
"""
}
The fourth process is designed to match multiple samples to a reference genome
process bwa_mapping{
tag{bwa_maping}
publishDir "${params.output}/barcode_splitter", mode: 'copy'
input:
file pair1 from trimmed_primer1
file pair2 from trimmed_primer2
file refhg19 from refhg19_1
output:
file '*_R1_R2.bam' into addheader
script:
"""
bwa mem -t 64 $refhg19 trimmed_primer_*-read-1.fastq trimmed_primer_*-read-2.fastq | samtools view -#8 -b | samtools sort -m 2G -#64 > * _R1_R2.bam
"""
}
The next remaining processes are all multisample-based operations
process {BaseRecalibrator
tag{"BaseRecalibrator"}
publishDir "${params.output}/BQSR/BaseRecalibratoraddheader", mode: 'copy'
input:
file BaseRecalibrator from BaseRecalibrator_1
file refhg19 from refhg19_2
file Mills from Mills
file 1000G from 1000G
file dbsnp from dbsnp
output:
file '*.recal.table' into ApplyBQSR
script:
"""
gatk --java-options "-XX:ParallelGCThreads=2" BaseRecalibrator -I BaseRecalibrator -R $refhg19_2 --known-sites $Mills --known-sites $1000G --known-sites $dbsnp -O *.recal.table
"""
}
process {ApplyBQSR
tag{"ApplyRecalibrator"}
publishDir "${params.output}/BQSR/ApplyRecalibrator", mode: 'copy'
input:
file ApplyBQSR from ApplyBQSR
file refhg19 from refhg19_3
file BaseRecalibrator from BaseRecalibrator_2
output:
file '*.bam' into HaplotypeCaller
script:
"""
gatk --java-options "-XX:ParallelGCThreads=2" ApplyBQSR -I BaseRecalibrator -R $refhg19_3 --bqsr-recal-file $AppleBQSR -O *.bam
"""
}
process {HaplotypeCaller
tag{"HaplotypeCallerr"}
publishDir "${params.output}/GATK/HaplotypeCaller", mode: 'copy'
input:
file HaplotypeCaller from HaplotypeCaller
file refhg19 from refhg19_4
file BaseRecalibrator from BaseRecalibrator_3
output:
file '*.g.vcf.gz' into GenotypeGVCFs
script:
"""
gatk --java-options "-XX:ParallelGCThreads=2" HaplotypeCaller -R $refhg19_4 -I BaseRecalibrator -O *.g.vcf.gz -ERC GVCF
"""
}
process {GenotypeGVCFs
tag{"GenotypeGVCFs"}
publishDir "${params.output}/GATK/GenotypeGVCFs", mode: 'copy'
input:
file GenotypeGVCFs from GenotypeGVCFs
file refhg19 from refhg19_5
output:
file '*.vcf.gz' into end
script:
"""
gatk --java-options "-XX:ParallelGCThreads=2" GenotypeGVCFs -R $refhg19_5 -V $GenotypeGVCFs -O *.vcf.gz
"""
Not sure what the error is exactly, but maybe it doesn't matter. It looks like you've declared more than one queue channels in your 'cutadapt' input declaration. Usually you don't want to do this. Please see: understand how multiple input channels work.
Note that the Channel.fromPath factory method creates a queue channel. Here, 'primer1' and 'primer2' are both queue channels that each provide only a single value:
params.primer1 = "/home/duxu/project/data/primer_1.fasta"
params.primer2 = "/home/duxu/project/data/primer_2.fasta"
primer1 = Channel.fromPath(params.primer1)
primer2 = Channel.fromPath(params.primer2)
What you want instead are value channels which can be read an infinite number of times without consuming their content:
params.primer1 = "/home/duxu/project/data/primer_1.fasta"
params.primer2 = "/home/duxu/project/data/primer_2.fasta"
primer1 = file(params.primer1)
primer2 = file(params.primer2)
A typically process, which involves executing a process for each of n samples, will involve a single queue channel and zero or more value channels. For example:
process cutadapt{
tag { sample }
publishDir "${params.output}/cut_primer", mode: 'copy'
cpus 64
input:
tuple val(sample), path(fq1), path(fq2) from trimmed_index
path primer1
path primer2
output:
tuple val(sample), path("${sample}.trim.1.fq.gz"), path("${sample}.trim.2.fq.gz")
script:
"""
cutadapt \\
-g "file:${primer1}" \\
-G "file:${primer2} \\
-j ${task.cpus} \\
--discard-untrimmed \\
-o "${sample}.trim.1.fq.gz" \\
-p "${sample}.trim.2.fq.gz" \\
"${fq1}" \\
"${fq2}"
"""
}
Note that when the file input name is the same as the channel name, the from channel declaration can be omitted. I also used the path qualifier above, as it should be preferred over the file qualifier when using Nextflow 19.10.0 or later.

Take process input from either of two channels

How do I allow a process to take an input from either one of two channels that are outputs of processes with mutually exclusive conditions for running? For example, something like:
params.condition = false
process a {
output:
path "a.out" into a_into_c
when:
params.condition == true
"""
touch a.out
"""
}
process b {
output:
path "b.out" into b_into_c
when:
params.condition == false
"""
touch b.out
"""
}
process c {
publishDir baseDir, mode: 'copy'
input:
path foo from a_into_c or b_into_c
output:
path "final.out"
"""
echo $foo > final.out
"""
}
where final.out will contain a.out if params.condition is true (e.g. --condition is given on the command line), and b.out if it is false.
You can use the mix operator for this:
process c {
publishDir baseDir, mode: 'copy'
input:
path foo from a_into_c.mix(b_into_c)
output:
path "final.out"
"""
echo $foo > final.out
"""
}

Multiple "params" in Snakemake file

I've got the following Snakemake file:
rule test:
params:
a = "a"
shell:
"echo {params.a}"
Which works as expected:
$ snakemake
a
But when I add a second parameter, I get an error:
rule test:
params:
a = "a"
b = 5
shell:
"echo {params.a} {params.b}"
SyntaxError in line 4 of /home/mschu/Code/snakemake/Snakefile:
invalid syntax
Why is that?
The documentation also has only examples with only one item in params.
Separate them by a comma:
rule test:
params:
a = "a",
b = 5
shell:
"echo {params.a} {params.b}"