Nextflow name collision - nextflow

I have files with identical names but in different folders. Nextflow stages these files into the same work directory resulting in name collisions. My question is how to deal with that without renaming the files. Example:
# Example data
mkdir folder1 folder2
echo 1 > folder1/file.txt
echo 2 > folder2/file.txt
# We read from samplesheet
$ cat samplesheet.csv
sample,file
sample1,/home/atpoint/foo/folder1/file.txt
sample1,/home/atpoint/foo/folder2/file.txt
# Nextflow main.nf
#! /usr/bin/env nextflow
nextflow.enable.dsl=2
// Read samplesheet and group files by sample (first column)
samplesheet = Channel
.fromPath(params.samplesheet)
.splitCsv(header:true)
.map {
sample = it['sample']
file = it['file']
tuple(sample, file)
}
ch_samplesheet = samplesheet.groupTuple(by:0)
// That creates a tuple like:
// [sample1, [/home/atpoint/foo/folder1/file.txt, /home/atpoint/foo/folder2/file.txt]]
// Dummy process that stages both files into the same work directory folder
process PRO {
input:
tuple val(samplename), path(files)
output:
path("out.txt")
script:
"""
echo $samplename with files $files > out.txt
"""
}
workflow { PRO(ch_samplesheet) }
# Run it
NXF_VER=21.10.6 nextflow run main.nf --samplesheet $(realpath samplesheet.csv)
...obviously resulting in:
N E X T F L O W ~ version 21.10.6
Launching `main.nf` [adoring_jennings] - revision: 87f26fa90b
[- ] process > PRO -
Error executing process > 'PRO (1)'
Caused by:
Process `PRO` input file name collision -- There are multiple input files for each of the following file names: file.txt
So, what now? The real world application here is sequencing replicates of the same fastq file, which then have the same name, but are in different folders, and I want to feed them into a process that merges them. I am aware of this section in the docs but cannot say that any of it was helpful or that I understand it properly.

You can use stageAs option in your process definition.
#! /usr/bin/env nextflow
nextflow.enable.dsl=2
samplesheet = Channel
.fromPath(params.samplesheet)
.splitCsv(header:true)
.map {
sample = it['sample']
file = it['file']
tuple(sample, file)
}
.groupTuple()
.set { ch_samplesheet }
// [sample1, [/path/to/folder1/file.txt, /path/to/folder2/file.txt]]
process PRO {
input:
tuple val(samplename), path(files, stageAs: "?/*")
output:
path("out.txt")
shell:
def input_str = files instanceof List ? files.join(" ") : files
"""
cat ${input_str} > out.txt
"""
}
workflow { PRO(ch_samplesheet) }
See an example from nf-core and the path input type docs

Related

Nextflow - No such variable: prefix

I tried to run my nextflow script and the first two precess worded fine, but the third process Conbinevcf reported an error, showing that the variable prefix was not found.
process Annovar_genebased {
publishDir "${params.output}/annovar", mode: 'copy'
input:
path 'snp_anatation' from anatation1.flatMap()
val humandb
val refgene
output:
path "*.exonic_variant_function" into end
"""
prefix=\$(basename \$(readlink snp_anatation) .avinput)
perl $refgene -geneanno -dbtype refGene -out \${prefix}.anatation -buildver hg19 $snp_anatation $humandb -hgvs
rm *.log
rm *.variant_function
"""
}
process Annovar {
publishDir "${params.output}/annovar", mode: 'copy'
input:
path 'snp_anatation' from anatation2.flatMap()
val annovar_table
val humandb
output:
path "*.csv" into end1
"""
prefix=\$(basename \$(readlink snp_anatation) .avinput)
perl $annovar_table $snp_anatation $humandb -buildver hg19 -out \${prefix}.anatation -remove -protocol refGene,cytoBand,exac03,clinvar_20200316,gnomad211_exome -operation g,r,f,f,f -nastring . -csvout -polish
"""
}
I got stuck on this process
process Combinevcf {
publishDir "${params.output}/combinevcf", mode: 'copy'
input:
path 'genebased' from end.flatMap()
path 'allbased' from end1.flatMap()
output:
path "*_3.csv" into end3
"""
prefix=\$(basename \$(readlink genebased) .exonic_variant_function)
prefix1=\$(basename \$(readlink allbased) .csv)
cat ${prefix}.exonic_variant_function | tr -s ‘[:blank:]’ ‘,’ | awk 'BEGIN{FS=",";OFS="," }{ print \$3,\$13,\$22}' | awk ' BEGIN { OFS=", "; print "refGene", "refGene", "refGene", "refGene", "refGene", "Zogysity","chr", "filter" } { print \$0, "" } ' > ${prefix}_1.csv
awk 'BEGIN{FS=",";OFS="," }{ print \$1,\$2,\$3,\$4,\$5,\$6,\$7,\$8,\$9,\$10,\$15,\$21,\$24,\$25}' ${prefix1}.csv > ${prefix1}_2.csv
paste ${prefix}_1.csv ${prefix1}_2.csv > ${prefix}_3.csv
"""
}
I am not sure what went wrong, any help would be appreciated.
You need to escape your ${prefix} with backslashes to tell nextflow that the variable prefix is in the script block scope, and not in the nextflow scope.
See https://www.nextflow.io/docs/latest/process.html#script for more info:
Since Nextflow uses the same Bash syntax for variable substitutions in strings, you must manage them carefully depending on whether you want to evaluate a Nextflow variable or a Bash variable

Execute a nextflow process for each CSV record (line)

I'm trying to read each line from a CSV file and then execute a Nextflow process for each line of it. However I don't know exactly why when I run the Nextflow script I get the following error:
Argument of file function cannot be null
params.index_fasta = "/home/test_1000Genomes.csv"
Channel
.fromPath(params.index_fasta)
.splitCsv(header:true)
.map { row-> set(row.sampleId, file(row.read1), file(row.read2)) }
.set { sample_run_ch }
process FastQCFQ {
tag "QC of fasta"
publishDir (
path: "${params.PublishDir}/Reports/${sampleId}/FastQC",
mode: 'copy',
overwrite: 'true'
)
input:
set sampleId, file("${read1}"), file("${read2}") from sample_run_ch
output:
file("*.{html,zip}") into QC_Report
script:
"""
fastqc -t 2 -q $read1 $read2
"""
}
ch_qc = QC_Report
The CSV file consist of a tab file with a header of same names sampleId, read1, read2 where read1 and read2 are the paths of the fasta files. I'm try to change some parameters inside the Nextflow process but without get a correct process.
Argument of file function cannot be null
As Pallie notes in the comments above, if the input CSV is not parsed correctly (for example, if the wrong delimiter is used) the variables that you expect to contain strings may actually be null. If your CSV is actually tab-separated, use the splitCsv sep parameter to set it:
params.samples_tsv = './samples.tsv'
params.publish_dir = './results'
Channel
.fromPath( params.samples_tsv )
.splitCsv( header: true, sep: '\t' )
.map { row -> tuple( row.sampleId, file(row.read1), file(row.read2) ) }
.set { sample_run_ch }
process FastQC {
tag { sampleId }
publishDir (
path: "${params.publish_dir}/Reports/${sampleId}/FastQC",
mode: 'copy',
overwrite: 'true',
)
input:
tuple val(sampleId), path(read1), path(read2) from sample_run_ch
output:
path "*.{html,zip}" into QC_Report
script:
"""
fastqc -t 2 -q "${read1}" "${read2}"
"""
}

How to process multiple samples as input in Nextflow?

I'm trying to learn nextflow but it's not going very well. I used NGS-based double-end sequencing data to build an analysis flow from fastq files to vcf files using Nextflow. However I got stuck right at the beginning, as shown in the code. The first process and the second porcess sworks fine, but when passing the files to the third process
there is an ERROR and I can't execute the whole process anymore. What should I do? Thanks for a help.
Following is my code:
#! /usr/bin/env nextflow
params.fq1 = "/home/duxu/project/data/*1.fq.gz"
params.fq2 = "/home/duxu/project/data/*2.fq.gz"
params.index = "/home/duxu/project/result/index.list"
params.primer1 = "/home/duxu/project/data/primer_1.fasta"
params.primer2 = "/home/duxu/project/data/primer_2.fasta"
params.ref = "/home/duxu/project/data/hg19.p13.plusMT.no_alt_analysis_set/"
params.output='results'
params.refhg19 = "/home/duxu/project/data/hg19.p13.plusMT.no_alt_analysis_set/hg19.p13.plusMT.no_alt_analysis_set.fa"
params.Mills = "/home/duxu/project/data/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf"
params.1000G = "/home/duxu/project/data/1000G_phase1.indels.hg19.sites.vcf"
params.dbsnp = "/home/duxu/project/data/dbsnp_138.hg19.vcf
fq2 = Channel.fromPath(params.fq2)
fq2 = Channel.fromPath(params.fq2)
index = Channel.fromPath(params.index)
index.into { index_1; index_2; index_3 }
primer1 = Channel.fromPath(params.primer1)
primer2 = Channel.fromPath(params.primer2)
ref = Channel.fromPath(params.ref)
refhg19 = Channel.fromPath(params.refhg19)
refhg19.into { refhg19_1; refhg19_2 ; refhg19_3; refhg19_4; refhg19_5}
Mills = Channel.fromPath(params.Mills)
1000G = Channel.fromPath(params.1000G)
dbsnp = Channel.fromPath(params.dbsnp)
This is first process:
process soapnuke{
conda'soapnuke'
tag{"soapnuk ${fq1} ${fq2}"}
publishDir "${params.outdir}/SOAPnuke", mode: 'copy'
input:
file rawfq1 from fq1
file rawfq2 from fq2
output:
file 'clean1.fastq.gz' into clean_fq1
file 'clean2.fastq.gz' into clean_fq2
script:
"""
SOAPnuke filter -1 $rawfq1 -2 $rawfq2 -l 12 -q 0.5 -Q 2 -o . \
-C clean1.fastq.gz -D clean2.fastq.gz --trim 8,0,8,0
"""
}
The second process
process barcode_splitter{
tag{"barcode_splitter"}
publishDir "${params.output}/barcode_splitter", mode: 'copy'
input:
file split1 from clean_fq1
file split2 from clean_fq2
file index from index_1
output:
file '*-read-1.fastq.gz' into trimmed_index1
file '*-read-2.fastq.gz' into trimmed_index2
script:
"""
barcode_splitter --bcfile $index $split1 $split2 --idxread 1 2 --mismatches 1 --suffix .fastq --gzipout
mv multimatched-read-1.fastq.gz multicatched.1.fastq.gz
mv multimatched-read-2.fastq.gz multicatched.2.fastq.gz
mv untimatched-read-1.fastq.gz multicatched.1.fastq.gz
mv untimatched-read-2.fastq.gz multicatched.2.fastq.gz
"""
}
The third ,and I got an error from this step. In fact, this process has multiple samples, since the previous process baicode_splitter output multiple files. This process cutadapt is designed to excise the first few bases of multiple samples.
process cutadapt{
tag{"cutadapt"}
publishDir "${params.output}/cut_primer", mode: 'copy'
input:
val sample from sample
file primer_1 from primer1
file primer_2 from primer2
file ${sample}-read-1.fastq.gz from trimmed_index1.collect()
file ${sample}-read-2.fastq.gz from trimmed_index2.collect()
output:
file '*.trim.1.fastq.gz' into trimmed_primer1
file '*.trim.2.fastq.gz' into trimmed_primer2
script:
"""
cutadapt -g file:$primer_1 -G file:$primer_2 -j 64 --discard-untrimmed -o \${sample}.trim.1.fastq.gz -p \$(sample}.trim.2.fastq.gz ${sample}-read-1.fastq.gz ${sample}-read-2.fastq.gz
"""
}
The fourth process is designed to match multiple samples to a reference genome
process bwa_mapping{
tag{bwa_maping}
publishDir "${params.output}/barcode_splitter", mode: 'copy'
input:
file pair1 from trimmed_primer1
file pair2 from trimmed_primer2
file refhg19 from refhg19_1
output:
file '*_R1_R2.bam' into addheader
script:
"""
bwa mem -t 64 $refhg19 trimmed_primer_*-read-1.fastq trimmed_primer_*-read-2.fastq | samtools view -#8 -b | samtools sort -m 2G -#64 > * _R1_R2.bam
"""
}
The next remaining processes are all multisample-based operations
process {BaseRecalibrator
tag{"BaseRecalibrator"}
publishDir "${params.output}/BQSR/BaseRecalibratoraddheader", mode: 'copy'
input:
file BaseRecalibrator from BaseRecalibrator_1
file refhg19 from refhg19_2
file Mills from Mills
file 1000G from 1000G
file dbsnp from dbsnp
output:
file '*.recal.table' into ApplyBQSR
script:
"""
gatk --java-options "-XX:ParallelGCThreads=2" BaseRecalibrator -I BaseRecalibrator -R $refhg19_2 --known-sites $Mills --known-sites $1000G --known-sites $dbsnp -O *.recal.table
"""
}
process {ApplyBQSR
tag{"ApplyRecalibrator"}
publishDir "${params.output}/BQSR/ApplyRecalibrator", mode: 'copy'
input:
file ApplyBQSR from ApplyBQSR
file refhg19 from refhg19_3
file BaseRecalibrator from BaseRecalibrator_2
output:
file '*.bam' into HaplotypeCaller
script:
"""
gatk --java-options "-XX:ParallelGCThreads=2" ApplyBQSR -I BaseRecalibrator -R $refhg19_3 --bqsr-recal-file $AppleBQSR -O *.bam
"""
}
process {HaplotypeCaller
tag{"HaplotypeCallerr"}
publishDir "${params.output}/GATK/HaplotypeCaller", mode: 'copy'
input:
file HaplotypeCaller from HaplotypeCaller
file refhg19 from refhg19_4
file BaseRecalibrator from BaseRecalibrator_3
output:
file '*.g.vcf.gz' into GenotypeGVCFs
script:
"""
gatk --java-options "-XX:ParallelGCThreads=2" HaplotypeCaller -R $refhg19_4 -I BaseRecalibrator -O *.g.vcf.gz -ERC GVCF
"""
}
process {GenotypeGVCFs
tag{"GenotypeGVCFs"}
publishDir "${params.output}/GATK/GenotypeGVCFs", mode: 'copy'
input:
file GenotypeGVCFs from GenotypeGVCFs
file refhg19 from refhg19_5
output:
file '*.vcf.gz' into end
script:
"""
gatk --java-options "-XX:ParallelGCThreads=2" GenotypeGVCFs -R $refhg19_5 -V $GenotypeGVCFs -O *.vcf.gz
"""
Not sure what the error is exactly, but maybe it doesn't matter. It looks like you've declared more than one queue channels in your 'cutadapt' input declaration. Usually you don't want to do this. Please see: understand how multiple input channels work.
Note that the Channel.fromPath factory method creates a queue channel. Here, 'primer1' and 'primer2' are both queue channels that each provide only a single value:
params.primer1 = "/home/duxu/project/data/primer_1.fasta"
params.primer2 = "/home/duxu/project/data/primer_2.fasta"
primer1 = Channel.fromPath(params.primer1)
primer2 = Channel.fromPath(params.primer2)
What you want instead are value channels which can be read an infinite number of times without consuming their content:
params.primer1 = "/home/duxu/project/data/primer_1.fasta"
params.primer2 = "/home/duxu/project/data/primer_2.fasta"
primer1 = file(params.primer1)
primer2 = file(params.primer2)
A typically process, which involves executing a process for each of n samples, will involve a single queue channel and zero or more value channels. For example:
process cutadapt{
tag { sample }
publishDir "${params.output}/cut_primer", mode: 'copy'
cpus 64
input:
tuple val(sample), path(fq1), path(fq2) from trimmed_index
path primer1
path primer2
output:
tuple val(sample), path("${sample}.trim.1.fq.gz"), path("${sample}.trim.2.fq.gz")
script:
"""
cutadapt \\
-g "file:${primer1}" \\
-G "file:${primer2} \\
-j ${task.cpus} \\
--discard-untrimmed \\
-o "${sample}.trim.1.fq.gz" \\
-p "${sample}.trim.2.fq.gz" \\
"${fq1}" \\
"${fq2}"
"""
}
Note that when the file input name is the same as the channel name, the from channel declaration can be omitted. I also used the path qualifier above, as it should be preferred over the file qualifier when using Nextflow 19.10.0 or later.

Take process input from either of two channels

How do I allow a process to take an input from either one of two channels that are outputs of processes with mutually exclusive conditions for running? For example, something like:
params.condition = false
process a {
output:
path "a.out" into a_into_c
when:
params.condition == true
"""
touch a.out
"""
}
process b {
output:
path "b.out" into b_into_c
when:
params.condition == false
"""
touch b.out
"""
}
process c {
publishDir baseDir, mode: 'copy'
input:
path foo from a_into_c or b_into_c
output:
path "final.out"
"""
echo $foo > final.out
"""
}
where final.out will contain a.out if params.condition is true (e.g. --condition is given on the command line), and b.out if it is false.
You can use the mix operator for this:
process c {
publishDir baseDir, mode: 'copy'
input:
path foo from a_into_c.mix(b_into_c)
output:
path "final.out"
"""
echo $foo > final.out
"""
}

Multiple outputs to single list input - merging BAM files in Nextflow

I am attempting to merge x number of bam files produced via performing multiple alignments at once (on batches of y number of fastq files) into one single bam file in Nextflow.
So far I have the following when performing the alignment and sorting/indexing the resulting bam file:
//Run minimap2 on concatenated fastqs
process miniMap2Bam {
publishDir "$params.bamDir"
errorStrategy 'retry'
cache 'deep'
maxRetries 3
maxForks 10
memory { 16.GB * task.attempt }
input:
val dirString from dirStr
val runString from stringRun
each file(batchFastq) from fastqBatch.flatMap()
output:
val runString into stringRun1
file("${batchFastq}.bam") into bamFiles
val dirString into dirStrSam
script:
"""
minimap2 --secondary=no --MD -2 -t 10 -a $params.genome ${batchFastq} | samtools sort -o ${batchFastq}.bam
samtools index ${batchFastq}.bam
"""
}
Where ${batchFastq}.bam is a bam file containing a batch of y number of fastq files.
This pipeline completes just fine, however, when attempting to perform samtools merge on these bam files in another process (samToolsMerge), the process runs each time an alignment is run (in this case, 4), instead of once for all bam files collected:
//Run samtools merge
process samToolsMerge {
echo true
publishDir "$dirString/aligned_minimap/", mode: 'copy', overwrite: 'false'
cache 'deep'
errorStrategy 'retry'
maxRetries 3
maxForks 10
memory { 14.GB * task.attempt }
input:
val runString from stringRun1
file bamFile from bamFiles.collect()
val dirString from dirStrSam
output:
file("**")
script:
"""
samtools merge ${runString}.bam ${bamFile}
"""
}
With the output being:
executor > lsf (9)
[49/182ec0] process > catFastqs (1) [100%] 1 of 1 ✔
[- ] process > nanoPlotSummary -
[0e/609a7a] process > miniMap2Bam (1) [100%] 4 of 4 ✔
[42/72469d] process > samToolsMerge (2) [100%] 4 of 4 ✔
Completed at: 04-Mar-2021 14:54:21
Duration : 5m 41s
CPU hours : 0.2
Succeeded : 9
How can I take just the resulting bam files from miniMap2Bam and run them through samToolsMerge a single time, instead of the process running multiple times?
Thanks in advance!
EDIT:
Thanks to Pallie in the comments below, the issue was feeding the runString and dirString values from a prior process into miniMap2Bam and then samToolsMerge, causing the process to repeat itself each time a value was passed on.
The solution was as simple as removing the vals from miniMap2Bam (as follows):
//Run minimap2 on concatenated fastqs
process miniMap2Bam {
errorStrategy 'retry'
cache 'deep'
maxRetries 3
maxForks 10
memory { 16.GB * task.attempt }
input:
each file(batchFastq) from fastqBatch.flatMap()
output:
file("${batchFastq}.bam") into bamFiles
script:
"""
minimap2 --secondary=no --MD -2 -t 10 -a $params.genome ${batchFastq} | samtools sort -o ${batchFastq}.bam
samtools index ${batchFastq}.bam
"""
}
The simplest fix would probably to stop passing the static dirstring and runstring around via channels:
// Instead of a hardcoded path use a parameter you passed via CLI like you did with bamDir
dirString = file("/path/to/fastqs/")
runString = file("/path/to/fastqs/").getParent()
fastqBatch = Channel.from("/path/to/fastqs/")
//Run minimap2 on concatenated fastqs
process miniMap2Bam {
publishDir "$params.bamDir"
errorStrategy 'retry'
cache 'deep'
maxRetries 3
maxForks 10
memory { 16.GB * task.attempt }
input:
each file(batchFastq) from fastqBatch.flatMap()
output:
file("${batchFastq}.bam") into bamFiles
script:
"""
minimap2 --secondary=no --MD -2 -t 10 -a $params.genome ${batchFastq} | samtools sort -o ${batchFastq}.bam
samtools index ${batchFastq}.bam
"""
}
//Run samtools merge
process samToolsMerge {
echo true
publishDir "$dirString/aligned_minimap/", mode: 'copy', overwrite: 'false'
cache 'deep'
errorStrategy 'retry'
maxRetries 3
maxForks 10
memory { 14.GB * task.attempt }
input:
file bamFile from bamFiles.collect()
output:
file("**")
script:
"""
samtools merge ${runString}.bam ${bamFile}
"""