getting weird error while running a nextflow pipeline

getting weird error while running a nextflow pipeline - nextflow

For my data analysis pipeline I am using nextflow as the workflow management system to run a tool called rmats.
In the script section I gave all the required arguments but when I run the pipeline using this command:
nextflow run -ansi-log false main.nf
I will get this error:
Command error:
ERROR: output folder and temporary folder required. Please check --od and --tmp.
Here is the rmats.nf module:
process RMATS {
tag "paired_rmats: ${sample1Name}_${sample2Name}"
label 'rmats_4.1.2'
label 'rmats_4.1.2_RMATS'
container = 'quay.io/biocontainers/rmats:4.1.2--py37haf75f70_1'
shell = ['/bin/bash', '-euo', 'pipefail']
input:
path(STAR_genome_index)
path(genome_gtf)
path(s1)
path(s2)
output:
path("*.txt", emit: final_results_rmats)
script:
"""
rmats.py \
--s1 ${s1} \
--s2 ${s2} \
--gtf ${genome_gtf} \
--readLength 150 \
--nthread 10
--novelSS
--mil 50
--mel 500
--bi ${STAR_genome_index} \
--keepTemp \
--od final_results_rmats \
--tmp final_results_rmats
"""
}
here is the main.nf:
#!/usr/bin/env nextflow
nextflow.preview.dsl=2
include RMATS from './modules/rmats.nf'
gtf_ch = Channel.fromPath(params.gtf)
s1_ch = Channel.fromPath(params.s1)
s2_ch = Channel.fromPath(params.s2)
STAR_genome_index_ch = Channel.fromPath(params.STAR_genome_index)
workflow {
rmats_AS_calling_ch=RMATS(s1_ch, s2_ch, gtf_ch, STAR_genome_index_ch)
}
in the script section the arguments that are in {} are given in the config file.
Do you know what could be the problem?

Your just missing some backslash characters in your script block, which are required by your shell for line continuation. You may also like to consider escaping the backslash characters to have Nextflow write scripts (.command.sh in the working directory) with line continuation:
script:
"""
rmats.py \\
--s1 ${s1} \\
--s2 ${s2} \\
--gtf ${genome_gtf} \\
--readLength 150 \\
--nthread 10 \\
--novelSS \\
--mil 50 \\
--mel 500 \\
--bi ${STAR_genome_index} \\
--keepTemp \\
--od final_results_rmats \\
--tmp final_results_rmats
"""

Related

Nextflow name collision

I have files with identical names but in different folders. Nextflow stages these files into the same work directory resulting in name collisions. My question is how to deal with that without renaming the files. Example:
# Example data
mkdir folder1 folder2
echo 1 > folder1/file.txt
echo 2 > folder2/file.txt
# We read from samplesheet
$ cat samplesheet.csv
sample,file
sample1,/home/atpoint/foo/folder1/file.txt
sample1,/home/atpoint/foo/folder2/file.txt
# Nextflow main.nf
#! /usr/bin/env nextflow
nextflow.enable.dsl=2
// Read samplesheet and group files by sample (first column)
samplesheet = Channel
.fromPath(params.samplesheet)
.splitCsv(header:true)
.map {
sample = it['sample']
file = it['file']
tuple(sample, file)
}
ch_samplesheet = samplesheet.groupTuple(by:0)
// That creates a tuple like:
// [sample1, [/home/atpoint/foo/folder1/file.txt, /home/atpoint/foo/folder2/file.txt]]
// Dummy process that stages both files into the same work directory folder
process PRO {
input:
tuple val(samplename), path(files)
output:
path("out.txt")
script:
"""
echo $samplename with files $files > out.txt
"""
}
workflow { PRO(ch_samplesheet) }
# Run it
NXF_VER=21.10.6 nextflow run main.nf --samplesheet $(realpath samplesheet.csv)
...obviously resulting in:
N E X T F L O W ~ version 21.10.6
Launching `main.nf` [adoring_jennings] - revision: 87f26fa90b
[- ] process > PRO -
Error executing process > 'PRO (1)'
Caused by:
Process `PRO` input file name collision -- There are multiple input files for each of the following file names: file.txt
So, what now? The real world application here is sequencing replicates of the same fastq file, which then have the same name, but are in different folders, and I want to feed them into a process that merges them. I am aware of this section in the docs but cannot say that any of it was helpful or that I understand it properly.

You can use stageAs option in your process definition.
#! /usr/bin/env nextflow
nextflow.enable.dsl=2
samplesheet = Channel
.fromPath(params.samplesheet)
.splitCsv(header:true)
.map {
sample = it['sample']
file = it['file']
tuple(sample, file)
}
.groupTuple()
.set { ch_samplesheet }
// [sample1, [/path/to/folder1/file.txt, /path/to/folder2/file.txt]]
process PRO {
input:
tuple val(samplename), path(files, stageAs: "?/*")
output:
path("out.txt")
shell:
def input_str = files instanceof List ? files.join(" ") : files
"""
cat ${input_str} > out.txt
"""
}
workflow { PRO(ch_samplesheet) }
See an example from nf-core and the path input type docs

Nextflow - No such variable: prefix

I tried to run my nextflow script and the first two precess worded fine, but the third process Conbinevcf reported an error, showing that the variable prefix was not found.
process Annovar_genebased {
publishDir "${params.output}/annovar", mode: 'copy'
input:
path 'snp_anatation' from anatation1.flatMap()
val humandb
val refgene
output:
path "*.exonic_variant_function" into end
"""
prefix=\$(basename \$(readlink snp_anatation) .avinput)
perl $refgene -geneanno -dbtype refGene -out \${prefix}.anatation -buildver hg19 $snp_anatation $humandb -hgvs
rm *.log
rm *.variant_function
"""
}
process Annovar {
publishDir "${params.output}/annovar", mode: 'copy'
input:
path 'snp_anatation' from anatation2.flatMap()
val annovar_table
val humandb
output:
path "*.csv" into end1
"""
prefix=\$(basename \$(readlink snp_anatation) .avinput)
perl $annovar_table $snp_anatation $humandb -buildver hg19 -out \${prefix}.anatation -remove -protocol refGene,cytoBand,exac03,clinvar_20200316,gnomad211_exome -operation g,r,f,f,f -nastring . -csvout -polish
"""
}
I got stuck on this process
process Combinevcf {
publishDir "${params.output}/combinevcf", mode: 'copy'
input:
path 'genebased' from end.flatMap()
path 'allbased' from end1.flatMap()
output:
path "*_3.csv" into end3
"""
prefix=\$(basename \$(readlink genebased) .exonic_variant_function)
prefix1=\$(basename \$(readlink allbased) .csv)
cat ${prefix}.exonic_variant_function | tr -s ‘[:blank:]’ ‘,’ | awk 'BEGIN{FS=",";OFS="," }{ print \$3,\$13,\$22}' | awk ' BEGIN { OFS=", "; print "refGene", "refGene", "refGene", "refGene", "refGene", "Zogysity","chr", "filter" } { print \$0, "" } ' > ${prefix}_1.csv
awk 'BEGIN{FS=",";OFS="," }{ print \$1,\$2,\$3,\$4,\$5,\$6,\$7,\$8,\$9,\$10,\$15,\$21,\$24,\$25}' ${prefix1}.csv > ${prefix1}_2.csv
paste ${prefix}_1.csv ${prefix1}_2.csv > ${prefix}_3.csv
"""
}
I am not sure what went wrong, any help would be appreciated.

You need to escape your ${prefix} with backslashes to tell nextflow that the variable prefix is in the script block scope, and not in the nextflow scope.
See https://www.nextflow.io/docs/latest/process.html#script for more info:
Since Nextflow uses the same Bash syntax for variable substitutions in strings, you must manage them carefully depending on whether you want to evaluate a Nextflow variable or a Bash variable

How to process multiple samples as input in Nextflow?

I'm trying to learn nextflow but it's not going very well. I used NGS-based double-end sequencing data to build an analysis flow from fastq files to vcf files using Nextflow. However I got stuck right at the beginning, as shown in the code. The first process and the second porcess sworks fine, but when passing the files to the third process
there is an ERROR and I can't execute the whole process anymore. What should I do? Thanks for a help.
Following is my code:
#! /usr/bin/env nextflow
params.fq1 = "/home/duxu/project/data/*1.fq.gz"
params.fq2 = "/home/duxu/project/data/*2.fq.gz"
params.index = "/home/duxu/project/result/index.list"
params.primer1 = "/home/duxu/project/data/primer_1.fasta"
params.primer2 = "/home/duxu/project/data/primer_2.fasta"
params.ref = "/home/duxu/project/data/hg19.p13.plusMT.no_alt_analysis_set/"
params.output='results'
params.refhg19 = "/home/duxu/project/data/hg19.p13.plusMT.no_alt_analysis_set/hg19.p13.plusMT.no_alt_analysis_set.fa"
params.Mills = "/home/duxu/project/data/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf"
params.1000G = "/home/duxu/project/data/1000G_phase1.indels.hg19.sites.vcf"
params.dbsnp = "/home/duxu/project/data/dbsnp_138.hg19.vcf
fq2 = Channel.fromPath(params.fq2)
fq2 = Channel.fromPath(params.fq2)
index = Channel.fromPath(params.index)
index.into { index_1; index_2; index_3 }
primer1 = Channel.fromPath(params.primer1)
primer2 = Channel.fromPath(params.primer2)
ref = Channel.fromPath(params.ref)
refhg19 = Channel.fromPath(params.refhg19)
refhg19.into { refhg19_1; refhg19_2 ; refhg19_3; refhg19_4; refhg19_5}
Mills = Channel.fromPath(params.Mills)
1000G = Channel.fromPath(params.1000G)
dbsnp = Channel.fromPath(params.dbsnp)
This is first process:
process soapnuke{
conda'soapnuke'
tag{"soapnuk ${fq1} ${fq2}"}
publishDir "${params.outdir}/SOAPnuke", mode: 'copy'
input:
file rawfq1 from fq1
file rawfq2 from fq2
output:
file 'clean1.fastq.gz' into clean_fq1
file 'clean2.fastq.gz' into clean_fq2
script:
"""
SOAPnuke filter -1 $rawfq1 -2 $rawfq2 -l 12 -q 0.5 -Q 2 -o . \
-C clean1.fastq.gz -D clean2.fastq.gz --trim 8,0,8,0
"""
}
The second process
process barcode_splitter{
tag{"barcode_splitter"}
publishDir "${params.output}/barcode_splitter", mode: 'copy'
input:
file split1 from clean_fq1
file split2 from clean_fq2
file index from index_1
output:
file '*-read-1.fastq.gz' into trimmed_index1
file '*-read-2.fastq.gz' into trimmed_index2
script:
"""
barcode_splitter --bcfile $index $split1 $split2 --idxread 1 2 --mismatches 1 --suffix .fastq --gzipout
mv multimatched-read-1.fastq.gz multicatched.1.fastq.gz
mv multimatched-read-2.fastq.gz multicatched.2.fastq.gz
mv untimatched-read-1.fastq.gz multicatched.1.fastq.gz
mv untimatched-read-2.fastq.gz multicatched.2.fastq.gz
"""
}
The third ,and I got an error from this step. In fact, this process has multiple samples, since the previous process baicode_splitter output multiple files. This process cutadapt is designed to excise the first few bases of multiple samples.
process cutadapt{
tag{"cutadapt"}
publishDir "${params.output}/cut_primer", mode: 'copy'
input:
val sample from sample
file primer_1 from primer1
file primer_2 from primer2
file ${sample}-read-1.fastq.gz from trimmed_index1.collect()
file ${sample}-read-2.fastq.gz from trimmed_index2.collect()
output:
file '*.trim.1.fastq.gz' into trimmed_primer1
file '*.trim.2.fastq.gz' into trimmed_primer2
script:
"""
cutadapt -g file:$primer_1 -G file:$primer_2 -j 64 --discard-untrimmed -o \${sample}.trim.1.fastq.gz -p \$(sample}.trim.2.fastq.gz ${sample}-read-1.fastq.gz ${sample}-read-2.fastq.gz
"""
}
The fourth process is designed to match multiple samples to a reference genome
process bwa_mapping{
tag{bwa_maping}
publishDir "${params.output}/barcode_splitter", mode: 'copy'
input:
file pair1 from trimmed_primer1
file pair2 from trimmed_primer2
file refhg19 from refhg19_1
output:
file '*_R1_R2.bam' into addheader
script:
"""
bwa mem -t 64 $refhg19 trimmed_primer_*-read-1.fastq trimmed_primer_*-read-2.fastq | samtools view -#8 -b | samtools sort -m 2G -#64 > * _R1_R2.bam
"""
}
The next remaining processes are all multisample-based operations
process {BaseRecalibrator
tag{"BaseRecalibrator"}
publishDir "${params.output}/BQSR/BaseRecalibratoraddheader", mode: 'copy'
input:
file BaseRecalibrator from BaseRecalibrator_1
file refhg19 from refhg19_2
file Mills from Mills
file 1000G from 1000G
file dbsnp from dbsnp
output:
file '*.recal.table' into ApplyBQSR
script:
"""
gatk --java-options "-XX:ParallelGCThreads=2" BaseRecalibrator -I BaseRecalibrator -R $refhg19_2 --known-sites $Mills --known-sites $1000G --known-sites $dbsnp -O *.recal.table
"""
}
process {ApplyBQSR
tag{"ApplyRecalibrator"}
publishDir "${params.output}/BQSR/ApplyRecalibrator", mode: 'copy'
input:
file ApplyBQSR from ApplyBQSR
file refhg19 from refhg19_3
file BaseRecalibrator from BaseRecalibrator_2
output:
file '*.bam' into HaplotypeCaller
script:
"""
gatk --java-options "-XX:ParallelGCThreads=2" ApplyBQSR -I BaseRecalibrator -R $refhg19_3 --bqsr-recal-file $AppleBQSR -O *.bam
"""
}
process {HaplotypeCaller
tag{"HaplotypeCallerr"}
publishDir "${params.output}/GATK/HaplotypeCaller", mode: 'copy'
input:
file HaplotypeCaller from HaplotypeCaller
file refhg19 from refhg19_4
file BaseRecalibrator from BaseRecalibrator_3
output:
file '*.g.vcf.gz' into GenotypeGVCFs
script:
"""
gatk --java-options "-XX:ParallelGCThreads=2" HaplotypeCaller -R $refhg19_4 -I BaseRecalibrator -O *.g.vcf.gz -ERC GVCF
"""
}
process {GenotypeGVCFs
tag{"GenotypeGVCFs"}
publishDir "${params.output}/GATK/GenotypeGVCFs", mode: 'copy'
input:
file GenotypeGVCFs from GenotypeGVCFs
file refhg19 from refhg19_5
output:
file '*.vcf.gz' into end
script:
"""
gatk --java-options "-XX:ParallelGCThreads=2" GenotypeGVCFs -R $refhg19_5 -V $GenotypeGVCFs -O *.vcf.gz
"""

Not sure what the error is exactly, but maybe it doesn't matter. It looks like you've declared more than one queue channels in your 'cutadapt' input declaration. Usually you don't want to do this. Please see: understand how multiple input channels work.
Note that the Channel.fromPath factory method creates a queue channel. Here, 'primer1' and 'primer2' are both queue channels that each provide only a single value:
params.primer1 = "/home/duxu/project/data/primer_1.fasta"
params.primer2 = "/home/duxu/project/data/primer_2.fasta"
primer1 = Channel.fromPath(params.primer1)
primer2 = Channel.fromPath(params.primer2)
What you want instead are value channels which can be read an infinite number of times without consuming their content:
params.primer1 = "/home/duxu/project/data/primer_1.fasta"
params.primer2 = "/home/duxu/project/data/primer_2.fasta"
primer1 = file(params.primer1)
primer2 = file(params.primer2)
A typically process, which involves executing a process for each of n samples, will involve a single queue channel and zero or more value channels. For example:
process cutadapt{
tag { sample }
publishDir "${params.output}/cut_primer", mode: 'copy'
cpus 64
input:
tuple val(sample), path(fq1), path(fq2) from trimmed_index
path primer1
path primer2
output:
tuple val(sample), path("${sample}.trim.1.fq.gz"), path("${sample}.trim.2.fq.gz")
script:
"""
cutadapt \\
-g "file:${primer1}" \\
-G "file:${primer2} \\
-j ${task.cpus} \\
--discard-untrimmed \\
-o "${sample}.trim.1.fq.gz" \\
-p "${sample}.trim.2.fq.gz" \\
"${fq1}" \\
"${fq2}"
"""
}
Note that when the file input name is the same as the channel name, the from channel declaration can be omitted. I also used the path qualifier above, as it should be preferred over the file qualifier when using Nextflow 19.10.0 or later.

jenkins on windows. Declarative pipeline jenkinsfile. how to set and get correct variable values

I use Jenkinsefile file to run the Stages.
It is in Jenkins pipeline installed on windows, Declarative pipeline.
On the begining I do:
pipeline {
agent { label 'master'}
environment {
My_build_result = 7
}
....
Than
stage('Test') {
steps {
echo 'Testing..'
bat """
cd Utils
"C:\\Program Files\\MATLAB\\R2019b\\bin\\matlab.exe" -wait -nodisplay -nosplash -nodesktop -r "run('automatic_tests\\run_test.m');"
echo %errorlevel%
set /a My_build_result_temp = %errorlevel%
set My_build_result = %My_build_result_temp%
"""
script {
My_build_result = bat(returnStatus:true , script: "exit (2)").trim()
echo "My_build_result ${env.My_build_result}"
if (My_build_result != 0) {
echo "inside if"
}
}
}
}
The variable My_build_result get value 7 at the begining
Inside the bat section, it suppose to get value 0 from %errorlevel%
Inside the script section it suppose to get value 2
BUT
in the echo "My_build_result ${env.My_build_result}" I get print of 7
(and it goes inside the if sentense)
How do I define variable that can be set value in bat"""
"""
and in script """
"""
section of the stage
and also be familiar in another stages and in the post { always { .. }} at the end ???
BTW: add env.before My_build_result (env.My_build_result ) does not work
Thanks a lot

In the first bat call, you are setting the environment variable only inside of the batch script environment. Environment variable values that are assigned through set don't persist when the script ends. Think of these like local variables. Simply use returnStatus: true to return the last value of ERRORLEVEL. There is no need to use %ERRORLEVEL% in the batch script here.
steps {
script {
My_build_result = bat returnStatus: true, script: """
cd Utils
"C:\\Program Files\\MATLAB\\R2019b\\bin\\matlab.exe" -wait -nodisplay -nosplash -nodesktop -r "run('automatic_tests\\run_test.m');"
"""
// My_build_result now has the value of ERRORLEVEL from the last command
// called in the batch script.
}
}
In the 2nd bat call the 1st mistake is to call the trim() method. Result type of bat step is Integer, when returnStatus: true is passed. The trim() method is only available when returnStdout: true is passed in which case the result type would be String. The 2nd mistake is to use brackets around the exit code value. The fixed code should look like:
My_build_result = bat returnStatus: true, script: "exit 2"
// My_build_result now equals 2

How to prompt for target-specific Makefile variable if undefined?

This is similar to another issue, but I only want make to prompt for a value if I'm running a specific target and a mandatory variable has not been specified.
The current code:
install-crontab: PASSWORD ?= "$(shell read -p "Password: "; echo "$$REPLY")"
install-crontab: $(SCRIPT_PATH)
#echo "#midnight \"$(SCRIPT_PATH)\" [...] \"$(PASSWORD)\""
This just results in the following output and no prompt:
Password: read: 1: arg count
#midnight [...] ""
The important point here is that I have to ask only when running this target, and only if the variable has not been defined. I can't use a configure script, because obviously I shouldn't store passwords in a config script, and because this target is not part of the standard installation procedure.

Turns out the problem was that Makefiles don't use Dash / Bash-style quotation, and that Dash's read built-in needs a variable name, unlike Bash. Resulting code:
install-crontab-delicious: $(DELICIOUS_TARGET_PATH)
#while [ -z "$$DELICIOUS_USER" ]; do \
read -r -p "Delicious user name: " DELICIOUS_USER;\
done && \
while [ -z "$$DELICIOUS_PASSWORD" ]; do \
read -r -p "Delicious password: " DELICIOUS_PASSWORD; \
done && \
while [ -z "$$DELICIOUS_PATH" ]; do \
read -r -p "Delicious backup path: " DELICIOUS_PATH; \
done && \
( \
CRONTAB_NOHEADER=Y crontab -l || true; \
printf '%s' \
'#midnight ' \
'"$(DELICIOUS_TARGET_PATH)" ' \
"\"$$DELICIOUS_USER\" " \
"\"$$DELICIOUS_PASSWORD\" " \
"\"$$DELICIOUS_PATH\""; \
printf '\n') | crontab -
Result:
$ crontab -r; make install-crontab-delicious && crontab -l
Delicious user name: a\b c\d
Delicious password: e f g
Delicious backup path: h\ i
no crontab for <user>
#midnight "/usr/local/bin/export_Delicious" "a\b c\d" "e f g" "h\ i"
$ DELICIOUS_PASSWORD=foo make install-crontab-delicious && crontab -l
Delicious user name: bar
Delicious backup path: baz
#midnight "/usr/local/bin/export_Delicious" "a\b c\d" "e f g" "h\ i"
#midnight "/usr/local/bin/export_Delicious" "bar" "foo" "baz"
This code:
treats all input characters as literals, so it works with spaces and backslashes,
avoids problems if the user presses Enter without writing anything,
uses environment variables if they exist, and
works whether crontab is empty or not.

l0b0's answer helped me with a similar problem where I wanted to exit if the user doesn't input 'y'. I ended up doing this:
#while [ -z "$$CONTINUE" ]; do \
read -r -p "Type anything but Y or y to exit. [y/N] " CONTINUE; \
done ; \
if [ ! $$CONTINUE == "y" ]; then \
if [ ! $$CONTINUE == "Y" ]; then \
echo "Exiting." ; exit 1 ; \
fi \
fi
I hope that helps someone. It's hard to find more info about using user input for an if/else in a makefile.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

getting weird error while running a nextflow pipeline - nextflow

Related

Nextflow name collision

Nextflow - No such variable: prefix

How to process multiple samples as input in Nextflow?

jenkins on windows. Declarative pipeline jenkinsfile. how to set and get correct variable values

How to prompt for target-specific Makefile variable if undefined?

Categories

Resources