I tried to run my nextflow script and the first two precess worded fine, but the third process Conbinevcf reported an error, showing that the variable prefix was not found.
process Annovar_genebased {
publishDir "${params.output}/annovar", mode: 'copy'
input:
path 'snp_anatation' from anatation1.flatMap()
val humandb
val refgene
output:
path "*.exonic_variant_function" into end
"""
prefix=\$(basename \$(readlink snp_anatation) .avinput)
perl $refgene -geneanno -dbtype refGene -out \${prefix}.anatation -buildver hg19 $snp_anatation $humandb -hgvs
rm *.log
rm *.variant_function
"""
}
process Annovar {
publishDir "${params.output}/annovar", mode: 'copy'
input:
path 'snp_anatation' from anatation2.flatMap()
val annovar_table
val humandb
output:
path "*.csv" into end1
"""
prefix=\$(basename \$(readlink snp_anatation) .avinput)
perl $annovar_table $snp_anatation $humandb -buildver hg19 -out \${prefix}.anatation -remove -protocol refGene,cytoBand,exac03,clinvar_20200316,gnomad211_exome -operation g,r,f,f,f -nastring . -csvout -polish
"""
}
I got stuck on this process
process Combinevcf {
publishDir "${params.output}/combinevcf", mode: 'copy'
input:
path 'genebased' from end.flatMap()
path 'allbased' from end1.flatMap()
output:
path "*_3.csv" into end3
"""
prefix=\$(basename \$(readlink genebased) .exonic_variant_function)
prefix1=\$(basename \$(readlink allbased) .csv)
cat ${prefix}.exonic_variant_function | tr -s ‘[:blank:]’ ‘,’ | awk 'BEGIN{FS=",";OFS="," }{ print \$3,\$13,\$22}' | awk ' BEGIN { OFS=", "; print "refGene", "refGene", "refGene", "refGene", "refGene", "Zogysity","chr", "filter" } { print \$0, "" } ' > ${prefix}_1.csv
awk 'BEGIN{FS=",";OFS="," }{ print \$1,\$2,\$3,\$4,\$5,\$6,\$7,\$8,\$9,\$10,\$15,\$21,\$24,\$25}' ${prefix1}.csv > ${prefix1}_2.csv
paste ${prefix}_1.csv ${prefix1}_2.csv > ${prefix}_3.csv
"""
}
I am not sure what went wrong, any help would be appreciated.
You need to escape your ${prefix} with backslashes to tell nextflow that the variable prefix is in the script block scope, and not in the nextflow scope.
See https://www.nextflow.io/docs/latest/process.html#script for more info:
Since Nextflow uses the same Bash syntax for variable substitutions in strings, you must manage them carefully depending on whether you want to evaluate a Nextflow variable or a Bash variable
Related
I have files with identical names but in different folders. Nextflow stages these files into the same work directory resulting in name collisions. My question is how to deal with that without renaming the files. Example:
# Example data
mkdir folder1 folder2
echo 1 > folder1/file.txt
echo 2 > folder2/file.txt
# We read from samplesheet
$ cat samplesheet.csv
sample,file
sample1,/home/atpoint/foo/folder1/file.txt
sample1,/home/atpoint/foo/folder2/file.txt
# Nextflow main.nf
#! /usr/bin/env nextflow
nextflow.enable.dsl=2
// Read samplesheet and group files by sample (first column)
samplesheet = Channel
.fromPath(params.samplesheet)
.splitCsv(header:true)
.map {
sample = it['sample']
file = it['file']
tuple(sample, file)
}
ch_samplesheet = samplesheet.groupTuple(by:0)
// That creates a tuple like:
// [sample1, [/home/atpoint/foo/folder1/file.txt, /home/atpoint/foo/folder2/file.txt]]
// Dummy process that stages both files into the same work directory folder
process PRO {
input:
tuple val(samplename), path(files)
output:
path("out.txt")
script:
"""
echo $samplename with files $files > out.txt
"""
}
workflow { PRO(ch_samplesheet) }
# Run it
NXF_VER=21.10.6 nextflow run main.nf --samplesheet $(realpath samplesheet.csv)
...obviously resulting in:
N E X T F L O W ~ version 21.10.6
Launching `main.nf` [adoring_jennings] - revision: 87f26fa90b
[- ] process > PRO -
Error executing process > 'PRO (1)'
Caused by:
Process `PRO` input file name collision -- There are multiple input files for each of the following file names: file.txt
So, what now? The real world application here is sequencing replicates of the same fastq file, which then have the same name, but are in different folders, and I want to feed them into a process that merges them. I am aware of this section in the docs but cannot say that any of it was helpful or that I understand it properly.
You can use stageAs option in your process definition.
#! /usr/bin/env nextflow
nextflow.enable.dsl=2
samplesheet = Channel
.fromPath(params.samplesheet)
.splitCsv(header:true)
.map {
sample = it['sample']
file = it['file']
tuple(sample, file)
}
.groupTuple()
.set { ch_samplesheet }
// [sample1, [/path/to/folder1/file.txt, /path/to/folder2/file.txt]]
process PRO {
input:
tuple val(samplename), path(files, stageAs: "?/*")
output:
path("out.txt")
shell:
def input_str = files instanceof List ? files.join(" ") : files
"""
cat ${input_str} > out.txt
"""
}
workflow { PRO(ch_samplesheet) }
See an example from nf-core and the path input type docs
I'm trying to read each line from a CSV file and then execute a Nextflow process for each line of it. However I don't know exactly why when I run the Nextflow script I get the following error:
Argument of file function cannot be null
params.index_fasta = "/home/test_1000Genomes.csv"
Channel
.fromPath(params.index_fasta)
.splitCsv(header:true)
.map { row-> set(row.sampleId, file(row.read1), file(row.read2)) }
.set { sample_run_ch }
process FastQCFQ {
tag "QC of fasta"
publishDir (
path: "${params.PublishDir}/Reports/${sampleId}/FastQC",
mode: 'copy',
overwrite: 'true'
)
input:
set sampleId, file("${read1}"), file("${read2}") from sample_run_ch
output:
file("*.{html,zip}") into QC_Report
script:
"""
fastqc -t 2 -q $read1 $read2
"""
}
ch_qc = QC_Report
The CSV file consist of a tab file with a header of same names sampleId, read1, read2 where read1 and read2 are the paths of the fasta files. I'm try to change some parameters inside the Nextflow process but without get a correct process.
Argument of file function cannot be null
As Pallie notes in the comments above, if the input CSV is not parsed correctly (for example, if the wrong delimiter is used) the variables that you expect to contain strings may actually be null. If your CSV is actually tab-separated, use the splitCsv sep parameter to set it:
params.samples_tsv = './samples.tsv'
params.publish_dir = './results'
Channel
.fromPath( params.samples_tsv )
.splitCsv( header: true, sep: '\t' )
.map { row -> tuple( row.sampleId, file(row.read1), file(row.read2) ) }
.set { sample_run_ch }
process FastQC {
tag { sampleId }
publishDir (
path: "${params.publish_dir}/Reports/${sampleId}/FastQC",
mode: 'copy',
overwrite: 'true',
)
input:
tuple val(sampleId), path(read1), path(read2) from sample_run_ch
output:
path "*.{html,zip}" into QC_Report
script:
"""
fastqc -t 2 -q "${read1}" "${read2}"
"""
}
I'm trying to learn nextflow but it's not going very well. I used NGS-based double-end sequencing data to build an analysis flow from fastq files to vcf files using Nextflow. However I got stuck right at the beginning, as shown in the code. The first process and the second porcess sworks fine, but when passing the files to the third process
there is an ERROR and I can't execute the whole process anymore. What should I do? Thanks for a help.
Following is my code:
#! /usr/bin/env nextflow
params.fq1 = "/home/duxu/project/data/*1.fq.gz"
params.fq2 = "/home/duxu/project/data/*2.fq.gz"
params.index = "/home/duxu/project/result/index.list"
params.primer1 = "/home/duxu/project/data/primer_1.fasta"
params.primer2 = "/home/duxu/project/data/primer_2.fasta"
params.ref = "/home/duxu/project/data/hg19.p13.plusMT.no_alt_analysis_set/"
params.output='results'
params.refhg19 = "/home/duxu/project/data/hg19.p13.plusMT.no_alt_analysis_set/hg19.p13.plusMT.no_alt_analysis_set.fa"
params.Mills = "/home/duxu/project/data/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf"
params.1000G = "/home/duxu/project/data/1000G_phase1.indels.hg19.sites.vcf"
params.dbsnp = "/home/duxu/project/data/dbsnp_138.hg19.vcf
fq2 = Channel.fromPath(params.fq2)
fq2 = Channel.fromPath(params.fq2)
index = Channel.fromPath(params.index)
index.into { index_1; index_2; index_3 }
primer1 = Channel.fromPath(params.primer1)
primer2 = Channel.fromPath(params.primer2)
ref = Channel.fromPath(params.ref)
refhg19 = Channel.fromPath(params.refhg19)
refhg19.into { refhg19_1; refhg19_2 ; refhg19_3; refhg19_4; refhg19_5}
Mills = Channel.fromPath(params.Mills)
1000G = Channel.fromPath(params.1000G)
dbsnp = Channel.fromPath(params.dbsnp)
This is first process:
process soapnuke{
conda'soapnuke'
tag{"soapnuk ${fq1} ${fq2}"}
publishDir "${params.outdir}/SOAPnuke", mode: 'copy'
input:
file rawfq1 from fq1
file rawfq2 from fq2
output:
file 'clean1.fastq.gz' into clean_fq1
file 'clean2.fastq.gz' into clean_fq2
script:
"""
SOAPnuke filter -1 $rawfq1 -2 $rawfq2 -l 12 -q 0.5 -Q 2 -o . \
-C clean1.fastq.gz -D clean2.fastq.gz --trim 8,0,8,0
"""
}
The second process
process barcode_splitter{
tag{"barcode_splitter"}
publishDir "${params.output}/barcode_splitter", mode: 'copy'
input:
file split1 from clean_fq1
file split2 from clean_fq2
file index from index_1
output:
file '*-read-1.fastq.gz' into trimmed_index1
file '*-read-2.fastq.gz' into trimmed_index2
script:
"""
barcode_splitter --bcfile $index $split1 $split2 --idxread 1 2 --mismatches 1 --suffix .fastq --gzipout
mv multimatched-read-1.fastq.gz multicatched.1.fastq.gz
mv multimatched-read-2.fastq.gz multicatched.2.fastq.gz
mv untimatched-read-1.fastq.gz multicatched.1.fastq.gz
mv untimatched-read-2.fastq.gz multicatched.2.fastq.gz
"""
}
The third ,and I got an error from this step. In fact, this process has multiple samples, since the previous process baicode_splitter output multiple files. This process cutadapt is designed to excise the first few bases of multiple samples.
process cutadapt{
tag{"cutadapt"}
publishDir "${params.output}/cut_primer", mode: 'copy'
input:
val sample from sample
file primer_1 from primer1
file primer_2 from primer2
file ${sample}-read-1.fastq.gz from trimmed_index1.collect()
file ${sample}-read-2.fastq.gz from trimmed_index2.collect()
output:
file '*.trim.1.fastq.gz' into trimmed_primer1
file '*.trim.2.fastq.gz' into trimmed_primer2
script:
"""
cutadapt -g file:$primer_1 -G file:$primer_2 -j 64 --discard-untrimmed -o \${sample}.trim.1.fastq.gz -p \$(sample}.trim.2.fastq.gz ${sample}-read-1.fastq.gz ${sample}-read-2.fastq.gz
"""
}
The fourth process is designed to match multiple samples to a reference genome
process bwa_mapping{
tag{bwa_maping}
publishDir "${params.output}/barcode_splitter", mode: 'copy'
input:
file pair1 from trimmed_primer1
file pair2 from trimmed_primer2
file refhg19 from refhg19_1
output:
file '*_R1_R2.bam' into addheader
script:
"""
bwa mem -t 64 $refhg19 trimmed_primer_*-read-1.fastq trimmed_primer_*-read-2.fastq | samtools view -#8 -b | samtools sort -m 2G -#64 > * _R1_R2.bam
"""
}
The next remaining processes are all multisample-based operations
process {BaseRecalibrator
tag{"BaseRecalibrator"}
publishDir "${params.output}/BQSR/BaseRecalibratoraddheader", mode: 'copy'
input:
file BaseRecalibrator from BaseRecalibrator_1
file refhg19 from refhg19_2
file Mills from Mills
file 1000G from 1000G
file dbsnp from dbsnp
output:
file '*.recal.table' into ApplyBQSR
script:
"""
gatk --java-options "-XX:ParallelGCThreads=2" BaseRecalibrator -I BaseRecalibrator -R $refhg19_2 --known-sites $Mills --known-sites $1000G --known-sites $dbsnp -O *.recal.table
"""
}
process {ApplyBQSR
tag{"ApplyRecalibrator"}
publishDir "${params.output}/BQSR/ApplyRecalibrator", mode: 'copy'
input:
file ApplyBQSR from ApplyBQSR
file refhg19 from refhg19_3
file BaseRecalibrator from BaseRecalibrator_2
output:
file '*.bam' into HaplotypeCaller
script:
"""
gatk --java-options "-XX:ParallelGCThreads=2" ApplyBQSR -I BaseRecalibrator -R $refhg19_3 --bqsr-recal-file $AppleBQSR -O *.bam
"""
}
process {HaplotypeCaller
tag{"HaplotypeCallerr"}
publishDir "${params.output}/GATK/HaplotypeCaller", mode: 'copy'
input:
file HaplotypeCaller from HaplotypeCaller
file refhg19 from refhg19_4
file BaseRecalibrator from BaseRecalibrator_3
output:
file '*.g.vcf.gz' into GenotypeGVCFs
script:
"""
gatk --java-options "-XX:ParallelGCThreads=2" HaplotypeCaller -R $refhg19_4 -I BaseRecalibrator -O *.g.vcf.gz -ERC GVCF
"""
}
process {GenotypeGVCFs
tag{"GenotypeGVCFs"}
publishDir "${params.output}/GATK/GenotypeGVCFs", mode: 'copy'
input:
file GenotypeGVCFs from GenotypeGVCFs
file refhg19 from refhg19_5
output:
file '*.vcf.gz' into end
script:
"""
gatk --java-options "-XX:ParallelGCThreads=2" GenotypeGVCFs -R $refhg19_5 -V $GenotypeGVCFs -O *.vcf.gz
"""
Not sure what the error is exactly, but maybe it doesn't matter. It looks like you've declared more than one queue channels in your 'cutadapt' input declaration. Usually you don't want to do this. Please see: understand how multiple input channels work.
Note that the Channel.fromPath factory method creates a queue channel. Here, 'primer1' and 'primer2' are both queue channels that each provide only a single value:
params.primer1 = "/home/duxu/project/data/primer_1.fasta"
params.primer2 = "/home/duxu/project/data/primer_2.fasta"
primer1 = Channel.fromPath(params.primer1)
primer2 = Channel.fromPath(params.primer2)
What you want instead are value channels which can be read an infinite number of times without consuming their content:
params.primer1 = "/home/duxu/project/data/primer_1.fasta"
params.primer2 = "/home/duxu/project/data/primer_2.fasta"
primer1 = file(params.primer1)
primer2 = file(params.primer2)
A typically process, which involves executing a process for each of n samples, will involve a single queue channel and zero or more value channels. For example:
process cutadapt{
tag { sample }
publishDir "${params.output}/cut_primer", mode: 'copy'
cpus 64
input:
tuple val(sample), path(fq1), path(fq2) from trimmed_index
path primer1
path primer2
output:
tuple val(sample), path("${sample}.trim.1.fq.gz"), path("${sample}.trim.2.fq.gz")
script:
"""
cutadapt \\
-g "file:${primer1}" \\
-G "file:${primer2} \\
-j ${task.cpus} \\
--discard-untrimmed \\
-o "${sample}.trim.1.fq.gz" \\
-p "${sample}.trim.2.fq.gz" \\
"${fq1}" \\
"${fq2}"
"""
}
Note that when the file input name is the same as the channel name, the from channel declaration can be omitted. I also used the path qualifier above, as it should be preferred over the file qualifier when using Nextflow 19.10.0 or later.
I am struggling on parsing some log file.
Here how it looks like:
node_name: na2-devdb-cssx
run_id: 3c3424f3-8a62-4f4c-b97a-2096a2afc070
start_time: 2015-06-26T21:00:44Z
status: failure
node_name: eu1-devsx
run_id: f5ed13a3-1f02-490f-b518-97de9649daf5
start_time: 2015-06-26T21:00:34Z
status: success
I need to get blocks which have "failure" in its last line of the block.
Ideally would be to consider on time stamp as well. Like if time stamp is like "2015-06-26T2*"
And here what I have tried so far:
sed -e '/node_name/./failure/p' /file
sed -n '/node_name/./failure/p' /file
awk '/node_name/,/failure/' file
sed -e 's/node_name\(.*\)failure/\1/' file
None of them doesn't work for me.
It just throws me everything except failure...
For example:
[root#localhost chef-repo-zilliant]# sed -n '/node_name/,/failure/p' /tmp/run.txt | head
node_name: eu1-devdb-linc
run_id: e49fe64d-567d-4627-a10d-477e17fb6016
start_time: 2015-06-28T20:59:55Z
status: success
node_name: eu1-devjs1
run_id: c6c7f668-b912-4459-9d56-94d1e0788802
start_time: 2015-06-28T20:59:53Z
status: success
Have no idea why it doesn't work. Seems like for all around these methods work fine...
Thank you in advance.
A way with Gnu sed:
sed -n ':a;/^./{H;n;ba;};x;/2015-06-26T21/{/failure$/p;};' file.txt
details:
:a; # define the label "a"
/^./ { # condition: when a line is not empty
H; # append it to the buffer space
n; # load the next line in the pattern space
ba; # go to label "a"
};
x; # swap buffer space and pattern space
/2015-06-26T21/ { # condition: if the needed date is in the block
/failure$/ p; # condition: if "failure" is in the block then print
};
I noted you tried with awk, although you only tagged the question with sed, so I will add a solution with it.
You can play with built-in variable that control how to split lines and fields, like:
awk '
BEGIN { RS = ""; FS = OFS = "\n"; ORS = "\n\n" }
$NF ~ /failure/ && $(NF-1) ~ /2015-06-26T2/ { print }
' infile
RS = "" separates records in newlines. FS and OFS separates fields in lines, and ORS is to print output like original input, with a line interleaved.
It yields:
node_name: na2-devdb-cssx
run_id: 3c3424f3-8a62-4f4c-b97a-2096a2afc070
start_time: 2015-06-26T21:00:44Z
status: failure
Use grep.
grep -oPz '\bnode_name:(?:(?!\n\n)[\s\S])*?2015-06-26T2(?:(?!\n\n)[\s\S])*?\bfailure\b' file
The main part here is (?:(?!\n\n)[\s\S])*? which matches any charactar but not of a blank line, zero or more times.
I have an smb.conf ini file which is overwritten whenever edited with a certain GUI tool, wiping out a custom setting. This means I need a cron job to ensure that one particular section in the file contains a certain option=value pair, and insert it at the end of the section if it doesn't exist.
Example
Ensure that hosts deny=192.168.23. exists within the [myshare] section:
[global]
printcap name = cups
winbind enum groups = yes
security = user
[myshare]
path=/mnt/myshare
browseable=yes
enable recycle bin=no
writeable=yes
hosts deny=192.168.23.
[Another Share]
invalid users=nobody,nobody
valid users=nobody,nobody
path=/mnt/share2
browseable=no
Long-winded solution using awk
After a long time struggling with sed, I concluded that it might not be the right tool for the job. So I moved over to awk and came up with this:
#!/bin/sh
file="smb.conf"
tmp="smb.conf.tmp"
section="myshare"
opt="hosts deny=192.168.23."
awk '
BEGIN {
this_section=0;
opt_found=0;
}
# Match the line where our section begins
/^[ \t]*\['"$section"'\][ \t]*$/ {
this_section=1;
print $0;
next;
}
# Match lines containing our option
this_section == 1 && /^[ \t]*'"$opt"'[ \t]*$/ {
opt_found=1;
}
# Match the following section heading
this_section == 1 && /^[ \t]*\[.*$/ {
this_section=0;
if (opt_found != 1) {
print "\t'"$opt"'";
}
}
# Print every line
{ print $0; }
END {
# In case our section is the very last in the file
if (this_section == 1 && opt_found != 1) {
print "\t'"$opt"'";
}
}
' $file > $tmp
# Overwrite $file only if $tmp is different
diff -q $file $tmp > /dev/null 2>&1
if [ $? -ne 0 ]; then
mv $tmp $file
# reload smb.conf here
else
rm $tmp
fi
I can't help feeling that this is a long script to achieve a simple task. Is there a more efficient/elegant way to insert a property in an ini file using basic shell tools like sed and awk?
Consider using Python 3's configparser:
#!/usr/bin/python3
import sys
from configparser import SafeConfigParser
cfg = SafeConfigParser()
cfg.read(sys.argv[1])
cfg['myshare']['hosts deny'] = '192.168.23.';
with open(sys.argv[1], 'w') as f:
cfg.write(f)
To be called as ./filename.py smb.conf (i.e., the first parameter is the file to change).
Note that comments are not preserved by this. However, since a GUI overwrites the config and doesn't preserve custom options, I suspect that comments are already nuked and that this is not a worry in your case.
Untested, should work though
awk -vT="hosts deny=192.168.23" 'x&&$0~T{x=0}x&&/^ *\[[^]]+\]/{print "\t\t"T;x=0}
/^ *\[myshare\]/{x++}1' file
This solution is a bit awkward. It uses the INI section header as the record separator. This means that there is an empty record before the first header, so when we match the header we're interested in, we have to read the next record to handle that INI section. Also, there are some printf commands because the records still contain leading and trailing newlines.
awk -v RS='[[][^]]+[]]' -v str="hosts deny=192.168.23." '
{printf "%s", $0; printf "%s", RT}
RT == "[myshare]" {
getline
printf "%s", $0
if (index($0, str) == 0) print str
printf "%s", RT
}
' smb.conf
RS is the awk variable that contains the regex to split the text into records.
RT is the awk variable that contains the actual text of the current record separator.
With GNU awk for a couple of extensions:
$ cat tst.awk
index($0,str) { found = 1 }
match($0,/^\s*\[([^]]+).*/,a) {
if ( (name == tgt) && !found ) { print indent str }
name = a[1]
found = 0
}
{ print; indent=gensub(/\S.*/,"","") }
.
$ awk -v tgt="myshare" -v str="hosts deny=192.168.23." -f tst.awk file
[global]
printcap name = cups
winbind enum groups = yes
security = user
[myshare]
path=/mnt/myshare
browseable=yes
enable recycle bin=no
writeable=yes
hosts deny=192.168.23.
[Another Share]
invalid users=nobody,nobody
valid users=nobody,nobody
path=/mnt/share2
browseable=no
.
$ awk -v tgt="myshare" -v str="fluffy bunny" -f tst.awk file
[global]
printcap name = cups
winbind enum groups = yes
security = user
[myshare]
path=/mnt/myshare
browseable=yes
enable recycle bin=no
writeable=yes
hosts deny=192.168.23.
fluffy bunny
[Another Share]
invalid users=nobody,nobody
valid users=nobody,nobody
path=/mnt/share2
browseable=no