I have files with identical names but in different folders. Nextflow stages these files into the same work directory resulting in name collisions. My question is how to deal with that without renaming the files. Example:
# Example data
mkdir folder1 folder2
echo 1 > folder1/file.txt
echo 2 > folder2/file.txt
# We read from samplesheet
$ cat samplesheet.csv
sample,file
sample1,/home/atpoint/foo/folder1/file.txt
sample1,/home/atpoint/foo/folder2/file.txt
# Nextflow main.nf
#! /usr/bin/env nextflow
nextflow.enable.dsl=2
// Read samplesheet and group files by sample (first column)
samplesheet = Channel
.fromPath(params.samplesheet)
.splitCsv(header:true)
.map {
sample = it['sample']
file = it['file']
tuple(sample, file)
}
ch_samplesheet = samplesheet.groupTuple(by:0)
// That creates a tuple like:
// [sample1, [/home/atpoint/foo/folder1/file.txt, /home/atpoint/foo/folder2/file.txt]]
// Dummy process that stages both files into the same work directory folder
process PRO {
input:
tuple val(samplename), path(files)
output:
path("out.txt")
script:
"""
echo $samplename with files $files > out.txt
"""
}
workflow { PRO(ch_samplesheet) }
# Run it
NXF_VER=21.10.6 nextflow run main.nf --samplesheet $(realpath samplesheet.csv)
...obviously resulting in:
N E X T F L O W ~ version 21.10.6
Launching `main.nf` [adoring_jennings] - revision: 87f26fa90b
[- ] process > PRO -
Error executing process > 'PRO (1)'
Caused by:
Process `PRO` input file name collision -- There are multiple input files for each of the following file names: file.txt
So, what now? The real world application here is sequencing replicates of the same fastq file, which then have the same name, but are in different folders, and I want to feed them into a process that merges them. I am aware of this section in the docs but cannot say that any of it was helpful or that I understand it properly.
You can use stageAs option in your process definition.
#! /usr/bin/env nextflow
nextflow.enable.dsl=2
samplesheet = Channel
.fromPath(params.samplesheet)
.splitCsv(header:true)
.map {
sample = it['sample']
file = it['file']
tuple(sample, file)
}
.groupTuple()
.set { ch_samplesheet }
// [sample1, [/path/to/folder1/file.txt, /path/to/folder2/file.txt]]
process PRO {
input:
tuple val(samplename), path(files, stageAs: "?/*")
output:
path("out.txt")
shell:
def input_str = files instanceof List ? files.join(" ") : files
"""
cat ${input_str} > out.txt
"""
}
workflow { PRO(ch_samplesheet) }
See an example from nf-core and the path input type docs
I'm trying to learn nextflow but it's not going very well. I used NGS-based double-end sequencing data to build an analysis flow from fastq files to vcf files using Nextflow. However I got stuck right at the beginning, as shown in the code. The first process and the second porcess sworks fine, but when passing the files to the third process
there is an ERROR and I can't execute the whole process anymore. What should I do? Thanks for a help.
Following is my code:
#! /usr/bin/env nextflow
params.fq1 = "/home/duxu/project/data/*1.fq.gz"
params.fq2 = "/home/duxu/project/data/*2.fq.gz"
params.index = "/home/duxu/project/result/index.list"
params.primer1 = "/home/duxu/project/data/primer_1.fasta"
params.primer2 = "/home/duxu/project/data/primer_2.fasta"
params.ref = "/home/duxu/project/data/hg19.p13.plusMT.no_alt_analysis_set/"
params.output='results'
params.refhg19 = "/home/duxu/project/data/hg19.p13.plusMT.no_alt_analysis_set/hg19.p13.plusMT.no_alt_analysis_set.fa"
params.Mills = "/home/duxu/project/data/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf"
params.1000G = "/home/duxu/project/data/1000G_phase1.indels.hg19.sites.vcf"
params.dbsnp = "/home/duxu/project/data/dbsnp_138.hg19.vcf
fq2 = Channel.fromPath(params.fq2)
fq2 = Channel.fromPath(params.fq2)
index = Channel.fromPath(params.index)
index.into { index_1; index_2; index_3 }
primer1 = Channel.fromPath(params.primer1)
primer2 = Channel.fromPath(params.primer2)
ref = Channel.fromPath(params.ref)
refhg19 = Channel.fromPath(params.refhg19)
refhg19.into { refhg19_1; refhg19_2 ; refhg19_3; refhg19_4; refhg19_5}
Mills = Channel.fromPath(params.Mills)
1000G = Channel.fromPath(params.1000G)
dbsnp = Channel.fromPath(params.dbsnp)
This is first process:
process soapnuke{
conda'soapnuke'
tag{"soapnuk ${fq1} ${fq2}"}
publishDir "${params.outdir}/SOAPnuke", mode: 'copy'
input:
file rawfq1 from fq1
file rawfq2 from fq2
output:
file 'clean1.fastq.gz' into clean_fq1
file 'clean2.fastq.gz' into clean_fq2
script:
"""
SOAPnuke filter -1 $rawfq1 -2 $rawfq2 -l 12 -q 0.5 -Q 2 -o . \
-C clean1.fastq.gz -D clean2.fastq.gz --trim 8,0,8,0
"""
}
The second process
process barcode_splitter{
tag{"barcode_splitter"}
publishDir "${params.output}/barcode_splitter", mode: 'copy'
input:
file split1 from clean_fq1
file split2 from clean_fq2
file index from index_1
output:
file '*-read-1.fastq.gz' into trimmed_index1
file '*-read-2.fastq.gz' into trimmed_index2
script:
"""
barcode_splitter --bcfile $index $split1 $split2 --idxread 1 2 --mismatches 1 --suffix .fastq --gzipout
mv multimatched-read-1.fastq.gz multicatched.1.fastq.gz
mv multimatched-read-2.fastq.gz multicatched.2.fastq.gz
mv untimatched-read-1.fastq.gz multicatched.1.fastq.gz
mv untimatched-read-2.fastq.gz multicatched.2.fastq.gz
"""
}
The third ,and I got an error from this step. In fact, this process has multiple samples, since the previous process baicode_splitter output multiple files. This process cutadapt is designed to excise the first few bases of multiple samples.
process cutadapt{
tag{"cutadapt"}
publishDir "${params.output}/cut_primer", mode: 'copy'
input:
val sample from sample
file primer_1 from primer1
file primer_2 from primer2
file ${sample}-read-1.fastq.gz from trimmed_index1.collect()
file ${sample}-read-2.fastq.gz from trimmed_index2.collect()
output:
file '*.trim.1.fastq.gz' into trimmed_primer1
file '*.trim.2.fastq.gz' into trimmed_primer2
script:
"""
cutadapt -g file:$primer_1 -G file:$primer_2 -j 64 --discard-untrimmed -o \${sample}.trim.1.fastq.gz -p \$(sample}.trim.2.fastq.gz ${sample}-read-1.fastq.gz ${sample}-read-2.fastq.gz
"""
}
The fourth process is designed to match multiple samples to a reference genome
process bwa_mapping{
tag{bwa_maping}
publishDir "${params.output}/barcode_splitter", mode: 'copy'
input:
file pair1 from trimmed_primer1
file pair2 from trimmed_primer2
file refhg19 from refhg19_1
output:
file '*_R1_R2.bam' into addheader
script:
"""
bwa mem -t 64 $refhg19 trimmed_primer_*-read-1.fastq trimmed_primer_*-read-2.fastq | samtools view -#8 -b | samtools sort -m 2G -#64 > * _R1_R2.bam
"""
}
The next remaining processes are all multisample-based operations
process {BaseRecalibrator
tag{"BaseRecalibrator"}
publishDir "${params.output}/BQSR/BaseRecalibratoraddheader", mode: 'copy'
input:
file BaseRecalibrator from BaseRecalibrator_1
file refhg19 from refhg19_2
file Mills from Mills
file 1000G from 1000G
file dbsnp from dbsnp
output:
file '*.recal.table' into ApplyBQSR
script:
"""
gatk --java-options "-XX:ParallelGCThreads=2" BaseRecalibrator -I BaseRecalibrator -R $refhg19_2 --known-sites $Mills --known-sites $1000G --known-sites $dbsnp -O *.recal.table
"""
}
process {ApplyBQSR
tag{"ApplyRecalibrator"}
publishDir "${params.output}/BQSR/ApplyRecalibrator", mode: 'copy'
input:
file ApplyBQSR from ApplyBQSR
file refhg19 from refhg19_3
file BaseRecalibrator from BaseRecalibrator_2
output:
file '*.bam' into HaplotypeCaller
script:
"""
gatk --java-options "-XX:ParallelGCThreads=2" ApplyBQSR -I BaseRecalibrator -R $refhg19_3 --bqsr-recal-file $AppleBQSR -O *.bam
"""
}
process {HaplotypeCaller
tag{"HaplotypeCallerr"}
publishDir "${params.output}/GATK/HaplotypeCaller", mode: 'copy'
input:
file HaplotypeCaller from HaplotypeCaller
file refhg19 from refhg19_4
file BaseRecalibrator from BaseRecalibrator_3
output:
file '*.g.vcf.gz' into GenotypeGVCFs
script:
"""
gatk --java-options "-XX:ParallelGCThreads=2" HaplotypeCaller -R $refhg19_4 -I BaseRecalibrator -O *.g.vcf.gz -ERC GVCF
"""
}
process {GenotypeGVCFs
tag{"GenotypeGVCFs"}
publishDir "${params.output}/GATK/GenotypeGVCFs", mode: 'copy'
input:
file GenotypeGVCFs from GenotypeGVCFs
file refhg19 from refhg19_5
output:
file '*.vcf.gz' into end
script:
"""
gatk --java-options "-XX:ParallelGCThreads=2" GenotypeGVCFs -R $refhg19_5 -V $GenotypeGVCFs -O *.vcf.gz
"""
Not sure what the error is exactly, but maybe it doesn't matter. It looks like you've declared more than one queue channels in your 'cutadapt' input declaration. Usually you don't want to do this. Please see: understand how multiple input channels work.
Note that the Channel.fromPath factory method creates a queue channel. Here, 'primer1' and 'primer2' are both queue channels that each provide only a single value:
params.primer1 = "/home/duxu/project/data/primer_1.fasta"
params.primer2 = "/home/duxu/project/data/primer_2.fasta"
primer1 = Channel.fromPath(params.primer1)
primer2 = Channel.fromPath(params.primer2)
What you want instead are value channels which can be read an infinite number of times without consuming their content:
params.primer1 = "/home/duxu/project/data/primer_1.fasta"
params.primer2 = "/home/duxu/project/data/primer_2.fasta"
primer1 = file(params.primer1)
primer2 = file(params.primer2)
A typically process, which involves executing a process for each of n samples, will involve a single queue channel and zero or more value channels. For example:
process cutadapt{
tag { sample }
publishDir "${params.output}/cut_primer", mode: 'copy'
cpus 64
input:
tuple val(sample), path(fq1), path(fq2) from trimmed_index
path primer1
path primer2
output:
tuple val(sample), path("${sample}.trim.1.fq.gz"), path("${sample}.trim.2.fq.gz")
script:
"""
cutadapt \\
-g "file:${primer1}" \\
-G "file:${primer2} \\
-j ${task.cpus} \\
--discard-untrimmed \\
-o "${sample}.trim.1.fq.gz" \\
-p "${sample}.trim.2.fq.gz" \\
"${fq1}" \\
"${fq2}"
"""
}
Note that when the file input name is the same as the channel name, the from channel declaration can be omitted. I also used the path qualifier above, as it should be preferred over the file qualifier when using Nextflow 19.10.0 or later.
I have a script that a used before and it worked. But now it does not working (I changed the machine).
to convert eps files to pds files
#!/usr/bin/perl
use File::Find::Rule;
my #files = File::Find::Rule->file()
->name( '*.eps' )
->in( './' );
for my $file (#files) {
print "file: $file\n";
system("epstopdf", $file);
}
to crop white spaces
#!/usr/bin/perl
use File::Find::Rule;
my #files = File::Find::Rule->file()
->name( '*.pdf' )
->in( './' );
for my $file (#files) {
print "file: $file\n";
system("pdfcrop", $file, $file);
}
But when running the files just the names of files are written and the action is not done.
Is there something to install?
Your Perl scripts are trying to run external programs called epstopdf and pdfcrop.
I would like to run a GNU AWK script that is editing files in-place (e.g. with the -i inplace option) and have it print the filenames it is working on to STDOUT. Is there a way to do this? The following just adds the filename as the first line in the modified file, rather than printing the filename on the command line:
BEGINFILE {
print FILENAME
}
Here is a workaround; drop -i inplace from the command line (not an obligatory though, see -e/-f) and place following at the very beginning of your script. Before starting to process a file's content, this will disable inplace temporarily and print FILENAME. Then inplace's BEGINFILE rule will enable itself again.
BEGINFILE {
if (inplace::filename != "") {
inplace::end(inplace::filename, inplace::suffix)
inplace::filename = ""
}
print FILENAME
}
#include "inplace"
See how inplace is implemented for a better understanding.
ok so i have have this
{"status":0,"id":"7aceb216d02ecdca7ceffadcadea8950-1","hypotheses":[{"utterance":"hello how are you","confidence":0.96311796}]}
and at the moment i'm using this shell command to decode it to get the string i need,
echo $x | grep -Po '"utterance":.*?[^\\]"' | sed -e s/://g -e s/utterance//g -e 's/"//g'
but this only works when you have a grep compiled with perl and plus the script i use to get that JSON string is written in perl, so is there any way i can do this same decoding in a simple perl script or a simpler unix command, or better yet, c or objective-c?
the script i'm using to get the json is here, http://pastebin.com/jBGzJbMk and if you want a file to use then download http://trevorrudolph.com/a.flac
How about:
perl -MJSON -nE 'say decode_json($_)->{hypotheses}[0]{utterance}'
in script form:
use JSON;
while (<>) {
print decode_json($_)->{hypotheses}[0]{utterance}, "\n"
}
Well, I'm not sure if I can deduce what you are after correctly, but this is a way to decode that JSON string in perl.
Of course, you'll need to know the data structure in order to get the data you need. The line that prints the "utterance" string is commented out in the code below.
use strict;
use warnings;
use Data::Dumper;
use JSON;
my $json = decode_json
q#{"status":0,"id":"7aceb216d02ecdca7ceffadcadea8950-1","hypotheses":[{"utterance":"hello how are you","confidence":0.96311796}]}#;
#print $json->{'hypotheses'}[0]{'utterance'};
print Dumper $json;
Output:
$VAR1 = {
'status' => 0,
'hypotheses' => [
{
'utterance' => 'hello how are you',
'confidence' => '0.96311796'
}
],
'id' => '7aceb216d02ecdca7ceffadcadea8950-1'
};
Quick hack:
while (<>) {
say for /"utterance":"?(.*?)(?<!\\)"/;
}
Or as a one-liner:
perl -lnwe 'print for /"utterance":"(.+?)(?<!\\)"/g' inputfile.txt
The one-liner is troublesome if you happen to be using Windows, since " is interpreted by the shell.
Quick hack#2:
This will hopefully go through any hash structure and find keys.
my $json = decode_json $str;
say find_key($json, 'utterance');
sub find_key {
my ($ref, $find) = #_;
if (ref $ref) {
if (ref $ref eq 'HASH' and defined $ref->{$find}) {
return $ref->{$find};
} else {
for (values $ref) {
my $found = find_key($_, $find);
if (defined $found) {
return $found;
}
}
}
}
return;
}
Based on the naming, it's possible to have multiple hypotheses. The prints the utterance of each hypothesis:
echo '{"status":0,"id":"7aceb216d02ecdca7ceffadcadea8950-1","hypotheses":[{"utterance":"hello how are you","confidence":0.96311796}]}' | \
perl -MJSON::XS -n000E'
say $_->{utterance}
for #{ JSON::XS->new->decode($_)->{hypotheses} }'
Or as a script:
use feature qw( say );
use JSON::XS;
my $json = '{"status":0,"id":"7aceb216d02ecdca7ceffadcadea8950-1","hypotheses":[{"utterance":"hello how are you","confidence":0.96311796}]}';
say $_->{utterance}
for #{ JSON::XS->new->decode($json)->{hypotheses} };
If you don't want to use any modules from CPAN and try a regex instead there are multiple variants you can try:
# JSON is on a single line:
$json = '{"other":"stuff","hypo":[{"utterance":"hi, this is \"bob\"","moo":0}]}';
# RegEx with negative look behind:
# Match everything up to a double quote without a Backslash in front of it
print "$1\n" if ($json =~ m/"utterance":"(.*?)(?<!\\)"/)
This regex works if there is only one utterance. It doesn't matter what else is in the string around it, since it only searches for the double quoted string following the utterance key.
For a more robust version you could add whitespace where necessary/possible and make the . in the RegEx match newlines: m/"utterance"\s*:\s*"(.*?)(?<!\\)"/s
If you have multiple entries for the utterance confidence hash/object, changing case and weird formatting of the JSON string try this:
# weird JSON:
$json = <<'EOJSON';
{
"status":0,
"id":"an ID",
"hypotheses":[
{
"UtTeraNcE":"hello my name is \"Bob\".",
"confidence":0.0
},
{
'utterance' : 'how are you?',
"confidence":0.1
},
{
"utterance"
: "
thought
so!
",
"confidence" : 0.9
}
]
}
EOJSON
# RegEx with alternatives:
print "$1\n" while ( $json =~ m/["']utterance["']\s*:\s*["'](([^\\"']|\\.)*)["']/gis);
The main part of this RegEx is "(([^\\"]|\\.)*)". Description in detail as extended regex:
/
["'] # opening quotes
( # start capturing parentheses for $1
( # start of grouping alternatives
[^\\"'] # anything that's not a backslash or a quote
| # or
\\. # a backslash followed by anything
) # end of grouping
* # in any quantity
) # end capturing parentheses
["'] # closing quotes
/xgs
If you have many data sets and speed is a concern you can add the o modifier to the regex and use character classes instead of the i modifier. You can suppress the capturing of the alternatives to $2 with clustering parenthesis (?:pattern). Then you get this final result:
m/["'][uU][tT][tT][eE][rR][aA][nN][cC][eE]["']\s*:\s*["']((?:[^\\"']|\\.)*)["']/gos
Yes, sometimes perl looks like a big explosion in a bracket factory ;-)
Just stubmled upon another nice method of doing this, i finaly found how to acsess the Mac OS X JavaScript engine form commandline, heres the script,
alias jsc='/System/Library/Frameworks/JavaScriptCore.framework/Versions/A/Resources/jsc'
x='{"status":0,"id":"7aceb216d02ecdca7ceffadcadea8950-1","hypotheses":[{"utterance":"hello how are you","confidence":0.96311796}]}'
jsc -e "print(${x}['hypotheses'][0]['utterance'])"
Ugh, yes i came up with another answer, im strudying python and it reads arrays in both its python format and the same format as a json so, i jsut made this one liner when your variable is x
python -c "print ${x}['hypotheses'][0]['utterance']"
figured it out for unix but would love to see your perl and c, objective-c answers...
echo $X | sed -e 's/.*utterance//' -e 's/confidence.*//' -e s/://g -e 's/"//g' -e 's/,//g'
:D
shorter copy of the same sed:
echo $X | sed -e 's/.*utterance//;s/confidence.*//;s/://g;s/"//g;s/,//g'