Nextflow - cannot get access to the variable at publishDir - scripting

I have a pipeline with two processes. I want to pass the results files of the first process used by the second process and save the output files in a separate directory for each of the files:
process one {
.
.
output:
file "split.*.json" into groups
.
.
}
process two {
.
.
publishDir params.output_path + "/exec_logs/train/${split.baseName}", mode: 'copy'
input:
set split from groups.flatten()
output:
file ".command.out"
file ".command.err"
.
.
"""
echo $split
"""
}
When I try the run the pipeline I get the following error:
No such variable: split in the publishDir ... line, but I have access to the split variable in the run section.
How can I get access to the split variable in the publishDir?
Thanks

I think the problem is actually the string concatenation. Try supplying instead a single GString:
publishDir "${params.output_path}/exec_logs/train/${split.baseName}", mode: 'copy'
For the example above, ensure also that the 'split' variable is provided using the path qualifier:
input:
path split from groups.flatten()

Related

How to save the multiple output of single process in publishDir in Nextflow

I have the process create_parallel_params whose output is parallel_params folder containing json files.
#!/usr/bin/env nextflow
nextflow.enable.dsl = 2
params.spectra = "$baseDir/data/spectra/"
params.library = "$baseDir/data/library/"
params.workflow_parameter="$baseDir/data/workflowParameters.xml"
TOOL_FOLDERS="$baseDir/bin"
process create_parallel_params{
publishDir "$baseDir/nf_output", mode: 'copy'
output:
path "parallel_params/*.json"
script:
"""
mkdir parallel_params | python $TOOL_FOLDERS/parallel_paramgen.py \
parallel_params \
10
"""
}
The output of the above process passed into process searchlibrarysearch_molecularv2_parallelstep1 which process each json file.
process searchlibrarysearch_molecularv2_parallelstep1{
publishDir "$baseDir/nf_output", mode: 'copy'
input:
path json_file
//path params.spectra
//path params.library
output:
path "result_folder" emit:"result_folder/*.tsv"
script:
"""
mkdir result_folder convert_binary librarysearch_binary | \
python $TOOL_FOLDERS/searchlibrarysearch_molecularv2_parallelstep1.py \
$params.spectra \
$json_file \
$params.workflow_parameter \
$params.library \
result_folder \
convert_binary \
librarysearch_binary \
"""
}
workflow{
ch_parallel_params=create_parallel_params()
ch_searchlibrarysearch=searchlibrarysearch_molecularv2_parallelstep1(create_parallel_params.out.flatten())
ch_searchlibrarysearch.view()
}
I want the output of these file in publishDir (nf_output) in a single folder. So How can i do that. Provide some example.
The emit option can be used to assign a name identifier to an output channel. This is helpful if your output declaration defines more than one output channels, but isn't usually necessary if you make only a single declaration. Providing a glob pattern as an identifier doesn't make much sense: if you need only the output TSV files (and not the whole folder), you can just use the following and the output TSV files will be published to the publishDir:
output:
path "result_folder/*.tsv"
If you want to declare the folder itself, usually you can just update your publishDir to include a subdirectory with a unique name. You could use something like:
publishDir "$baseDir/nf_output/${json_file.baseName}", mode: 'copy'
But this will give you a 'result_folder' in every subdirectory. If that's not desirable, it might be preferable to change your output declaration to:
output:
path "result_folder/*"

Passing list of filenames to nextflow process

I am a newcomer to Nextflow and I am trying to process multiple files in a workflow. The number of these files is more than 300, so I would like to not to paste it into a command line as an option. So what I have done is I've created a file with every filename of the files I need to process, but I am not sure how to pass it into the process. This is what I've tried:
params.SRRs = "srr_ids.txt"
process tmp {
input:
file ids
output:
path "*.txt"
script:
'''
while read id; do
touch ${id}.txt;
echo ${id} > ${id}.txt;
done < $ids
'''
}
workflow {
tmp(params.SRRs)
}
The script is supposed to read in the file srr_ids.txt, and create files that have their ids in it (just testing on a smaller task). The error log says that the id variable is unbound, but I don't understand why. What is the conventional way of passing lots of filenames to a pipeline? Should I write some other process that parses the list?
Maybe there's a typo in your question, but the error is actually that the ids variable is unbound:
Command error:
.command.sh: line 5: ids: unbound variable
The problem is that when you use a single-quote script string, you will not be able to access Nextflow variables in your script block. You can either define your script using a double-quote string and escape your shell variables:
params.SRRs = "srr_ids.txt"
process tmp {
input:
path ids
output:
path "*.txt"
script:
"""
while read id; do
touch "\${id}.txt"
echo "\${id}" > "\${id}.txt"
done < "${ids}"
"""
}
workflow {
SRRs = file(params.SRRs)
tmp(SRRs)
}
Or, use a shell block which uses the exclamation mark ! character as the variable placeholder for Nextflow variables. This makes it possible to use both Nextflow and shell variables in the same piece of code without having to escape each of the shell variables:
params.SRRs = "srr_ids.txt"
process tmp {
input:
path ids
output:
path "*.txt"
shell:
'''
while read id; do
touch "${id}.txt"
echo "${id}" > "${id}.txt"
done < "!{ids}"
'''
}
workflow {
SRRs = file(params.SRRs)
tmp(SRRs)
}
What is the conventional way of passing lots of filenames to a
pipeline?
The conventional way, I think, is to actually supply one (or more) glob patterns to the fromPath channel factory method. For example:
params.SRRs = "./path/to/files/SRR*.fastq.gz"
workflow {
Channel
.fromPath( params.SRRs )
.view()
}
Results:
$ nextflow run main.nf
N E X T F L O W ~ version 22.04.4
Launching `main.nf` [sleepy_bernard] DSL2 - revision: 30020008a7
/home/steve/working/stackoverflow/73702711/path/to/files/SRR1910483.fastq.gz
/home/steve/working/stackoverflow/73702711/path/to/files/SRR1910482.fastq.gz
/home/steve/working/stackoverflow/73702711/path/to/files/SRR1448795.fastq.gz
/home/steve/working/stackoverflow/73702711/path/to/files/SRR1448793.fastq.gz
/home/steve/working/stackoverflow/73702711/path/to/files/SRR1448794.fastq.gz
/home/steve/working/stackoverflow/73702711/path/to/files/SRR1448792.fastq.gz
If instead you would prefer to pass in a list of filenames, like in your example, use either the splitCsv or the splitText operator to get what you want. For example:
params.SRRs = "srr_ids.txt"
workflow {
Channel
.fromPath( params.SRRs )
.splitText() { it.strip() }
.view()
}
Results:
$ nextflow run main.nf
N E X T F L O W ~ version 22.04.4
Launching `main.nf` [fervent_ramanujan] DSL2 - revision: 89a1771d50
SRR1448794
SRR1448795
SRR1448792
SRR1448793
SRR1910483
SRR1910482
Should I write some other process that parses the list?
You may not need to. My feeling is that your code might benefit from using the fromSRA factory method, but we don't really have enough details to say one way or the other. If you need to, you could just write a function that returns a channel.

Nextflow: publishDir, output channels, and output subdirectories

I've been trying to learn how to use Nextflow and come across an issue with adding output to a channel as I need the processes to run in an order. I want to pass output files from one of the output subdirectories created by the tool (ONT-Guppy) into a channel, but can't seem to figure out how.
Here is the nextflow process in question:
process GupcallBases {
publishDir "$params.P1_outDir", mode: 'copy', pattern: "pass/*.bam"
executor = 'pbspro'
clusterOptions = "-lselect=1:ncpus=${params.P1_threads}:mem=${params.P1_memory}:ngpus=1:gpu_type=${params.P1_GPU} -lwalltime=${params.P1_walltime}:00:00"
output:
path "*.bam" into bams_ch
script:
"""
module load cuda/11.4.2
singularity exec --nv $params.Gup_container \
guppy_basecaller --config $params.P1_gupConf \
--device "cuda:0" \
--bam_out \
--recursive \
--compress \
--align_ref $params.refGen \
-i $params.P1_inDir \
-s $params.P1_outDir \
--gpu_runners_per_device $params.P1_GPU_runners \
--num_callers $params.P1_callers
"""
}
The output of the process is something like this:
$params.P1_outDir/pass/(lots of bams and fastqs)
$params.P1_outDir/fail/(lots of bams and fastqs)
$params.P1_outDir/(a few txt and log files)
I only want to keep the bam files in $params.P1_outDir/pass/, hence trying to use the pattern = "pass/*.bam, but I've tried a few other patterns to no avail.
The output syntax was chosen since once this process is done, using the following channel works:
// Channel
// .fromPath("${params.P1_outDir}/pass/*.bam")
// .ifEmpty { error "Cannot find any bam files in ${params.P1_outDir}" }
// .set { bams_ch }
But the problem is if I don't pass the files into the output channel of the first process, they run in parallel. I could simply be missing something in the extensive documentation in how to order processes, which would be an alternative solution.
Edit: I forgo to add the error message which is here: Missing output file(s) `*.bam` expected by process `GupcallBases` and the $params.P1_outDir/ contains the subdirectories and all the log files despite the pattern argument.
Thanks in advance.
Nextflow processes are designed to run isolated from each other, but this can be circumvented somewhat when the command-line input and/or outputs are specified using params. Using params like this can be problematic because if, for example, a params variable specifies an absolute path but your output declaration expects files in the Nextflow working directory (e.g. ./work/fc/0249e72585c03d08e31ce154b6d873), you will get the 'Missing output file(s) expected by process' error you're seeing.
The solution is to ensure your inputs are localized in the working directory using an input declaration block and that the outputs are also written to the work dir. Note that only files specified in the output declaration block can be published using the publishDir directive.
Also, best to avoid calling Singularity manually in your script block. Instead just add singularity.enabled = true to your nextflow.config. This should also work nicely with the beforeScript process directive to initialize your environment:
params.publishDir = './results'
input_dir = file( params.input_dir )
guppy_config = file( params.guppy_config )
ref_genome = file( params.ref_genome )
process GuppyBasecaller {
publishDir(
path: "${params.publishDir}/GuppyBasecaller",
mode: 'copy',
saveAs: { fn -> fn.substring(fn.lastIndexOf('/')+1) },
)
beforeScript 'module load cuda/11.4.2; export SINGULARITY_NV=1'
container '/path/to/guppy_basecaller.img'
input:
path input_dir
path guppy_config
path ref_genome
output:
path "outdir/pass/*.bam" into bams_ch
"""
mkdir outdir
guppy_basecaller \\
--config "${guppy_config}" \\
--device "cuda:0" \\
--bam_out \\
--recursive \\
--compress \\
--align_ref "${ref_genome}" \\
-i "${input_dir}" \\
-s outdir \\
--gpu_runners_per_device "${params.guppy_gpu_runners}" \\
--num_callers "${params.guppy_callers}"
"""
}

Nextflow: how do you pass an output (multiple files) from the publishdir to the next process?

I have a process generating two files that I am interested in, hitsort.cls and contigs.fasta.
I output these using publishdir:
process RUN_RE {
publishDir "$baseDir/RE_output", mode: 'copy'
input:
file 'interleaved.fq'
output:
file "${params.RE_run}/seqclust/clustering/hitsort.cls"
file "${params.RE_run}/contigs.fasta"
script:
"""
some_code
"""
}
Now, I need these two files to be an input for another process but I don't know how to do that.
I have tried calling this process with
NEXT_PROCESS(params.hitsort, params.contigs)
while specifying the input as:
process NEXT_PROCESS {
input:
path hitsort
path contigs
but it's not working, because only the basename is used instead of the full path. Basically what I want is to wait for RUN_RE to finish, and then use the two files it outputs for the next process.
Best to avoid accessing files in the publishDir, since:
Files are copied into the specified directory in an asynchronous manner, thus they may not be immediately available in the published directory at the end of the process execution. For this reason files published by a process must not be accessed by other downstream processes.
The recommendation is therefore to ensure your processes only access files in the working directory, (i.e. ./work). What this means is: it's best to avoid things like absolute paths in your input and output declarations. This will also help ensure your workflows are portable.
nextflow.enable.dsl=2
params.interleaved_fq = './path/to/interleaved.fq'
params.publish_dir = './results'
process RUN_RE {
publishDir "${params.publish_dir}/RE_output", mode: 'copy'
input:
path interleaved
output:
path "./seqclust/clustering/hitsort.cls", emit: hitsort_cls
path "./contigs.fasta", emit: contigs_fasta
"""
# do something with ${interleaved}...
ls -l "${interleaved}"
# create some outputs...
mkdir -p ./seqclust/clustering
touch ./seqclust/clustering/hitsort.cls
touch ./contigs.fasta
"""
}
process NEXT_PROCESS {
input:
path hitsort
path contigs
"""
ls -l
"""
}
workflow {
interleaved_fq = file( params.interleaved_fq )
NEXT_PROCESS( RUN_RE( interleaved_fq ) )
}
The above workflow block is effectively the same as:
workflow {
interleaved_fq = file( params.interleaved_fq )
RUN_RE( interleaved_fq )
NEXT_PROCESS( RUN_RE.out.hitsort_cls, RUN_RE.out.contigs_fasta )
}

Snakemake - input function exception

I am trying to run snakemake code using.json file as input. While checking the dry run i got foloowing error
InputFunctionException in line 172 of /home/Snakefile_ChIPseq_pe:
KeyError: '130241_1'
Wildcards:
library=130241_1
This is the part of snakemake code
rule findPeaks:
input:
sample = os.path.join(HOMERTAG_DIR, "{library}"),
input = lambda wildcards: os.path.join(HOMERTAG_DIR, config['lib_input'][wildcards.library])
output:
os.path.join(HOMERPEAK_DIR, "{library}.all.hpeaks")
params:
config['homer_findPeaks_params']
shell:
"findPeaks {input.sample} -i {input.input} {params} -o {output}"
There is single quote around input sample which is missing in the 'lib_input' part. How to add that single quote ahead of variable?
Also library names are like 12345_1, 12345_2 etc., never had this problem before however for the first time I have libraries with "underscore" in the names.
Snakemake will first try to interpret the given value as number. Only if that fails, it will interpret the value as string. Here, it does not fail, because the underscore _ is interpreted as thousand separator.
My guess is that in your json file the library IDs are not quoted. E.g. you have this:
{
"lib_input": {1234_1: "input.txt"}
}
Instead of:
{
"lib_input": {"1234_1": "input.txt"}
}
Or maybe library 130241_1 is not in the json at all?