I've been trying to learn how to use Nextflow and come across an issue with adding output to a channel as I need the processes to run in an order. I want to pass output files from one of the output subdirectories created by the tool (ONT-Guppy) into a channel, but can't seem to figure out how.
Here is the nextflow process in question:
process GupcallBases {
publishDir "$params.P1_outDir", mode: 'copy', pattern: "pass/*.bam"
executor = 'pbspro'
clusterOptions = "-lselect=1:ncpus=${params.P1_threads}:mem=${params.P1_memory}:ngpus=1:gpu_type=${params.P1_GPU} -lwalltime=${params.P1_walltime}:00:00"
output:
path "*.bam" into bams_ch
script:
"""
module load cuda/11.4.2
singularity exec --nv $params.Gup_container \
guppy_basecaller --config $params.P1_gupConf \
--device "cuda:0" \
--bam_out \
--recursive \
--compress \
--align_ref $params.refGen \
-i $params.P1_inDir \
-s $params.P1_outDir \
--gpu_runners_per_device $params.P1_GPU_runners \
--num_callers $params.P1_callers
"""
}
The output of the process is something like this:
$params.P1_outDir/pass/(lots of bams and fastqs)
$params.P1_outDir/fail/(lots of bams and fastqs)
$params.P1_outDir/(a few txt and log files)
I only want to keep the bam files in $params.P1_outDir/pass/, hence trying to use the pattern = "pass/*.bam, but I've tried a few other patterns to no avail.
The output syntax was chosen since once this process is done, using the following channel works:
// Channel
// .fromPath("${params.P1_outDir}/pass/*.bam")
// .ifEmpty { error "Cannot find any bam files in ${params.P1_outDir}" }
// .set { bams_ch }
But the problem is if I don't pass the files into the output channel of the first process, they run in parallel. I could simply be missing something in the extensive documentation in how to order processes, which would be an alternative solution.
Edit: I forgo to add the error message which is here: Missing output file(s) `*.bam` expected by process `GupcallBases` and the $params.P1_outDir/ contains the subdirectories and all the log files despite the pattern argument.
Thanks in advance.
Nextflow processes are designed to run isolated from each other, but this can be circumvented somewhat when the command-line input and/or outputs are specified using params. Using params like this can be problematic because if, for example, a params variable specifies an absolute path but your output declaration expects files in the Nextflow working directory (e.g. ./work/fc/0249e72585c03d08e31ce154b6d873), you will get the 'Missing output file(s) expected by process' error you're seeing.
The solution is to ensure your inputs are localized in the working directory using an input declaration block and that the outputs are also written to the work dir. Note that only files specified in the output declaration block can be published using the publishDir directive.
Also, best to avoid calling Singularity manually in your script block. Instead just add singularity.enabled = true to your nextflow.config. This should also work nicely with the beforeScript process directive to initialize your environment:
params.publishDir = './results'
input_dir = file( params.input_dir )
guppy_config = file( params.guppy_config )
ref_genome = file( params.ref_genome )
process GuppyBasecaller {
publishDir(
path: "${params.publishDir}/GuppyBasecaller",
mode: 'copy',
saveAs: { fn -> fn.substring(fn.lastIndexOf('/')+1) },
)
beforeScript 'module load cuda/11.4.2; export SINGULARITY_NV=1'
container '/path/to/guppy_basecaller.img'
input:
path input_dir
path guppy_config
path ref_genome
output:
path "outdir/pass/*.bam" into bams_ch
"""
mkdir outdir
guppy_basecaller \\
--config "${guppy_config}" \\
--device "cuda:0" \\
--bam_out \\
--recursive \\
--compress \\
--align_ref "${ref_genome}" \\
-i "${input_dir}" \\
-s outdir \\
--gpu_runners_per_device "${params.guppy_gpu_runners}" \\
--num_callers "${params.guppy_callers}"
"""
}
Related
I have the process create_parallel_params whose output is parallel_params folder containing json files.
#!/usr/bin/env nextflow
nextflow.enable.dsl = 2
params.spectra = "$baseDir/data/spectra/"
params.library = "$baseDir/data/library/"
params.workflow_parameter="$baseDir/data/workflowParameters.xml"
TOOL_FOLDERS="$baseDir/bin"
process create_parallel_params{
publishDir "$baseDir/nf_output", mode: 'copy'
output:
path "parallel_params/*.json"
script:
"""
mkdir parallel_params | python $TOOL_FOLDERS/parallel_paramgen.py \
parallel_params \
10
"""
}
The output of the above process passed into process searchlibrarysearch_molecularv2_parallelstep1 which process each json file.
process searchlibrarysearch_molecularv2_parallelstep1{
publishDir "$baseDir/nf_output", mode: 'copy'
input:
path json_file
//path params.spectra
//path params.library
output:
path "result_folder" emit:"result_folder/*.tsv"
script:
"""
mkdir result_folder convert_binary librarysearch_binary | \
python $TOOL_FOLDERS/searchlibrarysearch_molecularv2_parallelstep1.py \
$params.spectra \
$json_file \
$params.workflow_parameter \
$params.library \
result_folder \
convert_binary \
librarysearch_binary \
"""
}
workflow{
ch_parallel_params=create_parallel_params()
ch_searchlibrarysearch=searchlibrarysearch_molecularv2_parallelstep1(create_parallel_params.out.flatten())
ch_searchlibrarysearch.view()
}
I want the output of these file in publishDir (nf_output) in a single folder. So How can i do that. Provide some example.
The emit option can be used to assign a name identifier to an output channel. This is helpful if your output declaration defines more than one output channels, but isn't usually necessary if you make only a single declaration. Providing a glob pattern as an identifier doesn't make much sense: if you need only the output TSV files (and not the whole folder), you can just use the following and the output TSV files will be published to the publishDir:
output:
path "result_folder/*.tsv"
If you want to declare the folder itself, usually you can just update your publishDir to include a subdirectory with a unique name. You could use something like:
publishDir "$baseDir/nf_output/${json_file.baseName}", mode: 'copy'
But this will give you a 'result_folder' in every subdirectory. If that's not desirable, it might be preferable to change your output declaration to:
output:
path "result_folder/*"
Hello all!
I´m trying to write a small Nextflow pipeline that runs vcftools comands in 300 vcf´s. The pipe takes four inputs: vcf, pop1, pop2 and a .txt file, and would have to generate two outputs: a .log.weir.fst and a .log.log file. When i run the pipeline, it only gives the .log.weir.fst files but not the .log files.
Here´s my process definition:
process fst_calculation {
publishDir "${results_dir}/fst_results_pop1_pop2/", mode:"copy"
input:
file vcf
file pop_1
file pop_2
file mart
output:
path "*.log.*"
"""
while read linea
do
echo "[DEBUG] working in line: \$linea"
inicio=\$(echo "\$linea" | cut -f3)
final=\$(echo "\$linea" | cut -f4)
cromosoma=\$(echo "\$linea" | cut -f1)
segmento=\$(echo "\$linea" | cut -f5)
vcftools --vcf ${vcf} \
--weir-fst-pop ${pop_1} \
--weir-fst-pop ${pop_2} \
--out \$inicio.log --chr \$cromosoma \
--from-bp \$inicio --to-bp \$final
done < ${mart}
"""
}
And here´s the workflow of my process
/* Load files into channel*/
pop_1 = Channel.fromPath("${params.fst_path}/pop_1")
pop_2 = Channel.fromPath("${params.fst_path}/pop_2")
vcf = Channel.fromPath("${params.fst_path}/*.vcf")
mart = Channel.fromPath("${params.fst_path}/*.txt")
/* Import modules
*/
include {
fst_calculation } from './nf_modules/modules.nf'
/*
* main pipeline logic
*/
workflow {
p1 = fst_calculation(vcf, pop_1, pop_2, mart)
p1.view()
}
When i check the work directory of the pipeline, I can see that the pipe only generates the .log.weir.fst. To verify if my code was wrong, i ran "bash .command.sh" in the working directory and this actually generates the two output files. So, is there a reason for not getting the two output files when i run the pipe?
I appreciate any help.
Note that bash .command.sh and bash .command.run do different things. The latter is basically a wrapper around the former that sets up the environment and stages the declared input files, among other things. If running the latter produces the unusual behavior, you'll need to dig deeper.
It's not completely clear to me what the problem is here. My guess is that vcftools might behave differently when run non-interactively, such that it sends it's logging to STDERR. If that's the case, the logging will be captured in a file called .command.err. To instead send that to a file, you can just redirect STDERR in the usual way, untested:
while IFS=\$'\\t' read -r cromosoma null inicio final segmento ; do
>&2 echo "[DEBUG] Working with: \${cromosoma}, \${inicio}, \${final}, \${segmento}"
vcftools \\
--vcf "${vcf}" \\
--weir-fst-pop "${pop_1}" \\
--weir-fst-pop "${pop_2}" \\
--out "\${inicio}.log" \\
--chr "\${cromosoma}" \\
--from-bp "\${inicio}" \\
--to-bp "\${final}" \\
2> "\${cromosoma}.\${inicio}.\${final}.log.log"
done < "${mart}"
I have a process generating two files that I am interested in, hitsort.cls and contigs.fasta.
I output these using publishdir:
process RUN_RE {
publishDir "$baseDir/RE_output", mode: 'copy'
input:
file 'interleaved.fq'
output:
file "${params.RE_run}/seqclust/clustering/hitsort.cls"
file "${params.RE_run}/contigs.fasta"
script:
"""
some_code
"""
}
Now, I need these two files to be an input for another process but I don't know how to do that.
I have tried calling this process with
NEXT_PROCESS(params.hitsort, params.contigs)
while specifying the input as:
process NEXT_PROCESS {
input:
path hitsort
path contigs
but it's not working, because only the basename is used instead of the full path. Basically what I want is to wait for RUN_RE to finish, and then use the two files it outputs for the next process.
Best to avoid accessing files in the publishDir, since:
Files are copied into the specified directory in an asynchronous manner, thus they may not be immediately available in the published directory at the end of the process execution. For this reason files published by a process must not be accessed by other downstream processes.
The recommendation is therefore to ensure your processes only access files in the working directory, (i.e. ./work). What this means is: it's best to avoid things like absolute paths in your input and output declarations. This will also help ensure your workflows are portable.
nextflow.enable.dsl=2
params.interleaved_fq = './path/to/interleaved.fq'
params.publish_dir = './results'
process RUN_RE {
publishDir "${params.publish_dir}/RE_output", mode: 'copy'
input:
path interleaved
output:
path "./seqclust/clustering/hitsort.cls", emit: hitsort_cls
path "./contigs.fasta", emit: contigs_fasta
"""
# do something with ${interleaved}...
ls -l "${interleaved}"
# create some outputs...
mkdir -p ./seqclust/clustering
touch ./seqclust/clustering/hitsort.cls
touch ./contigs.fasta
"""
}
process NEXT_PROCESS {
input:
path hitsort
path contigs
"""
ls -l
"""
}
workflow {
interleaved_fq = file( params.interleaved_fq )
NEXT_PROCESS( RUN_RE( interleaved_fq ) )
}
The above workflow block is effectively the same as:
workflow {
interleaved_fq = file( params.interleaved_fq )
RUN_RE( interleaved_fq )
NEXT_PROCESS( RUN_RE.out.hitsort_cls, RUN_RE.out.contigs_fasta )
}
I'm kinda new at snakemake and I'm trying to understand how it works.
I tried to pull a simple snakefile
from snakemake.utils import min_version
min_version("5.3.0")
max_reads: 250000
sra_id: ["SRR1187735"]
rule all:
input:
"DATA/{sra_id}.fastq.gz"
rule prefetch:
output:
"DATA/{sra_id}.fastq.gz"
params:
max_reads = "max_reads"
version: "1.0"
shell:
"conda activate sra-tools-2.10.1 "
"&& "
"fastq-dump {wildcards.sra_id} -X {params.max_reads} --readids \
--dumpbase --skip-technical --gzip -Z > {output} "
"&& "
"conda deactivate "
But I'm getting this error :
WildcardError in line 5 of /save_home/skastalli/test_rule/Snakefile:
Wildcards in input files cannot be determined from output files:
'sra_id'
Can someone help me please ?
Most of your rules should be generalizable and include wildcards, so prefetch is correct. At some point you have to ask for a specific file for snakemake to try and generate. This can occur in several spots of the workflow, but at least the rule all needs to ask for specific files. Based on your preamble, I think the input of your rule all should be:
rule all:
input:
expand("DATA/{sra_id}.fastq.gz", sra_id=sra_id)
Also, because snakemake runs in debug mode for shell, you don't need the && in the shell directive and can just use newlines instead. Personal preference though.
How can I make sure in rule all that the output folder was well created?
Should I add each expected result file?
somehow relates to snakemake define folder as output but in my case the specified 'output' is a combination of a path to a dir and a prefix for all results files (they wil be multiple)
the following command creates a folder path Analysis/MosDepth and adds to that path the files:
gt0.mosdepth.global.dist.txt
gt0.mosdepth.region.dist.txt
gt0.per-base.bed.gz
gt0.per-base.bed.gz.csi
gt0.regions.bed.gz
gt0.regions.bed.gz.csi
rule MosDepth:
input:
bam = "Analysis/Minimap2/"+UnpackedRawFastq+".bam",
bed = "ReferenceData/"+UnpackedGenomeGFF+"_exons.bed"
output:
pfx = "Analysis/MosDepth/gt0"
threads: config["threads"]
shell:
"mosdepth -t {threads} -b {input.bed} {output.pfx} {input.bam}"
I currently have only one of the files in rule all:, is this enough or is there a better way to ensure that the mosdepth has run well and not redo it in a later re-run?
rule all:
input:
"Analysis/MosDepth/gt0.regions.bed.gz"
I would recommend sth like this:
mos_out = ['gt0.mosdepth.global.dist.txt', 'gt0.mosdepth.region.dist.txt', 'gt0.per-base.bed.gz', 'gt0.per-base.bed.gz.csi', 'gt0.regions.bed.gz', 'gt0.regions.bed.gz.csi']
rule MosDepth:
input:
bam = "Analysis/Minimap2/"+UnpackedRawFastq+".bam",
bed = "ReferenceData/"+UnpackedGenomeGFF+"_exons.bed"
output:
expand("Analysis/MosDepth/{mos_out}", mos_out=mos_out)
params:
pfx = "Analysis/MosDepth/gt0"
threads: config["threads"]
shell:
"mosdepth -t {threads} -b {input.bed} {params.pfx} {input.bam}"
If one of the output files is not created by the rule, snakemake will remove all the output files for you, and throw an error.