Problems getting two output files in Nextflow - nextflow

Hello all!
I´m trying to write a small Nextflow pipeline that runs vcftools comands in 300 vcf´s. The pipe takes four inputs: vcf, pop1, pop2 and a .txt file, and would have to generate two outputs: a .log.weir.fst and a .log.log file. When i run the pipeline, it only gives the .log.weir.fst files but not the .log files.
Here´s my process definition:
process fst_calculation {
publishDir "${results_dir}/fst_results_pop1_pop2/", mode:"copy"
input:
file vcf
file pop_1
file pop_2
file mart
output:
path "*.log.*"
"""
while read linea
do
echo "[DEBUG] working in line: \$linea"
inicio=\$(echo "\$linea" | cut -f3)
final=\$(echo "\$linea" | cut -f4)
cromosoma=\$(echo "\$linea" | cut -f1)
segmento=\$(echo "\$linea" | cut -f5)
vcftools --vcf ${vcf} \
--weir-fst-pop ${pop_1} \
--weir-fst-pop ${pop_2} \
--out \$inicio.log --chr \$cromosoma \
--from-bp \$inicio --to-bp \$final
done < ${mart}
"""
}
And here´s the workflow of my process
/* Load files into channel*/
pop_1 = Channel.fromPath("${params.fst_path}/pop_1")
pop_2 = Channel.fromPath("${params.fst_path}/pop_2")
vcf = Channel.fromPath("${params.fst_path}/*.vcf")
mart = Channel.fromPath("${params.fst_path}/*.txt")
/* Import modules
*/
include {
fst_calculation } from './nf_modules/modules.nf'
/*
* main pipeline logic
*/
workflow {
p1 = fst_calculation(vcf, pop_1, pop_2, mart)
p1.view()
}
When i check the work directory of the pipeline, I can see that the pipe only generates the .log.weir.fst. To verify if my code was wrong, i ran "bash .command.sh" in the working directory and this actually generates the two output files. So, is there a reason for not getting the two output files when i run the pipe?
I appreciate any help.

Note that bash .command.sh and bash .command.run do different things. The latter is basically a wrapper around the former that sets up the environment and stages the declared input files, among other things. If running the latter produces the unusual behavior, you'll need to dig deeper.
It's not completely clear to me what the problem is here. My guess is that vcftools might behave differently when run non-interactively, such that it sends it's logging to STDERR. If that's the case, the logging will be captured in a file called .command.err. To instead send that to a file, you can just redirect STDERR in the usual way, untested:
while IFS=\$'\\t' read -r cromosoma null inicio final segmento ; do
>&2 echo "[DEBUG] Working with: \${cromosoma}, \${inicio}, \${final}, \${segmento}"
vcftools \\
--vcf "${vcf}" \\
--weir-fst-pop "${pop_1}" \\
--weir-fst-pop "${pop_2}" \\
--out "\${inicio}.log" \\
--chr "\${cromosoma}" \\
--from-bp "\${inicio}" \\
--to-bp "\${final}" \\
2> "\${cromosoma}.\${inicio}.\${final}.log.log"
done < "${mart}"

Related

How to save the multiple output of single process in publishDir in Nextflow

I have the process create_parallel_params whose output is parallel_params folder containing json files.
#!/usr/bin/env nextflow
nextflow.enable.dsl = 2
params.spectra = "$baseDir/data/spectra/"
params.library = "$baseDir/data/library/"
params.workflow_parameter="$baseDir/data/workflowParameters.xml"
TOOL_FOLDERS="$baseDir/bin"
process create_parallel_params{
publishDir "$baseDir/nf_output", mode: 'copy'
output:
path "parallel_params/*.json"
script:
"""
mkdir parallel_params | python $TOOL_FOLDERS/parallel_paramgen.py \
parallel_params \
10
"""
}
The output of the above process passed into process searchlibrarysearch_molecularv2_parallelstep1 which process each json file.
process searchlibrarysearch_molecularv2_parallelstep1{
publishDir "$baseDir/nf_output", mode: 'copy'
input:
path json_file
//path params.spectra
//path params.library
output:
path "result_folder" emit:"result_folder/*.tsv"
script:
"""
mkdir result_folder convert_binary librarysearch_binary | \
python $TOOL_FOLDERS/searchlibrarysearch_molecularv2_parallelstep1.py \
$params.spectra \
$json_file \
$params.workflow_parameter \
$params.library \
result_folder \
convert_binary \
librarysearch_binary \
"""
}
workflow{
ch_parallel_params=create_parallel_params()
ch_searchlibrarysearch=searchlibrarysearch_molecularv2_parallelstep1(create_parallel_params.out.flatten())
ch_searchlibrarysearch.view()
}
I want the output of these file in publishDir (nf_output) in a single folder. So How can i do that. Provide some example.
The emit option can be used to assign a name identifier to an output channel. This is helpful if your output declaration defines more than one output channels, but isn't usually necessary if you make only a single declaration. Providing a glob pattern as an identifier doesn't make much sense: if you need only the output TSV files (and not the whole folder), you can just use the following and the output TSV files will be published to the publishDir:
output:
path "result_folder/*.tsv"
If you want to declare the folder itself, usually you can just update your publishDir to include a subdirectory with a unique name. You could use something like:
publishDir "$baseDir/nf_output/${json_file.baseName}", mode: 'copy'
But this will give you a 'result_folder' in every subdirectory. If that's not desirable, it might be preferable to change your output declaration to:
output:
path "result_folder/*"

How does snakemake handle possible corruptions due to a rule run in parallel simultaneously appending to a single file?

I would like to learn how snakemake handles following situations, and what is the best practice avoid collisions/corruptions.
rule something:
input:
expand("/path/to/out-{asd}.txt", asd=LIST)
output:
"/path/to/merged.txt"
shell:
"cat {input} >> {output}"
With snakemake -j10 the command will try to append to the same file simultaneously, and I could not figure out if this could lead to possible corruptions or if this is already handled.
Also, how are more complicated cases handled e.g. where it is not only cat but a return value of another process based on input value being appended to the same file? Is the best practice first writing them to individual files then catting them together?
rule get_merged_total_distinct:
input:
expand("{dataset_id}/merge_libraries/{tomerge}_merged_rmd.bam",dataset_id=config["dataset_id"],tomerge=list(TOMERGE.keys())),
output:
"{dataset_id}/merge_libraries/merged_total_distinct.csv"
params:
with_dups="{dataset_id}/merge_libraries/{tomerge}_merged.bam"
shell:
"""
RCT=$(samtools view -#4 -c -F1 -F4 -q 30 {params.with_dups})
RCD=$(samtools view -#4 -c -F1 -F4 -q 30 {input})
printf "{wildcards.tomerge},${{RCT}},${{RCD}}\n" >> {output}
"""
or cases where an external script is being called to print the result to a single output file?
input:
expand("infile/{x}",...) # expanded as above
output:
"results/all.txt"
shell:
"""
bash script.sh {params.x} {input} {params.y} >> {output}
"""
With your example, the shell directive will expand to
cat /path/to/out-SAMPLE1.txt /path/to/out-SAMPLE2.txt [...] >> /path/to/merged.txt
where SAMPLE1, etc, comes from the LIST. In this case, there is no collision, corruption, or race conditions. One thread will run that command as if you typed it on your shell and all inputs will get cated to the output. Since snakemake is pull based, once the output exists that rule will only run again if the inputs change at which points the new inputs will be added to the old due to using >>. As such, I would recommend using > so the old contents are removed; rules should be deterministic where possible.
Now, if you had done something like
rule something:
input:
"/path/to/out-{asd}.txt"
output:
touch("/path/to/merged-{asd}.txt")
params:
output="/path/to/merged.txt"
shell:
"cat {input} >> {params.output}"
# then invoke
snakemake -j10 /path/to/merged-{a..z}.txt
Things are more messy. Snakemake will launch all 10 jobs and output to the single merged.txt. Note that file is now a parameter and we are targeting some dummy files. This will behave as if you had 10 different shells and executed the commands
cat /path/to/out-a.txt >> /path/to/merged.txt
# ...
cat /path/to/out-z.txt >> /path/to/merged.txt
all at once. The output will have a random order and lines may be interleaved or interrupted.
As some guidance
Try to make outputs deterministic. Given the same inputs you should always produce the same outputs. If possible, set random seeds and enforce input ordering. In the second example, you have no idea what the output will be.
Don't use the append operator. This follows from the first point. If the output already exists and needs to be updated, start from scratch.
If you need to append a bunch of outputs, say log files or to create a summary, do so in a separate rule. This again follows from the first point, but it's the only reason I can think of to use append.
Hope that helps. Otherwise you can comment or edit with a more realistic example of what you are worried about.

Nextflow: publishDir, output channels, and output subdirectories

I've been trying to learn how to use Nextflow and come across an issue with adding output to a channel as I need the processes to run in an order. I want to pass output files from one of the output subdirectories created by the tool (ONT-Guppy) into a channel, but can't seem to figure out how.
Here is the nextflow process in question:
process GupcallBases {
publishDir "$params.P1_outDir", mode: 'copy', pattern: "pass/*.bam"
executor = 'pbspro'
clusterOptions = "-lselect=1:ncpus=${params.P1_threads}:mem=${params.P1_memory}:ngpus=1:gpu_type=${params.P1_GPU} -lwalltime=${params.P1_walltime}:00:00"
output:
path "*.bam" into bams_ch
script:
"""
module load cuda/11.4.2
singularity exec --nv $params.Gup_container \
guppy_basecaller --config $params.P1_gupConf \
--device "cuda:0" \
--bam_out \
--recursive \
--compress \
--align_ref $params.refGen \
-i $params.P1_inDir \
-s $params.P1_outDir \
--gpu_runners_per_device $params.P1_GPU_runners \
--num_callers $params.P1_callers
"""
}
The output of the process is something like this:
$params.P1_outDir/pass/(lots of bams and fastqs)
$params.P1_outDir/fail/(lots of bams and fastqs)
$params.P1_outDir/(a few txt and log files)
I only want to keep the bam files in $params.P1_outDir/pass/, hence trying to use the pattern = "pass/*.bam, but I've tried a few other patterns to no avail.
The output syntax was chosen since once this process is done, using the following channel works:
// Channel
// .fromPath("${params.P1_outDir}/pass/*.bam")
// .ifEmpty { error "Cannot find any bam files in ${params.P1_outDir}" }
// .set { bams_ch }
But the problem is if I don't pass the files into the output channel of the first process, they run in parallel. I could simply be missing something in the extensive documentation in how to order processes, which would be an alternative solution.
Edit: I forgo to add the error message which is here: Missing output file(s) `*.bam` expected by process `GupcallBases` and the $params.P1_outDir/ contains the subdirectories and all the log files despite the pattern argument.
Thanks in advance.
Nextflow processes are designed to run isolated from each other, but this can be circumvented somewhat when the command-line input and/or outputs are specified using params. Using params like this can be problematic because if, for example, a params variable specifies an absolute path but your output declaration expects files in the Nextflow working directory (e.g. ./work/fc/0249e72585c03d08e31ce154b6d873), you will get the 'Missing output file(s) expected by process' error you're seeing.
The solution is to ensure your inputs are localized in the working directory using an input declaration block and that the outputs are also written to the work dir. Note that only files specified in the output declaration block can be published using the publishDir directive.
Also, best to avoid calling Singularity manually in your script block. Instead just add singularity.enabled = true to your nextflow.config. This should also work nicely with the beforeScript process directive to initialize your environment:
params.publishDir = './results'
input_dir = file( params.input_dir )
guppy_config = file( params.guppy_config )
ref_genome = file( params.ref_genome )
process GuppyBasecaller {
publishDir(
path: "${params.publishDir}/GuppyBasecaller",
mode: 'copy',
saveAs: { fn -> fn.substring(fn.lastIndexOf('/')+1) },
)
beforeScript 'module load cuda/11.4.2; export SINGULARITY_NV=1'
container '/path/to/guppy_basecaller.img'
input:
path input_dir
path guppy_config
path ref_genome
output:
path "outdir/pass/*.bam" into bams_ch
"""
mkdir outdir
guppy_basecaller \\
--config "${guppy_config}" \\
--device "cuda:0" \\
--bam_out \\
--recursive \\
--compress \\
--align_ref "${ref_genome}" \\
-i "${input_dir}" \\
-s outdir \\
--gpu_runners_per_device "${params.guppy_gpu_runners}" \\
--num_callers "${params.guppy_callers}"
"""
}

Snakemake-Wildcards in input files cannot be determined from output files

I'm kinda new at snakemake and I'm trying to understand how it works.
I tried to pull a simple snakefile
from snakemake.utils import min_version
min_version("5.3.0")
max_reads: 250000
sra_id: ["SRR1187735"]
rule all:
input:
"DATA/{sra_id}.fastq.gz"
rule prefetch:
output:
"DATA/{sra_id}.fastq.gz"
params:
max_reads = "max_reads"
version: "1.0"
shell:
"conda activate sra-tools-2.10.1 "
"&& "
"fastq-dump {wildcards.sra_id} -X {params.max_reads} --readids \
--dumpbase --skip-technical --gzip -Z > {output} "
"&& "
"conda deactivate "
But I'm getting this error :
WildcardError in line 5 of /save_home/skastalli/test_rule/Snakefile:
Wildcards in input files cannot be determined from output files:
'sra_id'
Can someone help me please ?
Most of your rules should be generalizable and include wildcards, so prefetch is correct. At some point you have to ask for a specific file for snakemake to try and generate. This can occur in several spots of the workflow, but at least the rule all needs to ask for specific files. Based on your preamble, I think the input of your rule all should be:
rule all:
input:
expand("DATA/{sra_id}.fastq.gz", sra_id=sra_id)
Also, because snakemake runs in debug mode for shell, you don't need the && in the shell directive and can just use newlines instead. Personal preference though.

How to merge multiple markdown files with pandoc while retaining cross document links?

I am trying to merge multiple markdown documents in a single folder together into a PDF with pandoc.
The documents may contain links to each other which should be browseable in the markdown format, e.g. through IntelliJ or within GitLab.
Simple example documents:
0001-file-a.md
---
id: 0001
---
# File a
This is a simple file without an [external link](www.stackoverflow.com).
0002-file-b.md
---
id: 0002
---
# File b
This file links to [another file](0001-file-a.md).
By default pandoc does not handle this case out of the box, e.g. when running the following command:
pandoc -s -f markdown -t pdf *.md -V linkcolor=blue -o test.pdf
It merges the files, creates a PDF and highlights the links correctly, but when clicking the second link it wants to open the file instead of jumping to the right location in the document.
This problem has been experienced by many before me but none of the solutions I found so far have solved it. The closest I came was with the help of this answer: https://stackoverflow.com/a/61908457/6628753
It defines a filter that is first applied to each file and then the resulting JSON files are merged.
I modified this filter to fit my needs:
Add the number of the file to the label of the top-level header
Prepend the top-level header to all other header labels
Remove .md from internal links
Here is the filter:
#!/usr/bin/env python3
from pandocfilters import toJSONFilter, Header, Link
import re
import sys
"""
Pandoc filter to convert internal links for multifile documents
"""
headerL1 = []
def fix_links(key, value, format, meta):
global headerL1
# Store level 1 headers
if key == "Header":
[level, [label, t1, t2], header] = value
if level == 1:
id = meta.get("id")
newlabel = f"{id['c'][0]['c']}-{label}"
headerL1 = [newlabel]
sys.stderr.write(f"\nGlobal header: {headerL1}\n")
return Header(level, [newlabel, t1, t2], header)
# Prepend level 1 header label to all other header labels
if level > 1:
prefix = headerL1[0]
newlabel = prefix + "-" + label
sys.stderr.write(f"Header label: {label} -> {newlabel}\n")
return Header(level, [newlabel, t1, t2], header)
if key == "Link":
[t1, linktext, [linkref, t4]] = value
if ".md" in linkref:
newlinkref = re.sub(r'.md', r'', linkref)
sys.stderr.write(f'Link: {linkref} -> {newlinkref}\n')
return Link(t1, linktext, [newlinkref, t4])
else:
sys.stderr.write(f'External link: {linkref}\n')
if __name__ == "__main__":
toJSONFilter(fix_links)
And here is a script that executes the whole thing:
#!/bin/bash
MD_INPUT=$(find . -type f | grep md | sort)
# Pass the markdown through the gitlab filters into Pandoc JSON files
echo "Filtering Gitlab markdown"
for file in $MD_INPUT
do
echo "Filtering $file"
pandoc \
--filter fix-links.py \
"$file" \
-t json \
-o "${file%.md}.json"
done
JSON_INPUT=$(find . -type f | grep json | sort)
echo "Generating LaTeX"
pandoc -s -f json -t latex $JSON_INPUT -V linkcolor=blue -o test.tex
echo "Generating PDF"
pandoc -s -f json -t pdf $JSON_INPUT -V linkcolor=blue -o test.pdf
Applying this script generates a PDF where the second link does not work at all.
Looking at the LaTeX code the problem can be solved by replacing the generated \href directive with \hyperlink.
Once this is done the linking works as expected.
The problem now is that this isn't done automatically by pandoc, which almost seems like a bug.
Is there a way to tell pandoc a link is internal from within the filter?
After running the filter it is non-trivial to fix the issue since there is no good way to differentiate internal and external links.