Issue with bwa mem process not running on all output files from previous process - nextflow

I'm building a nextflow pipeline to map and variant call genotyping by sequencing (GBS) data (single end Illumina). I've based much of it on the nf-core/eager pipeline as that had many of the tools I want to incorporate into my pipeline. I've tested the pipeline on sample and it works perfectly. However, when I try running the pipeline on more samples, it pulls in the read files fine and trims them with fastp. However, when I try to run bwa mem on the trimmed files, it'll only work on one of the trimmed fastq files which it seems to pick at random which means the downstream processes only run on one file. I've tried a few different things, none of which seem to work. I'm guessing it may be something to do with the fasta reference/bwa index not being a value channel? Any suggestions?
//read reference fasta channel
Channel.fromPath("${params.fasta}")
.ifEmpty { exit 1, "No genome specified! Please specify one with --fasta or --bwa_index"}
.into {ch_fasta_for_bwa_indexing; ch_fasta_for_faidx_indexing; ch_fasta_for_variant_call; ch_fasta_for_bwamem_mapping; ch_fasta_for_qualimap}
///build_bwa_index
process build_bwa_index {
tag {fasta}
publishDir path: "${params.outdir}/bwa_index", mode: 'copy', saveAs: { filename ->
if (params.saveReference) filename
else if(!params.saveReference && filename == "where_are_my_files.txt") filename
else null
}
when: !params.bwa_index && params.fasta
input:
file fasta from ch_fasta_for_bwa_indexing
file wherearemyfiles
output:
file "*.{amb,ann,bwt,pac,sa,fasta,fa}" into bwa_index_bwamem
file "where_are_my_files.txt"
"""
bwa index $fasta
"""
}
///bwa_align process
process bwa_align {
tag "$name"
publishDir "${params.outdir}/mapping/bwamem", mode: 'copy'
input:
set val(name), file(reads) from trimmed_fastq
file fasta from ch_fasta_for_bwamem_mapping
file "*" from bwa_index_bwamem
output:
file "*_sorted.bam" into bwa_sorted_bam_idxstats, bwa_sorted_bam_filter
file "*.bai"
script:
if(params.singleEnd){
"""
bwa mem $fasta ${reads[0]} -t ${task.cpus} | samtools sort -# ${task.cpus} -o ${name}_sorted.bam
samtools index -# ${task.cpus} ${name}_sorted.bam
"""
} else {
"""
bwa mem $fasta ${reads[0]} ${reads[1]} -t ${task.cpus} | samtools sort -# ${task.cpus} -o ${name}_sorted.bam
samtools index -# ${task.cpus} ${name}_sorted.bam
"""
}
}
I would it expect to the bwa_align process to run on both files produced by the fastp process in this example
Pipeline name : trishulagenetics/genocan
Pipeline version: 0.1dev
Run name : exotic_hoover
Reads : data_2/*.R{1,2}.fastq.gz
Fasta reference: GCA_000230575.4_ASM23057v4_genomic.fna
bwa index : false
Data type : Single-end
Max Memory : null
Max CPUs : null
Max Time : null
Output dir : ./results
Working dir : /home/debian/Trishula/SRR2060630_split/test/work
Container Engine: docker
Container : trishulagenetics/genocan:latest
Current home : /home/debian
Current user : debian
Current path : /home/debian/Trishula/SRR2060630_split/test
Script dir : /home/debian/.nextflow/assets/trishulagenetics/genocan
Config Profile : docker
=========================================
executor > local (14)
[b1/080d6a] process > get_software_versions [100%] 1 of 1 ✔
[4e/87b4c2] process > build_bwa_index (GCA_000230575.4_ASM23057v4_genomic.fna) [100%] 1 of 1 ✔
[27/64b776] process > build_fasta_index (GCA_000230575.4_ASM23057v4_genomic.fna) [100%] 1 of 1 ✔
[f6/b07508] process > fastqc (P2_E07_M_0055) [100%] 2 of 2 ✔
[87/ecd07c] process > fastp (P2_E07_M_0055) [100%] 2 of 2 ✔
[50/e7bf8c] process > bwa_align (P2_A01_M_0001) [100%] 1 of 1 ✔
[c1/3647bc] process > samtools_idxstats (P2_A01_M_0001_sorted) [100%] 1 of 1 ✔
[0c/68b22c] process > samtools_filter (P2_A01_M_0001_sorted) [100%] 1 of 1 ✔
[de/c26b2d] process > qualimap (P2_A01_M_0001_sorted.filtered) [100%] 1 of 1 ✔
[bc/f7cf86] process > variant_call (P2_A01_M_0001) [100%] 1 of 1 ✔
[6f/2a9ab8] process > multiqc [100%] 1 of 1 ✔
[bb/b8b957] process > output_documentation (null) [100%] 1 of 1 ✔
[trishulagenetics/genocan] Pipeline Complete
Completed at: 17-Aug-2019 09:51:48
Duration : 19m 34s
CPU hours : 0.3
Succeeded : 14

Yes - Basically it's better to avoid splitting your fasta file into multiple channels and just use a single value which is implicitly a value channel:
ref_fasta = file(params.fasta)
process build_bwa_index {
storeDir ...
input:
file ref_fasta
output:
file "*.{amb,ann,bwt,pac,sa}" into bwa_index
"""
bwa index "${ref_fasta}"
"""
}
process bwa_mem {
publishDir ...
input:
set name, file(reads) from trimmed_fastq
file ref_fasta
file "*" from bwa_index
...
}

Related

How to save the multiple output of single process in publishDir in Nextflow

I have the process create_parallel_params whose output is parallel_params folder containing json files.
#!/usr/bin/env nextflow
nextflow.enable.dsl = 2
params.spectra = "$baseDir/data/spectra/"
params.library = "$baseDir/data/library/"
params.workflow_parameter="$baseDir/data/workflowParameters.xml"
TOOL_FOLDERS="$baseDir/bin"
process create_parallel_params{
publishDir "$baseDir/nf_output", mode: 'copy'
output:
path "parallel_params/*.json"
script:
"""
mkdir parallel_params | python $TOOL_FOLDERS/parallel_paramgen.py \
parallel_params \
10
"""
}
The output of the above process passed into process searchlibrarysearch_molecularv2_parallelstep1 which process each json file.
process searchlibrarysearch_molecularv2_parallelstep1{
publishDir "$baseDir/nf_output", mode: 'copy'
input:
path json_file
//path params.spectra
//path params.library
output:
path "result_folder" emit:"result_folder/*.tsv"
script:
"""
mkdir result_folder convert_binary librarysearch_binary | \
python $TOOL_FOLDERS/searchlibrarysearch_molecularv2_parallelstep1.py \
$params.spectra \
$json_file \
$params.workflow_parameter \
$params.library \
result_folder \
convert_binary \
librarysearch_binary \
"""
}
workflow{
ch_parallel_params=create_parallel_params()
ch_searchlibrarysearch=searchlibrarysearch_molecularv2_parallelstep1(create_parallel_params.out.flatten())
ch_searchlibrarysearch.view()
}
I want the output of these file in publishDir (nf_output) in a single folder. So How can i do that. Provide some example.
The emit option can be used to assign a name identifier to an output channel. This is helpful if your output declaration defines more than one output channels, but isn't usually necessary if you make only a single declaration. Providing a glob pattern as an identifier doesn't make much sense: if you need only the output TSV files (and not the whole folder), you can just use the following and the output TSV files will be published to the publishDir:
output:
path "result_folder/*.tsv"
If you want to declare the folder itself, usually you can just update your publishDir to include a subdirectory with a unique name. You could use something like:
publishDir "$baseDir/nf_output/${json_file.baseName}", mode: 'copy'
But this will give you a 'result_folder' in every subdirectory. If that's not desirable, it might be preferable to change your output declaration to:
output:
path "result_folder/*"

Snakemake checkpoint output unknown number of files with no subsequent aggregation but instead rules that peform actions on individual files?

Thanks for any help ahead of time.
I'm trying to use the Snakemake checkpoint functionality to produce an unknown number of files in a directory, which I've gotten to work using the pattern described in the docs, but then I don't want to do any kind of aggregation rule afterwards, I want to have rules that do actions on each individual file (of course inherently in parallel via wildcards).
Here's a simple reproducible example of my problem:
from os.path import join
rule all:
input:
"aggregated.txt",
checkpoint create_gzip_file:
output:
directory("my_directory/"),
shell:
"""
mkdir my_directory/
cd my_directory
for i in 1 2 3; do gzip < /dev/null > $i.txt.gz; done
"""
rule gunzip_file:
input:
join("my_directory", "{i}.txt.gz"),
output:
join("my_directory", "{i}.txt"),
shell:
"""
gunzip -c {input} > {output}
"""
def gather_gunzip_input(wildcards):
out_dir = checkpoints.create_gzip_file.get(**wildcards).output[0]
i = glob_wildcards(join(out_dir, "{i}.txt.gz"))
return expand(f"{out_dir}/{{i}}", i=i)
rule aggregate:
input:
gather_gunzip_input,
output:
"aggregated.txt",
shell:
"cat {input} > {output}"
I'm getting the following error:
$ snakemake --printshellcmds --cores all
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 16
Rules claiming more threads will be scaled down.
Job stats:
job count min threads max threads
---------------- ------- ------------- -------------
aggregate 1 1 1
all 1 1 1
create_gzip_file 1 1 1
total 3 1 1
Select jobs to execute...
[Wed Jul 13 14:57:09 2022]
checkpoint create_gzip_file:
output: my_directory
jobid: 2
reason: Missing output files: my_directory
resources: tmpdir=/tmp
Downstream jobs will be updated after completion.
mkdir my_directory/
cd my_directory
for i in 1 2 3; do gzip < /dev/null > $i.txt.gz; done
[Wed Jul 13 14:57:09 2022]
Finished job 2.
1 of 3 steps (33%) done
MissingInputException in line 20 of /home/hermidalc/projects/github/hermidalc/test/Snakefile:
Missing input files for rule gunzip_file:
output: my_directory/['1', '2', '3'].txt
wildcards: i=['1', '2', '3']
affected files:
my_directory/['1', '2', '3'].txt.gz
I had a syntax issue (which wasn't triggering any syntax check or compiler issues) that was causing the seemingly unrelated MissingInputException. The glob_wildcards line:
i = glob_wildcards(join(out_dir, "{i}.txt.gz"))
needs to be with a trailing comma:
i, = glob_wildcards(join(out_dir, "{i}.txt.gz"))
or
i = glob_wildcards(join(out_dir, "{i}.txt.gz")).i
Also, in answering the other part of the question - I believe if you don't want an aggregation-type rule (which uses the function that gathers the unknown number of files as its input) then you need to put that function as input to your rule all. As shown in this question, you can continue to have downstream rules of your checkpoint, which do not aggregate, but perform actions on the individual unknown files, you just have to use the wildcards created in your gather function and write the expand in the right way that it outputs the file structure for how the output of the last rule performing actions on files from the checkpoint come out.

Nextflow deepTools fingerprint

I am trying to use nextflow pipeline to do a fringerprint(bamCoverage) from deeptool. When I input the bam files and run the script. it says I don't have index files. error: [E::idx_find_and_load] Could not retrieve index file for 'Kasumi_NCOR1.genome.sorted.bam'
[E::idx_find_and_load] Could not retrieve index file for 'Kasumi_NCOR1.genome.sorted.bam'
'Kasumi_NCOR1.genome.sorted.bam' does not appear to have an index. You MUST index the file first!
process fingerprint_cov {
publishDir "${params.outdir}/fingerprint_cov", mode: 'copy'
input:
set val(sample_id), file(samples) from sorted_bam_sample_control_ch.samples
set val(sample_id_c), file(controls) from sorted_bam_sample_control_ch.controls
output:
set val(sample_id), file("${sample_id}.cov.bedgraph") into sample_cov_ch
set val(sample_id_c), file("${sample_id_c}.cov.bedgraph") into control_cov_ch
script:
"""
bamCoverage -b ${samples} -o ${sample_id}.cov.bedgraph -of bedgraph -bs 1000 -p 10
bamCoverage -b ${controls} -o ${sample_id_c}.cov.bedgraph -of bedgraph -bs 1000 -p 10
"""
}
sorted_bam_sample_control_ch.samples has all the sample bam files, and sorted_bam_sample_control_ch.control has the control bam files. How do I input the bam.bai files? I have also seen that output bam and bam.bai to a channel, but how to process this steps?
This is my sample input. but when I run the process it only runs one sample
[Kasumi_H3K36, [/mnt/Data/cut_and_tag/work/0c/24e138a92a1eb0d906e1e9fad9ba4b/Kasumi_H3K36.genome.sorted.bam, /mnt/Data/cut_and_tag/work/0c/24e138a92a1eb0d906e1e9fad9ba4b/Kasumi_H3K36.genome.sorted.bam.bai]]
[Kasumi_H4K5, [/mnt/Data/cut_and_tag/work/7e/a740e11ce39f2a310b749603c785a4/Kasumi_H4K5.genome.sorted.bam, /mnt/Data/cut_and_tag/work/7e/a740e11ce39f2a310b749603c785a4/Kasumi_H4K5.genome.sorted.bam.bai]]
[Kasumi_NCOR1, [/mnt/Data/cut_and_tag/work/b8/e91ff7c7aea0fa3a0814530ab07972/Kasumi_NCOR1.genome.sorted.bam, /mnt/Data/cut_and_tag/work/b8/e91ff7c7aea0fa3a0814530ab07972/Kasumi_NCOR1.genome.sorted.bam.bai]]
[Kasumi_JMJD1C, [/mnt/Data/cut_and_tag/work/49/99ebe402d2b1953a95968525e258f6/Kasumi_JMJD1C.genome.sorted.bam, /mnt/Data/cut_and_tag/work/49/99ebe402d2b1953a95968525e258f6/Kasumi_JMJD1C.genome.sorted.bam.bai]]
Here is the control input
[Kasumi_IgG, [/mnt/Data/cut_and_tag/work/0e/1cd7aefd90105205e58fb6ef912aa4/Kasumi_IgG.genome.sorted.bam, /mnt/Data/cut_and_tag/work/0e/1cd7aefd90105205e58fb6ef912aa4/Kasumi_IgG.genome.sorted.bam.bai]]
You'll need to index your BAM files first if the index (.bai) files don't already exist. You can use samtools index <bam> for this.
Then all you would need to do is input these into your process somehow. Rather than having a separate variable in each of your input sets/tuples, what I find works quite nicely is grouping the BAM files and their indexes in a tuple of the form: tuple( bam, bai )
Then your process might look like:
process fingerprint_cov {
publishDir "${params.outdir}/fingerprint_cov", mode: 'copy'
input:
set val(test_sample_id), file(indexed_test_bam) from sorted_bam_sample_control_ch.samples
set val(control_sample_id), file(indexed_control_bam) from sorted_bam_sample_control_ch.controls
output:
set val(test_sample_id), file("${test_sample_id}.cov.bedgraph") into sample_cov_ch
set val(control_sample_id), file("${control_sample_id}.cov.bedgraph") into control_cov_ch
script:
def test_bam = indexed_test_bam.first()
def control_bam = indexed_control_bam.first()
"""
bamCoverage -b "${test_bam}" -o "${sample_id}.cov.bedgraph" -of bedgraph -bs 1000 -p 10
bamCoverage -b "${control_bam}" -o "${sample_id_c}.cov.bedgraph" -of bedgraph -bs 1000 -p 10
"""
}

How to call a process in workflow.onError

I have this small pipeline:
process test {
"""
echo 'hello'
exit 1
"""
}
workflow.onError {
process finish_error{
script:
"""
echo 'blablabla'
"""
}
}
I want to trigger a python script in case the pipeline has an error using the finish error process, but this entire process does not seem to be triggered even when using a simple echo blabla example.
nextflow run test.nf
N E X T F L O W ~ version 20.10.0
Launching `test.nf` [cheesy_banach] - revision: 9020d641ca
executor > local (1)
[56/994298] process > test [100%] 1 of 1, failed: 1 ✘
[- ] process > finish_error [ 0%] 0 of 1
Error executing process > 'test'
Caused by:
Process `test` terminated with an error exit status (1)
Command executed:
echo 'hello'
exit 1
Command exit status:
1
Command output:
hello
Command wrapper:
hello
Work dir:
/home/joost/nextflow/work/56/9942985fc9948fd9bf7797d39c1785
Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line
How can I trigger this finish_error process, and how can I view its output?
The onError handler is invoked when a process causes pipeline execution to terminate prematurely. Since a Nextflow pipeline is really just a series of processes joined together, launching another pipeline process from within an event handler doesn't make much sense to me. If your python script should be run using the local executor, you can just execute it in the usual way. This example assumes your script is executable and has an appropriate shebang:
process test {
"""
echo 'hello'
exit 1
"""
}
workflow.onError {
def proc = "${baseDir}/test.py".execute()
proc.waitFor()
println proc.text
}
Run using:
nextflow run -ansi-log false test.nf

Snakemake “Missing files after X seconds” error

I am getting the following error every time I try to run my snakemake script:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cluster nodes: 99
Job counts:
count jobs
1 all
1 antiSMASH
1 pear
1 prodigal
4
[Wed Dec 11 14:59:43 2019]
rule pear:
input: Unmap_41_1.fastq, Unmap_41_2.fastq
output: merged_reads/Unmap_41.fastq
jobid: 3
wildcards: sample=Unmap_41, extension=fastq
Submitted job 3 with external jobid 'Submitted batch job 4572437'.
Waiting at most 120 seconds for missing files.
MissingOutputException in line 14 of /faststorage/project/ABR/scripts/antismash.smk:
Missing files after 120 seconds:
merged_reads/Unmap_41.fastq
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Job failed, going on with independent jobs.
Exiting because a job execution failed. Look above for error message
It would seem that the first rule is not executing, but I am unsure as to why as from what I can see all the syntax is correct. Anyone have some advice?
The snakefile is the following:
#!/miniconda/bin/python
workdir: config["path_to_files"]
wildcard_constraints:
separator = config["separator"],
extension = config["file_extension"],
sample = '|' .join(config["samples"])
rule all:
input:
expand("antismash-output/{sample}/{sample}.txt", sample = config["samples"])
# merging the paired end reads (either fasta or fastq) as prodigal only takes single end reads
rule pear:
input:
forward = f"{{sample}}{config['separator']}1.{{extension}}",
reverse = f"{{sample}}{config['separator']}2.{{extension}}"
output:
"merged_reads/{sample}.{extension}"
#conda:
#"/home/lamma/env-export/antismash.yaml"
run:
shell("set +u")
shell("source ~/miniconda3/etc/profile.d/conda.sh")
shell("conda activate antismash")
shell("pear -f {input.forward} -r {input.reverse} -o {output} -t 21")
# If single end then move them to merged_reads directory
rule move:
input:
"{sample}.{extension}"
output:
"merged_reads/{sample}.{extension}"
shell:
"cp {path}/{sample}.{extension} {path}/merged_reads/"
# Setting the rule order on the 3 above rules which should be treated equally and only one run.
ruleorder: pear > move
# annotating the metagenome with prodigal#. Can be done inside antiSMASH but prefer to do it out
rule prodigal:
input:
f"merged_reads/{{sample}}.{config['file_extension']}"
output:
gbk_files = "annotated_reads/{sample}.gbk",
protein_files = "protein_reads/{sample}.faa"
#conda:
#"/home/lamma/env-export/antismash.yaml"
run:
shell("set +u")
shell("source ~/miniconda3/etc/profile.d/conda.sh")
shell("conda activate antismash")
shell("prodigal -i {input} -o {output.gbk_files} -a {output.protein_files} -p meta")
# running antiSMASH on the annotated metagenome
rule antiSMASH:
input:
"annotated_reads/{sample}.gbk"
output:
touch("antismash-output/{sample}/{sample}.txt")
#conda:
#"/home/lamma/env-export/antismash.yaml"
run:
shell("set +u")
shell("source ~/miniconda3/etc/profile.d/conda.sh")
shell("conda activate antismash")
shell("antismash --knownclusterblast --subclusterblast --full-hmmer --smcog --outputfolder antismash-output/{wildcards.sample}/ {input}")
I am running the pipeline on only one file at the moment but the yaml file looks like this if it is of intest:
file_extension: fastq
path_to_files: /home/lamma/ABR/Each_reads
samples:
- Unmap_41
separator: _
I know the error can occure when you use certain flags in snakemake but I dont believe I am using those flags. The command being submited to run the snakefile is:
snakemake --latency-wait 120 --rerun-incomplete --keep-going --jobs 99 --cluster-status 'python /home/lamma/ABR/scripts/slurm-status.py' --cluster 'sbatch -t {cluster.time} --mem={cluster.mem} --cpus-per-task={cluster.c} --error={cluster.error} --job-name={cluster.name} --output={cluster.output}' --cluster-config antismash-config.json --configfile yaml-config-files/antismash-on-rawMetagenome.yaml --snakefile antismash.smk
I have tried to -F flag to force a rerun but this seems to do nothing, as does increasing the --latency-wait number. Any help would be appriciated :)
I think it is likely to be something involving the way I am calling the conda environment in the run commands but using the conda: option with a yaml files returns version not found style errors.
As of what I read from pear documentation:
-o Specify the name to be used as base for the output files. PEAR outputs four files. A file containing the assembled reads with a
assembled.fastq extension, two files containing the forward, resp.
reverse, unassembled reads with extensions unassembled.forward.fastq,
resp. unassembled.reverse.fastq, and a file containing the discarded
reads with a discarded.fastq extension
So if the output defined in your rule is just a base, I suggest you put this as a param and the real names of the output as output:
rule pear:
input:
forward = f"{{sample}}{config['separator']}1.{{extension}}",
reverse = f"{{sample}}{config['separator']}2.{{extension}}"
output:
"merged_reads/{sample}.{extension}".assembled.fastq,
"merged_reads/{sample}.{extension}".unassembled.forward.fastq,
"merged_reads/{sample}.{extension}".unassembled.reverse.fastq,
"merged_reads/{sample}.{extension}".discarded.fastq
params:
base="merged_reads/{sample}.{extension}"
#conda:
#"/home/lamma/env-export/antismash.yaml"
run:
shell("set +u")
shell("source ~/miniconda3/etc/profile.d/conda.sh")
shell("conda activate antismash")
shell("pear -f {input.forward} -r {input.reverse} -o {params.base} -t 21")
I haven't tested pear so not sure what the output file names are exactly.