Nextflow multiple inputs with different number of files - nextflow

I'm trying to input two channels. However, the seacr_res_ch2 has 4 files, bigwig_ch3 has 5 files which contain a control and 4 samples. So I was trying to run the following process to compute the peak center.
When I ran this process I have got this error: unexpected EOF while looking for matching `"'
process compute_matrix_peak_center {
input:
set val(sample_id), file(seacr_bed) from seacr_res_ch2
set val(sample_id), file(bigwig) from bigwig_ch3
output:
set val(sample_id), file("${sample_id}.peak_centered.mat.gz") into peak_center_ch
script:
"""
"computeMatrix reference-point \
-S ${bigwig} \
-R ${seacr_bed} \
-a 1000 \
-b 1000 \
-o ${sample_id}.peak_centered.mat.gz \
--referencePoint center \
-p 10
"""
}

Likely the input files are not file objects. Try replacing the file in the declaration with path, eg:
input:
set val(sample_id), path(seacr_bed) from seacr_res_ch2
set val(sample_id), path(bigwig) from bigwig_ch3
Check the documentation for details https://www.nextflow.io/docs/latest/process.html#input-of-type-path

Your input block declares twice a value called sample_id. There's no guarantee that these values will be the same if the value is derived from two (or more) channels. One value will simply clobber the other(s). You'll need to join() these channels first:
input:
set val(sample_id), file(seacr_bed), file(bigwig) from seacr_res_ch2.join(bigwig_ch3)

Related

How to save the multiple output of single process in publishDir in Nextflow

I have the process create_parallel_params whose output is parallel_params folder containing json files.
#!/usr/bin/env nextflow
nextflow.enable.dsl = 2
params.spectra = "$baseDir/data/spectra/"
params.library = "$baseDir/data/library/"
params.workflow_parameter="$baseDir/data/workflowParameters.xml"
TOOL_FOLDERS="$baseDir/bin"
process create_parallel_params{
publishDir "$baseDir/nf_output", mode: 'copy'
output:
path "parallel_params/*.json"
script:
"""
mkdir parallel_params | python $TOOL_FOLDERS/parallel_paramgen.py \
parallel_params \
10
"""
}
The output of the above process passed into process searchlibrarysearch_molecularv2_parallelstep1 which process each json file.
process searchlibrarysearch_molecularv2_parallelstep1{
publishDir "$baseDir/nf_output", mode: 'copy'
input:
path json_file
//path params.spectra
//path params.library
output:
path "result_folder" emit:"result_folder/*.tsv"
script:
"""
mkdir result_folder convert_binary librarysearch_binary | \
python $TOOL_FOLDERS/searchlibrarysearch_molecularv2_parallelstep1.py \
$params.spectra \
$json_file \
$params.workflow_parameter \
$params.library \
result_folder \
convert_binary \
librarysearch_binary \
"""
}
workflow{
ch_parallel_params=create_parallel_params()
ch_searchlibrarysearch=searchlibrarysearch_molecularv2_parallelstep1(create_parallel_params.out.flatten())
ch_searchlibrarysearch.view()
}
I want the output of these file in publishDir (nf_output) in a single folder. So How can i do that. Provide some example.
The emit option can be used to assign a name identifier to an output channel. This is helpful if your output declaration defines more than one output channels, but isn't usually necessary if you make only a single declaration. Providing a glob pattern as an identifier doesn't make much sense: if you need only the output TSV files (and not the whole folder), you can just use the following and the output TSV files will be published to the publishDir:
output:
path "result_folder/*.tsv"
If you want to declare the folder itself, usually you can just update your publishDir to include a subdirectory with a unique name. You could use something like:
publishDir "$baseDir/nf_output/${json_file.baseName}", mode: 'copy'
But this will give you a 'result_folder' in every subdirectory. If that's not desirable, it might be preferable to change your output declaration to:
output:
path "result_folder/*"

Nextflow: publishDir, output channels, and output subdirectories

I've been trying to learn how to use Nextflow and come across an issue with adding output to a channel as I need the processes to run in an order. I want to pass output files from one of the output subdirectories created by the tool (ONT-Guppy) into a channel, but can't seem to figure out how.
Here is the nextflow process in question:
process GupcallBases {
publishDir "$params.P1_outDir", mode: 'copy', pattern: "pass/*.bam"
executor = 'pbspro'
clusterOptions = "-lselect=1:ncpus=${params.P1_threads}:mem=${params.P1_memory}:ngpus=1:gpu_type=${params.P1_GPU} -lwalltime=${params.P1_walltime}:00:00"
output:
path "*.bam" into bams_ch
script:
"""
module load cuda/11.4.2
singularity exec --nv $params.Gup_container \
guppy_basecaller --config $params.P1_gupConf \
--device "cuda:0" \
--bam_out \
--recursive \
--compress \
--align_ref $params.refGen \
-i $params.P1_inDir \
-s $params.P1_outDir \
--gpu_runners_per_device $params.P1_GPU_runners \
--num_callers $params.P1_callers
"""
}
The output of the process is something like this:
$params.P1_outDir/pass/(lots of bams and fastqs)
$params.P1_outDir/fail/(lots of bams and fastqs)
$params.P1_outDir/(a few txt and log files)
I only want to keep the bam files in $params.P1_outDir/pass/, hence trying to use the pattern = "pass/*.bam, but I've tried a few other patterns to no avail.
The output syntax was chosen since once this process is done, using the following channel works:
// Channel
// .fromPath("${params.P1_outDir}/pass/*.bam")
// .ifEmpty { error "Cannot find any bam files in ${params.P1_outDir}" }
// .set { bams_ch }
But the problem is if I don't pass the files into the output channel of the first process, they run in parallel. I could simply be missing something in the extensive documentation in how to order processes, which would be an alternative solution.
Edit: I forgo to add the error message which is here: Missing output file(s) `*.bam` expected by process `GupcallBases` and the $params.P1_outDir/ contains the subdirectories and all the log files despite the pattern argument.
Thanks in advance.
Nextflow processes are designed to run isolated from each other, but this can be circumvented somewhat when the command-line input and/or outputs are specified using params. Using params like this can be problematic because if, for example, a params variable specifies an absolute path but your output declaration expects files in the Nextflow working directory (e.g. ./work/fc/0249e72585c03d08e31ce154b6d873), you will get the 'Missing output file(s) expected by process' error you're seeing.
The solution is to ensure your inputs are localized in the working directory using an input declaration block and that the outputs are also written to the work dir. Note that only files specified in the output declaration block can be published using the publishDir directive.
Also, best to avoid calling Singularity manually in your script block. Instead just add singularity.enabled = true to your nextflow.config. This should also work nicely with the beforeScript process directive to initialize your environment:
params.publishDir = './results'
input_dir = file( params.input_dir )
guppy_config = file( params.guppy_config )
ref_genome = file( params.ref_genome )
process GuppyBasecaller {
publishDir(
path: "${params.publishDir}/GuppyBasecaller",
mode: 'copy',
saveAs: { fn -> fn.substring(fn.lastIndexOf('/')+1) },
)
beforeScript 'module load cuda/11.4.2; export SINGULARITY_NV=1'
container '/path/to/guppy_basecaller.img'
input:
path input_dir
path guppy_config
path ref_genome
output:
path "outdir/pass/*.bam" into bams_ch
"""
mkdir outdir
guppy_basecaller \\
--config "${guppy_config}" \\
--device "cuda:0" \\
--bam_out \\
--recursive \\
--compress \\
--align_ref "${ref_genome}" \\
-i "${input_dir}" \\
-s outdir \\
--gpu_runners_per_device "${params.guppy_gpu_runners}" \\
--num_callers "${params.guppy_callers}"
"""
}

Nextflow deepTools fingerprint

I am trying to use nextflow pipeline to do a fringerprint(bamCoverage) from deeptool. When I input the bam files and run the script. it says I don't have index files. error: [E::idx_find_and_load] Could not retrieve index file for 'Kasumi_NCOR1.genome.sorted.bam'
[E::idx_find_and_load] Could not retrieve index file for 'Kasumi_NCOR1.genome.sorted.bam'
'Kasumi_NCOR1.genome.sorted.bam' does not appear to have an index. You MUST index the file first!
process fingerprint_cov {
publishDir "${params.outdir}/fingerprint_cov", mode: 'copy'
input:
set val(sample_id), file(samples) from sorted_bam_sample_control_ch.samples
set val(sample_id_c), file(controls) from sorted_bam_sample_control_ch.controls
output:
set val(sample_id), file("${sample_id}.cov.bedgraph") into sample_cov_ch
set val(sample_id_c), file("${sample_id_c}.cov.bedgraph") into control_cov_ch
script:
"""
bamCoverage -b ${samples} -o ${sample_id}.cov.bedgraph -of bedgraph -bs 1000 -p 10
bamCoverage -b ${controls} -o ${sample_id_c}.cov.bedgraph -of bedgraph -bs 1000 -p 10
"""
}
sorted_bam_sample_control_ch.samples has all the sample bam files, and sorted_bam_sample_control_ch.control has the control bam files. How do I input the bam.bai files? I have also seen that output bam and bam.bai to a channel, but how to process this steps?
This is my sample input. but when I run the process it only runs one sample
[Kasumi_H3K36, [/mnt/Data/cut_and_tag/work/0c/24e138a92a1eb0d906e1e9fad9ba4b/Kasumi_H3K36.genome.sorted.bam, /mnt/Data/cut_and_tag/work/0c/24e138a92a1eb0d906e1e9fad9ba4b/Kasumi_H3K36.genome.sorted.bam.bai]]
[Kasumi_H4K5, [/mnt/Data/cut_and_tag/work/7e/a740e11ce39f2a310b749603c785a4/Kasumi_H4K5.genome.sorted.bam, /mnt/Data/cut_and_tag/work/7e/a740e11ce39f2a310b749603c785a4/Kasumi_H4K5.genome.sorted.bam.bai]]
[Kasumi_NCOR1, [/mnt/Data/cut_and_tag/work/b8/e91ff7c7aea0fa3a0814530ab07972/Kasumi_NCOR1.genome.sorted.bam, /mnt/Data/cut_and_tag/work/b8/e91ff7c7aea0fa3a0814530ab07972/Kasumi_NCOR1.genome.sorted.bam.bai]]
[Kasumi_JMJD1C, [/mnt/Data/cut_and_tag/work/49/99ebe402d2b1953a95968525e258f6/Kasumi_JMJD1C.genome.sorted.bam, /mnt/Data/cut_and_tag/work/49/99ebe402d2b1953a95968525e258f6/Kasumi_JMJD1C.genome.sorted.bam.bai]]
Here is the control input
[Kasumi_IgG, [/mnt/Data/cut_and_tag/work/0e/1cd7aefd90105205e58fb6ef912aa4/Kasumi_IgG.genome.sorted.bam, /mnt/Data/cut_and_tag/work/0e/1cd7aefd90105205e58fb6ef912aa4/Kasumi_IgG.genome.sorted.bam.bai]]
You'll need to index your BAM files first if the index (.bai) files don't already exist. You can use samtools index <bam> for this.
Then all you would need to do is input these into your process somehow. Rather than having a separate variable in each of your input sets/tuples, what I find works quite nicely is grouping the BAM files and their indexes in a tuple of the form: tuple( bam, bai )
Then your process might look like:
process fingerprint_cov {
publishDir "${params.outdir}/fingerprint_cov", mode: 'copy'
input:
set val(test_sample_id), file(indexed_test_bam) from sorted_bam_sample_control_ch.samples
set val(control_sample_id), file(indexed_control_bam) from sorted_bam_sample_control_ch.controls
output:
set val(test_sample_id), file("${test_sample_id}.cov.bedgraph") into sample_cov_ch
set val(control_sample_id), file("${control_sample_id}.cov.bedgraph") into control_cov_ch
script:
def test_bam = indexed_test_bam.first()
def control_bam = indexed_control_bam.first()
"""
bamCoverage -b "${test_bam}" -o "${sample_id}.cov.bedgraph" -of bedgraph -bs 1000 -p 10
bamCoverage -b "${control_bam}" -o "${sample_id_c}.cov.bedgraph" -of bedgraph -bs 1000 -p 10
"""
}

Snakemake “Missing files after X seconds” error

I am getting the following error every time I try to run my snakemake script:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cluster nodes: 99
Job counts:
count jobs
1 all
1 antiSMASH
1 pear
1 prodigal
4
[Wed Dec 11 14:59:43 2019]
rule pear:
input: Unmap_41_1.fastq, Unmap_41_2.fastq
output: merged_reads/Unmap_41.fastq
jobid: 3
wildcards: sample=Unmap_41, extension=fastq
Submitted job 3 with external jobid 'Submitted batch job 4572437'.
Waiting at most 120 seconds for missing files.
MissingOutputException in line 14 of /faststorage/project/ABR/scripts/antismash.smk:
Missing files after 120 seconds:
merged_reads/Unmap_41.fastq
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Job failed, going on with independent jobs.
Exiting because a job execution failed. Look above for error message
It would seem that the first rule is not executing, but I am unsure as to why as from what I can see all the syntax is correct. Anyone have some advice?
The snakefile is the following:
#!/miniconda/bin/python
workdir: config["path_to_files"]
wildcard_constraints:
separator = config["separator"],
extension = config["file_extension"],
sample = '|' .join(config["samples"])
rule all:
input:
expand("antismash-output/{sample}/{sample}.txt", sample = config["samples"])
# merging the paired end reads (either fasta or fastq) as prodigal only takes single end reads
rule pear:
input:
forward = f"{{sample}}{config['separator']}1.{{extension}}",
reverse = f"{{sample}}{config['separator']}2.{{extension}}"
output:
"merged_reads/{sample}.{extension}"
#conda:
#"/home/lamma/env-export/antismash.yaml"
run:
shell("set +u")
shell("source ~/miniconda3/etc/profile.d/conda.sh")
shell("conda activate antismash")
shell("pear -f {input.forward} -r {input.reverse} -o {output} -t 21")
# If single end then move them to merged_reads directory
rule move:
input:
"{sample}.{extension}"
output:
"merged_reads/{sample}.{extension}"
shell:
"cp {path}/{sample}.{extension} {path}/merged_reads/"
# Setting the rule order on the 3 above rules which should be treated equally and only one run.
ruleorder: pear > move
# annotating the metagenome with prodigal#. Can be done inside antiSMASH but prefer to do it out
rule prodigal:
input:
f"merged_reads/{{sample}}.{config['file_extension']}"
output:
gbk_files = "annotated_reads/{sample}.gbk",
protein_files = "protein_reads/{sample}.faa"
#conda:
#"/home/lamma/env-export/antismash.yaml"
run:
shell("set +u")
shell("source ~/miniconda3/etc/profile.d/conda.sh")
shell("conda activate antismash")
shell("prodigal -i {input} -o {output.gbk_files} -a {output.protein_files} -p meta")
# running antiSMASH on the annotated metagenome
rule antiSMASH:
input:
"annotated_reads/{sample}.gbk"
output:
touch("antismash-output/{sample}/{sample}.txt")
#conda:
#"/home/lamma/env-export/antismash.yaml"
run:
shell("set +u")
shell("source ~/miniconda3/etc/profile.d/conda.sh")
shell("conda activate antismash")
shell("antismash --knownclusterblast --subclusterblast --full-hmmer --smcog --outputfolder antismash-output/{wildcards.sample}/ {input}")
I am running the pipeline on only one file at the moment but the yaml file looks like this if it is of intest:
file_extension: fastq
path_to_files: /home/lamma/ABR/Each_reads
samples:
- Unmap_41
separator: _
I know the error can occure when you use certain flags in snakemake but I dont believe I am using those flags. The command being submited to run the snakefile is:
snakemake --latency-wait 120 --rerun-incomplete --keep-going --jobs 99 --cluster-status 'python /home/lamma/ABR/scripts/slurm-status.py' --cluster 'sbatch -t {cluster.time} --mem={cluster.mem} --cpus-per-task={cluster.c} --error={cluster.error} --job-name={cluster.name} --output={cluster.output}' --cluster-config antismash-config.json --configfile yaml-config-files/antismash-on-rawMetagenome.yaml --snakefile antismash.smk
I have tried to -F flag to force a rerun but this seems to do nothing, as does increasing the --latency-wait number. Any help would be appriciated :)
I think it is likely to be something involving the way I am calling the conda environment in the run commands but using the conda: option with a yaml files returns version not found style errors.
As of what I read from pear documentation:
-o Specify the name to be used as base for the output files. PEAR outputs four files. A file containing the assembled reads with a
assembled.fastq extension, two files containing the forward, resp.
reverse, unassembled reads with extensions unassembled.forward.fastq,
resp. unassembled.reverse.fastq, and a file containing the discarded
reads with a discarded.fastq extension
So if the output defined in your rule is just a base, I suggest you put this as a param and the real names of the output as output:
rule pear:
input:
forward = f"{{sample}}{config['separator']}1.{{extension}}",
reverse = f"{{sample}}{config['separator']}2.{{extension}}"
output:
"merged_reads/{sample}.{extension}".assembled.fastq,
"merged_reads/{sample}.{extension}".unassembled.forward.fastq,
"merged_reads/{sample}.{extension}".unassembled.reverse.fastq,
"merged_reads/{sample}.{extension}".discarded.fastq
params:
base="merged_reads/{sample}.{extension}"
#conda:
#"/home/lamma/env-export/antismash.yaml"
run:
shell("set +u")
shell("source ~/miniconda3/etc/profile.d/conda.sh")
shell("conda activate antismash")
shell("pear -f {input.forward} -r {input.reverse} -o {params.base} -t 21")
I haven't tested pear so not sure what the output file names are exactly.

Parallel execution of a snakemake rule with same input and a range of values for a single parameter

I am transitioning a bash script to snakemake and I would like to parallelize a step I was previously handling with a for loop. The issue I am running into is that instead of running parallel processes, snakemake ends up trying to run one process with all parameters and fails.
My original bash script runs a program multiple times for a range of values of the parameter K.
for num in {1..3}
do
structure.py -K $num --input=fileprefix --output=fileprefix
done
There are multiple input files that start with fileprefix. And there are two main outputs per run, e.g. for K=1 they are fileprefix.1.meanP, fileprefix.1.meanQ. My config and snakemake files are as follows.
Config:
cat config.yaml
infile: fileprefix
K:
- 1
- 2
- 3
Snakemake:
configfile: 'config.yaml'
rule all:
input:
expand("output/{sample}.{K}.{ext}",
sample = config['infile'],
K = config['K'],
ext = ['meanQ', 'meanP'])
rule structure:
output:
"output/{sample}.{K}.meanQ",
"output/{sample}.{K}.meanP"
params:
prefix = config['infile'],
K = config['K']
threads: 3
shell:
"""
structure.py -K {params.K} \
--input=output/{params.prefix} \
--output=output/{params.prefix}
"""
This was executed with snakemake --cores 3. The problem persists when I only use one thread.
I expected the outputs described above for each value of K, but the run fails with this error:
RuleException:
CalledProcessError in line 84 of Snakefile:
Command ' set -euo pipefail; structure.py -K 1 2 3 --input=output/fileprefix \
--output=output/fileprefix ' returned non-zero exit status 2.
File "Snakefile", line 84, in __rule_Structure
File "snake/lib/python3.6/concurrent/futures/thread.py", line 56, in run
When I set K to a single value such as K = ['1'], everything works. So the problem seems to be that {params.K} is being expanded to all values of K when the shell command is executed. I started teaching myself snakemake today, and it works really well, but I'm hitting a brick wall with this.
You need to retrieve the argument for -K from the wildcards, not from the config file. The config file will simply return your list of possible values, it is a plain python dictionary.
configfile: 'config.yaml'
rule all:
input:
expand("output/{sample}.{K}.{ext}",
sample = config['infile'],
K = config['K'],
ext = ['meanQ', 'meanP'])
rule structure:
output:
"output/{sample}.{K}.meanQ",
"output/{sample}.{K}.meanP"
params:
prefix = config['invcf'],
K = config['K']
threads: 3
shell:
"structure.py -K {wildcards.K} "
"--input=output/{params.prefix} "
"--output=output/{params.prefix}"
Note that there are more things to improve here. For example, the rule structure does not define any input file, although it uses one.
There is an option now for parameter space exploration
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#parameter-space-exploration